<html xmlns="http://www.w3.org/1999/xhtml">
<head>
<meta charset="utf-8">
<meta name="generator" content="pdf2htmlEX">
<meta http-equiv="X-UA-Compatible" content="IE=edge,chrome=1">
<link rel="stylesheet" href="https://static.pudn.com/base/css/base.min.css">
<link rel="stylesheet" href="https://static.pudn.com/base/css/fancy.min.css">
<link rel="stylesheet" href="https://static.pudn.com/prod/directory_preview_static/624b58668947fd5953ee5026/raw.css">
<script src="https://static.pudn.com/base/js/compatibility.min.js"></script>
<script src="https://static.pudn.com/base/js/pdf2htmlEX.min.js"></script>
<script>
try{
pdf2htmlEX.defaultViewer = new pdf2htmlEX.Viewer({});
}catch(e){}
</script>
<title></title>
</head>
<body>
<div id="sidebar" style="display: none">
<div id="outline">
</div>
</div>
<div id="pf1" class="pf w0 h0" data-page-no="1"><div class="pc pc1 w0 h0"><img class="bi x0 y0 w1 h1" alt="" src="https://static.pudn.com/prod/directory_preview_static/624b58668947fd5953ee5026/bg1.jpg"><div class="t m0 x1 h2 y1 ff1 fs0 fc0 sc0 ls0 ws0">DenseCap:<span class="_ _0"> </span>Fully<span class="_ _1"> </span>Con<span class="_ _2"></span>volutional<span class="_ _1"> </span>Localization<span class="_ _1"> </span>Netw<span class="_ _3"></span>orks<span class="_ _1"> </span>f<span class="_ _2"></span>or<span class="_ _1"> </span>Dense<span class="_ _1"> </span>Captioning</div><div class="t m0 x2 h3 y2 ff2 fs1 fc0 sc0 ls0 ws0">Justin<span class="_"> </span>Johnson</div><div class="t m0 x3 h4 y3 ff3 fs2 fc0 sc0 ls0 ws0">∗</div><div class="t m0 x4 h3 y2 ff2 fs1 fc0 sc0 ls0 ws0">Andrej<span class="_"> </span>Karpathy</div><div class="t m0 x5 h4 y3 ff3 fs2 fc0 sc0 ls0 ws0">∗</div><div class="t m0 x6 h3 y2 ff2 fs1 fc0 sc0 ls0 ws0">Li<span class="_"> </span>Fei-Fei</div><div class="t m0 x7 h3 y4 ff2 fs1 fc0 sc0 ls0 ws0">Department<span class="_"> </span>of<span class="_"> </span>Computer<span class="_"> </span>Science,<span class="_"> </span>Stanford<span class="_"> </span>Uni<span class="_ _3"></span>versity</div><div class="t m0 x8 h5 y5 ff4 fs3 fc0 sc0 ls0 ws0">{<span class="ff5">jcjohns,karpathy,feifeili</span>}<span class="ff5">@cs.stanford.edu</span></div><div class="t m0 x9 h6 y6 ff1 fs1 fc0 sc0 ls0 ws0">Abstract</div><div class="t m0 xa h7 y7 ff6 fs4 fc0 sc0 ls0 ws0">W<span class="_ _4"></span>e<span class="_ _1"> </span>intr<span class="_ _2"></span>oduce<span class="_ _5"> </span>the<span class="_ _1"> </span>dense<span class="_ _1"> </span>captioning<span class="_ _1"> </span>task,<span class="_ _6"> </span>which<span class="_ _1"> </span>r<span class="_ _2"></span>equires<span class="_ _1"> </span>a</div><div class="t m0 xa h7 y8 ff6 fs4 fc0 sc0 ls0 ws0">computer<span class="_ _7"> </span>vision<span class="_ _7"> </span>system<span class="_ _7"> </span>to<span class="_ _7"> </span>both<span class="_ _7"> </span>localize<span class="_ _7"> </span>and<span class="_ _7"> </span>describe<span class="_ _7"> </span>salient</div><div class="t m0 xa h7 y9 ff6 fs4 fc0 sc0 ls0 ws0">r<span class="_ _2"></span>egions<span class="_"> </span>in<span class="_ _8"> </span>images<span class="_"> </span>in<span class="_ _8"> </span>natural<span class="_"> </span>language.<span class="_ _6"> </span>The<span class="_ _9"> </span>dense<span class="_ _9"> </span>caption-</div><div class="t m0 xa h7 ya ff6 fs4 fc0 sc0 ls0 ws0">ing<span class="_ _9"> </span>task<span class="_ _9"> </span>gener<span class="_ _3"></span>alizes<span class="_ _9"> </span>object<span class="_ _9"> </span>detection<span class="_ _9"> </span>when<span class="_ _9"> </span>the<span class="_ _9"> </span>descriptions</div><div class="t m0 xa h7 yb ff6 fs4 fc0 sc0 ls0 ws0">consist<span class="_ _8"> </span>of<span class="_ _1"> </span>a<span class="_ _8"> </span>single<span class="_ _a"> </span>wor<span class="_ _2"></span>d,<span class="_ _1"> </span>and<span class="_ _a"> </span>Image<span class="_ _8"> </span>Captioning<span class="_ _8"> </span>when<span class="_ _a"> </span>one</div><div class="t m0 xa h7 yc ff6 fs4 fc0 sc0 ls0 ws0">pr<span class="_ _2"></span>edicted<span class="_"> </span>r<span class="_ _2"></span>e<span class="_ _3"></span>gion<span class="_ _7"> </span>covers<span class="_ _7"> </span>the<span class="_ _7"> </span>full<span class="_"> </span>ima<span class="_ _3"></span>ge<span class="_ _3"></span>.<span class="_ _9"> </span>T<span class="_ _4"></span>o<span class="_"> </span>addr<span class="_ _2"></span>ess<span class="_ _7"> </span>the<span class="_"> </span>local-</div><div class="t m0 xa h7 yd ff6 fs4 fc0 sc0 ls0 ws0">ization<span class="_ _7"> </span>and<span class="_ _7"> </span>description<span class="_"> </span>task<span class="_ _7"> </span>jointly<span class="_ _7"> </span>we<span class="_ _7"> </span>pr<span class="_ _2"></span>opose<span class="_ _7"> </span>a<span class="_"> </span>Fully<span class="_ _7"> </span>Con-</div><div class="t m0 xa h7 ye ff6 fs4 fc0 sc0 ls0 ws0">volutional<span class="_ _a"> </span>Localization<span class="_ _1"> </span>Network<span class="_ _a"> </span>(FCLN)<span class="_ _1"> </span>ar<span class="_ _3"></span>chitectur<span class="_ _2"></span>e<span class="_ _a"> </span>that</div><div class="t m0 xa h7 yf ff6 fs4 fc0 sc0 ls0 ws0">pr<span class="_ _2"></span>ocesses<span class="_"> </span>an<span class="_"> </span>ima<span class="_ _3"></span>ge<span class="_ _7"> </span>with<span class="_"> </span>a<span class="_"> </span>single<span class="_ _3"></span>,<span class="_"> </span>ef<span class="_ _2"></span>ficient<span class="_"> </span>forwar<span class="_ _3"></span>d<span class="_"> </span>pass,<span class="_ _7"> </span>r<span class="_ _3"></span>e-</div><div class="t m0 xa h7 y10 ff6 fs4 fc0 sc0 ls0 ws0">quir<span class="_ _2"></span>es<span class="_ _5"> </span>no<span class="_ _6"> </span>e<span class="_ _3"></span>xternal<span class="_ _5"> </span>r<span class="_ _3"></span>e<span class="_ _2"></span>gions<span class="_ _5"> </span>pr<span class="_ _3"></span>oposals,<span class="_ _6"> </span>and<span class="_ _5"> </span>can<span class="_ _5"> </span>be<span class="_ _5"> </span>trained</div><div class="t m0 xa h7 y11 ff6 fs4 fc0 sc0 ls0 ws0">end-to-end<span class="_ _9"> </span>with<span class="_ _8"> </span>a<span class="_ _9"> </span>single<span class="_ _8"> </span>r<span class="_ _2"></span>ound<span class="_ _8"> </span>of<span class="_ _9"> </span>optimization.<span class="_ _0"> </span>The<span class="_ _9"> </span>arc<span class="_ _3"></span>hi-</div><div class="t m0 xa h7 y12 ff6 fs4 fc0 sc0 ls0 ws0">tectur<span class="_ _2"></span>e<span class="_ _5"> </span>is<span class="_ _5"> </span>composed<span class="_ _1"> </span>of<span class="_ _5"> </span>a<span class="_ _1"> </span>Convolutional<span class="_ _1"> </span>Network,<span class="_ _5"> </span>a<span class="_ _5"> </span>novel</div><div class="t m0 xa h7 y13 ff6 fs4 fc0 sc0 ls0 ws0">dense<span class="_ _b"> </span>localization<span class="_ _b"> </span>layer<span class="_ _4"></span>,<span class="_ _c"> </span>and<span class="_ _b"> </span>Recurr<span class="_ _2"></span>ent<span class="_ _b"> </span>Neural<span class="_ _b"> </span>Network</div><div class="t m0 xa h7 y14 ff6 fs4 fc0 sc0 ls0 ws0">language<span class="_ _5"> </span>model<span class="_ _0"> </span>that<span class="_ _6"> </span>generates<span class="_ _6"> </span>the<span class="_ _6"> </span>label<span class="_ _0"> </span>sequences.<span class="_ _d"> </span>W<span class="_ _4"></span>e</div><div class="t m0 xa h7 y15 ff6 fs4 fc0 sc0 ls0 ws0">evaluate<span class="_"> </span>our<span class="_ _9"> </span>network<span class="_"> </span>on<span class="_ _9"> </span>the<span class="_ _9"> </span>V<span class="_ _4"></span>isual<span class="_ _9"> </span>Genome<span class="_ _9"> </span>dataset,<span class="_ _9"> </span>which</div><div class="t m0 xa h7 y16 ff6 fs4 fc0 sc0 ls0 ws0">comprises<span class="_ _5"> </span>94,000<span class="_ _5"> </span>images<span class="_ _1"> </span>and<span class="_ _5"> </span>4,100,000<span class="_ _5"> </span>r<span class="_ _3"></span>e<span class="_ _2"></span>gion-grounded</div><div class="t m0 xa h7 y17 ff6 fs4 fc0 sc0 ls0 ws0">captions.<span class="_ _e"> </span>W<span class="_ _4"></span>e<span class="_ _1"> </span>observe<span class="_ _5"> </span>both<span class="_ _5"> </span>speed<span class="_ _5"> </span>and<span class="_ _5"> </span>accuracy<span class="_ _1"> </span>impro<span class="_ _2"></span>ve-</div><div class="t m0 xa h7 y18 ff6 fs4 fc0 sc0 ls0 ws0">ments<span class="_ _8"> </span>over<span class="_ _a"> </span>baselines<span class="_ _8"> </span>based<span class="_ _a"> </span>on<span class="_ _a"> </span>curr<span class="_ _2"></span>ent<span class="_ _a"> </span>state<span class="_ _a"> </span>of<span class="_ _a"> </span>the<span class="_ _8"> </span>art<span class="_ _a"> </span>ap-</div><div class="t m0 xa h7 y19 ff6 fs4 fc0 sc0 ls0 ws0">pr<span class="_ _2"></span>oaches<span class="_"> </span>in<span class="_"> </span>both<span class="_"> </span>gener<span class="_ _3"></span>ation<span class="_"> </span>and<span class="_"> </span>r<span class="_ _2"></span>etrieval<span class="_"> </span>settings.</div><div class="t m0 xa h6 y1a ff1 fs1 fc0 sc0 ls0 ws0">1.<span class="_ _9"> </span>Introduction</div><div class="t m0 xa h8 y1b ff2 fs4 fc0 sc0 ls0 ws0">Our<span class="_"> </span>ability<span class="_"> </span>to<span class="_ _9"> </span>effortlessly<span class="_"> </span>point<span class="_"> </span>out<span class="_"> </span>and<span class="_"> </span>describe<span class="_ _9"> </span>all<span class="_"> </span>aspects</div><div class="t m0 xa h8 y1c ff2 fs4 fc0 sc0 ls0 ws0">of<span class="_ _9"> </span>an<span class="_ _9"> </span>image<span class="_ _9"> </span>relies<span class="_ _8"> </span>on<span class="_ _9"> </span>a<span class="_ _9"> </span>strong<span class="_ _9"> </span>semantic<span class="_ _8"> </span>understanding<span class="_ _9"> </span>of<span class="_ _9"> </span>a</div><div class="t m0 xa h8 y1d ff2 fs4 fc0 sc0 ls0 ws0">visual<span class="_ _8"> </span>scene<span class="_ _a"> </span>and<span class="_ _8"> </span>all<span class="_ _8"> </span>of<span class="_ _a"> </span>its<span class="_ _8"> </span>elements.<span class="_ _f"> </span>Howe<span class="_ _2"></span>ver<span class="_ _2"></span>,<span class="_ _a"> </span>despite<span class="_ _a"> </span>nu-</div><div class="t m0 xa h8 y1e ff2 fs4 fc0 sc0 ls0 ws0">merous<span class="_ _1"> </span>potential<span class="_ _1"> </span>applications,<span class="_ _5"> </span>this<span class="_ _1"> </span>ability<span class="_ _1"> </span>remains<span class="_ _5"> </span>a<span class="_ _1"> </span>chal-</div><div class="t m0 xa h8 y1f ff2 fs4 fc0 sc0 ls0 ws0">lenge<span class="_ _6"> </span>for<span class="_ _5"> </span>our<span class="_ _6"> </span>state<span class="_ _6"> </span>of<span class="_ _6"> </span>the<span class="_ _6"> </span>art<span class="_ _6"> </span>visual<span class="_ _6"> </span>recognition<span class="_ _6"> </span>systems.</div><div class="t m0 xa h8 y20 ff2 fs4 fc0 sc0 ls0 ws0">In<span class="_ _6"> </span>the<span class="_ _0"> </span>last<span class="_ _6"> </span>few<span class="_ _6"> </span>years<span class="_ _6"> </span>there<span class="_ _0"> </span>has<span class="_ _6"> </span>been<span class="_ _0"> </span>significant<span class="_ _0"> </span>progress</div><div class="t m0 xa h8 y21 ff2 fs4 fc0 sc0 ls0 ws0">in<span class="_ _1"> </span>image<span class="_ _1"> </span>classification<span class="_ _1"> </span>[<span class="fc1">39</span>,<span class="_ _1"> </span><span class="fc1">26</span>,<span class="_ _1"> </span><span class="fc1">53</span>,<span class="_ _1"> </span><span class="fc1">45</span>],<span class="_ _5"> </span>where<span class="_ _5"> </span>the<span class="_ _1"> </span>task<span class="_ _1"> </span>is</div><div class="t m0 xa h8 y22 ff2 fs4 fc0 sc0 ls0 ws0">to<span class="_ _8"> </span>assign<span class="_ _a"> </span>one<span class="_ _a"> </span>label<span class="_ _8"> </span>to<span class="_ _a"> </span>an<span class="_ _a"> </span>image.<span class="_ _f"> </span>Further<span class="_ _8"> </span>work<span class="_ _a"> </span>has<span class="_ _8"> </span>pushed</div><div class="t m0 xa h8 y23 ff2 fs4 fc0 sc0 ls0 ws0">these<span class="_ _7"> </span>advances<span class="_ _7"> </span>along<span class="_ _7"> </span>two<span class="_ _7"> </span>orthogonal<span class="_"> </span>directions:<span class="_"> </span>First,<span class="_"> </span>rapid</div><div class="t m0 xa h8 y24 ff2 fs4 fc0 sc0 ls0 ws0">progress<span class="_"> </span>in<span class="_ _7"> </span>object<span class="_"> </span>detection<span class="_ _7"> </span>[<span class="fc1">40</span>,<span class="_"> </span><span class="fc1">14</span>,<span class="_ _7"> </span><span class="fc1">46</span>]<span class="_ _7"> </span>has<span class="_"> </span>identified<span class="_ _7"> </span>mod-</div><div class="t m0 xa h8 y25 ff2 fs4 fc0 sc0 ls0 ws0">els<span class="_ _7"> </span>that<span class="_ _7"> </span>efficiently<span class="_ _7"> </span>identify<span class="_ _7"> </span>and<span class="_ _7"> </span>label<span class="_ _7"> </span>multiple<span class="_ _7"> </span>salient<span class="_"> </span>re<span class="_ _2"></span>gions</div><div class="t m0 xa h8 y26 ff2 fs4 fc0 sc0 ls0 ws0">of<span class="_ _9"> </span>an<span class="_ _8"> </span>image.<span class="_ _0"> </span>Second,<span class="_ _8"> </span>recent<span class="_ _9"> </span>advances<span class="_ _9"> </span>in<span class="_ _9"> </span>image<span class="_ _8"> </span>captioning</div><div class="t m0 xa h8 y27 ff2 fs4 fc0 sc0 ls0 ws0">[<span class="fc1">3</span>,<span class="_ _a"> </span><span class="fc1">32</span>,<span class="_ _1"> </span><span class="fc1">21</span>,<span class="_ _a"> </span><span class="fc1">49</span>,<span class="_ _1"> </span><span class="fc1">51</span>,<span class="_ _a"> </span><span class="fc1">8</span>,<span class="_ _1"> </span><span class="fc1">4</span>]<span class="_ _1"> </span>hav<span class="_ _2"></span>e<span class="_ _1"> </span>expanded<span class="_ _a"> </span>the<span class="_ _1"> </span>complexity<span class="_ _a"> </span>of</div><div class="t m0 xa h8 y28 ff2 fs4 fc0 sc0 ls0 ws0">the<span class="_"> </span>label<span class="_"> </span>space<span class="_"> </span>from<span class="_"> </span>a<span class="_"> </span>fix<span class="_ _2"></span>ed<span class="_"> </span>set<span class="_"> </span>of<span class="_"> </span>categories<span class="_"> </span>to<span class="_"> </span>sequence<span class="_ _7"> </span>of</div><div class="t m0 xa h8 y29 ff2 fs4 fc0 sc0 ls0 ws0">words<span class="_"> </span>able<span class="_"> </span>to<span class="_"> </span>express<span class="_"> </span>significantly<span class="_"> </span>richer<span class="_"> </span>concepts.</div><div class="t m0 xb h8 y2a ff2 fs4 fc0 sc0 ls0 ws0">Howe<span class="_ _2"></span>ver<span class="_ _2"></span>,<span class="_ _5"> </span>despite<span class="_ _a"> </span>encouraging<span class="_ _1"> </span>progress<span class="_ _1"> </span>along<span class="_ _1"> </span>the<span class="_ _1"> </span>label</div><div class="t m0 xa h8 y2b ff2 fs4 fc0 sc0 ls0 ws0">density<span class="_ _7"> </span>and<span class="_ _7"> </span>label<span class="_ _10"> </span>complexity<span class="_ _10"> </span>axes,<span class="_ _10"> </span>these<span class="_ _7"> </span>two<span class="_ _10"> </span>directions<span class="_ _7"> </span>hav<span class="_ _3"></span>e</div><div class="t m0 xc h9 y2c ff7 fs5 fc0 sc0 ls0 ws0">∗</div><div class="t m0 xd ha y2d ff2 fs2 fc0 sc0 ls0 ws0">Both<span class="_"> </span>authors<span class="_"> </span>contributed<span class="_"> </span>equally<span class="_"> </span>to<span class="_"> </span>this<span class="_"> </span>work.</div><div class="c xe y2e w2 hb"><div class="t m0 xf hc y2f ff8 fs6 fc0 sc0 ls0 ws0"><span class="fc5 sc0">Classification</span></div><div class="t m0 x10 hc y30 ff8 fs6 fc0 sc0 ls0 ws0"><span class="fc5 sc0">Cat</span></div><div class="t m0 x11 hc y31 ff8 fs6 fc0 sc0 ls0 ws0"><span class="fc5 sc0">Captioning</span></div><div class="t m0 x12 hc y32 ff8 fs6 fc0 sc0 ls0 ws0"><span class="fc5 sc0">A </span><span class="fc5 sc0">cat </span></div><div class="t m0 x12 hc y33 ff8 fs6 fc0 sc0 ls0 ws0"><span class="fc5 sc0">riding </span><span class="fc5 sc0">a </span></div><div class="t m0 x12 hc y34 ff8 fs6 fc0 sc0 ls0 ws0"><span class="fc5 sc0">skateboard</span></div><div class="t m0 x13 hc y35 ff8 fs6 fc0 sc0 ls0 ws0"><span class="fc5 sc0">Detection</span></div><div class="t m0 x14 hc y36 ff8 fs6 fc0 sc0 ls0 ws0"><span class="fc5 sc0">Cat</span></div><div class="t m0 x14 hd y37 ff9 fs6 fc0 sc0 ls0 ws0"><span class="fc5 sc0">Skateboard</span></div><div class="t m0 x15 hc y38 ff8 fs6 fc0 sc0 ls0 ws0"><span class="fc5 sc0">Dense </span><span class="fc5 sc0">Captioning</span></div><div class="t m0 x16 he y39 ff8 fs7 fc0 sc0 ls0 ws0"><span class="fc5 sc0">Orange </span><span class="fc5 sc0">spotted </span><span class="fc5 sc0">cat</span></div><div class="t m0 x14 he y3a ff8 fs7 fc0 sc0 ls0 ws0"><span class="fc5 sc0">Skateboard </span><span class="fc5 sc0">with </span></div><div class="t m0 x14 he y3b ff8 fs7 fc0 sc0 ls0 ws0"><span class="fc5 sc0">red </span><span class="fc5 sc0">wheels</span></div><div class="t m0 x14 he y3c ff8 fs7 fc0 sc0 ls0 ws0"><span class="fc5 sc0">Cat </span><span class="fc5 sc0">riding </span><span class="fc5 sc0">a </span></div><div class="t m0 x14 he y3d ff8 fs7 fc0 sc0 ls0 ws0"><span class="fc5 sc0">skateboard</span></div><div class="t m0 x14 he y3e ff8 fs7 fc0 sc0 ls0 ws0"><span class="fc5 sc0">Brown </span><span class="fc5 sc0">hardwood </span></div><div class="t m0 x14 he y3f ff8 fs7 fc0 sc0 ls0 ws0"><span class="fc5 sc0">flooring</span></div><div class="t m0 x14 hc y40 ff8 fs6 fc2 sc0 ls0 ws0"><span class="fc5 sc0">label </span><span class="fc5 sc0">density</span></div><div class="t m0 x17 hf y41 ff8 fs8 fc0 sc0 ls0 ws0"><span class="fc5 sc0">Whole </span><span class="fc5 sc0">Image</span><span class="_ _11"> </span><span class="fc5 sc0">Image </span><span class="fc5 sc0">Regions</span></div><div class="t m0 x18 hc y42 ff8 fs6 fc3 sc0 ls0 ws0"><span class="fc5 sc0">label </span></div><div class="t m0 x19 hc y43 ff8 fs6 fc3 sc0 ls0 ws0"><span class="fc5 sc0">complexity</span></div><div class="t m0 x1a hf y44 ff8 fs8 fc0 sc0 ls0 ws0"><span class="fc5 sc0">Single</span></div><div class="t m0 x1a hf y45 ff8 fs8 fc0 sc0 ls0 ws0"><span class="fc5 sc0">Label</span></div><div class="t m0 x1b hc y46 ff8 fs6 fc0 sc0 ls0 ws0"><span class="fc5 sc0">Sequence</span></div></div><div class="t m0 xe h10 y47 ff2 fs3 fc0 sc0 ls0 ws0">Figure<span class="_ _a"> </span>1.<span class="_ _1"> </span>W<span class="_ _2"></span>e<span class="_ _a"> </span>address<span class="_ _1"> </span>the<span class="_ _a"> </span>Dense<span class="_ _1"> </span>Captioning<span class="_ _1"> </span>task<span class="_ _a"> </span>(bottom<span class="_ _1"> </span>right)</div><div class="t m0 xe h10 y48 ff2 fs3 fc0 sc0 ls0 ws0">with<span class="_ _10"> </span>a<span class="_ _10"> </span>model<span class="_ _10"> </span>that<span class="_"> </span>jointly<span class="_ _10"> </span>generates<span class="_ _10"> </span>both<span class="_ _10"> </span>dense<span class="_ _10"> </span>and<span class="_ _10"> </span>rich<span class="_ _10"> </span>annotations</div><div class="t m0 xe h10 y49 ff2 fs3 fc0 sc0 ls0 ws0">in<span class="_"> </span>a<span class="_"> </span>single<span class="_"> </span>forward<span class="_"> </span>pass.</div><div class="t m0 xe h8 y4a ff2 fs4 fc0 sc0 ls0 ws0">remained<span class="_"> </span>separate.<span class="_ _9"> </span>In<span class="_"> </span>this<span class="_"> </span>work<span class="_"> </span>we<span class="_"> </span>take<span class="_"> </span>a<span class="_"> </span>step<span class="_"> </span>to<span class="_ _2"></span>wards<span class="_"> </span>uni-</div><div class="t m0 xe h8 y4b ff2 fs4 fc0 sc0 ls0 ws0">fying<span class="_ _9"> </span>these<span class="_ _8"> </span>two<span class="_ _9"> </span>inter-connected<span class="_ _9"> </span>tasks<span class="_ _9"> </span>into<span class="_ _9"> </span>one<span class="_ _8"> </span>joint<span class="_ _8"> </span>frame-</div><div class="t m0 xe h8 y4c ff2 fs4 fc0 sc0 ls0 ws0">work.<span class="_ _e"> </span>First,<span class="_ _6"> </span>we<span class="_ _1"> </span>introduce<span class="_ _5"> </span>the<span class="_ _5"> </span>dense<span class="_ _5"> </span>captioning<span class="_ _5"> </span>task<span class="_ _5"> </span>(see</div><div class="t m0 xe h8 y4d ff2 fs4 fc0 sc0 ls0 ws0">Figure<span class="_ _10"> </span><span class="fc4">1</span>),<span class="_"> </span>which<span class="_ _10"> </span>requires<span class="_ _7"> </span>a<span class="_ _7"> </span>model<span class="_ _10"> </span>to<span class="_"> </span>predict<span class="_ _10"> </span>a<span class="_ _10"> </span>set<span class="_ _7"> </span>of<span class="_ _7"> </span>descrip-</div><div class="t m0 xe h8 y4e ff2 fs4 fc0 sc0 ls0 ws0">tions<span class="_"> </span>across<span class="_ _9"> </span>regions<span class="_"> </span>of<span class="_ _9"> </span>an<span class="_ _9"> </span>image.<span class="_ _1"> </span>Object<span class="_ _9"> </span>detection<span class="_"> </span>is<span class="_ _9"> </span>hence</div><div class="t m0 xe h8 y4f ff2 fs4 fc0 sc0 ls0 ws0">recov<span class="_ _3"></span>ered<span class="_ _1"> </span>as<span class="_ _1"> </span>a<span class="_ _1"> </span>special<span class="_ _1"> </span>case<span class="_ _1"> </span>when<span class="_ _1"> </span>the<span class="_ _1"> </span>target<span class="_ _1"> </span>labels<span class="_ _1"> </span>consist</div><div class="t m0 xe h8 y50 ff2 fs4 fc0 sc0 ls0 ws0">of<span class="_ _1"> </span>one<span class="_ _1"> </span>word,<span class="_ _1"> </span>and<span class="_ _5"> </span>image<span class="_ _1"> </span>captioning<span class="_ _1"> </span>is<span class="_ _1"> </span>recovered<span class="_ _a"> </span>when<span class="_ _1"> </span>all</div><div class="t m0 xe h8 y51 ff2 fs4 fc0 sc0 ls0 ws0">images<span class="_"> </span>consist<span class="_"> </span>of<span class="_"> </span>one<span class="_"> </span>region<span class="_"> </span>that<span class="_"> </span>spans<span class="_"> </span>the<span class="_"> </span>full<span class="_"> </span>image.</div><div class="t m0 x1c h8 y52 ff2 fs4 fc0 sc0 ls0 ws0">Additionally<span class="_ _2"></span>,<span class="_ _1"> </span>we<span class="_ _1"> </span>dev<span class="_ _3"></span>elop<span class="_ _a"> </span>a<span class="_ _1"> </span>Fully<span class="_ _1"> </span>Con<span class="_ _2"></span>volutional<span class="_ _a"> </span>Local-</div><div class="t m0 xe h8 y53 ff2 fs4 fc0 sc0 ls0 ws0">ization<span class="_ _c"> </span>Network<span class="_ _b"> </span>(FCLN)<span class="_ _c"> </span>for<span class="_ _c"> </span>the<span class="_ _c"> </span>dense<span class="_ _b"> </span>captioning<span class="_ _c"> </span>task.</div><div class="t m0 xe h8 y54 ff2 fs4 fc0 sc0 ls0 ws0">Our<span class="_ _8"> </span>model<span class="_ _8"> </span>is<span class="_ _8"> </span>inspired<span class="_ _a"> </span>by<span class="_ _8"> </span>recent<span class="_ _8"> </span>work<span class="_ _8"> </span>in<span class="_ _8"> </span>image<span class="_ _a"> </span>captioning</div><div class="t m0 xe h8 y55 ff2 fs4 fc0 sc0 ls0 ws0">[<span class="fc1">49</span>,<span class="_ _9"> </span><span class="fc1">21</span>,<span class="_ _8"> </span><span class="fc1">32</span>,<span class="_ _9"> </span><span class="fc1">8</span>,<span class="_ _8"> </span><span class="fc1">4</span>]<span class="_ _9"> </span>in<span class="_ _8"> </span>that<span class="_ _9"> </span>it<span class="_ _8"> </span>is<span class="_ _8"> </span>composed<span class="_ _9"> </span>of<span class="_ _8"> </span>a<span class="_ _9"> </span>Conv<span class="_ _2"></span>olutional</div><div class="t m0 xe h8 y56 ff2 fs4 fc0 sc0 ls0 ws0">Neural<span class="_"> </span>Network<span class="_"> </span>and<span class="_ _9"> </span>a<span class="_"> </span>Recurrent<span class="_ _9"> </span>Neural<span class="_"> </span>Network<span class="_"> </span>language</div><div class="t m0 xe h8 y57 ff2 fs4 fc0 sc0 ls0 ws0">model.<span class="_ _8"> </span>Howe<span class="_ _2"></span>ver<span class="_ _2"></span>,<span class="_"> </span>drawing<span class="_"> </span>on<span class="_"> </span>work<span class="_"> </span>in<span class="_"> </span>object<span class="_ _7"> </span>detection<span class="_"> </span>[<span class="fc1">38</span>],</div><div class="t m0 xe h8 y58 ff2 fs4 fc0 sc0 ls0 ws0">our<span class="_"> </span>second<span class="_ _7"> </span>core<span class="_"> </span>contrib<span class="_ _3"></span>ution<span class="_"> </span>is<span class="_ _7"> </span>to<span class="_"> </span>introduce<span class="_ _7"> </span>a<span class="_"> </span>ne<span class="_ _3"></span>w<span class="_"> </span>dense<span class="_ _7"> </span>lo-</div><div class="t m0 xe h8 y59 ff2 fs4 fc0 sc0 ls0 ws0">calization<span class="_ _a"> </span>layer<span class="_ _2"></span>.<span class="_ _12"> </span>This<span class="_ _a"> </span>layer<span class="_ _1"> </span>is<span class="_ _1"> </span>fully<span class="_ _1"> </span>differentiable<span class="_ _a"> </span>and<span class="_ _a"> </span>can</div><div class="t m0 xe h8 y5a ff2 fs4 fc0 sc0 ls0 ws0">be<span class="_ _a"> </span>inserted<span class="_ _a"> </span>into<span class="_ _a"> </span>any<span class="_ _8"> </span>neural<span class="_ _a"> </span>network<span class="_ _a"> </span>that<span class="_ _a"> </span>processes<span class="_ _a"> </span>images</div><div class="t m0 xe h8 y5b ff2 fs4 fc0 sc0 ls0 ws0">to<span class="_ _a"> </span>enable<span class="_ _a"> </span>region-le<span class="_ _2"></span>vel<span class="_ _a"> </span>training<span class="_ _a"> </span>and<span class="_ _a"> </span>predictions.<span class="_ _f"> </span>Internally<span class="_ _2"></span>,</div><div class="t m0 xe h8 y5c ff2 fs4 fc0 sc0 ls0 ws0">the<span class="_ _9"> </span>localization<span class="_ _9"> </span>layer<span class="_"> </span>predicts<span class="_ _9"> </span>a<span class="_ _9"> </span>set<span class="_ _9"> </span>of<span class="_ _9"> </span>regions<span class="_ _9"> </span>of<span class="_ _9"> </span>interest<span class="_"> </span>in</div><div class="t m0 xe h8 y5d ff2 fs4 fc0 sc0 ls0 ws0">the<span class="_ _1"> </span>image<span class="_ _1"> </span>and<span class="_ _1"> </span>then<span class="_ _5"> </span>uses<span class="_ _1"> </span>bilinear<span class="_ _1"> </span>interpolation<span class="_ _1"> </span>[<span class="fc1">19</span>,<span class="_ _5"> </span><span class="fc1">16</span>]<span class="_ _1"> </span>to</div><div class="t m0 xe h8 y5e ff2 fs4 fc0 sc0 ls0 ws0">smoothly<span class="_"> </span>crop<span class="_"> </span>the<span class="_"> </span>activ<span class="_ _2"></span>ations<span class="_"> </span>in<span class="_"> </span>each<span class="_"> </span>region.</div><div class="t m0 x1c h8 y5f ff2 fs4 fc0 sc0 ls0 ws0">W<span class="_ _4"></span>e<span class="_"> </span>e<span class="_ _2"></span>valuate<span class="_ _10"> </span>the<span class="_"> </span>model<span class="_ _10"> </span>on<span class="_"> </span>the<span class="_ _10"> </span>large-scale<span class="_ _7"> </span>V<span class="_ _2"></span>isual<span class="_"> </span>Genome</div><div class="t m0 xe h8 y60 ff2 fs4 fc0 sc0 ls0 ws0">dataset,<span class="_ _10"> </span>which<span class="_ _7"> </span>contains<span class="_ _10"> </span>94,000<span class="_ _10"> </span>images<span class="_ _7"> </span>and<span class="_ _10"> </span>4,100,000<span class="_ _10"> </span>region</div><div class="t m0 xe h8 y61 ff2 fs4 fc0 sc0 ls0 ws0">captions.<span class="_ _9"> </span>Our<span class="_"> </span>results<span class="_"> </span>sho<span class="_ _2"></span>w<span class="_"> </span>both<span class="_"> </span>performance<span class="_ _7"> </span>and<span class="_"> </span>speed<span class="_"> </span>im-</div><div class="t m0 xe h8 y62 ff2 fs4 fc0 sc0 ls0 ws0">prov<span class="_ _3"></span>ements<span class="_ _9"> </span>ov<span class="_ _3"></span>er<span class="_ _9"> </span>approaches<span class="_ _9"> </span>based<span class="_ _9"> </span>on<span class="_ _9"> </span>pre<span class="_ _3"></span>vious<span class="_ _9"> </span>state<span class="_ _9"> </span>of<span class="_ _9"> </span>the</div><div class="t m0 xe h8 y63 ff2 fs4 fc0 sc0 ls0 ws0">art.<span class="_ _b"> </span>W<span class="_ _4"></span>e<span class="_ _a"> </span>make<span class="_ _9"> </span>our<span class="_ _8"> </span>code<span class="_ _8"> </span>and<span class="_ _8"> </span>data<span class="_ _8"> </span>publicly<span class="_ _a"> </span>av<span class="_ _2"></span>ailable<span class="_ _8"> </span>to<span class="_ _8"> </span>sup-</div><div class="t m0 xe h8 y2d ff2 fs4 fc0 sc0 ls0 ws0">port<span class="_"> </span>further<span class="_"> </span>progress<span class="_"> </span>on<span class="_"> </span>the<span class="_"> </span>dense<span class="_"> </span>captioning<span class="_"> </span>task.</div><div class="t m0 x1d h8 y64 ff2 fs4 fc0 sc0 ls0 ws0">1</div><a class="l" rel='nofollow' onclick='return false;'><div class="d m1"></div></a><a class="l" rel='nofollow' onclick='return false;'><div class="d m1"></div></a><a class="l" rel='nofollow' onclick='return false;'><div class="d m1"></div></a><a class="l" rel='nofollow' onclick='return false;'><div class="d m1"></div></a><a class="l" rel='nofollow' onclick='return false;'><div class="d m1"></div></a><a class="l" rel='nofollow' onclick='return false;'><div class="d m1"></div></a><a class="l" rel='nofollow' onclick='return false;'><div class="d m1"></div></a><a class="l" rel='nofollow' onclick='return false;'><div class="d m1"></div></a><a class="l" rel='nofollow' onclick='return false;'><div class="d m1"></div></a><a class="l" rel='nofollow' onclick='return false;'><div class="d m1"></div></a><a class="l" rel='nofollow' onclick='return false;'><div class="d m1"></div></a><a class="l" rel='nofollow' onclick='return false;'><div class="d m1"></div></a><a class="l" rel='nofollow' onclick='return false;'><div class="d m1"></div></a><a class="l" rel='nofollow' onclick='return false;'><div class="d m1"></div></a><a class="l" rel='nofollow' onclick='return false;'><div class="d m1"></div></a><a class="l" rel='nofollow' onclick='return false;'><div class="d m1"></div></a><a class="l" rel='nofollow' onclick='return false;'><div class="d m1"></div></a><a class="l" rel='nofollow' onclick='return false;'><div class="d m1"></div></a><a class="l" rel='nofollow' onclick='return false;'><div class="d m1"></div></a><a class="l" rel='nofollow' onclick='return false;'><div class="d m1"></div></a><a class="l" rel='nofollow' onclick='return false;'><div class="d m1"></div></a><a class="l" rel='nofollow' onclick='return false;'><div class="d m1"></div></a><a class="l" rel='nofollow' onclick='return false;'><div class="d m1"></div></a></div><div class="pi" data-data='{"ctm":[1.568627,0.000000,0.000000,1.568627,0.000000,0.000000]}'></div></div>
</body>
</html>
<div id="pf2" class="pf w0 h0" data-page-no="2"><div class="pc pc2 w0 h0"><img class="bi x0 y0 w1 h1" alt="" src="https://static.pudn.com/prod/directory_preview_static/624b58668947fd5953ee5026/bg2.jpg"><div class="t m0 xa h6 y65 ff1 fs1 fc0 sc0 ls0 ws0">2.<span class="_ _9"> </span>Related<span class="_ _8"> </span>W<span class="_ _4"></span>ork</div><div class="t m0 xa h8 y66 ff2 fs4 fc0 sc0 ls0 ws0">Our<span class="_ _5"> </span>work<span class="_ _6"> </span>draws<span class="_ _5"> </span>on<span class="_ _6"> </span>recent<span class="_ _5"> </span>work<span class="_ _6"> </span>in<span class="_ _5"> </span>object<span class="_ _6"> </span>detection,<span class="_ _0"> </span>im-</div><div class="t m0 xa h8 y67 ff2 fs4 fc0 sc0 ls0 ws0">age<span class="_"> </span>captioning,<span class="_ _9"> </span>and<span class="_ _9"> </span>soft<span class="_ _9"> </span>spatial<span class="_"> </span>attention<span class="_ _9"> </span>that<span class="_ _9"> </span>allows<span class="_"> </span>down-</div><div class="t m0 xa h8 y68 ff2 fs4 fc0 sc0 ls0 ws0">stream<span class="_"> </span>processing<span class="_"> </span>of<span class="_"> </span>particular<span class="_"> </span>regions<span class="_"> </span>in<span class="_"> </span>the<span class="_"> </span>image.</div><div class="t m0 xa h11 y69 ff1 fs4 fc0 sc0 ls0 ws0">Object<span class="_ _9"> </span>Detection.<span class="_ _6"> </span><span class="ff2">Our<span class="_ _8"> </span>core<span class="_ _9"> </span>visual<span class="_ _9"> </span>processing<span class="_ _8"> </span>module<span class="_ _9"> </span>is<span class="_ _8"> </span>a</span></div><div class="t m0 xa h8 y6a ff2 fs4 fc0 sc0 ls0 ws0">Con<span class="_ _2"></span>volutional<span class="_ _9"> </span>Neural<span class="_ _8"> </span>Network<span class="_ _9"> </span>(CNN)<span class="_ _8"> </span>[<span class="fc1">29</span>,<span class="_ _8"> </span><span class="fc1">26</span>],<span class="_ _8"> </span>which<span class="_ _8"> </span>has</div><div class="t m0 xa h8 y6b ff2 fs4 fc0 sc0 ls0 ws0">emerged<span class="_ _a"> </span>as<span class="_ _a"> </span>a<span class="_ _1"> </span>powerful<span class="_ _a"> </span>model<span class="_ _a"> </span>for<span class="_ _1"> </span>visual<span class="_ _1"> </span>recognition<span class="_ _a"> </span>tasks</div><div class="t m0 xa h8 y6c ff2 fs4 fc0 sc0 ls0 ws0">[<span class="fc1">39</span>].<span class="_ _1"> </span>The<span class="_ _9"> </span>first<span class="_ _9"> </span>application<span class="_ _9"> </span>of<span class="_ _9"> </span>these<span class="_"> </span>models<span class="_ _9"> </span>to<span class="_ _9"> </span>dense<span class="_ _9"> </span>predic-</div><div class="t m0 xa h8 y6d ff2 fs4 fc0 sc0 ls0 ws0">tion<span class="_ _a"> </span>tasks<span class="_ _1"> </span>was<span class="_ _a"> </span>introduced<span class="_ _a"> </span>in<span class="_ _1"> </span>R-CNN<span class="_ _a"> </span>[<span class="fc1">14</span>],<span class="_ _5"> </span>where<span class="_ _a"> </span>each<span class="_ _1"> </span>re-</div><div class="t m0 xa h8 y6e ff2 fs4 fc0 sc0 ls0 ws0">gion<span class="_"> </span>of<span class="_"> </span>interest<span class="_"> </span>was<span class="_"> </span>processed<span class="_"> </span>independently<span class="_ _2"></span>.<span class="_ _8"> </span>Further<span class="_"> </span>work</div><div class="t m0 xa h8 y6f ff2 fs4 fc0 sc0 ls0 ws0">has<span class="_ _9"> </span>focused<span class="_ _9"> </span>on<span class="_ _8"> </span>processing<span class="_ _9"> </span>all<span class="_ _8"> </span>regions<span class="_ _9"> </span>with<span class="_ _9"> </span>only<span class="_ _8"> </span>single<span class="_ _9"> </span>for-</div><div class="t m0 xa h8 y70 ff2 fs4 fc0 sc0 ls0 ws0">ward<span class="_"> </span>pass<span class="_ _9"> </span>of<span class="_"> </span>the<span class="_ _9"> </span>CNN<span class="_"> </span>[<span class="fc1">17</span>,<span class="_ _9"> </span><span class="fc1">13</span>],<span class="_ _9"> </span>and<span class="_"> </span>on<span class="_ _9"> </span>eliminating<span class="_"> </span>explicit</div><div class="t m0 xa h8 y71 ff2 fs4 fc0 sc0 ls0 ws0">region<span class="_ _9"> </span>proposal<span class="_ _8"> </span>methods<span class="_ _8"> </span>by<span class="_ _8"> </span>directly<span class="_ _a"> </span>predicting<span class="_ _8"> </span>the<span class="_ _8"> </span>bound-</div><div class="t m0 xa h8 y72 ff2 fs4 fc0 sc0 ls0 ws0">ing<span class="_ _10"> </span>boxes<span class="_ _10"> </span>either<span class="_ _10"> </span>in<span class="_ _7"> </span>the<span class="_ _10"> </span>image<span class="_ _10"> </span>coordinate<span class="_ _7"> </span>system<span class="_ _10"> </span>[<span class="fc1">46</span>,<span class="_ _7"> </span><span class="fc1">9</span>],<span class="_ _7"> </span>or<span class="_ _10"> </span>in</div><div class="t m0 xa h8 y73 ff2 fs4 fc0 sc0 ls0 ws0">a<span class="_ _9"> </span>fully<span class="_ _9"> </span>con<span class="_ _3"></span>volutional<span class="_"> </span>[<span class="fc1">31</span>]<span class="_ _9"> </span>and<span class="_ _8"> </span>hence<span class="_ _9"> </span>position-in<span class="_ _3"></span>v<span class="_ _3"></span>ariant<span class="_ _9"> </span>set-</div><div class="t m0 xa h8 y74 ff2 fs4 fc0 sc0 ls0 ws0">tings<span class="_"> </span>[<span class="fc1">40</span>,<span class="_"> </span><span class="fc1">38</span>,<span class="_"> </span><span class="fc1">37</span>].<span class="_ _a"> </span>Most<span class="_"> </span>related<span class="_"> </span>to<span class="_"> </span>our<span class="_"> </span>approach<span class="_"> </span>is<span class="_ _9"> </span>the<span class="_"> </span>work</div><div class="t m0 xa h8 y75 ff2 fs4 fc0 sc0 ls0 ws0">of<span class="_ _8"> </span>Ren<span class="_ _a"> </span><span class="ff6">et<span class="_ _a"> </span>al</span>.<span class="_ _a"> </span>[<span class="fc1">38</span>]<span class="_ _a"> </span>who<span class="_ _a"> </span>dev<span class="_ _3"></span>elop<span class="_ _8"> </span>a<span class="_ _a"> </span>region<span class="_ _a"> </span>proposal<span class="_ _a"> </span>network</div><div class="t m0 xa h8 y76 ff2 fs4 fc0 sc0 ls0 ws0">(RPN)<span class="_ _5"> </span>that<span class="_ _5"> </span>regresses<span class="_ _5"> </span>from<span class="_ _5"> </span>anchors<span class="_ _5"> </span>to<span class="_ _6"> </span>regions<span class="_ _1"> </span>of<span class="_ _6"> </span>interest.</div><div class="t m0 xa h8 y77 ff2 fs4 fc0 sc0 ls0 ws0">Howe<span class="_ _2"></span>ver<span class="_ _2"></span>,<span class="_ _5"> </span>they<span class="_ _a"> </span>adopt<span class="_ _1"> </span>a<span class="_ _1"> </span>4-step<span class="_ _1"> </span>optimization<span class="_ _1"> </span>process,<span class="_ _5"> </span>while</div><div class="t m0 xa h8 y78 ff2 fs4 fc0 sc0 ls0 ws0">our<span class="_"> </span>approach<span class="_"> </span>does<span class="_ _9"> </span>not<span class="_"> </span>require<span class="_ _9"> </span>training<span class="_"> </span>pipelines.<span class="_ _a"> </span>Addition-</div><div class="t m0 xa h8 y79 ff2 fs4 fc0 sc0 ls0 ws0">ally<span class="_ _2"></span>,<span class="_"> </span>we<span class="_"> </span>replace<span class="_"> </span>their<span class="_"> </span>RoI<span class="_"> </span>pooling<span class="_ _9"> </span>mechanism<span class="_"> </span>with<span class="_"> </span>a<span class="_"> </span>differ<span class="_ _2"></span>-</div><div class="t m0 xa h8 y7a ff2 fs4 fc0 sc0 ls0 ws0">entiable,<span class="_ _8"> </span>spatial<span class="_ _9"> </span>soft<span class="_ _8"> </span>attention<span class="_ _9"> </span>mechanism<span class="_ _8"> </span>[<span class="fc1">19</span>,<span class="_ _8"> </span><span class="fc1">16</span>].<span class="_ _0"> </span>In<span class="_ _9"> </span>par-</div><div class="t m0 xa h8 y7b ff2 fs4 fc0 sc0 ls0 ws0">ticular<span class="_ _2"></span>,<span class="_ _8"> </span>this<span class="_ _9"> </span>change<span class="_ _9"> </span>allo<span class="_ _3"></span>ws<span class="_ _9"> </span>us<span class="_ _9"> </span>to<span class="_ _9"> </span>backpropagate<span class="_ _9"> </span>through<span class="_ _9"> </span>the</div><div class="t m0 xa h8 y7c ff2 fs4 fc0 sc0 ls0 ws0">region<span class="_"> </span>proposal<span class="_"> </span>network<span class="_"> </span>and<span class="_"> </span>train<span class="_"> </span>the<span class="_"> </span>whole<span class="_"> </span>model<span class="_"> </span>jointly<span class="_ _4"></span>.</div><div class="t m0 xa h11 y7d ff1 fs4 fc0 sc0 ls0 ws0">Image<span class="_ _5"> </span>Captioning.<span class="_ _13"> </span><span class="ff2">Sev<span class="_ _2"></span>eral<span class="_ _6"> </span>pioneering<span class="_ _6"> </span>approaches<span class="_ _6"> </span>ha<span class="_ _3"></span>ve</span></div><div class="t m0 xa h8 y7e ff2 fs4 fc0 sc0 ls0 ws0">explored<span class="_ _6"> </span>the<span class="_ _6"> </span>task<span class="_ _6"> </span>of<span class="_ _0"> </span>describing<span class="_ _6"> </span>images<span class="_ _0"> </span>with<span class="_ _6"> </span>natural<span class="_ _0"> </span>lan-</div><div class="t m0 xa h8 y7f ff2 fs4 fc0 sc0 ls0 ws0">guage<span class="_ _6"> </span>[<span class="fc1">1</span>,<span class="_ _0"> </span><span class="fc1">27</span>,<span class="_ _0"> </span><span class="fc1">12</span>,<span class="_ _0"> </span><span class="fc1">34</span>,<span class="_ _6"> </span><span class="fc1">42</span>,<span class="_ _0"> </span><span class="fc1">43</span>,<span class="_ _0"> </span><span class="fc1">28</span>,<span class="_ _6"> </span><span class="fc1">20</span>].<span class="_ _14"> </span>More<span class="_ _6"> </span>recent<span class="_ _0"> </span>ap-</div><div class="t m0 xa h8 y80 ff2 fs4 fc0 sc0 ls0 ws0">proaches<span class="_"> </span>based<span class="_"> </span>on<span class="_"> </span>neural<span class="_"> </span>networks<span class="_"> </span>hav<span class="_ _2"></span>e<span class="_ _9"> </span>adopted<span class="_"> </span>Recurrent</div><div class="t m0 xa h8 y81 ff2 fs4 fc0 sc0 ls0 ws0">Neural<span class="_ _9"> </span>Networks<span class="_"> </span>(RNNs)<span class="_ _9"> </span>[<span class="fc1">50</span>,<span class="_ _9"> </span><span class="fc1">18</span>]<span class="_ _8"> </span>as<span class="_ _9"> </span>the<span class="_ _9"> </span>core<span class="_ _9"> </span>architectural</div><div class="t m0 xa h8 y82 ff2 fs4 fc0 sc0 ls0 ws0">element<span class="_ _1"> </span>for<span class="_ _1"> </span>generating<span class="_ _1"> </span>captions.<span class="_ _12"> </span>These<span class="_ _1"> </span>models<span class="_ _5"> </span>hav<span class="_ _2"></span>e<span class="_ _5"> </span>pre-</div><div class="t m0 xa h8 y83 ff2 fs4 fc0 sc0 ls0 ws0">viously<span class="_ _6"> </span>been<span class="_ _5"> </span>used<span class="_ _6"> </span>in<span class="_ _6"> </span>language<span class="_ _6"> </span>modeling<span class="_ _5"> </span>[<span class="fc1">2</span>,<span class="_ _6"> </span><span class="fc1">15</span>,<span class="_ _6"> </span><span class="fc1">33</span>,<span class="_ _6"> </span><span class="fc1">44</span>],</div><div class="t m0 xa h8 y84 ff2 fs4 fc0 sc0 ls0 ws0">where<span class="_ _5"> </span>they<span class="_ _5"> </span>are<span class="_ _5"> </span>known<span class="_ _5"> </span>to<span class="_ _5"> </span>learn<span class="_ _6"> </span>po<span class="_ _3"></span>werful<span class="_ _5"> </span>long-term<span class="_ _5"> </span>inter-</div><div class="t m0 xa h8 y85 ff2 fs4 fc0 sc0 ls0 ws0">actions<span class="_ _9"> </span>[<span class="fc1">22</span>].<span class="_ _b"> </span>Several<span class="_ _9"> </span>recent<span class="_ _9"> </span>approaches<span class="_ _8"> </span>to<span class="_ _8"> </span>Image<span class="_ _8"> </span>Caption-</div><div class="t m0 xa h8 y86 ff2 fs4 fc0 sc0 ls0 ws0">ing<span class="_"> </span>[<span class="fc1">32</span>,<span class="_"> </span><span class="fc1">21</span>,<span class="_"> </span><span class="fc1">49</span>,<span class="_"> </span><span class="fc1">8</span>,<span class="_ _9"> </span><span class="fc1">4</span>,<span class="_"> </span><span class="fc1">24</span>,<span class="_"> </span><span class="fc1">11</span>]<span class="_"> </span>rely<span class="_"> </span>on<span class="_ _9"> </span>a<span class="_"> </span>combination<span class="_"> </span>of<span class="_"> </span>RNN</div><div class="t m0 xa h8 y87 ff2 fs4 fc0 sc0 ls0 ws0">language<span class="_ _a"> </span>model<span class="_ _a"> </span>conditioned<span class="_ _1"> </span>on<span class="_ _a"> </span>image<span class="_ _a"> </span>information,<span class="_ _1"> </span>possi-</div><div class="t m0 xa h8 y88 ff2 fs4 fc0 sc0 ls0 ws0">bly<span class="_ _8"> </span>with<span class="_ _8"> </span>soft<span class="_ _8"> </span>attention<span class="_ _8"> </span>mechanisms<span class="_ _8"> </span>[<span class="fc1">51</span>,<span class="_ _a"> </span><span class="fc1">5</span>].<span class="_ _b"> </span>Similar<span class="_ _a"> </span>to<span class="_ _8"> </span>our</div><div class="t m0 xa h8 y89 ff2 fs4 fc0 sc0 ls0 ws0">work,<span class="_ _a"> </span>Karpathy<span class="_ _a"> </span>and<span class="_ _1"> </span>Fei-Fei<span class="_ _a"> </span>[<span class="fc1">21</span>]<span class="_ _a"> </span>run<span class="_ _1"> </span>an<span class="_ _a"> </span>image<span class="_ _a"> </span>captioning</div><div class="t m0 xa h8 y8a ff2 fs4 fc0 sc0 ls0 ws0">model<span class="_ _1"> </span>on<span class="_ _5"> </span>regions<span class="_ _1"> </span>but<span class="_ _1"> </span>they<span class="_ _1"> </span>do<span class="_ _5"> </span>not<span class="_ _1"> </span>tackle<span class="_ _5"> </span>the<span class="_ _5"> </span>joint<span class="_ _1"> </span>task<span class="_ _5"> </span>of</div><div class="t m0 xa h8 y8b ff2 fs4 fc0 sc0 ls0 ws0">detection<span class="_"> </span>of<span class="_ _7"> </span>description<span class="_"> </span>in<span class="_"> </span>one<span class="_ _7"> </span>model.<span class="_ _8"> </span>Our<span class="_"> </span>model<span class="_"> </span>is<span class="_ _7"> </span>end-to-</div><div class="t m0 xa h8 y8c ff2 fs4 fc0 sc0 ls0 ws0">end<span class="_ _8"> </span>and<span class="_ _a"> </span>designed<span class="_ _8"> </span>in<span class="_ _a"> </span>such<span class="_ _8"> </span>way<span class="_ _a"> </span>that<span class="_ _8"> </span>the<span class="_ _a"> </span>prediction<span class="_ _8"> </span>for<span class="_ _a"> </span>each</div><div class="t m0 xa h8 y8d ff2 fs4 fc0 sc0 ls0 ws0">region<span class="_ _9"> </span>is<span class="_ _9"> </span>a<span class="_ _9"> </span>function<span class="_ _8"> </span>of<span class="_ _9"> </span>the<span class="_ _8"> </span>global<span class="_ _9"> </span>image<span class="_ _8"> </span>context,<span class="_ _9"> </span>which<span class="_ _9"> </span>we</div><div class="t m0 xa h8 y8e ff2 fs4 fc0 sc0 ls0 ws0">show<span class="_ _10"> </span>also<span class="_ _10"> </span>ultimately<span class="_"> </span>leads<span class="_ _10"> </span>to<span class="_ _7"> </span>stronger<span class="_ _7"> </span>performance.<span class="_ _9"> </span>Finally<span class="_ _2"></span>,</div><div class="t m0 xa h8 y8f ff2 fs4 fc0 sc0 ls0 ws0">the<span class="_"> </span>metrics<span class="_"> </span>we<span class="_"> </span>de<span class="_ _2"></span>velop<span class="_"> </span>for<span class="_"> </span>the<span class="_ _7"> </span>dense<span class="_"> </span>captioning<span class="_"> </span>task<span class="_"> </span>are<span class="_"> </span>in-</div><div class="t m0 xa h8 y90 ff2 fs4 fc0 sc0 ls0 ws0">spired<span class="_ _7"> </span>by<span class="_"> </span>metrics<span class="_ _7"> </span>dev<span class="_ _2"></span>eloped<span class="_"> </span>for<span class="_ _7"> </span>image<span class="_"> </span>captioning<span class="_ _7"> </span>[<span class="fc1">48</span>,<span class="_ _7"> </span><span class="fc1">7</span>,<span class="_"> </span><span class="fc1">3</span>].</div><div class="t m0 xa h6 y91 ff1 fs1 fc0 sc0 ls0 ws0">3.<span class="_ _9"> </span>Model</div><div class="t m0 xa h11 y92 ff1 fs4 fc0 sc0 ls0 ws0">Overview<span class="_ _4"></span>.<span class="_ _8"> </span><span class="ff2">Our<span class="_"> </span>goal<span class="_"> </span>is<span class="_"> </span>to<span class="_"> </span>design<span class="_"> </span>an<span class="_ _9"> </span>architecture<span class="_"> </span>that<span class="_"> </span>jointly</span></div><div class="t m0 xa h8 y5f ff2 fs4 fc0 sc0 ls0 ws0">localizes<span class="_ _1"> </span>regions<span class="_ _1"> </span>of<span class="_ _5"> </span>interest<span class="_ _1"> </span>and<span class="_ _5"> </span>then<span class="_ _5"> </span>describes<span class="_ _1"> </span>each<span class="_ _5"> </span>with</div><div class="t m0 xa h8 y93 ff2 fs4 fc0 sc0 ls0 ws0">natural<span class="_ _5"> </span>language.<span class="_ _15"> </span>The<span class="_ _5"> </span>primary<span class="_ _5"> </span>challenge<span class="_ _5"> </span>is<span class="_ _6"> </span>to<span class="_ _5"> </span>dev<span class="_ _2"></span>elop<span class="_ _6"> </span>a</div><div class="t m0 xa h8 y61 ff2 fs4 fc0 sc0 ls0 ws0">model<span class="_ _10"> </span>that<span class="_"> </span>supports<span class="_ _10"> </span>end-to-end<span class="_ _10"> </span>training<span class="_ _7"> </span>with<span class="_ _7"> </span>a<span class="_ _7"> </span>single<span class="_ _7"> </span>step<span class="_ _10"> </span>of</div><div class="t m0 xa h8 y62 ff2 fs4 fc0 sc0 ls0 ws0">optimization,<span class="_ _7"> </span>and<span class="_ _7"> </span>both<span class="_ _7"> </span>efficient<span class="_ _10"> </span>and<span class="_ _10"> </span>effecti<span class="_ _3"></span>ve<span class="_ _10"> </span>inference.<span class="_ _8"> </span>Our</div><div class="t m0 xa h8 y63 ff2 fs4 fc0 sc0 ls0 ws0">proposed<span class="_"> </span>architecture<span class="_ _9"> </span>(see<span class="_ _9"> </span>Figure<span class="_"> </span><span class="fc4">2</span>)<span class="_ _9"> </span>draws<span class="_"> </span>on<span class="_ _9"> </span>architectural</div><div class="t m0 xa h8 y2d ff2 fs4 fc0 sc0 ls0 ws0">elements<span class="_"> </span>present<span class="_ _9"> </span>in<span class="_ _9"> </span>recent<span class="_ _9"> </span>work<span class="_"> </span>on<span class="_ _9"> </span>object<span class="_ _9"> </span>detection,<span class="_ _9"> </span>image</div><div class="t m0 xe h8 y65 ff2 fs4 fc0 sc0 ls0 ws0">captioning<span class="_ _8"> </span>and<span class="_ _a"> </span>soft<span class="_ _a"> </span>spatial<span class="_ _a"> </span>attention<span class="_ _a"> </span>to<span class="_ _a"> </span>simultaneously<span class="_ _a"> </span>ad-</div><div class="t m0 xe h8 y94 ff2 fs4 fc0 sc0 ls0 ws0">dress<span class="_"> </span>these<span class="_"> </span>design<span class="_"> </span>constraints.</div><div class="t m0 xe h8 y95 ff2 fs4 fc0 sc0 ls0 ws0">In<span class="_ _b"> </span>Section<span class="_ _b"> </span><span class="fc4">3.1<span class="_ _c"> </span></span>we<span class="_ _b"> </span>first<span class="_ _b"> </span>describe<span class="_ _b"> </span>the<span class="_ _c"> </span>components<span class="_ _b"> </span>of<span class="_ _b"> </span>our</div><div class="t m0 xe h8 y96 ff2 fs4 fc0 sc0 ls0 ws0">model.<span class="_ _12"> </span>Then<span class="_ _a"> </span>in<span class="_ _1"> </span>Sections<span class="_ _1"> </span><span class="fc4">3.2<span class="_ _1"> </span></span>and<span class="_ _1"> </span><span class="fc4">3.3<span class="_ _1"> </span></span>we<span class="_ _a"> </span>address<span class="_ _1"> </span>the<span class="_ _1"> </span>loss</div><div class="t m0 xe h8 y97 ff2 fs4 fc0 sc0 ls0 ws0">function<span class="_"> </span>and<span class="_"> </span>the<span class="_"> </span>details<span class="_"> </span>of<span class="_"> </span>training<span class="_"> </span>and<span class="_"> </span>inference.</div><div class="t m0 xe h12 y98 ff1 fs9 fc0 sc0 ls0 ws0">3.1.<span class="_ _9"> </span>Model<span class="_ _9"> </span>Ar<span class="_ _2"></span>chitecture</div><div class="t m0 xe h11 y99 ff1 fs4 fc0 sc0 ls0 ws0">3.1.1<span class="_ _16"> </span>Con<span class="_ _2"></span>volutional<span class="_ _9"> </span>Netw<span class="_ _2"></span>ork</div><div class="t m0 xe h8 y9a ff2 fs4 fc0 sc0 ls0 ws0">W<span class="_ _4"></span>e<span class="_"> </span>use<span class="_"> </span>the<span class="_"> </span>V<span class="_ _3"></span>GG-16<span class="_"> </span>architecture<span class="_"> </span>[<span class="fc1">41</span>]<span class="_ _7"> </span>for<span class="_"> </span>its<span class="_"> </span>state-of-the-art</div><div class="t m0 xe h13 y9b ff2 fs4 fc0 sc0 ls0 ws0">performance<span class="_ _1"> </span>[<span class="fc1">39</span>].<span class="_ _e"> </span>It<span class="_ _1"> </span>consists<span class="_ _5"> </span>of<span class="_ _5"> </span>13<span class="_ _1"> </span>layers<span class="_ _5"> </span>of<span class="_ _5"> </span><span class="ffa">3<span class="_ _8"> </span><span class="ffb">×<span class="_ _8"> </span></span>3<span class="_ _5"> </span></span>con-</div><div class="t m0 xe h13 y9c ff2 fs4 fc0 sc0 ls0 ws0">volutions<span class="_"> </span>interspersed<span class="_ _8"> </span>with<span class="_ _8"> </span>5<span class="_ _9"> </span>layers<span class="_ _8"> </span>of<span class="_ _9"> </span><span class="ffa">2<span class="_ _9"> </span><span class="ffb">×<span class="_ _17"> </span></span>2<span class="_ _9"> </span></span>max<span class="_ _8"> </span>pooling.</div><div class="t m0 xe h8 y9d ff2 fs4 fc0 sc0 ls0 ws0">W<span class="_ _4"></span>e<span class="_ _5"> </span>remove<span class="_ _5"> </span>the<span class="_ _5"> </span>final<span class="_ _5"> </span>pooling<span class="_ _6"> </span>layer<span class="_ _2"></span>,<span class="_ _0"> </span>so<span class="_ _5"> </span>an<span class="_ _5"> </span>input<span class="_ _6"> </span>image<span class="_ _5"> </span>of</div><div class="t m0 xe h13 y9e ff2 fs4 fc0 sc0 ls0 ws0">shape<span class="_"> </span><span class="ffa">3<span class="_ _10"> </span><span class="ffb">×<span class="_ _10"> </span><span class="ffc">W<span class="_ _9"> </span></span>×<span class="_ _10"> </span><span class="ffc">H<span class="_ _a"> </span></span></span></span>giv<span class="_ _2"></span>es<span class="_"> </span>rise<span class="_"> </span>to<span class="_"> </span>a<span class="_ _7"> </span>tensor<span class="_"> </span>of<span class="_"> </span>features<span class="_"> </span>of<span class="_ _7"> </span>shape</div><div class="t m0 xe h13 y9f ffc fs4 fc0 sc0 ls0 ws0">C<span class="_ _18"> </span><span class="ffb">×<span class="_ _19"></span></span>W</div><div class="t m0 x1e h14 ya0 ffd fsa fc0 sc0 ls0 ws0">0</div><div class="t m0 x1f h13 y9f ffb fs4 fc0 sc0 ls0 ws0">×<span class="_ _19"></span><span class="ffc">H</span></div><div class="t m0 x20 h14 ya0 ffd fsa fc0 sc0 ls0 ws0">0</div><div class="t m0 x21 h13 y9f ff2 fs4 fc0 sc0 ls0 ws0">where<span class="_ _7"> </span><span class="ffc">C<span class="_ _a"> </span><span class="ffa">=<span class="_ _9"> </span>512</span></span>,<span class="_"> </span><span class="ffc">W</span></div><div class="t m0 x22 h14 ya0 ffd fsa fc0 sc0 ls0 ws0">0</div><div class="t m0 x23 h13 y9f ffa fs4 fc0 sc0 ls0 ws0">=</div><div class="t m0 x24 h15 ya1 ffe fs4 fc0 sc0 ls0 ws0"></div><div class="t m0 x25 h16 ya2 fff fsa fc0 sc0 ls0 ws0">W</div><div class="t m0 x25 h17 ya3 ff10 fsa fc0 sc0 ls0 ws0">16</div><div class="t m0 x26 h15 ya4 ffe fs4 fc0 sc0 ls0 ws0"></div><div class="t m0 x27 h13 y9f ff2 fs4 fc0 sc0 ls0 ws0">,<span class="_ _7"> </span>and<span class="_ _7"> </span><span class="ffc">H</span></div><div class="t m0 x28 h14 ya0 ffd fsa fc0 sc0 ls0 ws0">0</div><div class="t m0 x29 h13 y9f ffa fs4 fc0 sc0 ls0 ws0">=</div><div class="t m0 x2a h15 ya4 ffe fs4 fc0 sc0 ls0 ws0"></div><div class="t m0 x2b h16 ya5 fff fsa fc0 sc0 ls0 ws0">H</div><div class="t m0 x2b h17 ya3 ff10 fsa fc0 sc0 ls0 ws0">16</div><div class="t m0 x2c h15 ya4 ffe fs4 fc0 sc0 ls0 ws0"></div><div class="t m0 x2d h8 y9f ff2 fs4 fc0 sc0 ls0 ws0">.</div><div class="t m0 xe h8 ya6 ff2 fs4 fc0 sc0 ls0 ws0">The<span class="_ _a"> </span>output<span class="_ _a"> </span>of<span class="_ _a"> </span>this<span class="_ _1"> </span>network<span class="_ _8"> </span>encodes<span class="_ _1"> </span>the<span class="_ _a"> </span>appearance<span class="_ _a"> </span>of<span class="_ _a"> </span>the</div><div class="t m0 xe h8 ya7 ff2 fs4 fc0 sc0 ls0 ws0">image<span class="_ _a"> </span>at<span class="_ _a"> </span>a<span class="_ _a"> </span>set<span class="_ _a"> </span>of<span class="_ _1"> </span>uniformly<span class="_ _a"> </span>sampled<span class="_ _a"> </span>image<span class="_ _a"> </span>locations,<span class="_ _1"> </span>and</div><div class="t m0 xe h8 ya8 ff2 fs4 fc0 sc0 ls0 ws0">forms<span class="_"> </span>the<span class="_"> </span>input<span class="_"> </span>to<span class="_"> </span>the<span class="_"> </span>localization<span class="_"> </span>layer<span class="_ _2"></span>.</div><div class="t m0 xe h11 ya9 ff1 fs4 fc0 sc0 ls0 ws0">3.1.2<span class="_ _16"> </span>Fully<span class="_ _17"> </span>Con<span class="_ _2"></span>volutional<span class="_ _17"> </span>Localization<span class="_ _17"> </span>Layer</div><div class="t m0 xe h8 yaa ff2 fs4 fc0 sc0 ls0 ws0">The<span class="_ _6"> </span>localization<span class="_ _5"> </span>layer<span class="_ _6"> </span>receiv<span class="_ _3"></span>es<span class="_ _6"> </span>an<span class="_ _5"> </span>input<span class="_ _6"> </span>tensor<span class="_ _6"> </span>of<span class="_ _6"> </span>activ<span class="_ _2"></span>a-</div><div class="t m0 xe h8 yab ff2 fs4 fc0 sc0 ls0 ws0">tions,<span class="_"> </span>identifies<span class="_ _9"> </span>spatial<span class="_"> </span>regions<span class="_"> </span>of<span class="_"> </span>interest<span class="_ _9"> </span>and<span class="_"> </span>smoothly<span class="_ _9"> </span>ex-</div><div class="t m0 xe h8 yac ff2 fs4 fc0 sc0 ls0 ws0">tracts<span class="_ _1"> </span>a<span class="_ _5"> </span>fixed-sized<span class="_ _1"> </span>representation<span class="_ _1"> </span>from<span class="_ _5"> </span>each<span class="_ _5"> </span>region.<span class="_ _1a"> </span>Our</div><div class="t m0 xe h8 yad ff2 fs4 fc0 sc0 ls0 ws0">approach<span class="_ _5"> </span>is<span class="_ _1"> </span>based<span class="_ _5"> </span>on<span class="_ _5"> </span>that<span class="_ _5"> </span>of<span class="_ _5"> </span>Faster<span class="_ _5"> </span>R-CNN<span class="_ _5"> </span>[<span class="fc1">38</span>],<span class="_ _6"> </span>but<span class="_ _1"> </span>we</div><div class="t m0 xe h8 yae ff2 fs4 fc0 sc0 ls0 ws0">replace<span class="_ _b"> </span>their<span class="_ _b"> </span>RoI<span class="_ _b"> </span>pooling<span class="_ _c"> </span>mechanism<span class="_ _b"> </span>[<span class="fc1">13</span>]<span class="_ _b"> </span>with<span class="_ _b"> </span>bilinear</div><div class="t m0 xe h8 yaf ff2 fs4 fc0 sc0 ls0 ws0">interpolation<span class="_ _6"> </span>[<span class="fc1">19</span>],<span class="_ _b"> </span>allowing<span class="_ _6"> </span>our<span class="_ _6"> </span>model<span class="_ _6"> </span>to<span class="_ _6"> </span>propagate<span class="_ _0"> </span>gra-</div><div class="t m0 xe h8 yb0 ff2 fs4 fc0 sc0 ls0 ws0">dients<span class="_ _5"> </span>backward<span class="_ _6"> </span>through<span class="_ _5"> </span>the<span class="_ _6"> </span>coordinates<span class="_ _5"> </span>of<span class="_ _6"> </span>predicted<span class="_ _5"> </span>re-</div><div class="t m0 xe h8 yb1 ff2 fs4 fc0 sc0 ls0 ws0">gions.<span class="_ _9"> </span>This<span class="_"> </span>modification<span class="_ _10"> </span>opens<span class="_"> </span>up<span class="_ _10"> </span>the<span class="_"> </span>possibil<span class="_ _3"></span>ity<span class="_ _7"> </span>of<span class="_"> </span>predict-</div><div class="t m0 xe h8 yb2 ff2 fs4 fc0 sc0 ls0 ws0">ing<span class="_"> </span>af<span class="_ _3"></span>fine<span class="_"> </span>or<span class="_"> </span>morphed<span class="_"> </span>region<span class="_"> </span>proposals<span class="_ _7"> </span>instead<span class="_"> </span>of<span class="_"> </span>bounding</div><div class="t m0 xe h8 yb3 ff2 fs4 fc0 sc0 ls0 ws0">boxes<span class="_"> </span>[<span class="fc1">19</span>],<span class="_"> </span>b<span class="_ _2"></span>ut<span class="_"> </span>we<span class="_"> </span>leave<span class="_"> </span>these<span class="_"> </span>e<span class="_ _3"></span>xtensions<span class="_"> </span>to<span class="_"> </span>future<span class="_"> </span>work.</div><div class="t m0 xe h11 yb4 ff1 fs4 fc0 sc0 ls0 ws0">Inputs/outputs<span class="ff2">.<span class="_ _b"> </span>The<span class="_ _9"> </span>localization<span class="_ _8"> </span>layer<span class="_ _8"> </span>accepts<span class="_ _8"> </span>a<span class="_ _8"> </span>tensor<span class="_ _8"> </span>of</span></div><div class="t m0 xe h13 yb5 ff2 fs4 fc0 sc0 ls0 ws0">activ<span class="_ _2"></span>ations<span class="_ _9"> </span>of<span class="_ _9"> </span>size<span class="_ _9"> </span><span class="ffc">C<span class="_ _8"> </span><span class="ffb">×<span class="_ _17"> </span></span>W</span></div><div class="t m0 x2e h14 yb6 ffd fsa fc0 sc0 ls0 ws0">0</div><div class="t m0 x2f h13 yb5 ffb fs4 fc0 sc0 ls0 ws0">×<span class="_ _17"> </span><span class="ffc">H</span></div><div class="t m0 x30 h14 yb6 ffd fsa fc0 sc0 ls0 ws0">0</div><div class="t m0 x31 h8 yb5 ff2 fs4 fc0 sc0 ls0 ws0">.<span class="_ _5"> </span>It<span class="_ _9"> </span>then<span class="_ _9"> </span>internally<span class="_ _9"> </span>selects</div><div class="t m0 xe h13 yb7 ffc fs4 fc0 sc0 ls0 ws0">B<span class="_ _9"> </span><span class="ff2">regions<span class="_ _7"> </span>of<span class="_"> </span>interest<span class="_ _7"> </span>and<span class="_"> </span>returns<span class="_ _7"> </span>three<span class="_"> </span>output<span class="_ _7"> </span>tensors<span class="_"> </span>gi<span class="_ _2"></span>ving</span></div><div class="t m0 xe h8 yb8 ff2 fs4 fc0 sc0 ls0 ws0">information<span class="_"> </span>about<span class="_"> </span>these<span class="_"> </span>regions:</div><div class="t m0 x32 h13 yb9 ff2 fs4 fc0 sc0 ls0 ws0">1.<span class="_ _b"> </span><span class="ff1">Region<span class="_ _17"> </span>Coordinates</span>:<span class="_ _8"> </span>A<span class="_"> </span>matrix<span class="_"> </span>of<span class="_"> </span>shape<span class="_"> </span><span class="ffc">B<span class="_ _17"> </span><span class="ffb">×<span class="_ _10"> </span><span class="ffa">4<span class="_ _17"> </span></span></span></span>giving</div><div class="t m0 x33 h8 yba ff2 fs4 fc0 sc0 ls0 ws0">bounding<span class="_"> </span>box<span class="_"> </span>coordinates<span class="_"> </span>for<span class="_"> </span>each<span class="_"> </span>output<span class="_"> </span>region.</div><div class="t m0 x32 h13 ybb ff2 fs4 fc0 sc0 ls0 ws0">2.<span class="_ _b"> </span><span class="ff1">Region<span class="_ _5"> </span>Scores</span>:<span class="_ _f"> </span>A<span class="_ _5"> </span>vector<span class="_ _1"> </span>of<span class="_ _1"> </span>length<span class="_ _5"> </span><span class="ffc">B<span class="_ _6"> </span></span>giving<span class="_ _1"> </span>a<span class="_ _5"> </span>con-</div><div class="t m0 x33 h8 ybc ff2 fs4 fc0 sc0 ls0 ws0">fidence<span class="_ _6"> </span>score<span class="_ _6"> </span>for<span class="_ _6"> </span>each<span class="_ _6"> </span>output<span class="_ _6"> </span>region.<span class="_ _13"> </span>Regions<span class="_ _5"> </span>with</div><div class="t m0 x33 h8 ybd ff2 fs4 fc0 sc0 ls0 ws0">high<span class="_ _a"> </span>confidence<span class="_ _a"> </span>scores<span class="_ _a"> </span>are<span class="_ _a"> </span>more<span class="_ _a"> </span>likely<span class="_ _8"> </span>to<span class="_ _a"> </span>correspond</div><div class="t m0 x33 h8 ybe ff2 fs4 fc0 sc0 ls0 ws0">to<span class="_"> </span>ground-truth<span class="_"> </span>regions<span class="_"> </span>of<span class="_"> </span>interest.</div><div class="t m0 x32 h13 ybf ff2 fs4 fc0 sc0 ls0 ws0">3.<span class="_ _b"> </span><span class="ff1">Region<span class="_ _9"> </span>F<span class="_ _2"></span>eatures<span class="ff2">:<span class="_ _8"> </span>A<span class="_"> </span>tensor<span class="_"> </span>of<span class="_"> </span>shape<span class="_"> </span><span class="ffc">B<span class="_ _17"> </span><span class="ffb">×<span class="_ _17"> </span></span>C<span class="_ _9"> </span><span class="ffb">×<span class="_ _7"> </span></span>X<span class="_ _9"> </span><span class="ffb">×<span class="_ _7"> </span></span>Y</span></span></span></div><div class="t m0 x33 h8 yc0 ff2 fs4 fc0 sc0 ls0 ws0">giving<span class="_ _7"> </span>features<span class="_"> </span>for<span class="_ _7"> </span>output<span class="_"> </span>regions;<span class="_ _7"> </span>is<span class="_"> </span>represented<span class="_"> </span>by<span class="_ _7"> </span>an</div><div class="t m0 x33 h13 yc1 ffc fs4 fc0 sc0 ls0 ws0">X<span class="_ _9"> </span><span class="ffb">×<span class="_ _17"> </span></span>Y<span class="_ _0"> </span><span class="ff2">grid<span class="_"> </span>of<span class="_"> </span></span>C<span class="_ _1b"></span><span class="ff2">-dimensional<span class="_"> </span>features.</span></div><div class="t m0 xe h11 yc2 ff1 fs4 fc0 sc0 ls0 ws0">Con<span class="_ _2"></span>volutional<span class="_ _5"> </span>Anchors<span class="ff2">.<span class="_ _e"> </span>Similar<span class="_ _5"> </span>to<span class="_ _5"> </span>Faster<span class="_ _5"> </span>R-CNN<span class="_ _5"> </span>[<span class="fc1">38</span>],</span></div><div class="t m0 xe h8 y92 ff2 fs4 fc0 sc0 ls0 ws0">our<span class="_ _9"> </span>localization<span class="_ _9"> </span>layer<span class="_ _9"> </span>predicts<span class="_ _8"> </span>region<span class="_ _9"> </span>proposals<span class="_ _9"> </span>by<span class="_ _9"> </span>regress-</div><div class="t m0 xe h8 y5f ff2 fs4 fc0 sc0 ls0 ws0">ing<span class="_ _1"> </span>offsets<span class="_ _a"> </span>from<span class="_ _1"> </span>a<span class="_ _5"> </span>set<span class="_ _1"> </span>of<span class="_ _1"> </span>translation-in<span class="_ _3"></span>variant<span class="_ _a"> </span>anchors.<span class="_ _12"> </span>In</div><div class="t m0 xe h13 y60 ff2 fs4 fc0 sc0 ls0 ws0">particular<span class="_ _2"></span>,<span class="_ _6"> </span>we<span class="_ _1"> </span>project<span class="_ _5"> </span>each<span class="_ _5"> </span>point<span class="_ _1"> </span>in<span class="_ _5"> </span>the<span class="_ _5"> </span><span class="ffc">W</span></div><div class="t m0 x34 h14 yc3 ffd fsa fc0 sc0 ls0 ws0">0</div><div class="t m0 x35 h13 y60 ffb fs4 fc0 sc0 ls0 ws0">×<span class="_ _8"> </span><span class="ffc">H</span></div><div class="t m0 x36 h14 yc3 ffd fsa fc0 sc0 ls0 ws0">0</div><div class="t m0 x37 h8 y60 ff2 fs4 fc0 sc0 ls0 ws0">grid<span class="_ _1"> </span>of</div><div class="t m0 xe h13 y61 ff2 fs4 fc0 sc0 ls0 ws0">input<span class="_ _9"> </span>features<span class="_"> </span>back<span class="_ _9"> </span>into<span class="_ _9"> </span>the<span class="_ _9"> </span><span class="ffc">W<span class="_ _1"> </span><span class="ffb">×<span class="_ _17"> </span></span>H<span class="_ _1"> </span></span>image<span class="_ _9"> </span>plane,<span class="_ _9"> </span>and<span class="_ _9"> </span>con-</div><div class="t m0 xe h13 y62 ff2 fs4 fc0 sc0 ls0 ws0">sider<span class="_ _a"> </span><span class="ffc">k<span class="_ _5"> </span></span>anchor<span class="_ _a"> </span>boxes<span class="_ _a"> </span>of<span class="_ _a"> </span>different<span class="_ _a"> </span>aspect<span class="_ _a"> </span>ratios<span class="_ _a"> </span>centered<span class="_ _1"> </span>at</div><div class="t m0 xe h13 y63 ff2 fs4 fc0 sc0 ls0 ws0">this<span class="_ _6"> </span>projected<span class="_ _6"> </span>point.<span class="_ _14"> </span>F<span class="_ _2"></span>or<span class="_ _0"> </span>each<span class="_ _6"> </span>of<span class="_ _0"> </span>these<span class="_ _6"> </span><span class="ffc">k<span class="_ _b"> </span></span>anchor<span class="_ _6"> </span>boxes,</div><div class="t m0 xe h8 y2d ff2 fs4 fc0 sc0 ls0 ws0">the<span class="_ _a"> </span>localization<span class="_ _8"> </span>layer<span class="_ _a"> </span>predicts<span class="_ _a"> </span>a<span class="_ _a"> </span>confidence<span class="_ _a"> </span>score<span class="_ _a"> </span>and<span class="_ _a"> </span>four</div><a class="l" rel='nofollow' onclick='return false;'><div class="d m1"></div></a><a class="l" rel='nofollow' onclick='return false;'><div class="d m1"></div></a><a class="l" rel='nofollow' onclick='return false;'><div class="d m1"></div></a><a class="l" rel='nofollow' onclick='return false;'><div class="d m1"></div></a><a class="l" rel='nofollow' onclick='return false;'><div class="d m1"></div></a><a class="l" rel='nofollow' onclick='return false;'><div class="d m1"></div></a><a class="l" rel='nofollow' onclick='return false;'><div class="d m1"></div></a><a class="l" rel='nofollow' onclick='return false;'><div class="d m1"></div></a><a class="l" rel='nofollow' onclick='return false;'><div class="d m1"></div></a><a class="l" rel='nofollow' onclick='return false;'><div class="d m1"></div></a><a class="l" rel='nofollow' onclick='return false;'><div class="d m1"></div></a><a class="l" rel='nofollow' onclick='return false;'><div class="d m1"></div></a><a class="l" rel='nofollow' onclick='return false;'><div class="d m1"></div></a><a class="l" rel='nofollow' onclick='return false;'><div class="d m1"></div></a><a class="l" rel='nofollow' onclick='return false;'><div class="d m1"></div></a><a class="l" rel='nofollow' onclick='return false;'><div class="d m1"></div></a><a class="l" rel='nofollow' onclick='return false;'><div class="d m1"></div></a><a class="l" rel='nofollow' onclick='return false;'><div class="d m1"></div></a><a class="l" rel='nofollow' onclick='return false;'><div class="d m1"></div></a><a class="l" rel='nofollow' onclick='return false;'><div class="d m1"></div></a><a class="l" rel='nofollow' onclick='return false;'><div class="d m1"></div></a><a class="l" rel='nofollow' onclick='return false;'><div class="d m1"></div></a><a class="l" rel='nofollow' onclick='return false;'><div class="d m1"></div></a><a class="l" rel='nofollow' onclick='return false;'><div class="d m1"></div></a><a class="l" rel='nofollow' onclick='return false;'><div class="d m1"></div></a><a class="l" rel='nofollow' onclick='return false;'><div class="d m1"></div></a><a class="l" rel='nofollow' onclick='return false;'><div class="d m1"></div></a><a class="l" rel='nofollow' onclick='return false;'><div class="d m1"></div></a><a class="l" rel='nofollow' onclick='return false;'><div class="d m1"></div></a><a class="l" rel='nofollow' onclick='return false;'><div class="d m1"></div></a><a class="l" rel='nofollow' onclick='return false;'><div class="d m1"></div></a><a class="l" rel='nofollow' onclick='return false;'><div class="d m1"></div></a><a class="l" rel='nofollow' onclick='return false;'><div class="d m1"></div></a><a class="l" rel='nofollow' onclick='return false;'><div class="d m1"></div></a><a class="l" rel='nofollow' onclick='return false;'><div class="d m1"></div></a><a class="l" rel='nofollow' onclick='return false;'><div class="d m1"></div></a><a class="l" rel='nofollow' onclick='return false;'><div class="d m1"></div></a><a class="l" rel='nofollow' onclick='return false;'><div class="d m1"></div></a><a class="l" rel='nofollow' onclick='return false;'><div class="d m1"></div></a><a class="l" rel='nofollow' onclick='return false;'><div class="d m1"></div></a><a class="l" rel='nofollow' onclick='return false;'><div class="d m1"></div></a><a class="l" rel='nofollow' onclick='return false;'><div class="d m1"></div></a><a class="l" rel='nofollow' onclick='return false;'><div class="d m1"></div></a><a class="l" rel='nofollow' onclick='return false;'><div class="d m1"></div></a><a class="l" rel='nofollow' onclick='return false;'><div class="d m1"></div></a><a class="l" rel='nofollow' onclick='return false;'><div class="d m1"></div></a><a class="l" rel='nofollow' onclick='return false;'><div class="d m1"></div></a><a class="l" rel='nofollow' onclick='return false;'><div class="d m1"></div></a><a class="l" rel='nofollow' onclick='return false;'><div class="d m1"></div></a><a class="l" rel='nofollow' onclick='return false;'><div class="d m1"></div></a><a class="l" rel='nofollow' onclick='return false;'><div class="d m1"></div></a><a class="l" rel='nofollow' onclick='return false;'><div class="d m1"></div></a><a class="l" rel='nofollow' onclick='return false;'><div class="d m1"></div></a><a class="l" rel='nofollow' onclick='return false;'><div class="d m1"></div></a></div><div class="pi" data-data='{"ctm":[1.568627,0.000000,0.000000,1.568627,0.000000,0.000000]}'></div></div>