李飞飞论文整理/计算机视觉论文/斯坦福四

  • n0_576111
    了解作者
  • 41.8MB
    文件大小
  • zip
    文件格式
  • 0
    收藏次数
  • VIP专享
    资源类型
  • 0
    下载次数
  • 2022-04-04 20:15
    上传日期
斯坦福计算机视觉实验室04年至今所有论文,共分五份(资源大小限制),对计算机视觉有兴趣的可以下载学习,可用于了解计算机视觉发展体系
2016.zip
  • 2016
  • pusiol2016miccai.pdf
    2.2MB
  • ramanathan2016cvpr.pdf
    6.3MB
  • GreeneCognition2016.pdf
    1MB
  • iordan-etal-neuroimage-2016.pdf
    360KB
  • KarpathyICLR2016.pdf
    2.8MB
  • haque2016cvpr.pdf
    4.5MB
  • SAILORS-SIGCSE2016.pdf
    288.7KB
  • yeung2016cvpr.pdf
    1.3MB
  • CC2016.pdf
    966.2KB
  • johnson2016cvpr.pdf
    7.9MB
  • zhu2016cvpr.pdf
    1.8MB
  • lu2016eccv.pdf
    2.3MB
  • haque2016eccv.pdf
    7.5MB
  • CVPR16_N_LSTM.pdf
    1.8MB
  • GreeneJEPG2016.pdf
    2.8MB
内容介绍
<html xmlns="http://www.w3.org/1999/xhtml"> <head> <meta charset="utf-8"> <meta name="generator" content="pdf2htmlEX"> <meta http-equiv="X-UA-Compatible" content="IE=edge,chrome=1"> <link rel="stylesheet" href="https://static.pudn.com/base/css/base.min.css"> <link rel="stylesheet" href="https://static.pudn.com/base/css/fancy.min.css"> <link rel="stylesheet" href="https://static.pudn.com/prod/directory_preview_static/624b58668947fd5953ee5026/raw.css"> <script src="https://static.pudn.com/base/js/compatibility.min.js"></script> <script src="https://static.pudn.com/base/js/pdf2htmlEX.min.js"></script> <script> try{ pdf2htmlEX.defaultViewer = new pdf2htmlEX.Viewer({}); }catch(e){} </script> <title></title> </head> <body> <div id="sidebar" style="display: none"> <div id="outline"> </div> </div> <div id="pf1" class="pf w0 h0" data-page-no="1"><div class="pc pc1 w0 h0"><img class="bi x0 y0 w1 h1" alt="" src="https://static.pudn.com/prod/directory_preview_static/624b58668947fd5953ee5026/bg1.jpg"><div class="t m0 x1 h2 y1 ff1 fs0 fc0 sc0 ls0 ws0">DenseCap:<span class="_ _0"> </span>Fully<span class="_ _1"> </span>Con<span class="_ _2"></span>volutional<span class="_ _1"> </span>Localization<span class="_ _1"> </span>Netw<span class="_ _3"></span>orks<span class="_ _1"> </span>f<span class="_ _2"></span>or<span class="_ _1"> </span>Dense<span class="_ _1"> </span>Captioning</div><div class="t m0 x2 h3 y2 ff2 fs1 fc0 sc0 ls0 ws0">Justin<span class="_"> </span>Johnson</div><div class="t m0 x3 h4 y3 ff3 fs2 fc0 sc0 ls0 ws0">&#8727;</div><div class="t m0 x4 h3 y2 ff2 fs1 fc0 sc0 ls0 ws0">Andrej<span class="_"> </span>Karpathy</div><div class="t m0 x5 h4 y3 ff3 fs2 fc0 sc0 ls0 ws0">&#8727;</div><div class="t m0 x6 h3 y2 ff2 fs1 fc0 sc0 ls0 ws0">Li<span class="_"> </span>Fei-Fei</div><div class="t m0 x7 h3 y4 ff2 fs1 fc0 sc0 ls0 ws0">Department<span class="_"> </span>of<span class="_"> </span>Computer<span class="_"> </span>Science,<span class="_"> </span>Stanford<span class="_"> </span>Uni<span class="_ _3"></span>versity</div><div class="t m0 x8 h5 y5 ff4 fs3 fc0 sc0 ls0 ws0">{<span class="ff5">jcjohns,karpathy,feifeili</span>}<span class="ff5">@cs.stanford.edu</span></div><div class="t m0 x9 h6 y6 ff1 fs1 fc0 sc0 ls0 ws0">Abstract</div><div class="t m0 xa h7 y7 ff6 fs4 fc0 sc0 ls0 ws0">W<span class="_ _4"></span>e<span class="_ _1"> </span>intr<span class="_ _2"></span>oduce<span class="_ _5"> </span>the<span class="_ _1"> </span>dense<span class="_ _1"> </span>captioning<span class="_ _1"> </span>task,<span class="_ _6"> </span>which<span class="_ _1"> </span>r<span class="_ _2"></span>equires<span class="_ _1"> </span>a</div><div class="t m0 xa h7 y8 ff6 fs4 fc0 sc0 ls0 ws0">computer<span class="_ _7"> </span>vision<span class="_ _7"> </span>system<span class="_ _7"> </span>to<span class="_ _7"> </span>both<span class="_ _7"> </span>localize<span class="_ _7"> </span>and<span class="_ _7"> </span>describe<span class="_ _7"> </span>salient</div><div class="t m0 xa h7 y9 ff6 fs4 fc0 sc0 ls0 ws0">r<span class="_ _2"></span>egions<span class="_"> </span>in<span class="_ _8"> </span>images<span class="_"> </span>in<span class="_ _8"> </span>natural<span class="_"> </span>language.<span class="_ _6"> </span>The<span class="_ _9"> </span>dense<span class="_ _9"> </span>caption-</div><div class="t m0 xa h7 ya ff6 fs4 fc0 sc0 ls0 ws0">ing<span class="_ _9"> </span>task<span class="_ _9"> </span>gener<span class="_ _3"></span>alizes<span class="_ _9"> </span>object<span class="_ _9"> </span>detection<span class="_ _9"> </span>when<span class="_ _9"> </span>the<span class="_ _9"> </span>descriptions</div><div class="t m0 xa h7 yb ff6 fs4 fc0 sc0 ls0 ws0">consist<span class="_ _8"> </span>of<span class="_ _1"> </span>a<span class="_ _8"> </span>single<span class="_ _a"> </span>wor<span class="_ _2"></span>d,<span class="_ _1"> </span>and<span class="_ _a"> </span>Image<span class="_ _8"> </span>Captioning<span class="_ _8"> </span>when<span class="_ _a"> </span>one</div><div class="t m0 xa h7 yc ff6 fs4 fc0 sc0 ls0 ws0">pr<span class="_ _2"></span>edicted<span class="_"> </span>r<span class="_ _2"></span>e<span class="_ _3"></span>gion<span class="_ _7"> </span>covers<span class="_ _7"> </span>the<span class="_ _7"> </span>full<span class="_"> </span>ima<span class="_ _3"></span>ge<span class="_ _3"></span>.<span class="_ _9"> </span>T<span class="_ _4"></span>o<span class="_"> </span>addr<span class="_ _2"></span>ess<span class="_ _7"> </span>the<span class="_"> </span>local-</div><div class="t m0 xa h7 yd ff6 fs4 fc0 sc0 ls0 ws0">ization<span class="_ _7"> </span>and<span class="_ _7"> </span>description<span class="_"> </span>task<span class="_ _7"> </span>jointly<span class="_ _7"> </span>we<span class="_ _7"> </span>pr<span class="_ _2"></span>opose<span class="_ _7"> </span>a<span class="_"> </span>Fully<span class="_ _7"> </span>Con-</div><div class="t m0 xa h7 ye ff6 fs4 fc0 sc0 ls0 ws0">volutional<span class="_ _a"> </span>Localization<span class="_ _1"> </span>Network<span class="_ _a"> </span>(FCLN)<span class="_ _1"> </span>ar<span class="_ _3"></span>chitectur<span class="_ _2"></span>e<span class="_ _a"> </span>that</div><div class="t m0 xa h7 yf ff6 fs4 fc0 sc0 ls0 ws0">pr<span class="_ _2"></span>ocesses<span class="_"> </span>an<span class="_"> </span>ima<span class="_ _3"></span>ge<span class="_ _7"> </span>with<span class="_"> </span>a<span class="_"> </span>single<span class="_ _3"></span>,<span class="_"> </span>ef<span class="_ _2"></span>&#64257;cient<span class="_"> </span>forwar<span class="_ _3"></span>d<span class="_"> </span>pass,<span class="_ _7"> </span>r<span class="_ _3"></span>e-</div><div class="t m0 xa h7 y10 ff6 fs4 fc0 sc0 ls0 ws0">quir<span class="_ _2"></span>es<span class="_ _5"> </span>no<span class="_ _6"> </span>e<span class="_ _3"></span>xternal<span class="_ _5"> </span>r<span class="_ _3"></span>e<span class="_ _2"></span>gions<span class="_ _5"> </span>pr<span class="_ _3"></span>oposals,<span class="_ _6"> </span>and<span class="_ _5"> </span>can<span class="_ _5"> </span>be<span class="_ _5"> </span>trained</div><div class="t m0 xa h7 y11 ff6 fs4 fc0 sc0 ls0 ws0">end-to-end<span class="_ _9"> </span>with<span class="_ _8"> </span>a<span class="_ _9"> </span>single<span class="_ _8"> </span>r<span class="_ _2"></span>ound<span class="_ _8"> </span>of<span class="_ _9"> </span>optimization.<span class="_ _0"> </span>The<span class="_ _9"> </span>arc<span class="_ _3"></span>hi-</div><div class="t m0 xa h7 y12 ff6 fs4 fc0 sc0 ls0 ws0">tectur<span class="_ _2"></span>e<span class="_ _5"> </span>is<span class="_ _5"> </span>composed<span class="_ _1"> </span>of<span class="_ _5"> </span>a<span class="_ _1"> </span>Convolutional<span class="_ _1"> </span>Network,<span class="_ _5"> </span>a<span class="_ _5"> </span>novel</div><div class="t m0 xa h7 y13 ff6 fs4 fc0 sc0 ls0 ws0">dense<span class="_ _b"> </span>localization<span class="_ _b"> </span>layer<span class="_ _4"></span>,<span class="_ _c"> </span>and<span class="_ _b"> </span>Recurr<span class="_ _2"></span>ent<span class="_ _b"> </span>Neural<span class="_ _b"> </span>Network</div><div class="t m0 xa h7 y14 ff6 fs4 fc0 sc0 ls0 ws0">language<span class="_ _5"> </span>model<span class="_ _0"> </span>that<span class="_ _6"> </span>generates<span class="_ _6"> </span>the<span class="_ _6"> </span>label<span class="_ _0"> </span>sequences.<span class="_ _d"> </span>W<span class="_ _4"></span>e</div><div class="t m0 xa h7 y15 ff6 fs4 fc0 sc0 ls0 ws0">evaluate<span class="_"> </span>our<span class="_ _9"> </span>network<span class="_"> </span>on<span class="_ _9"> </span>the<span class="_ _9"> </span>V<span class="_ _4"></span>isual<span class="_ _9"> </span>Genome<span class="_ _9"> </span>dataset,<span class="_ _9"> </span>which</div><div class="t m0 xa h7 y16 ff6 fs4 fc0 sc0 ls0 ws0">comprises<span class="_ _5"> </span>94,000<span class="_ _5"> </span>images<span class="_ _1"> </span>and<span class="_ _5"> </span>4,100,000<span class="_ _5"> </span>r<span class="_ _3"></span>e<span class="_ _2"></span>gion-grounded</div><div class="t m0 xa h7 y17 ff6 fs4 fc0 sc0 ls0 ws0">captions.<span class="_ _e"> </span>W<span class="_ _4"></span>e<span class="_ _1"> </span>observe<span class="_ _5"> </span>both<span class="_ _5"> </span>speed<span class="_ _5"> </span>and<span class="_ _5"> </span>accuracy<span class="_ _1"> </span>impro<span class="_ _2"></span>ve-</div><div class="t m0 xa h7 y18 ff6 fs4 fc0 sc0 ls0 ws0">ments<span class="_ _8"> </span>over<span class="_ _a"> </span>baselines<span class="_ _8"> </span>based<span class="_ _a"> </span>on<span class="_ _a"> </span>curr<span class="_ _2"></span>ent<span class="_ _a"> </span>state<span class="_ _a"> </span>of<span class="_ _a"> </span>the<span class="_ _8"> </span>art<span class="_ _a"> </span>ap-</div><div class="t m0 xa h7 y19 ff6 fs4 fc0 sc0 ls0 ws0">pr<span class="_ _2"></span>oaches<span class="_"> </span>in<span class="_"> </span>both<span class="_"> </span>gener<span class="_ _3"></span>ation<span class="_"> </span>and<span class="_"> </span>r<span class="_ _2"></span>etrieval<span class="_"> </span>settings.</div><div class="t m0 xa h6 y1a ff1 fs1 fc0 sc0 ls0 ws0">1.<span class="_ _9"> </span>Introduction</div><div class="t m0 xa h8 y1b ff2 fs4 fc0 sc0 ls0 ws0">Our<span class="_"> </span>ability<span class="_"> </span>to<span class="_ _9"> </span>effortlessly<span class="_"> </span>point<span class="_"> </span>out<span class="_"> </span>and<span class="_"> </span>describe<span class="_ _9"> </span>all<span class="_"> </span>aspects</div><div class="t m0 xa h8 y1c ff2 fs4 fc0 sc0 ls0 ws0">of<span class="_ _9"> </span>an<span class="_ _9"> </span>image<span class="_ _9"> </span>relies<span class="_ _8"> </span>on<span class="_ _9"> </span>a<span class="_ _9"> </span>strong<span class="_ _9"> </span>semantic<span class="_ _8"> </span>understanding<span class="_ _9"> </span>of<span class="_ _9"> </span>a</div><div class="t m0 xa h8 y1d ff2 fs4 fc0 sc0 ls0 ws0">visual<span class="_ _8"> </span>scene<span class="_ _a"> </span>and<span class="_ _8"> </span>all<span class="_ _8"> </span>of<span class="_ _a"> </span>its<span class="_ _8"> </span>elements.<span class="_ _f"> </span>Howe<span class="_ _2"></span>ver<span class="_ _2"></span>,<span class="_ _a"> </span>despite<span class="_ _a"> </span>nu-</div><div class="t m0 xa h8 y1e ff2 fs4 fc0 sc0 ls0 ws0">merous<span class="_ _1"> </span>potential<span class="_ _1"> </span>applications,<span class="_ _5"> </span>this<span class="_ _1"> </span>ability<span class="_ _1"> </span>remains<span class="_ _5"> </span>a<span class="_ _1"> </span>chal-</div><div class="t m0 xa h8 y1f ff2 fs4 fc0 sc0 ls0 ws0">lenge<span class="_ _6"> </span>for<span class="_ _5"> </span>our<span class="_ _6"> </span>state<span class="_ _6"> </span>of<span class="_ _6"> </span>the<span class="_ _6"> </span>art<span class="_ _6"> </span>visual<span class="_ _6"> </span>recognition<span class="_ _6"> </span>systems.</div><div class="t m0 xa h8 y20 ff2 fs4 fc0 sc0 ls0 ws0">In<span class="_ _6"> </span>the<span class="_ _0"> </span>last<span class="_ _6"> </span>few<span class="_ _6"> </span>years<span class="_ _6"> </span>there<span class="_ _0"> </span>has<span class="_ _6"> </span>been<span class="_ _0"> </span>signi&#64257;cant<span class="_ _0"> </span>progress</div><div class="t m0 xa h8 y21 ff2 fs4 fc0 sc0 ls0 ws0">in<span class="_ _1"> </span>image<span class="_ _1"> </span>classi&#64257;cation<span class="_ _1"> </span>[<span class="fc1">39</span>,<span class="_ _1"> </span><span class="fc1">26</span>,<span class="_ _1"> </span><span class="fc1">53</span>,<span class="_ _1"> </span><span class="fc1">45</span>],<span class="_ _5"> </span>where<span class="_ _5"> </span>the<span class="_ _1"> </span>task<span class="_ _1"> </span>is</div><div class="t m0 xa h8 y22 ff2 fs4 fc0 sc0 ls0 ws0">to<span class="_ _8"> </span>assign<span class="_ _a"> </span>one<span class="_ _a"> </span>label<span class="_ _8"> </span>to<span class="_ _a"> </span>an<span class="_ _a"> </span>image.<span class="_ _f"> </span>Further<span class="_ _8"> </span>work<span class="_ _a"> </span>has<span class="_ _8"> </span>pushed</div><div class="t m0 xa h8 y23 ff2 fs4 fc0 sc0 ls0 ws0">these<span class="_ _7"> </span>advances<span class="_ _7"> </span>along<span class="_ _7"> </span>two<span class="_ _7"> </span>orthogonal<span class="_"> </span>directions:<span class="_"> </span>First,<span class="_"> </span>rapid</div><div class="t m0 xa h8 y24 ff2 fs4 fc0 sc0 ls0 ws0">progress<span class="_"> </span>in<span class="_ _7"> </span>object<span class="_"> </span>detection<span class="_ _7"> </span>[<span class="fc1">40</span>,<span class="_"> </span><span class="fc1">14</span>,<span class="_ _7"> </span><span class="fc1">46</span>]<span class="_ _7"> </span>has<span class="_"> </span>identi&#64257;ed<span class="_ _7"> </span>mod-</div><div class="t m0 xa h8 y25 ff2 fs4 fc0 sc0 ls0 ws0">els<span class="_ _7"> </span>that<span class="_ _7"> </span>ef&#64257;ciently<span class="_ _7"> </span>identify<span class="_ _7"> </span>and<span class="_ _7"> </span>label<span class="_ _7"> </span>multiple<span class="_ _7"> </span>salient<span class="_"> </span>re<span class="_ _2"></span>gions</div><div class="t m0 xa h8 y26 ff2 fs4 fc0 sc0 ls0 ws0">of<span class="_ _9"> </span>an<span class="_ _8"> </span>image.<span class="_ _0"> </span>Second,<span class="_ _8"> </span>recent<span class="_ _9"> </span>advances<span class="_ _9"> </span>in<span class="_ _9"> </span>image<span class="_ _8"> </span>captioning</div><div class="t m0 xa h8 y27 ff2 fs4 fc0 sc0 ls0 ws0">[<span class="fc1">3</span>,<span class="_ _a"> </span><span class="fc1">32</span>,<span class="_ _1"> </span><span class="fc1">21</span>,<span class="_ _a"> </span><span class="fc1">49</span>,<span class="_ _1"> </span><span class="fc1">51</span>,<span class="_ _a"> </span><span class="fc1">8</span>,<span class="_ _1"> </span><span class="fc1">4</span>]<span class="_ _1"> </span>hav<span class="_ _2"></span>e<span class="_ _1"> </span>expanded<span class="_ _a"> </span>the<span class="_ _1"> </span>complexity<span class="_ _a"> </span>of</div><div class="t m0 xa h8 y28 ff2 fs4 fc0 sc0 ls0 ws0">the<span class="_"> </span>label<span class="_"> </span>space<span class="_"> </span>from<span class="_"> </span>a<span class="_"> </span>&#64257;x<span class="_ _2"></span>ed<span class="_"> </span>set<span class="_"> </span>of<span class="_"> </span>categories<span class="_"> </span>to<span class="_"> </span>sequence<span class="_ _7"> </span>of</div><div class="t m0 xa h8 y29 ff2 fs4 fc0 sc0 ls0 ws0">words<span class="_"> </span>able<span class="_"> </span>to<span class="_"> </span>express<span class="_"> </span>signi&#64257;cantly<span class="_"> </span>richer<span class="_"> </span>concepts.</div><div class="t m0 xb h8 y2a ff2 fs4 fc0 sc0 ls0 ws0">Howe<span class="_ _2"></span>ver<span class="_ _2"></span>,<span class="_ _5"> </span>despite<span class="_ _a"> </span>encouraging<span class="_ _1"> </span>progress<span class="_ _1"> </span>along<span class="_ _1"> </span>the<span class="_ _1"> </span>label</div><div class="t m0 xa h8 y2b ff2 fs4 fc0 sc0 ls0 ws0">density<span class="_ _7"> </span>and<span class="_ _7"> </span>label<span class="_ _10"> </span>complexity<span class="_ _10"> </span>axes,<span class="_ _10"> </span>these<span class="_ _7"> </span>two<span class="_ _10"> </span>directions<span class="_ _7"> </span>hav<span class="_ _3"></span>e</div><div class="t m0 xc h9 y2c ff7 fs5 fc0 sc0 ls0 ws0">&#8727;</div><div class="t m0 xd ha y2d ff2 fs2 fc0 sc0 ls0 ws0">Both<span class="_"> </span>authors<span class="_"> </span>contributed<span class="_"> </span>equally<span class="_"> </span>to<span class="_"> </span>this<span class="_"> </span>work.</div><div class="c xe y2e w2 hb"><div class="t m0 xf hc y2f ff8 fs6 fc0 sc0 ls0 ws0"><span class="fc5 sc0">Classification</span></div><div class="t m0 x10 hc y30 ff8 fs6 fc0 sc0 ls0 ws0"><span class="fc5 sc0">Cat</span></div><div class="t m0 x11 hc y31 ff8 fs6 fc0 sc0 ls0 ws0"><span class="fc5 sc0">Captioning</span></div><div class="t m0 x12 hc y32 ff8 fs6 fc0 sc0 ls0 ws0"><span class="fc5 sc0">A </span><span class="fc5 sc0">cat </span></div><div class="t m0 x12 hc y33 ff8 fs6 fc0 sc0 ls0 ws0"><span class="fc5 sc0">riding </span><span class="fc5 sc0">a </span></div><div class="t m0 x12 hc y34 ff8 fs6 fc0 sc0 ls0 ws0"><span class="fc5 sc0">skateboard</span></div><div class="t m0 x13 hc y35 ff8 fs6 fc0 sc0 ls0 ws0"><span class="fc5 sc0">Detection</span></div><div class="t m0 x14 hc y36 ff8 fs6 fc0 sc0 ls0 ws0"><span class="fc5 sc0">Cat</span></div><div class="t m0 x14 hd y37 ff9 fs6 fc0 sc0 ls0 ws0"><span class="fc5 sc0">Skateboard</span></div><div class="t m0 x15 hc y38 ff8 fs6 fc0 sc0 ls0 ws0"><span class="fc5 sc0">Dense </span><span class="fc5 sc0">Captioning</span></div><div class="t m0 x16 he y39 ff8 fs7 fc0 sc0 ls0 ws0"><span class="fc5 sc0">Orange </span><span class="fc5 sc0">spotted </span><span class="fc5 sc0">cat</span></div><div class="t m0 x14 he y3a ff8 fs7 fc0 sc0 ls0 ws0"><span class="fc5 sc0">Skateboard </span><span class="fc5 sc0">with </span></div><div class="t m0 x14 he y3b ff8 fs7 fc0 sc0 ls0 ws0"><span class="fc5 sc0">red </span><span class="fc5 sc0">wheels</span></div><div class="t m0 x14 he y3c ff8 fs7 fc0 sc0 ls0 ws0"><span class="fc5 sc0">Cat </span><span class="fc5 sc0">riding </span><span class="fc5 sc0">a </span></div><div class="t m0 x14 he y3d ff8 fs7 fc0 sc0 ls0 ws0"><span class="fc5 sc0">skateboard</span></div><div class="t m0 x14 he y3e ff8 fs7 fc0 sc0 ls0 ws0"><span class="fc5 sc0">Brown </span><span class="fc5 sc0">hardwood </span></div><div class="t m0 x14 he y3f ff8 fs7 fc0 sc0 ls0 ws0"><span class="fc5 sc0">flooring</span></div><div class="t m0 x14 hc y40 ff8 fs6 fc2 sc0 ls0 ws0"><span class="fc5 sc0">label </span><span class="fc5 sc0">density</span></div><div class="t m0 x17 hf y41 ff8 fs8 fc0 sc0 ls0 ws0"><span class="fc5 sc0">Whole </span><span class="fc5 sc0">Image</span><span class="_ _11"> </span><span class="fc5 sc0">Image </span><span class="fc5 sc0">Regions</span></div><div class="t m0 x18 hc y42 ff8 fs6 fc3 sc0 ls0 ws0"><span class="fc5 sc0">label </span></div><div class="t m0 x19 hc y43 ff8 fs6 fc3 sc0 ls0 ws0"><span class="fc5 sc0">complexity</span></div><div class="t m0 x1a hf y44 ff8 fs8 fc0 sc0 ls0 ws0"><span class="fc5 sc0">Single</span></div><div class="t m0 x1a hf y45 ff8 fs8 fc0 sc0 ls0 ws0"><span class="fc5 sc0">Label</span></div><div class="t m0 x1b hc y46 ff8 fs6 fc0 sc0 ls0 ws0"><span class="fc5 sc0">Sequence</span></div></div><div class="t m0 xe h10 y47 ff2 fs3 fc0 sc0 ls0 ws0">Figure<span class="_ _a"> </span>1.<span class="_ _1"> </span>W<span class="_ _2"></span>e<span class="_ _a"> </span>address<span class="_ _1"> </span>the<span class="_ _a"> </span>Dense<span class="_ _1"> </span>Captioning<span class="_ _1"> </span>task<span class="_ _a"> </span>(bottom<span class="_ _1"> </span>right)</div><div class="t m0 xe h10 y48 ff2 fs3 fc0 sc0 ls0 ws0">with<span class="_ _10"> </span>a<span class="_ _10"> </span>model<span class="_ _10"> </span>that<span class="_"> </span>jointly<span class="_ _10"> </span>generates<span class="_ _10"> </span>both<span class="_ _10"> </span>dense<span class="_ _10"> </span>and<span class="_ _10"> </span>rich<span class="_ _10"> </span>annotations</div><div class="t m0 xe h10 y49 ff2 fs3 fc0 sc0 ls0 ws0">in<span class="_"> </span>a<span class="_"> </span>single<span class="_"> </span>forward<span class="_"> </span>pass.</div><div class="t m0 xe h8 y4a ff2 fs4 fc0 sc0 ls0 ws0">remained<span class="_"> </span>separate.<span class="_ _9"> </span>In<span class="_"> </span>this<span class="_"> </span>work<span class="_"> </span>we<span class="_"> </span>take<span class="_"> </span>a<span class="_"> </span>step<span class="_"> </span>to<span class="_ _2"></span>wards<span class="_"> </span>uni-</div><div class="t m0 xe h8 y4b ff2 fs4 fc0 sc0 ls0 ws0">fying<span class="_ _9"> </span>these<span class="_ _8"> </span>two<span class="_ _9"> </span>inter-connected<span class="_ _9"> </span>tasks<span class="_ _9"> </span>into<span class="_ _9"> </span>one<span class="_ _8"> </span>joint<span class="_ _8"> </span>frame-</div><div class="t m0 xe h8 y4c ff2 fs4 fc0 sc0 ls0 ws0">work.<span class="_ _e"> </span>First,<span class="_ _6"> </span>we<span class="_ _1"> </span>introduce<span class="_ _5"> </span>the<span class="_ _5"> </span>dense<span class="_ _5"> </span>captioning<span class="_ _5"> </span>task<span class="_ _5"> </span>(see</div><div class="t m0 xe h8 y4d ff2 fs4 fc0 sc0 ls0 ws0">Figure<span class="_ _10"> </span><span class="fc4">1</span>),<span class="_"> </span>which<span class="_ _10"> </span>requires<span class="_ _7"> </span>a<span class="_ _7"> </span>model<span class="_ _10"> </span>to<span class="_"> </span>predict<span class="_ _10"> </span>a<span class="_ _10"> </span>set<span class="_ _7"> </span>of<span class="_ _7"> </span>descrip-</div><div class="t m0 xe h8 y4e ff2 fs4 fc0 sc0 ls0 ws0">tions<span class="_"> </span>across<span class="_ _9"> </span>regions<span class="_"> </span>of<span class="_ _9"> </span>an<span class="_ _9"> </span>image.<span class="_ _1"> </span>Object<span class="_ _9"> </span>detection<span class="_"> </span>is<span class="_ _9"> </span>hence</div><div class="t m0 xe h8 y4f ff2 fs4 fc0 sc0 ls0 ws0">recov<span class="_ _3"></span>ered<span class="_ _1"> </span>as<span class="_ _1"> </span>a<span class="_ _1"> </span>special<span class="_ _1"> </span>case<span class="_ _1"> </span>when<span class="_ _1"> </span>the<span class="_ _1"> </span>target<span class="_ _1"> </span>labels<span class="_ _1"> </span>consist</div><div class="t m0 xe h8 y50 ff2 fs4 fc0 sc0 ls0 ws0">of<span class="_ _1"> </span>one<span class="_ _1"> </span>word,<span class="_ _1"> </span>and<span class="_ _5"> </span>image<span class="_ _1"> </span>captioning<span class="_ _1"> </span>is<span class="_ _1"> </span>recovered<span class="_ _a"> </span>when<span class="_ _1"> </span>all</div><div class="t m0 xe h8 y51 ff2 fs4 fc0 sc0 ls0 ws0">images<span class="_"> </span>consist<span class="_"> </span>of<span class="_"> </span>one<span class="_"> </span>region<span class="_"> </span>that<span class="_"> </span>spans<span class="_"> </span>the<span class="_"> </span>full<span class="_"> </span>image.</div><div class="t m0 x1c h8 y52 ff2 fs4 fc0 sc0 ls0 ws0">Additionally<span class="_ _2"></span>,<span class="_ _1"> </span>we<span class="_ _1"> </span>dev<span class="_ _3"></span>elop<span class="_ _a"> </span>a<span class="_ _1"> </span>Fully<span class="_ _1"> </span>Con<span class="_ _2"></span>volutional<span class="_ _a"> </span>Local-</div><div class="t m0 xe h8 y53 ff2 fs4 fc0 sc0 ls0 ws0">ization<span class="_ _c"> </span>Network<span class="_ _b"> </span>(FCLN)<span class="_ _c"> </span>for<span class="_ _c"> </span>the<span class="_ _c"> </span>dense<span class="_ _b"> </span>captioning<span class="_ _c"> </span>task.</div><div class="t m0 xe h8 y54 ff2 fs4 fc0 sc0 ls0 ws0">Our<span class="_ _8"> </span>model<span class="_ _8"> </span>is<span class="_ _8"> </span>inspired<span class="_ _a"> </span>by<span class="_ _8"> </span>recent<span class="_ _8"> </span>work<span class="_ _8"> </span>in<span class="_ _8"> </span>image<span class="_ _a"> </span>captioning</div><div class="t m0 xe h8 y55 ff2 fs4 fc0 sc0 ls0 ws0">[<span class="fc1">49</span>,<span class="_ _9"> </span><span class="fc1">21</span>,<span class="_ _8"> </span><span class="fc1">32</span>,<span class="_ _9"> </span><span class="fc1">8</span>,<span class="_ _8"> </span><span class="fc1">4</span>]<span class="_ _9"> </span>in<span class="_ _8"> </span>that<span class="_ _9"> </span>it<span class="_ _8"> </span>is<span class="_ _8"> </span>composed<span class="_ _9"> </span>of<span class="_ _8"> </span>a<span class="_ _9"> </span>Conv<span class="_ _2"></span>olutional</div><div class="t m0 xe h8 y56 ff2 fs4 fc0 sc0 ls0 ws0">Neural<span class="_"> </span>Network<span class="_"> </span>and<span class="_ _9"> </span>a<span class="_"> </span>Recurrent<span class="_ _9"> </span>Neural<span class="_"> </span>Network<span class="_"> </span>language</div><div class="t m0 xe h8 y57 ff2 fs4 fc0 sc0 ls0 ws0">model.<span class="_ _8"> </span>Howe<span class="_ _2"></span>ver<span class="_ _2"></span>,<span class="_"> </span>drawing<span class="_"> </span>on<span class="_"> </span>work<span class="_"> </span>in<span class="_"> </span>object<span class="_ _7"> </span>detection<span class="_"> </span>[<span class="fc1">38</span>],</div><div class="t m0 xe h8 y58 ff2 fs4 fc0 sc0 ls0 ws0">our<span class="_"> </span>second<span class="_ _7"> </span>core<span class="_"> </span>contrib<span class="_ _3"></span>ution<span class="_"> </span>is<span class="_ _7"> </span>to<span class="_"> </span>introduce<span class="_ _7"> </span>a<span class="_"> </span>ne<span class="_ _3"></span>w<span class="_"> </span>dense<span class="_ _7"> </span>lo-</div><div class="t m0 xe h8 y59 ff2 fs4 fc0 sc0 ls0 ws0">calization<span class="_ _a"> </span>layer<span class="_ _2"></span>.<span class="_ _12"> </span>This<span class="_ _a"> </span>layer<span class="_ _1"> </span>is<span class="_ _1"> </span>fully<span class="_ _1"> </span>differentiable<span class="_ _a"> </span>and<span class="_ _a"> </span>can</div><div class="t m0 xe h8 y5a ff2 fs4 fc0 sc0 ls0 ws0">be<span class="_ _a"> </span>inserted<span class="_ _a"> </span>into<span class="_ _a"> </span>any<span class="_ _8"> </span>neural<span class="_ _a"> </span>network<span class="_ _a"> </span>that<span class="_ _a"> </span>processes<span class="_ _a"> </span>images</div><div class="t m0 xe h8 y5b ff2 fs4 fc0 sc0 ls0 ws0">to<span class="_ _a"> </span>enable<span class="_ _a"> </span>region-le<span class="_ _2"></span>vel<span class="_ _a"> </span>training<span class="_ _a"> </span>and<span class="_ _a"> </span>predictions.<span class="_ _f"> </span>Internally<span class="_ _2"></span>,</div><div class="t m0 xe h8 y5c ff2 fs4 fc0 sc0 ls0 ws0">the<span class="_ _9"> </span>localization<span class="_ _9"> </span>layer<span class="_"> </span>predicts<span class="_ _9"> </span>a<span class="_ _9"> </span>set<span class="_ _9"> </span>of<span class="_ _9"> </span>regions<span class="_ _9"> </span>of<span class="_ _9"> </span>interest<span class="_"> </span>in</div><div class="t m0 xe h8 y5d ff2 fs4 fc0 sc0 ls0 ws0">the<span class="_ _1"> </span>image<span class="_ _1"> </span>and<span class="_ _1"> </span>then<span class="_ _5"> </span>uses<span class="_ _1"> </span>bilinear<span class="_ _1"> </span>interpolation<span class="_ _1"> </span>[<span class="fc1">19</span>,<span class="_ _5"> </span><span class="fc1">16</span>]<span class="_ _1"> </span>to</div><div class="t m0 xe h8 y5e ff2 fs4 fc0 sc0 ls0 ws0">smoothly<span class="_"> </span>crop<span class="_"> </span>the<span class="_"> </span>activ<span class="_ _2"></span>ations<span class="_"> </span>in<span class="_"> </span>each<span class="_"> </span>region.</div><div class="t m0 x1c h8 y5f ff2 fs4 fc0 sc0 ls0 ws0">W<span class="_ _4"></span>e<span class="_"> </span>e<span class="_ _2"></span>valuate<span class="_ _10"> </span>the<span class="_"> </span>model<span class="_ _10"> </span>on<span class="_"> </span>the<span class="_ _10"> </span>large-scale<span class="_ _7"> </span>V<span class="_ _2"></span>isual<span class="_"> </span>Genome</div><div class="t m0 xe h8 y60 ff2 fs4 fc0 sc0 ls0 ws0">dataset,<span class="_ _10"> </span>which<span class="_ _7"> </span>contains<span class="_ _10"> </span>94,000<span class="_ _10"> </span>images<span class="_ _7"> </span>and<span class="_ _10"> </span>4,100,000<span class="_ _10"> </span>region</div><div class="t m0 xe h8 y61 ff2 fs4 fc0 sc0 ls0 ws0">captions.<span class="_ _9"> </span>Our<span class="_"> </span>results<span class="_"> </span>sho<span class="_ _2"></span>w<span class="_"> </span>both<span class="_"> </span>performance<span class="_ _7"> </span>and<span class="_"> </span>speed<span class="_"> </span>im-</div><div class="t m0 xe h8 y62 ff2 fs4 fc0 sc0 ls0 ws0">prov<span class="_ _3"></span>ements<span class="_ _9"> </span>ov<span class="_ _3"></span>er<span class="_ _9"> </span>approaches<span class="_ _9"> </span>based<span class="_ _9"> </span>on<span class="_ _9"> </span>pre<span class="_ _3"></span>vious<span class="_ _9"> </span>state<span class="_ _9"> </span>of<span class="_ _9"> </span>the</div><div class="t m0 xe h8 y63 ff2 fs4 fc0 sc0 ls0 ws0">art.<span class="_ _b"> </span>W<span class="_ _4"></span>e<span class="_ _a"> </span>make<span class="_ _9"> </span>our<span class="_ _8"> </span>code<span class="_ _8"> </span>and<span class="_ _8"> </span>data<span class="_ _8"> </span>publicly<span class="_ _a"> </span>av<span class="_ _2"></span>ailable<span class="_ _8"> </span>to<span class="_ _8"> </span>sup-</div><div class="t m0 xe h8 y2d ff2 fs4 fc0 sc0 ls0 ws0">port<span class="_"> </span>further<span class="_"> </span>progress<span class="_"> </span>on<span class="_"> </span>the<span class="_"> </span>dense<span class="_"> </span>captioning<span class="_"> </span>task.</div><div class="t m0 x1d h8 y64 ff2 fs4 fc0 sc0 ls0 ws0">1</div><a class="l" rel='nofollow' onclick='return false;'><div class="d m1"></div></a><a class="l" rel='nofollow' onclick='return false;'><div class="d m1"></div></a><a class="l" rel='nofollow' onclick='return false;'><div class="d m1"></div></a><a class="l" rel='nofollow' onclick='return false;'><div class="d m1"></div></a><a class="l" rel='nofollow' onclick='return false;'><div class="d m1"></div></a><a class="l" rel='nofollow' onclick='return false;'><div class="d m1"></div></a><a class="l" rel='nofollow' onclick='return false;'><div class="d m1"></div></a><a class="l" rel='nofollow' onclick='return false;'><div class="d m1"></div></a><a class="l" rel='nofollow' onclick='return false;'><div class="d m1"></div></a><a class="l" rel='nofollow' onclick='return false;'><div class="d m1"></div></a><a class="l" rel='nofollow' onclick='return false;'><div class="d m1"></div></a><a class="l" rel='nofollow' onclick='return false;'><div class="d m1"></div></a><a class="l" rel='nofollow' onclick='return false;'><div class="d m1"></div></a><a class="l" rel='nofollow' onclick='return false;'><div class="d m1"></div></a><a class="l" rel='nofollow' onclick='return false;'><div class="d m1"></div></a><a class="l" rel='nofollow' onclick='return false;'><div class="d m1"></div></a><a class="l" rel='nofollow' onclick='return false;'><div class="d m1"></div></a><a class="l" rel='nofollow' onclick='return false;'><div class="d m1"></div></a><a class="l" rel='nofollow' onclick='return false;'><div class="d m1"></div></a><a class="l" rel='nofollow' onclick='return false;'><div class="d m1"></div></a><a class="l" rel='nofollow' onclick='return false;'><div class="d m1"></div></a><a class="l" rel='nofollow' onclick='return false;'><div class="d m1"></div></a><a class="l" rel='nofollow' onclick='return false;'><div class="d m1"></div></a></div><div class="pi" data-data='{"ctm":[1.568627,0.000000,0.000000,1.568627,0.000000,0.000000]}'></div></div> </body> </html>
评论
    相关推荐