sigg2012.rar

  • wafad
    了解作者
  • PDF
    开发工具
  • 744KB
    文件大小
  • rar
    文件格式
  • 0
    收藏次数
  • 1 积分
    下载积分
  • 0
    下载次数
  • 2019-10-15 03:32
    上传日期
speech inhancement using dictionary learning
sigg2012.rar
  • sigg2012.pdf
    1017.5KB
内容介绍
<html xmlns="http://www.w3.org/1999/xhtml"> <head> <meta charset="utf-8"> <meta name="generator" content="pdf2htmlEX"> <meta http-equiv="X-UA-Compatible" content="IE=edge,chrome=1"> <link rel="stylesheet" href="https://static.pudn.com/base/css/base.min.css"> <link rel="stylesheet" href="https://static.pudn.com/base/css/fancy.min.css"> <link rel="stylesheet" href="https://static.pudn.com/prod/directory_preview_static/626631de4c65f412592fec1c/raw.css"> <script src="https://static.pudn.com/base/js/compatibility.min.js"></script> <script src="https://static.pudn.com/base/js/pdf2htmlEX.min.js"></script> <script> try{ pdf2htmlEX.defaultViewer = new pdf2htmlEX.Viewer({}); }catch(e){} </script> <title></title> </head> <body> <div id="sidebar" style="display: none"> <div id="outline"> </div> </div> <div id="pf1" class="pf w0 h0" data-page-no="1"><div class="pc pc1 w0 h0"><img class="bi x0 y0 w1 h1" alt="" src="https://static.pudn.com/prod/directory_preview_static/626631de4c65f412592fec1c/bg1.jpg"><div class="t m0 x1 h2 y1 ff1 fs0 fc0 sc0 ls0 ws0">1698<span class="_ _0"> </span>IEEE<span class="_ _1"> </span>TRANSA<span class="_ _2"></span>CTIONS<span class="_ _1"> </span>ON<span class="_ _1"> </span>A<span class="_ _2"></span>UDIO,<span class="_ _1"> </span>SPEECH,<span class="_ _1"> </span>AND<span class="_ _1"> </span>LANGU<span class="_ _2"></span>A<span class="_ _2"></span>GE<span class="_ _1"> </span>PR<span class="_ _2"></span>OCESSING,<span class="_ _1"> </span>V<span class="_ _2"></span>OL.<span class="_ _1"> </span>20,<span class="_ _1"> </span>NO.<span class="_ _1"> </span>6,<span class="_ _1"> </span>A<span class="_ _2"></span>UGUST<span class="_ _3"> </span>2012</div><div class="t m0 x2 h3 y2 ff1 fs1 fc0 sc0 ls1 ws0">Speech<span class="_ _4"> </span>Enhancement<span class="_ _4"> </span>Using<span class="_ _4"> </span>Generati<span class="_ _2"></span>v<span class="_ _5"></span>e</div><div class="t m0 x3 h3 y3 ff1 fs1 fc0 sc0 ls1 ws0">Dictionary<span class="_ _6"> </span>Learning</div><div class="t m0 x4 h4 y4 ff1 fs2 fc0 sc0 ls1 ws0">Christian<span class="_ _7"> </span>D.<span class="_ _7"> </span>Sigg<span class="ff2">,<span class="_ _7"> </span>Member<span class="_ _8"></span>,<span class="_ _7"> </span>IEEE<span class="ff1">,<span class="_ _9"> </span>T<span class="_ _8"></span>omas<span class="_ _a"> </span>Dikk,<span class="_ _7"> </span>and<span class="_ _9"> </span>Joachim<span class="_ _7"> </span>M.<span class="_ _7"> </span>Buhmann<span class="ff2">,<span class="_ _7"> </span>Senior<span class="_ _7"> </span>Member<span class="_ _8"></span>,<span class="_ _7"> </span>IEEE</span></span></span></div><div class="t m0 x5 h5 y5 ff3 fs3 fc0 sc0 ls1 ws0">Abstract&#8212;<span class="ff4">The<span class="_ _7"> </span>enhancement<span class="_ _a"> </span>of<span class="_ _7"> </span>speech<span class="_ _a"> </span>degraded<span class="_ _a"> </span>by<span class="_ _7"> </span>real-world</span></div><div class="t m0 x1 h6 y6 ff4 fs3 fc0 sc0 ls1 ws0">interferers<span class="_ _7"> </span>is<span class="_ _a"> </span>a<span class="_ _a"> </span>highly<span class="_ _a"> </span>relevant<span class="_ _7"> </span>and<span class="_ _a"> </span>dif&#64257;cult<span class="_ _a"> </span>task.<span class="_ _a"> </span>Its<span class="_ _a"> </span>importance</div><div class="t m0 x1 h6 y7 ff4 fs3 fc0 sc0 ls1 ws0">arises<span class="_ _3"> </span>from<span class="_ _3"> </span>the<span class="_ _3"> </span>multitude<span class="_ _1"> </span>of<span class="_ _3"> </span>practical<span class="_ _1"> </span>applications,<span class="_ _3"> </span>whereas<span class="_ _3"> </span>the<span class="_ _3"> </span>dif-</div><div class="t m0 x1 h6 y8 ff4 fs3 fc0 sc0 ls1 ws0">&#64257;culty<span class="_ _a"> </span>is<span class="_ _a"> </span>due<span class="_ _a"> </span>to<span class="_ _a"> </span>the<span class="_ _a"> </span>fact<span class="_ _b"> </span>that<span class="_ _a"> </span>interferers<span class="_ _7"> </span>are<span class="_ _a"> </span>often<span class="_ _a"> </span>nonstationary</div><div class="t m0 x1 h6 y9 ff4 fs3 fc0 sc0 ls1 ws0">and<span class="_ _b"> </span>potentially<span class="_ _c"> </span>similar<span class="_ _b"> </span>to<span class="_ _c"> </span>speech.<span class="_ _b"> </span>The<span class="_ _c"> </span>goal<span class="_ _b"> </span>of<span class="_ _c"> </span>monaural<span class="_ _b"> </span>speech</div><div class="t m0 x1 h6 ya ff4 fs3 fc0 sc0 ls1 ws0">enhancement<span class="_ _b"> </span>is<span class="_ _b"> </span>to<span class="_ _c"> </span>separate<span class="_ _b"> </span>a<span class="_ _b"> </span>single<span class="_ _c"> </span>mixtur<span class="_ _2"></span>e<span class="_ _b"> </span>into<span class="_ _b"> </span>its<span class="_ _c"> </span>underlying</div><div class="t m0 x1 h6 yb ff4 fs3 fc0 sc0 ls1 ws0">clean<span class="_ _a"> </span>speech<span class="_ _a"> </span>and<span class="_ _a"> </span>interferer<span class="_ _a"> </span>components.<span class="_ _a"> </span>This<span class="_ _a"> </span>under-determined</div><div class="t m0 x1 h6 yc ff4 fs3 fc0 sc0 ls1 ws0">problem<span class="_ _7"> </span>is<span class="_ _a"> </span>solved<span class="_ _a"> </span>by<span class="_ _7"> </span>incorporating<span class="_ _a"> </span>prior<span class="_ _a"> </span>knowledge<span class="_ _7"> </span>in<span class="_ _a"> </span>the<span class="_ _a"> </span>form</div><div class="t m0 x1 h5 yd ff4 fs3 fc0 sc0 ls1 ws0">of<span class="_ _7"> </span>learned<span class="_ _7"> </span>speech<span class="_ _7"> </span>and<span class="_ _7"> </span>interferer<span class="_ _7"> </span><span class="ff3">dictionaries</span>.<span class="_ _7"> </span>The<span class="_ _7"> </span>clean<span class="_ _7"> </span>speech<span class="_ _7"> </span>is</div><div class="t m0 x1 h6 ye ff4 fs3 fc0 sc0 ls1 ws0">recov<span class="_ _2"></span>ered<span class="_ _d"> </span>from<span class="_ _d"> </span>the<span class="_ _d"> </span>degraded<span class="_ _7"> </span>speech<span class="_ _d"> </span>by<span class="_ _7"> </span>sparse<span class="_ _d"> </span>coding<span class="_ _7"> </span>of<span class="_ _d"> </span>the<span class="_ _d"> </span>mix-</div><div class="t m0 x1 h5 yf ff4 fs3 fc0 sc0 ls1 ws0">ture<span class="_ _d"> </span>in<span class="_ _7"> </span>a<span class="_ _7"> </span><span class="ff3">composite<span class="_"> </span>dictionary<span class="_ _7"> </span></span>consisting<span class="_ _7"> </span>of<span class="_ _d"> </span>the<span class="_ _7"> </span>concatenation<span class="_ _7"> </span>of<span class="_ _d"> </span>a</div><div class="t m0 x1 h6 y10 ff4 fs3 fc0 sc0 ls1 ws0">speech<span class="_ _d"> </span>and<span class="_ _1"> </span>an<span class="_ _d"> </span>interferer<span class="_ _e"> </span>dictionary.<span class="_ _e"> </span>Enhancement<span class="_ _d"> </span>perf<span class="_ _5"></span>ormance<span class="_ _d"> </span>is</div><div class="t m0 x1 h5 y11 ff4 fs3 fc0 sc0 ls1 ws0">measured<span class="_ _a"> </span>using<span class="_ _a"> </span><span class="ff3">objective<span class="_ _7"> </span>measures<span class="_ _a"> </span></span>and<span class="_ _b"> </span>is<span class="_ _a"> </span>limited<span class="_ _a"> </span>by<span class="_ _a"> </span>two<span class="_ _a"> </span>effects.</div><div class="t m0 x1 h6 y12 ff4 fs3 fc0 sc0 ls1 ws0">A<span class="_ _7"> </span>too<span class="_ _7"> </span>sparse<span class="_ _7"> </span>coding<span class="_ _a"> </span>of<span class="_ _7"> </span>the<span class="_ _7"> </span>mixture<span class="_ _7"> </span>causes<span class="_ _7"> </span>the<span class="_ _7"> </span>speech<span class="_ _7"> </span>component</div><div class="t m0 x1 h6 y13 ff4 fs3 fc0 sc0 ls1 ws0">to<span class="_ _a"> </span>be<span class="_ _a"> </span>explained<span class="_ _a"> </span>with<span class="_ _a"> </span>too<span class="_ _a"> </span>few<span class="_ _a"> </span>speech<span class="_ _a"> </span>dictionary<span class="_ _b"> </span>atoms,<span class="_ _a"> </span>which<span class="_ _a"> </span>in-</div><div class="t m0 x1 h5 y14 ff4 fs3 fc0 sc0 ls1 ws0">duces<span class="_ _7"> </span>an<span class="_ _a"> </span>approximation<span class="_ _7"> </span>error<span class="_ _a"> </span>we<span class="_ _7"> </span>denote<span class="_ _a"> </span><span class="ff3">source<span class="_ _a"> </span>distortion</span><span class="ls2">.H<span class="_ _f"></span>o<span class="_ _f"></span>w<span class="_ _f"></span>-</span></div><div class="t m0 x1 h5 y15 ff4 fs3 fc0 sc0 ls1 ws0">ever<span class="_ _8"></span>,<span class="_ _e"> </span>a<span class="_ _d"> </span>too<span class="_ _e"> </span>dense<span class="_ _d"> </span>coding<span class="_ _e"> </span>of<span class="_ _d"> </span>the<span class="_ _d"> </span>mixture<span class="_ _1"> </span>results<span class="_ _d"> </span>in<span class="_ _e"> </span><span class="ff3">source<span class="_"> </span>confusion</span>,</div><div class="t m0 x1 h6 y16 ff4 fs3 fc0 sc0 ls1 ws0">where<span class="_ _7"> </span>parts<span class="_ _d"> </span>of<span class="_ _7"> </span>the<span class="_ _7"> </span>speech<span class="_ _7"> </span>component<span class="_ _7"> </span>ar<span class="_ _2"></span>e<span class="_ _7"> </span>explained<span class="_ _7"> </span>by<span class="_ _7"> </span>interfer<span class="_ _2"></span>er</div><div class="t m0 x1 h6 y17 ff4 fs3 fc0 sc0 ls1 ws0">dictionary<span class="_ _3"> </span>atoms<span class="_ _3"> </span>and<span class="_ _3"> </span>vice-versa.<span class="_ _3"> </span>Our<span class="_ _3"> </span>method<span class="_ _3"> </span>enables<span class="_ _1"> </span>the<span class="_ _3"> </span>control<span class="_ _3"> </span>of</div><div class="t m0 x1 h6 y18 ff4 fs3 fc0 sc0 ls1 ws0">the<span class="_ _1"> </span>source<span class="_ _e"> </span>distortion<span class="_ _1"> </span>and<span class="_ _e"> </span>source<span class="_ _1"> </span>confusion<span class="_ _e"> </span>trade-off,<span class="_ _1"> </span>and<span class="_ _e"> </span>therefor<span class="_ _5"></span>e</div><div class="t m0 x1 h6 y19 ff4 fs3 fc0 sc0 ls1 ws0">achieves<span class="_ _e"> </span>superior<span class="_ _d"> </span>performance<span class="_ _d"> </span>compared<span class="_ _e"> </span>to<span class="_ _d"> </span>powerful<span class="_ _d"> </span>approaches</div><div class="t m0 x1 h5 y1a ff4 fs3 fc0 sc0 ls1 ws0">like<span class="_ _1"> </span><span class="ff3">geometric<span class="_"> </span>spectral<span class="_ _1"> </span>subtraction<span class="_"> </span></span>and<span class="_ _1"> </span><span class="ff3">codebook-based<span class="_"> </span>&#64257;ltering</span><span class="ls3">,f<span class="_ _10"></span>o<span class="_ _11"></span>r</span></div><div class="t m0 x1 h6 y1b ff4 fs3 fc0 sc0 ls1 ws0">a<span class="_ _7"> </span>number<span class="_ _7"> </span>of<span class="_ _7"> </span>challenging<span class="_ _7"> </span>interferer<span class="_ _7"> </span>classes<span class="_ _7"> </span>such<span class="_ _7"> </span>as<span class="_ _7"> </span>speech<span class="_ _7"> </span>babble</div><div class="t m0 x1 h6 y1c ff4 fs3 fc0 sc0 ls1 ws0">and<span class="_ _7"> </span>wind<span class="_ _d"> </span>noise.</div><div class="t m0 x5 h5 y1d ff3 fs3 fc0 sc0 ls1 ws0">Index<span class="_ _12"> </span>T<span class="_ _8"></span>erms&#8212;<span class="ff4">Dictionary<span class="_ _12"> </span>learning,<span class="_ _12"> </span>sparse<span class="_ _12"> </span>coding,<span class="_ _12"> </span>speech<span class="_ _12"> </span>en-</span></div><div class="t m0 x1 h6 y1e ff4 fs3 fc0 sc0 ls1 ws0">hancement.</div><div class="t m0 x6 h7 y1f ff1 fs4 fc0 sc0 ls1 ws0">I.<span class="_ _13"> </span>I<span class="fs5">NTR<span class="_ _5"></span>ODUCTION</span></div><div class="t m0 x1 h8 y20 ff4 fs6 fc0 sc0 ls1 ws0">E</div><div class="t m0 x7 h7 y21 ff1 fs4 fc0 sc0 ls1 ws0">NHANCING<span class="_ _14"> </span>speech<span class="_ _14"> </span>degraded<span class="_ _14"> </span>by<span class="_ _14"> </span>nonstationary</div><div class="t m0 x7 h7 y22 ff1 fs4 fc0 sc0 ls1 ws0">real-world<span class="_ _15"> </span>interferers<span class="_ _15"> </span>is<span class="_ _15"> </span>both<span class="_ _16"> </span>an<span class="_ _15"> </span>important<span class="_ _15"> </span>and<span class="_ _16"> </span>dif&#64257;-</div><div class="t m0 x1 h7 y23 ff1 fs4 fc0 sc0 ls1 ws0">cult<span class="_ _a"> </span>task.<span class="_ _b"> </span>The<span class="_ _a"> </span>importance<span class="_ _b"> </span>arises<span class="_ _b"> </span>from<span class="_ _a"> </span>many<span class="_ _a"> </span>signal<span class="_ _b"> </span>processing</div><div class="t m0 x1 h7 y24 ff1 fs4 fc0 sc0 ls1 ws0">applications,<span class="_ _b"> </span>including<span class="_ _c"> </span>hearing<span class="_ _c"> </span>aids,<span class="_ _b"> </span>mobile<span class="_ _c"> </span>communications,</div><div class="t m0 x1 h7 y25 ff1 fs4 fc0 sc0 ls1 ws0">and<span class="_ _9"> </span>preprocessing<span class="_ _17"> </span>for<span class="_ _9"> </span>speech<span class="_ _17"> </span>recognition.<span class="_ _9"> </span>The<span class="_ _17"> </span>dif&#64257;culty<span class="_ _9"> </span>of</div><div class="t m0 x1 h7 y26 ff1 fs4 fc0 sc0 ls1 ws0">speech<span class="_ _16"> </span>enhancement<span class="_ _13"> </span>in<span class="_ _13"> </span>these<span class="_ _13"> </span>applications<span class="_ _16"> </span>arises<span class="_ _13"> </span>from<span class="_ _13"> </span>the</div><div class="t m0 x1 h7 y27 ff1 fs4 fc0 sc0 ls1 ws0">nature<span class="_ _9"> </span>of<span class="_ _9"> </span>the<span class="_ _9"> </span>encountered<span class="_ _9"> </span>interferers,<span class="_ _9"> </span>which<span class="_ _9"> </span>often<span class="_ _9"> </span>are<span class="_ _9"> </span>non-</div><div class="t m0 x1 h7 y28 ff1 fs4 fc0 sc0 ls1 ws0">stationary<span class="_ _18"> </span>and<span class="_ _18"> </span>potentially<span class="_ _18"> </span>speech-like,<span class="_ _18"> </span>thereby<span class="_ _18"> </span>inducing<span class="_ _18"> </span>a</div><div class="t m0 x1 h7 y29 ff1 fs4 fc0 sc0 ls1 ws0">signi&#64257;cant<span class="_ _c"> </span>and<span class="_ _c"> </span>time-varying<span class="_ _b"> </span>spectral<span class="_ _c"> </span>overlap<span class="_ _c"> </span>between<span class="_ _c"> </span>speech</div><div class="t m0 x1 h7 y2a ff1 fs4 fc0 sc0 ls1 ws0">and<span class="_ _7"> </span>interferer.</div><div class="t m0 x8 h9 y2b ff1 fs5 fc0 sc0 ls1 ws0">Manuscript<span class="_ _1"> </span>received<span class="_ _3"> </span>April<span class="_ _e"> </span>27,<span class="_ _1"> </span>2011;<span class="_ _e"> </span>revised<span class="_ _3"> </span>October<span class="_ _e"> </span>08,<span class="_ _1"> </span>2011;<span class="_ _e"> </span>accepted<span class="_ _1"> </span>De-</div><div class="t m0 x1 h9 y2c ff1 fs5 fc0 sc0 ls1 ws0">cember<span class="_ _d"> </span>31,<span class="_ _d"> </span>2011.<span class="_ _d"> </span>February<span class="_ _d"> </span>06,<span class="_ _d"> </span>2012;<span class="_ _d"> </span>date<span class="_ _d"> </span>of<span class="_ _d"> </span>current<span class="_ _d"> </span>version<span class="_ _e"> </span>March<span class="_ _d"> </span>30,<span class="_ _d"> </span>2012.</div><div class="t m0 x1 h9 y2d ff1 fs5 fc0 sc0 ls1 ws0">This<span class="_ _1"> </span>work<span class="_ _1"> </span>was<span class="_ _e"> </span>supported<span class="_ _1"> </span>in<span class="_ _e"> </span>part<span class="_ _1"> </span>by<span class="_ _1"> </span>CTI<span class="_ _e"> </span>grant<span class="_ _1"> </span>8539.2;2<span class="_ _e"> </span>ESPP-ES.<span class="_ _e"> </span>Date<span class="_ _1"> </span>of<span class="_ _1"> </span>pub-</div><div class="t m0 x1 h9 y2e ff1 fs5 fc0 sc0 ls1 ws0">lication<span class="_ _1"> </span>The<span class="_ _e"> </span>associate<span class="_ _1"> </span>editor<span class="_ _e"> </span>coordinating<span class="_ _1"> </span>the<span class="_ _e"> </span>revie<span class="_ _5"></span>w<span class="_ _e"> </span>of<span class="_ _1"> </span>this<span class="_ _e"> </span>manuscript<span class="_ _1"> </span>and<span class="_ _e"> </span>ap-</div><div class="t m0 x1 h9 y2f ff1 fs5 fc0 sc0 ls1 ws0">proving<span class="_ _e"> </span>it<span class="_ _e"> </span>for<span class="_ _d"> </span>publication<span class="_ _e"> </span>was<span class="_ _e"> </span>Prof.<span class="_ _d"> </span>Sharon<span class="_ _e"> </span>Gannot.</div><div class="t m0 x8 h9 y30 ff1 fs5 fc0 sc0 ls1 ws0">C.<span class="_ _1"> </span>D.<span class="_ _e"> </span>Sigg<span class="_ _1"> </span>is<span class="_ _e"> </span>with<span class="_ _1"> </span>the<span class="_ _e"> </span>Swiss<span class="_ _1"> </span>Federal<span class="_ _e"> </span>Of&#64257;ce<span class="_ _1"> </span>of<span class="_ _1"> </span>Meteorology<span class="_ _e"> </span>and<span class="_ _1"> </span>Climatology</div><div class="t m0 x1 h9 y31 ff1 fs5 fc0 sc0 ls1 ws0">(MeteoSwiss),<span class="_ _e"> </span>Zurich,<span class="_ _e"> </span>Switzerland<span class="_ _e"> </span>(e-mail:<span class="_ _e"> </span>christian@sigg-iten.ch).</div><div class="t m0 x8 h9 y32 ff1 fs5 fc0 sc0 ls1 ws0">T.<span class="_ _e"> </span>Dikk<span class="_ _d"> </span>and<span class="_ _d"> </span>J.<span class="_ _e"> </span>M.<span class="_ _d"> </span>Buhmann<span class="_ _d"> </span>are<span class="_ _d"> </span>with<span class="_ _e"> </span>the<span class="_ _d"> </span>Department<span class="_ _d"> </span>of<span class="_ _e"> </span>Computer<span class="_ _d"> </span>Science,</div><div class="t m0 x1 h9 y33 ff1 fs5 fc0 sc0 ls1 ws0">ETH<span class="_ _7"> </span>Zurich,<span class="_ _a"> </span>8092<span class="_ _a"> </span>Zurich,<span class="_ _7"> </span>Switzerland<span class="_ _7"> </span>(e-mail:<span class="_ _a"> </span>tomasdikk@tomasdikk.com;</div><div class="t m0 x1 h9 y34 ff1 fs5 fc0 sc0 ls1 ws0">jbuhmann@inf.ethz.ch).</div><div class="t m0 x8 h9 y35 ff1 fs5 fc0 sc0 ls1 ws0">Color<span class="_ _1"> </span>versions<span class="_ _1"> </span>of<span class="_ _1"> </span>one<span class="_ _e"> </span>or<span class="_ _1"> </span>more<span class="_ _1"> </span>of<span class="_ _1"> </span>the<span class="_ _e"> </span>&#64257;gures<span class="_ _1"> </span>in<span class="_ _1"> </span>this<span class="_ _e"> </span>paper<span class="_ _1"> </span>are<span class="_ _1"> </span>available<span class="_ _3"> </span>online</div><div class="t m0 x1 h9 y36 ff1 fs5 fc0 sc0 ls1 ws0">at<span class="_ _d"> </span>http://ieeexplore.ieee.or<span class="_ _2"></span>g.</div><div class="t m0 x8 h9 y37 ff1 fs5 fc0 sc0 ls1 ws0">Digital<span class="_ _e"> </span>Object<span class="_ _e"> </span>Identi&#64257;er<span class="_ _d"> </span>10.1109/T<span class="_ _5"></span>ASL.2012.2187194</div><div class="t m0 x9 h7 y38 ff1 fs4 fc0 sc0 ls1 ws0">The<span class="_ _d"> </span>goal<span class="_ _d"> </span>of<span class="_ _d"> </span>speech<span class="_ _7"> </span>enhancement<span class="_ _e"> </span>is<span class="_ _d"> </span>twofold:<span class="_ _d"> </span>to<span class="_ _d"> </span>improve<span class="_ _d"> </span>both</div><div class="t m0 xa h7 y39 ff1 fs4 fc0 sc0 ls1 ws0">the<span class="_ _e"> </span>perceiv<span class="_ _2"></span>ed<span class="_ _e"> </span><span class="ff2">quality<span class="_ _d"> </span></span>and<span class="_ _e"> </span>the<span class="_ _e"> </span><span class="ff2">intelligibility<span class="_ _d"> </span></span>of<span class="_ _e"> </span>speech,<span class="_ _e"> </span>by<span class="_ _d"> </span>attenu-</div><div class="t m0 xa h7 y3a ff1 fs4 fc0 sc0 ls1 ws0">ating<span class="_ _7"> </span>the<span class="_ _7"> </span>interferer<span class="_ _7"> </span>without<span class="_ _7"> </span>substantially<span class="_ _7"> </span>de<span class="_ _2"></span>grading<span class="_ _7"> </span>the<span class="_ _7"> </span>speech.</div><div class="t m0 xa h7 y3b ff1 fs4 fc0 sc0 ls1 ws0">Speech<span class="_ _1"> </span>of<span class="_ _1"> </span>higher<span class="_ _e"> </span>quality<span class="_ _1"> </span>is<span class="_ _e"> </span>perceiv<span class="_ _5"></span>ed<span class="_ _e"> </span>as<span class="_ _1"> </span>being<span class="_ _e"> </span>more<span class="_ _1"> </span>comfortable</div><div class="t m0 xa h7 y3c ff1 fs4 fc0 sc0 ls1 ws0">to<span class="_ _a"> </span>listen<span class="_ _7"> </span>to,<span class="_ _a"> </span>for<span class="_ _a"> </span>longer<span class="_ _a"> </span>periods<span class="_ _a"> </span>of<span class="_ _a"> </span>time,<span class="_ _a"> </span>whereas<span class="_ _a"> </span>higher<span class="_ _7"> </span>speech</div><div class="t m0 xa h7 y3d ff1 fs4 fc0 sc0 ls1 ws0">intelligibility<span class="_ _a"> </span>is<span class="_ _a"> </span>measured<span class="_ _a"> </span>by<span class="_ _7"> </span>lower<span class="_ _a"> </span>word<span class="_ _7"> </span>error<span class="_ _a"> </span>rates<span class="_ _a"> </span>in<span class="_ _a"> </span>speech</div><div class="t m0 xa h7 y3e ff1 fs4 fc0 sc0 ls1 ws0">recognition<span class="_ _7"> </span>scenarios.</div><div class="t m0 x9 h7 y3f ff1 fs4 fc0 sc0 ls1 ws0">Ideally<span class="_ _5"></span>,<span class="_ _7"> </span>the<span class="_ _a"> </span>performance<span class="_ _a"> </span>of<span class="_ _7"> </span>speech<span class="_ _a"> </span>enhancement<span class="_ _a"> </span>algorithms</div><div class="t m0 xa h7 y40 ff1 fs4 fc0 sc0 ls1 ws0">is<span class="_ _1"> </span>measured<span class="_ _e"> </span>by<span class="_ _1"> </span>conducting<span class="_ _e"> </span>subjectiv<span class="_ _5"></span>e<span class="_ _e"> </span>listening<span class="_ _1"> </span>tests<span class="_ _e"> </span>with<span class="_ _1"> </span>human</div><div class="t m0 xa h7 y41 ff1 fs4 fc0 sc0 ls1 ws0">listeners.<span class="_ _d"> </span><span class="ff2">Objective<span class="_ _7"> </span>measur<span class="_ _5"></span>es<span class="_ _d"> </span><span class="ff1">are<span class="_ _7"> </span>designed<span class="_ _d"> </span>to<span class="_ _d"> </span>approximate<span class="_ _7"> </span>sub-</span></span></div><div class="t m0 xa h7 y42 ff1 fs4 fc0 sc0 ls1 ws0">jectiv<span class="_ _5"></span>e<span class="_ _12"> </span>quality<span class="_ _c"> </span>scores<span class="_ _12"> </span>and<span class="_ _c"> </span>intelligibility<span class="_ _12"> </span>rates.<span class="_ _12"> </span>Most<span class="_ _c"> </span>objective</div><div class="t m0 xa h7 y43 ff1 fs4 fc0 sc0 ls1 ws0">measures<span class="_ _1"> </span>quantify<span class="_ _1"> </span>improvement<span class="_ _3"> </span>by<span class="_ _1"> </span>comparing<span class="_ _1"> </span>the<span class="_ _1"> </span>(unobserved)</div><div class="t m0 xa h7 y44 ff1 fs4 fc0 sc0 ls1 ws0">clean<span class="_ _e"> </span>speech<span class="_ _e"> </span>with<span class="_ _e"> </span>the<span class="_ _e"> </span>degraded<span class="_ _1"> </span>speech<span class="_ _e"> </span>and<span class="_ _e"> </span>the<span class="_ _e"> </span>enhanced<span class="_ _e"> </span>speech</div><div class="t m0 xa h7 y45 ff1 fs4 fc0 sc0 ls1 ws0">in<span class="_ _b"> </span>a<span class="_ _c"> </span>perceptually<span class="_ _b"> </span>meaningful<span class="_ _c"> </span>way.<span class="_ _b"> </span>As<span class="_ _b"> </span>a<span class="_ _c"> </span>consequence,<span class="_ _b"> </span>perfor-</div><div class="t m0 xa h7 y46 ff1 fs4 fc0 sc0 ls1 ws0">mance<span class="_ _7"> </span>e<span class="_ _5"></span>valuation<span class="_ _d"> </span>has<span class="_ _7"> </span>to<span class="_ _d"> </span>be<span class="_ _7"> </span>conducted<span class="_ _d"> </span>on<span class="_ _7"> </span>synthetic<span class="_ _d"> </span>mixtures<span class="_ _7"> </span>of</div><div class="t m0 xa h7 y47 ff1 fs4 fc0 sc0 ls1 ws0">clean<span class="_ _7"> </span>speech<span class="_ _7"> </span>and<span class="_ _7"> </span>interferer<span class="_ _d"> </span>signals.</div><div class="t m0 x9 h7 y48 ff1 fs4 fc0 sc0 ls1 ws0">W<span class="_ _5"></span>e<span class="_ _1"> </span>consider<span class="_ _d"> </span>the<span class="_ _e"> </span>setting<span class="_ _d"> </span>of<span class="_ _e"> </span>a<span class="_ _e"> </span>one-to-one<span class="_ _d"> </span>con<span class="_ _5"></span>versation<span class="_ _e"> </span>in<span class="_ _d"> </span>a<span class="_ _e"> </span>nat-</div><div class="t m0 xa h7 y49 ff1 fs4 fc0 sc0 ls1 ws0">ural<span class="_ _7"> </span>en<span class="_ _5"></span>vironment,<span class="_ _7"> </span>recorded<span class="_ _d"> </span>by<span class="_ _7"> </span>a<span class="_ _d"> </span>single<span class="_ _7"> </span>microphone.<span class="_ _7"> </span>This<span class="_ _d"> </span>setup</div><div class="t m0 xa h7 y4a ff1 fs4 fc0 sc0 ls1 ws0">can<span class="_ _12"> </span>be<span class="_ _19"> </span>modeled<span class="_ _12"> </span>as<span class="_ _12"> </span>a<span class="_ _19"> </span>linear<span class="_ _19"> </span>additi<span class="_ _2"></span>ve<span class="_ _12"> </span>mixture<span class="_ _12"> </span>of<span class="_ _19"> </span><span class="ff2">tar<span class="_ _5"></span>get<span class="_ _12"> </span><span class="ff1">clean</span></span></div><div class="t m0 xa h7 y4b ff1 fs4 fc0 sc0 ls1 ws0">speech<span class="_ _7"> </span>and<span class="_ _7"> </span><span class="ff2">interferer</span></div><div class="t m0 xb h7 y4c ff1 fs4 fc0 sc0 ls1 ws0">(1)</div><div class="t m0 xa h7 y4d ff1 fs4 fc0 sc0 ls1 ws0">where</div><div class="t m0 xc h7 y4e ff1 fs4 fc0 sc0 ls1 ws0">is<span class="_ _d"> </span>the<span class="_ _d"> </span>time-domain<span class="_ _d"> </span>mixture<span class="_ _7"> </span>signal<span class="_ _e"> </span>at<span class="_ _7"> </span>sample<span class="_ _1a"> </span>,<span class="_ _d"> </span>and</div><div class="t m0 xd h7 y4f ff1 fs4 fc0 sc0 ls1 ws0">and<span class="_ _1b"> </span>are<span class="_ _1"> </span>the<span class="_ _1"> </span>time-domain<span class="_ _e"> </span>speech<span class="_ _1"> </span>and<span class="_ _e"> </span>interferer<span class="_ _1"> </span>signals.</div><div class="t m0 xa h7 y50 ff1 fs4 fc0 sc0 ls1 ws0">Recovering<span class="_ _7"> </span>the<span class="_ _7"> </span>clean<span class="_ _7"> </span>speech<span class="_ _7"> </span>signal<span class="_ _7"> </span>from<span class="_ _7"> </span>the<span class="_ _7"> </span>mixture<span class="_ _a"> </span>is<span class="_ _7"> </span>under-</div><div class="t m0 xa h7 y51 ff1 fs4 fc0 sc0 ls1 ws0">determined<span class="_ _a"> </span>without<span class="_ _a"> </span>additional<span class="_ _a"> </span>assumptions.<span class="_ _a"> </span>Our<span class="_ _a"> </span>enhancement</div><div class="t m0 xa h7 y52 ff1 fs4 fc0 sc0 ls1 ws0">approach<span class="_ _a"> </span>is<span class="_ _a"> </span>based<span class="_ _a"> </span>on<span class="_ _7"> </span>transforming<span class="_ _a"> </span>time-domain<span class="_ _a"> </span>signals<span class="_ _a"> </span>into<span class="_ _a"> </span>a</div><div class="t m0 xa h7 y53 ff1 fs4 fc0 sc0 ls1 ws0">suitably<span class="_ _d"> </span>chosen<span class="_ _d"> </span>feature<span class="_ _7"> </span>space,<span class="_ _d"> </span>and<span class="_ _d"> </span><span class="ff2">sparse<span class="_ _d"> </span>coding<span class="_ _7"> </span></span>in<span class="_ _d"> </span>this<span class="_ _d"> </span>feature</div><div class="t m0 xa h7 y54 ff1 fs4 fc0 sc0 ls1 ws0">space<span class="_ _e"> </span>using<span class="_ _d"> </span>signal<span class="_ _d"> </span>models<span class="_ _e"> </span>for<span class="_ _d"> </span>both<span class="_ _d"> </span>the<span class="_ _e"> </span>speech<span class="_ _d"> </span>and<span class="_ _d"> </span>the<span class="_ _e"> </span>interferer</div><div class="t m0 xa h7 y55 ff1 fs4 fc0 sc0 ls1 ws0">(called<span class="_ _1"> </span><span class="ff2">dictionaries</span>).<span class="_ _3"> </span>Since<span class="_ _e"> </span>speech<span class="_ _1"> </span>and<span class="_ _1"> </span>many<span class="_ _3"> </span>kinds<span class="_ _1"> </span>of<span class="_ _1"> </span>interferers</div><div class="t m0 xa h7 y56 ff1 fs4 fc0 sc0 ls1 ws0">contain<span class="_ _c"> </span>structure,<span class="_ _b"> </span>their<span class="_ _c"> </span>structured<span class="_ _c"> </span>component<span class="_ _c"> </span>can<span class="_ _b"> </span>be<span class="_ _c"> </span>sparsely</div><div class="t m0 xa h7 y57 ff1 fs4 fc0 sc0 ls1 ws0">coded<span class="_ _d"> </span>in<span class="_ _d"> </span><span class="ff2">coher<span class="_ _5"></span>ent<span class="_ _d"> </span><span class="ff1">dictionaries.<span class="_ _d"> </span>If<span class="_ _d"> </span>both<span class="_ _d"> </span>the<span class="_ _d"> </span>speech<span class="_ _d"> </span>and<span class="_ _e"> </span>interferer</span></span></div><div class="t m0 xa h7 y58 ff1 fs4 fc0 sc0 ls1 ws0">dictionary<span class="_ _7"> </span>is<span class="_ _a"> </span>coherent<span class="_ _a"> </span>only<span class="_ _7"> </span>to<span class="_ _a"> </span>its<span class="_ _a"> </span>respectiv<span class="_ _5"></span>e<span class="_ _a"> </span>structured<span class="_ _a"> </span>compo-</div><div class="t m0 xa h7 y59 ff1 fs4 fc0 sc0 ls1 ws0">nent<span class="_ _d"> </span>in<span class="_ _7"> </span>the<span class="_ _7"> </span>mixture<span class="_ _d"> </span>signal,<span class="_ _7"> </span>sparse<span class="_ _d"> </span>coding<span class="_ _7"> </span>is<span class="_ _d"> </span>able<span class="_ _7"> </span>to<span class="_ _d"> </span>separate<span class="_ _7"> </span>the</div><div class="t m0 xa h7 y5a ff1 fs4 fc0 sc0 ls1 ws0">mixture<span class="_ _d"> </span>into<span class="_ _7"> </span>its<span class="_ _d"> </span>structured<span class="_ _7"> </span>components<span class="_ _d"> </span>and<span class="_ _7"> </span>to<span class="_ _d"> </span>suppress<span class="_ _7"> </span>an<span class="_ _5"></span>y<span class="_ _7"> </span>un-</div><div class="t m0 xa h7 y5b ff1 fs4 fc0 sc0 ls1 ws0">structured<span class="_ _7"> </span>component<span class="_ _7"> </span>(i.e.,<span class="_ _7"> </span>random<span class="_ _7"> </span>noise)<span class="_ _7"> </span>that<span class="_ _7"> </span>is<span class="_ _7"> </span><span class="ff2">incoher<span class="_ _5"></span>ent<span class="_ _7"> </span><span class="ff1">to</span></span></div><div class="t m0 xa h7 y5c ff1 fs4 fc0 sc0 ls1 ws0">both<span class="_ _1"> </span>dictionaries.<span class="_ _1"> </span>Finally<span class="_ _5"></span>,<span class="_ _1"> </span>an<span class="_ _1"> </span>estimate<span class="_ _1"> </span>of</div><div class="t m0 xe h7 y5d ff1 fs4 fc0 sc0 ls1 ws0">is<span class="_ _1"> </span>obtained<span class="_ _1"> </span>by<span class="_ _1"> </span>per-</div><div class="t m0 xa h7 y5e ff1 fs4 fc0 sc0 ls1 ws0">forming<span class="_ _1"> </span>the<span class="_ _e"> </span>inv<span class="_ _5"></span>erse<span class="_ _e"> </span>transform<span class="_ _e"> </span>from<span class="_ _e"> </span>the<span class="_ _e"> </span>feature<span class="_ _1"> </span>space<span class="_ _e"> </span>back<span class="_ _e"> </span>to<span class="_ _e"> </span>the</div><div class="t m0 xa h7 y5f ff1 fs4 fc0 sc0 ls1 ws0">time-domain.</div><div class="t m0 x9 h7 y60 ff1 fs4 fc0 sc0 ls1 ws0">Since<span class="_ _c"> </span>clean<span class="_ _b"> </span>speech<span class="_ _c"> </span>is<span class="_ _c"> </span>ne<span class="_ _5"></span>ver<span class="_ _c"> </span>observ<span class="_ _2"></span>able<span class="_ _c"> </span>in<span class="_ _b"> </span>the<span class="_ _c"> </span>en<span class="_ _5"></span>vironment</div><div class="t m0 xa h7 y61 ff1 fs4 fc0 sc0 ls1 ws0">where<span class="_ _a"> </span>enhancement<span class="_ _b"> </span>is<span class="_ _a"> </span>to<span class="_ _b"> </span>take<span class="_ _a"> </span>place,<span class="_ _b"> </span>we<span class="_ _a"> </span>learn<span class="_ _b"> </span>the<span class="_ _a"> </span>speech<span class="_ _b"> </span>dic-</div><div class="t m0 xa h7 y62 ff1 fs4 fc0 sc0 ls1 ws0">tionary<span class="_ _7"> </span>on<span class="_ _d"> </span>a<span class="_ _7"> </span>training<span class="_ _7"> </span>corpus.<span class="_ _7"> </span>Speech<span class="_ _d"> </span>is<span class="_ _7"> </span>a<span class="_ _7"> </span>well-structured<span class="_ _7"> </span>signal</div><div class="t m0 xa h7 y63 ff1 fs4 fc0 sc0 ls1 ws0">class,<span class="_ _1"> </span>therefore<span class="_ _e"> </span>a<span class="_ _e"> </span>pre-trained<span class="_ _1"> </span>model<span class="_ _e"> </span>remains<span class="_ _e"> </span>largely<span class="_ _3"> </span>valid<span class="_ _1"> </span>during</div><div class="t m0 xa h7 y64 ff1 fs4 fc0 sc0 ls1 ws0">enhancement,<span class="_ _a"> </span>ev<span class="_ _5"></span>en<span class="_ _a"> </span>in<span class="_ _b"> </span>the<span class="_ _a"> </span>speaker<span class="_ _a"> </span>independent<span class="_ _a"> </span>case.<span class="_ _a"> </span>The<span class="_ _a"> </span>con-</div><div class="t m0 xa h7 y65 ff1 fs4 fc0 sc0 ls1 ws0">trary<span class="_ _c"> </span>is<span class="_ _12"> </span>true<span class="_ _c"> </span>for<span class="_ _12"> </span>the<span class="_ _c"> </span>interferer<span class="_ _5"></span>,<span class="_ _12"> </span>which<span class="_ _c"> </span>varies<span class="_ _c"> </span>considerably<span class="_ _c"> </span>de-</div><div class="t m0 xa h7 y66 ff1 fs4 fc0 sc0 ls1 ws0">pending<span class="_ _c"> </span>on<span class="_ _b"> </span>the<span class="_ _c"> </span>en<span class="_ _2"></span>vironment,<span class="_ _c"> </span>and<span class="_ _c"> </span>which<span class="_ _b"> </span>might<span class="_ _c"> </span>be<span class="_ _c"> </span>a<span class="_ _c"> </span>superpo-</div><div class="t m0 xa h7 y67 ff1 fs4 fc0 sc0 ls1 ws0">sition<span class="_ _b"> </span>of<span class="_ _c"> </span>sev<span class="_ _5"></span>eral<span class="_ _c"> </span>sources,<span class="_ _b"> </span>requiring<span class="_ _c"> </span>a<span class="_ _c"> </span>single<span class="_ _b"> </span>general<span class="_ _c"> </span>interferer</div><div class="t m0 xf h9 y68 ff1 fs5 fc0 sc0 ls1 ws0">1558-7916/$31.00<span class="_ _d"> </span>&#169;<span class="_ _d"> </span>2012<span class="_ _d"> </span>IEEE</div></div><div class="pi" data-data='{"ctm":[1.616162,0.000000,0.000000,1.616162,0.000000,0.000000]}'></div></div> </body> </html>
评论
    相关推荐
    • amr-SPEECH.rar
      AMR语音编解码程序,官网上下的是编解码分开的。这里做了修改合到一起。
    • DL---物联网.rar
      深度学习用于物联网的最新论文,非常值得一看
    • DNN assisted Sphere Decoder.rar
      deeplearning for mimo dnn assisted
    • yui-dl:我的GitHub个人资料的配置文件
      :waving_hand: 嗨,我是@ yui-dl :eyes: 我对人工智能感兴趣 :seedling: 我目前正在学习计算机科学 :revolving_hearts: 我希望在有关TTS,问答和其他有趣主题的深度学习项目中进行合作! :closed_mailbox_with_...
    • speech_animation
      参考: ://dl.acm.org/citation.cfm id 3073699 该项目是在上的GRID数据集上实现的 数据采集 使用OpenCV和Dlib提取下表面的界标点 过滤掉位置,比例和旋转效果(形状对齐) 总体前兆分析 形状对齐: 形状模型 在...
    • mongolian-speech-recognition:使用PyTorch进行蒙古语语音识别
      下载蒙古圣经数据集: cd datasets && python dl_mbspeech.py 预先计算Mel频谱图: python preprop_dataset.py --dataset mbspeech 火车: python train.py --model crnn --max-epochs 50 --dataset mbspeech --...
    • speech.tar.gz
      基于百度语音开发的4麦克风阵列程序,里面包含了语音识别,语义理解,语音合成,播放等。是一个完整的案例,初学者可以参考下。
    • 安全会议论文_DL.rar
      安全领域四大顶会包括ACM CCS、NDSS、S&P、Usenix Security。此为近几年与深度学习相关的四大顶会论文集合。
    • primer_dl_nlp.zip
      A Primer on Neural Network Models for Natural Language Processing
    • GaussDB_100_1.0.1-DATABASE-REDHAT-64bit.tar.gz
      guassdb100在redhat上安装包,单机部署的包,安装步骤请看我的文中介绍,经过大量实验搭建总结出来的文档