dover_to_calais:Ruby包装器环绕OpenCalais Web服务,可处理各种数据源,并提供异步响应和结果过滤

  • a3_260034
  • 58.3KB
  • zip
  • 0
  • VIP专享
  • 0
  • 2022-05-27 01:36
**************重要通知 ************ 截至2015年9月30日,汤姆森-路透社已停止使用构建有该宝石的OpenCalais API。 OpenCalais现在正在使用一个经过重大更改的新API。 您可以在了解有关更改的。 不幸的是,这意味着DoverToCalais不再起作用。 在这个阶段,我不知道是否以及何时将这个gem升级到新的API。 感谢您在使用DoverToCalais上花费的时间和精力。 多佛到加莱 DoverToCalais允许用户向发送各种数据源(文件和URL),并在完成输入处理后接收异步响应。 另外,DoverToCalais启用响应过滤以便找到相关的标签和/或标签值。 什么是OpenCalais? 简而言之-并引用创建者: “ OpenCalais Web服务会在一秒钟之内自动为您提交的内容创建丰富的语义元数据。Calais使用自然语言处理
  • dover_to_calais-master
  • .gitignore
  • test
  • test_file_1.pdf
  • test_file_1.html
  • test_file_1.rtf
  • test_file_1.odt
  • test_file_1.doc
  • test_file_1.txt
  • Gemfile
  • .yardopts
  • features
  • data_sources.feature
  • step_definitions
  • data_sources_steps.rb
  • data_mining_steps.rb
  • filtering_steps.rb
  • filtering.feature
  • data_mining.feature
  • LICENSE.txt
  • dover_to_calais.gemspec
  • lib
  • dover_to_calais.rb
  • dover_to_calais
  • models.rb
  • version.rb
  • ontology.rb
  • Rakefile
# **************IMPORTANT NOTICE ************ ## As of __30 September__ 2015, the OpenCalais API on which this gem is built, has been discontinued by Thomson-Reuters. A new and significantly changed API is now in use by OpenCalais. You can read about the changes [here]( Unfortunately this means that DoverToCalais is no longer functional. I don't know -at this stage- if and when I'll upgrade this gem to the new API. Thank you for your time and effort in using DoverToCalais. # DoverToCalais DoverToCalais allows the user to send a wide range of data sources (files & URLs) to [OpenCalais]( and receive asynchronous responses when [OpenCalais]( has finished processing the inputs. In addition, DoverToCalais enables response filtering in order to find relevant tags and/or tag values. ## What is OpenCalais? In short -and quoting the [OpenCalais]( creators: > "*The OpenCalais Web Service automatically creates rich semantic metadata for the content you submit – in well under a second. Using natural language processing (NLP), machine learning and other methods, Calais analyzes your document and finds the entities within it. But, Calais goes well beyond classic entity identification and returns the facts and events hidden within your text as well.*" In general, OpenCalais Simple XML Format (the one used by DoverToCalais) returns three kinds of tags: [Entitites, Events]( and [Topics]( ***Entities*** are static 'things', like Persons, Places, et al. that are involved in the textual context in some capacity. OpenCalais assigns a *relevance* score to each entity to indicate it's relevance within the context of the data source's general topic. ***Events*** are facts or actions that pertain to one or more Entities. ***Topics*** are a characterisation or generic description of the data source's context. We can use these tags and the information within them to extract relevant information from the data or to draw useful conclusions about it. For example, if the data source tags include an *&lt;Event&gt;* with the value of *'CompanyExpansion'*, I can then look for the &lt;City&gt; or &lt;Company&gt; tags to find out which company is expanding and if it's near my location (hint: they may be looking for more staff :)) Or, I could pick out all &lt;Company&gt;s involved in a &lt;JointVenture&gt;, or all &lt;Person&gt;s implicated in an &lt;Arrest&gt; in my &lt;City&gt;, etc. DoverToCalais, from version 0.2.1 onwards also supports the OpenCalais rich [JSON Output format]( This format returns relationships between entities, as well as the previous tags returned by the Simple XML format, thus allowing a deeper level of data analysis and detection. ## Why use OpenCalais? There are many reasons, mainly to: * incorporate tags into other applications, such as search, news aggregation, blogs, catalogs, etc. * enrich search by looking for deeper, contextual meaning instead of merely phrases or keywords. * help to discern relationships between semantic entities. * facilitate data processing and analysis by allowing easy identification of relevant or important data sources and the discarding of irrelevant ones. ## DoverToCalais Features 1. **Multiple data source support**: Thanks to the power of [Yomu](, DoverToCalais can process a vast range of files (and, of course, web pages), extract text from them and send them to OpenCalais for analysis and tag generation. 2. **Asynchronous responses (callbacks)**: Users can set callbacks to receive the processed meta-data, once the OpenCalais Web Service response has been received. Furthermore, a user can set multiple callbacks for the same request (data source), thus enabling cleaner, more modular code. 3. **Result filtering**: DoverToCalais uses the OpenCalais [Simple XML Format]( as the preferred response format. The user can work directly with the XML-formatted response, or -if feeling a bit lazy- can take advantage of the DoverToCalais filtering functionality and receive specific entities, optionally based on specified conditions. For more details of the features and code samples, see [Usage](#usage). ##Pre-requisites and dependencies To use the OpenCalais Web Service and -by extension- DoverToCalais, one needs to possess an OpenCalais API key, which is easily obtainable from the [OpenCalais web site]( DoverToCalais requires the presence of a working [JRE]( Also, if you're going to use the rich JSON output format, you'll need to have [Redis]( running on an accessible node. ## Installation Add this line to your application's Gemfile: gem 'dover_to_calais' And then execute: $ bundle Or install it yourself as: $ gem install dover_to_calais ## Compatibility DoverToCalais has been developed in Ruby 1.9.3 and should work fine on post-1.9.3 MRI versions too. If anyone is succesfully running it on other Ruby runtimes please let me know. ## Usage Using DoverToCalais is extremely simple. ### The Basics As DoverToCalais uses the awesome-ness of [EventMachine](, code must be placed within an EM *run* block: ```ruby do # use Control + C to stop the EM Signal.trap('INT') { EventMachine.stop } Signal.trap('TERM') { EventMachine.stop } # we need an API key to use OpenCalais DoverToCalais::API_KEY = 'my-opencalais-api-key' # create a new dover dover ='') # parse the text and send it to OpenCalais dover.analyse_this puts 'do some stuff....' # set a callback for when we receive a response dover.to_calais { |response| puts response.error ? response.error : response } puts 'do some more stuff....' end ``` This will produce the following result: > do some stuff.... <br> > do some more stuff.... <br> > <?xml version="1.0"?> <br> > &lt;OpenCalaisSimple&gt; <br> > .......... <br> > (the rest of the XML response from OpenCalais) <br> As can be observed, the callback (#to_calais) is trigerred after the rest of the code has been executed and only when the OpenCalais request has been completed. Of course, we can analyse more than one sources at a time: ```ruby do # use Control + C to stop the EM Signal.trap('INT') { EventMachine.stop } Signal.trap('TERM') { EventMachine.stop } DoverToCalais::API_KEY = 'my-opencalais-api-key' d1 ='') d2 ='/home/fred/Documents/RailsRecipes.pdf') d3 ='//network-drive/annual_forecast.doc') d1.analyse_this; d2.analyse_this; d3.analyse_this; puts 'do some stuff....' d1.to_calais { |response| puts response.error ? response.error : response } d2.to_calais { |response| puts response.error ? response.error : response } d3.to_calais { |response| puts response.error ? response.error : response } puts 'do some more stuff....' end ``` This will output the two *puts* statements followed by the three callbacks (d1, d2, d3) in the order in which they are triggered, i.e. the first callback to receive a response from OpenCalais will fire first. ###Filtering the response Why parse the response XML ourselves when DoverToCalais can do it