clj-tokenizer

所属分类:搜索引擎
开发工具:Clojure
文件大小:3KB
下载次数:0
上传日期:2011-08-21 19:09:48
上 传 者sh-1993
说明:  Lucene文本标记器的一个简单包装器
(A simple wrapper of the Lucene text tokenizer)

文件列表:
project.clj (407, 2011-08-22)
src (0, 2011-08-22)
src\clj_tokenizer (0, 2011-08-22)
src\clj_tokenizer\core.clj (1810, 2011-08-22)
test (0, 2011-08-22)
test\clj_tokenizer (0, 2011-08-22)
test\clj_tokenizer\test (0, 2011-08-22)
test\clj_tokenizer\test\core.clj (432, 2011-08-22)

# clj-tokenizer A simple Clojure wrapper around the Lucene text tokenizer. A wrapper for the Lucene [StandardAnalyzer](http://lucene.apache.org/java/3_0_3/api/core/org/apache/lucene/analysis/standard/StandardAnalyzer.html) and Lucene [StandardTokenizer](http://lucene.apache.org/java/3_0_3/api/core/org/apache/lucene/analysis/standard/StandardTokenizer.html) are provided. For a proper Clojure library for NLP see [clojure-nlp](https://github.com/dakrone/clojure-opennlp). The project can run from the command line and will tokenize each line of stdin, remove stopwords and write to stdout. ## Usage First clone the project. Then set up your lein deps lein compile; lein uberjar For example, to use the tokenizer from the command line use `java -jar` curl http://www.gutenberg.org/cache/epub/2701/pg2701.txt | java -jar clj-tokenizer-0.1.0-SNAPSHOT-standalone.jar | head -100 will tokenize Herman Melville's Moby Dick. To use the tokenizer within Clojure first add the dependency to project.clj [clj-tokenizer "0.1.0"] To create a token stream: (token-seq (token-stream "This is a string.")) ;; ("This" "is" "a" "string") To convert to lowercase and remove stopwords: (token-seq (token-stream-without-stopwords "This is a string, without the stopwords.")) ;; ("string" "without" "stopwords") To stem the words using the Snowball stemmer: (token-seq (stemmed (token-stream-without-stopwords "Going to be Stemming some lemmings."))) ;; ("go" "stem" "some" "lem") ## License Copyright (C) 2010 Erik Andrejko Distributed under the Eclipse Public License, the same as Clojure.

近期下载者

相关文件


收藏者