wikiparse

所属分类:其他
开发工具:Clojure
文件大小:0KB
下载次数:0
上传日期:2015-07-21 17:58:53
上 传 者sh-1993
说明:  解析维基百科转储并将(一些)页面数据索引到弹性搜索,
(Parse wikipedia dumps and index (some) page data to elasticsearch,)

文件列表:
doc/ (0, 2015-07-21)
doc/intro.md (127, 2015-07-21)
project.clj (570, 2015-07-21)
run-fast.sh (180, 2015-07-21)
src/ (0, 2015-07-21)
src/wikiparse/ (0, 2015-07-21)
src/wikiparse/core.clj (10934, 2015-07-21)
test.xml (63, 2015-07-21)
test/ (0, 2015-07-21)
test/wikiparse/ (0, 2015-07-21)
test/wikiparse/core_test.clj (133, 2015-07-21)
wikisample.xml.bz2 (88324, 2015-07-21)

# wikiparse Imports wikipedia data dump XML into elasticsearch. ## Usage * [NOTE] This is the most cross-platform way to run the script. See the section on running it faster below for a 2-4x optimization in terms of load time. * Download the pages-articles XML dump, find the link on [this page](http://en.wikipedia.org/wiki/Wikipedia:Database_download#XML_schema). You want pages-articles.xml.bz2. DO NOT UNCOMPRESS THE BZ2 FILE. * From the releases page, download the [wikiparse JAR](https://github.com/andrewvc/wikiparse/releases) * Run the jar on the BZ2 file: `java -jar -Xmx3g -Xms3g wikiparse-0.2.1.jar --es http://localhost:9200 /var/lib/elasticsearch/enwiki-latest-pages-articles.xml.bz2` * The data will be indexed to an index named `en-wikipedia` (by default). This can be changed with `--index` parameter. # Running it Faster The fastest way to run this code is by using the run-fast.sh shell script in this repo. This shells out to your OS's bzip2 program helping with parallelism (at the expense of having to uncompress the bzip2 file twice. This also makes two passes over the input file optimizing the writes to elasticsearch. The source code of the `run-fast.sh` script is included below. ```bash #!/bin/sh JAR=$1 DUMP=$2 curl -XDELETE http://localhost:9200 bzip2 -dcf $DUMP | java -Xmx3g -Xms3g -jar $JAR -p redirects && bzip2 -dcf $DUMP | java -Xmx3g -Xms3g -jar $JAR -p full ``` ## License Wikisample.bz2 Copyright: http://en.wikipedia.org/wiki/Wikipedia:Copyrights All code and other files Copyright 2013 Andrew Cholakian and distributed under the Eclipse Public License, the same as Clojure.

近期下载者

相关文件


收藏者