hnstats
所属分类:特征抽取
开发工具:Java
文件大小:3148KB
下载次数:0
上传日期:2017-09-05 22:19:14
上 传 者:
sh-1993
说明: 使用word2vec的黑客新闻分析
(HackerNews analytics using word2vec)
文件列表:
_config.yml (26, 2017-09-06)
d3.layout.cloud.js (14613, 2017-09-06)
d3.min.js (149699, 2017-09-06)
data.json (18848, 2017-09-06)
h2-1.4.196.jar (1821816, 2017-09-06)
hnstats-ewan-robertson-208059.jpg (277421, 2017-09-06)
hnstats-ewan-robertson-208059.png (1100447, 2017-09-06)
logging.properties (415, 2017-09-06)
pom.xml (7954, 2017-09-06)
src (0, 2017-09-06)
src\main (0, 2017-09-06)
src\main\webapp (0, 2017-09-06)
src\main\webapp\META-INF (0, 2017-09-06)
src\main\webapp\META-INF\context.xml (38, 2017-09-06)
src\main\webapp\WEB-INF (0, 2017-09-06)
src\main\webapp\WEB-INF\jboss-deployment-structure.xml (143, 2017-09-06)
src\main\webapp\WEB-INF\jboss-web.xml (325, 2017-09-06)
src\main\webapp\WEB-INF\web.xml (423, 2017-09-06)
src\main\webapp\index.html (0, 2017-09-06)
src\test (0, 2017-09-06)
src\test\java (0, 2017-09-06)
src\test\java\test (0, 2017-09-06)
src\test\java\test\BaseUtil.java (12332, 2017-09-06)
src\test\java\test\BasicLineIterator.java (4087, 2017-09-06)
src\test\java\test\DaemonMyStemProvider.java (2187, 2017-09-06)
src\test\java\test\ListSequenceIterator.java (2163, 2017-09-06)
src\test\java\test\NERDemo.java (6800, 2017-09-06)
src\test\java\test\ReadJSON.java (3307, 2017-09-06)
src\test\java\test\StanfordLemmatizer.java (4927, 2017-09-06)
src\test\java\test\TestDumpBigQuery2H2.java (5135, 2017-09-06)
src\test\java\test\TestLemmatizer.java (671, 2017-09-06)
src\test\java\test\TestMinio.java (2122, 2017-09-06)
src\test\java\test\TestNER.java (2338, 2017-09-06)
src\test\java\test\TestShowH2.java (1207, 2017-09-06)
src\test\java\test\TestStemming.java (4558, 2017-09-06)
src\test\java\test\TestWord2Vec.java (4696, 2017-09-06)
src\test\java\test\TestWord2VecDB.java (4018, 2017-09-06)
... ...
![HackerNews analytics](https://github.com/wizecore/hnstats/blob/master/hnstats-ewan-robertson-208059.png)
Using available [HackerNews](https://github.com/wizecore/hnstats/blob/master/https://news.ycombinator.com) dataset produce some insight into the most meaningful topics.
## Ultimate reason
* Most discussed topics (and yearly shift)
* Top technology and startups
## Technology behind it
* Java
* Deeplearning4J
* Word2vec
* Stanford CoreNLP (lemmatizing)
## Project
[Online version](https://github.com/wizecore/hnstats/blob/master/http://wizecore.com/hnstats/terms.html)
## Roadmap
- Gather data (DONE)
- Produce JSON (DONE)
- For selected terms - related words trending through years 2007 - 2017 (DONE)
- All terms - display counts every year (TODO)
- Term cleanup (DONE)
- Auto build SPA, i.e. push -> CI -> deploy (TODO)
- Fine tune word2vec params (see below)
## Source repo
Project is hosted on [GitHub](https://github.com/wizecore/hnstats/blob/master/https://github.com/wizecore/hnstats)
## Word2vec tuning
**Help is welcome** in fine-tuning word2vec parameters. Here is current setup:
```java
Word2Vec vec = new Word2Vec.Builder()
.minWordFrequency(5)
.iterations(1)
.layerSize(100)
.seed(System.currentTimeMillis())
.windowSize(5)
.iterate(iter)
.tokenizerFactory(t)
.build();
```
近期下载者:
相关文件:
收藏者: