telegram_data_clustering_2019

所属分类:聚类算法
开发工具:HTML
文件大小:4557KB
下载次数:0
上传日期:2019-12-29 14:05:05
上 传 者sh-1993
说明:  关于无监督新闻聚类的Telegram竞赛
(Telegram contest on unsupervised news clustering)

文件列表:
3dparty (0, 2019-12-29)
3dparty\fastText-0.9.1 (0, 2019-12-29)
3dparty\fastText-0.9.1\.circleci (0, 2019-12-29)
3dparty\fastText-0.9.1\.circleci\cmake_test.sh (706, 2019-12-29)
3dparty\fastText-0.9.1\.circleci\config.yml (4248, 2019-12-29)
3dparty\fastText-0.9.1\.circleci\gcc_test.sh (639, 2019-12-29)
3dparty\fastText-0.9.1\.circleci\pip_test.sh (309, 2019-12-29)
3dparty\fastText-0.9.1\.circleci\pull_data.sh (1075, 2019-12-29)
3dparty\fastText-0.9.1\.circleci\python_test.sh (260, 2019-12-29)
3dparty\fastText-0.9.1\.circleci\run_locally.sh (459, 2019-12-29)
3dparty\fastText-0.9.1\.circleci\setup_circleimg.sh (307, 2019-12-29)
3dparty\fastText-0.9.1\.circleci\setup_debian.sh (319, 2019-12-29)
3dparty\fastText-0.9.1\CMakeLists.txt (1933, 2019-12-29)
3dparty\fastText-0.9.1\CODE_OF_CONDUCT.md (241, 2019-12-29)
3dparty\fastText-0.9.1\CONTRIBUTING.md (2061, 2019-12-29)
3dparty\fastText-0.9.1\LICENSE (1080, 2019-12-29)
3dparty\fastText-0.9.1\MANIFEST.in (95, 2019-12-29)
3dparty\fastText-0.9.1\Makefile (2183, 2019-12-29)
3dparty\fastText-0.9.1\alignment (0, 2019-12-29)
3dparty\fastText-0.9.1\alignment\align.py (5335, 2019-12-29)
3dparty\fastText-0.9.1\alignment\eval.py (2478, 2019-12-29)
3dparty\fastText-0.9.1\alignment\example.sh (1408, 2019-12-29)
3dparty\fastText-0.9.1\alignment\unsup_align.py (4616, 2019-12-29)
3dparty\fastText-0.9.1\alignment\utils.py (4803, 2019-12-29)
3dparty\fastText-0.9.1\classification-example.sh (1428, 2019-12-29)
3dparty\fastText-0.9.1\classification-results.sh (3154, 2019-12-29)
3dparty\fastText-0.9.1\crawl (0, 2019-12-29)
3dparty\fastText-0.9.1\crawl\dedup.cc (1237, 2019-12-29)
3dparty\fastText-0.9.1\crawl\download_crawl.sh (1563, 2019-12-29)
3dparty\fastText-0.9.1\crawl\filter_dedup.sh (332, 2019-12-29)
3dparty\fastText-0.9.1\crawl\filter_utf8.cc (3034, 2019-12-29)
3dparty\fastText-0.9.1\crawl\process_wet_file.sh (912, 2019-12-29)
3dparty\fastText-0.9.1\docs (0, 2019-12-29)
3dparty\fastText-0.9.1\docs\aligned-vectors.md (6122, 2019-12-29)
... ...

My entry to Telegram Data Clustering content. https://entry1144-dcround1.usercontent.dev/categories/en/ What's useful here? You can sneak peak a GoLang and C wrapper for fasttext here. It is exteneded version of something I found on the internet with extra wrapping function of vector extraction routine. Basically, my first attempt to write something of use in GoLang. Here's some things I implemented and general approach. - FastText integration. I've used FastText C++ library for model construction and inference, so I've CC'ed C wrapper for FastText and extended it to my needs, so I could call it from GoLang. It works smoothly. - For language detection I've used existing fasttext model. - For news/non-news classification I've looked thorght the list of domain names and cherrypicked those into 2 categories - definitely trustworthy and definitly spam/fake. Then I assinged appropriate classes to all articles coming from those domain and used them exclusively to train classifier. - For categories detection I've made labels for a few (5k) english articles using Google Text Classification API and manually created a conversion rules for G.Categories into Telegram categories. Using those I trained English categories model. As for Russian, I've cheated even twice - I've translated 5k or Russian texts into Engish with Google Translation API, assigned them categories using model created for English texts and trained Russian model using those pseudo-labels, that had confidence over significant threshold. I didn't had time to do last parts of the contenst - Top and Threads and my few starting experiments on threads were of low quality results. Naive sentence embeddings and cosine distances produced low-quality results, I've spent too much time on implementing LSH, as I was targeting super-performance to practice a bit. Surely, different approach was required here.

近期下载者

相关文件


收藏者