dutch-word-embeddings

所属分类:特征抽取
开发工具:Python
文件大小:13KB
下载次数:0
上传日期:2022-02-22 11:15:27
上 传 者sh-1993
说明:  荷兰语单词嵌入,在大量荷兰社交媒体消息和新闻博客论坛帖子上进行训练。
(Dutch word embeddings, trained on a large collection of Dutch social media messages and news blog forum posts.)

文件列表:
CITATION.cff (378, 2022-02-22)
LICENSE.txt (19342, 2022-02-22)
analogies.txt (2880, 2022-02-22)
utils.py (4033, 2022-02-22)

Dutch Word2Vec Model

This repository contains a Word2Vec model trained on a large Dutch corpus, comprised of social media messages and posts from Dutch news, blog and fora. Finding pre-trained Dutch models online can often be quite difficult, especially since most online models are trained on neatly written texts like Wikipedia or newspaper archives. When working with noisy text sources these models usually underperform due to the large number of out-of-vocabulary words used on social platforms and their short-message writing style. By training on a combination of both large and short texts from multiple online sources we've tried to create a model more suited for these types of texts. We are sharing this model to help research using Dutch data sources, so feel free to use it for your projects! If you would like a more up-to-date model, or a model with specific preprocessing steps, we'd be happy to help! Please contact the current maintainer ([*@Alexander Nieuwenhuijse*](https://github.com/severun)) if you'd like to use this model for commercial products. ## Installation The model can be downloaded using the provided utils script ```sh $ git clone https://github.com/coosto/dutch-word-embeddings.git $ cd dutch-word2vec-model $ python3 utils.py download ``` Or directly as an asset from [release page](https://github.com/coosto/dutch-word-embeddings/releases). ## Usage To run a demo (using [gensim](https://github.com/RaRe-Technologies/gensim)) for the downloaded model run following command. It will first output some example analogies and afterward present an interactive prompt to query for nearest neighbour terms. ```sh $ python3 utils.py demo --model model.bin Loading model... Model loaded ... ```` ## Examples If we query for "tomaat" (Dutch for tomato) we get a lot of Dutch vegetables:
Enter word or sentence: tomaat
     Term     |      Distance
-----------------------------------
paprika       |0.8452869653701782   (Bell pepper)
komkommer     |0.79324918***536682   (Cucumber)
courgette     |0.771128237247467    (Zucchini)
spinazie      |0.76975500583***868   (Spinach)
aubergine     |0.7***6535634994507   (Eggplant)
rucola        |0.7631270885467529   (Arugula)
avocado       |0.7610437273979187   (Avocado)
radijs        |0.7554484605789185   (Radish)
tomaatjes     |0.7549760341***4287   (Tomatoes)
tomaten       |0.7525067925453186   (Tomatoes)
Another interesting case is "lidl" (A supermarket chain), which returns a list of other supermarkets:
Enter word or sentence: lidl
      Term     |      Distance
------------------------------------
aldi           |0.9053274***9128113
jumbo          |0.7790680527687073
albert_heijn   |0.7582876086235046
supermarkt     |0.7522222995758057
#lidl          |0.7320866584777832
albert_hein    |0.7161234617233276
albert_heyn    |0.7076783180236816
ekoplaza       |0.6897568702697754
vomar          |0.6830324530601501
nettorama      |0.6739382147789001
This example shows some common typographical errors for a supermarket chain called Albert Heijn. This type of errors would not be in the model is it was trained only on neatly written text, like the Dutch Wikipedia data or newspaper articles, but is included in this model because these errors are made a lot on social media.
# Data selection & Modeling The model was trained using ~600 million individual messages, comprised of Dutch social media messages (624 million messages) and Dutch news, blog and fora posts (36 million messages). All messages were published between 01/01/2017 and 31/12/2017. To improve the quality of the model some basic preprocessing was applied to all the messages, described below. ### Splitting sentences Every individual message was split into separate sentences by searching for punctuation marks, which are only considered to be an end-of-line character if it is not used in the following exceptions: * (Roman) Numerals * Single letters * Abbriviations ### Clean up Given that social media messages are usually not neatly formatted some cleaning of the text is applied: 1. Converting to lowercase 2. Removing HTML/XML tags 3. Replacing *URLs* with the *\* token 4. Replacing *@-mentions* with the *\* token 5. Removing punctuation marks, emojis and unwanted unicode characters 7. Removing sentences with less than 5 tokens. The URLs and @-mentions are removed for both privacy and performance reasons. The last step is applied to remove very small text messages, since they usually do not provide enough relevant context to learn from. ### Deduplication Lastly the entire set of cleaned up sentences is removed from any duplicate training examples and shuffeled into a random order. This results in a training set of 490 million unique preprocessed sentences. ### Training The model was trained using [Google's Word2Vec implementation](https://code.google.com/p/word2vec/). We've selected the Continuous Bag-of-Words (CBOW) model and generate vectors of size 300. The min-count parameter was chosen based on manual inspection of the vocabulary and to limit the size of the model. ```sh word2vec -train input.txt -output model.bin -size 300 -window 10 -negative 10 -hs 0 -cbow 1 -sample 1e-5 -iter 5 -min-count 300 ``` The resulting model contains 250479 vectors and was not pruned or altered in any way. License ---- This work is licensed under a [Creative Commons Attribution-NonCommercial 4.0 International License ](https://creativecommons.org/licenses/by-nc/4.0/). Please contact the current maintainer ([*@Alexander Nieuwenhuijse*](https://github.com/severun)) if you wish to use this model with a different license.

近期下载者

相关文件


收藏者