philo2vec

所属分类:特征抽取
开发工具:Python
文件大小:34996KB
下载次数:0
上传日期:2016-08-12 17:48:03
上 传 者sh-1993
说明:  word2vec应用于[斯坦福哲学百科全书](<http:plato.stanford.edu>)的实现

文件列表:
LICENSE (1067, 2016-08-13)
data.py (2188, 2016-08-13)
data (0, 2016-08-13)
data\data.zip (35895367, 2016-08-13)
models.py (14764, 2016-08-13)
preprocessors.py (4267, 2016-08-13)
requirements.txt (143, 2016-08-13)
utils.py (949, 2016-08-13)

# philo2vec A Tensorflow implementation of word2vec applied to [stanford philosophy encyclopedia](http://plato.stanford.edu/), the implementation supports both `cbow` and `skip gram` for more reference, please have a look at this papers: * [Distributed Representations of Words and Phrases and their Compositionality](http://papers.nips.cc/paper/5021-distributed-representations-of-words-and-phrases-and-their-compositionality.pdf) * [word2vec Parameter Learning Explained](http://www-personal.umich.edu/~ronxin/pdf/w2vexp.pdf) * [Explained: Deriving Mikolov et al.s Negative-Sampling Word-Embedding Method](http://arxiv.org/pdf/1402.3722v1.pdf) After training the model returns some interesting results, see [interesting results part](https://github.com/mouradmourafiq/philo2vec#some-interesting-results) Evaluating `hume - empiricist + rationalist`: ``` descartes malebranche spinoza hobbes herder ``` screen shot 2016-08-12 at 19 19 22 ### Some interesting results #### Similarities Similar words to `death`: ``` untimely ravages grief torment ``` Similar words to `god`: ``` divine De Providentia christ Hesiod ``` Similar words to `love`: ``` friendship affection christ reverence ``` Similar words to `life`: ``` career live lifetime community society ``` Similar words to `brain`: ``` neurological senile nerve nervous ``` #### operations Evaluating `hume - empiricist + rationalist`: ``` descartes malebranche spinoza hobbes herder ``` Evaluating `ethics - rational`: ``` hiroshima ``` Evaluating `ethic - reason`: ``` inegalitarian anti-naturalist austere ``` Evaluating `moral - rational`: ``` commonsense ``` Evaluating `life - death + love`: ``` self-positing friendship care harmony ``` Evaluating `death + choice`: ``` regret agony misfortune impending ``` Evaluating `god + human`: ``` divine inviolable yahweh god-like man ``` Evaluating `god + religion`: ``` amida torah scripture buddha sokushinbutsu ``` Evaluating `politic + moral`: ``` rights-oriented normative ethics integrity ``` ### The repo contains: * an object to crawl data from the philosophy encyclopedia; [PlatoData](https://github.com/mouradmourafiq/philo2vec/blob/master/data.py) * a object to build the vocabulary based on the crawled data; [VocabBuilder](https://github.com/mouradmourafiq/philo2vec/blob/master/preprocessors.py) * the model that computes the continuous distributed representations of words; [Philo2Vec](https://github.com/mouradmourafiq/philo2vec/blob/master/models.py) ### Installation The dependencies used for this module can be easily installed with pip: ``` > pip install -r requirements.txt ``` ### The params for the VocabBuilder: * **min_frequency**: the minimum frequency of the words to be used in the model. * **size**: the size of the data, the model then use the top size most frequenct words. ### The hyperparams of the model: * **optimizer**: an instance of tensorflow `Optimizer`, such as `GradientDescentOptimizer`, `AdagradOptimizer`, or `MomentumOptimizer`. * **model**: the model to use to create the vectorized representation, possible values: `CBOW`, `SKIP_GRAM`. * **loss_fct**: the loss function used to calculate the error, possible values: `SOFTMAX`, `NCE`. * **embedding_size**: dimensionality of word embeddings. * **neg_sample_size**: number of negative samples for each positive sample * **num_skips**: numer of skips for a `SKIP_GRAM` model. * **context_window**: window size, this window is used to create the context for calculating the vector representations [ window target window ]. ### Quick usage: ```python params = { 'model': Philo2Vec.CBOW, 'loss_fct': Philo2Vec.NCE, 'context_window': 5, } x_train = get_data() validation_words = ['kant', 'descartes', 'human', 'natural'] x_validation = [StemmingLookup.stem(w) for w in validation_words] vb = VocabBuilder(x_train, min_frequency=5) pv = Philo2Vec(vb, **params) pv.fit(epochs=30, validation_data=x_validation) ``` ```python params = { 'model': Philo2Vec.SKIP_GRAM, 'loss_fct': Philo2Vec.SOFTMAX, 'context_window': 2, 'num_skips': 4, 'neg_sample_size': 2, } x_train = get_data() validation_words = ['kant', 'descartes', 'human', 'natural'] x_validation = [StemmingLookup.stem(w) for w in validation_words] vb = VocabBuilder(x_train, min_frequency=5) pv = Philo2Vec(vb, **params) pv.fit(epochs=30, validation_data=x_validation) ``` ### about stemming Since the words are stemmed as part of the preprocessing, some operation are sometimes necessary ```python StemmingLookup.stem('religious') # returns "religi" StemmingLookup.original_form('religi') # returns "religion" ``` ### Getting similarities ```python pv.get_similar_words(['rationalist', 'empirist']) ``` ### Evaluating operations ```python pv.evaluate_operation('moral - rational') ``` ### plotting vectorized words ```python pv.plot(['hume', 'empiricist', 'descart', 'rationalist']) ``` ### Training details #### skip_gram: skip_gram_loss skip_gram_embeddings skip_gram_w skip_gram_b #### cbow: cbow_loss cbow_embedding cbow_w cbow_b

近期下载者

相关文件


收藏者