# philo2vec
A Tensorflow implementation of word2vec applied to [stanford philosophy encyclopedia](http://plato.stanford.edu/), the implementation supports both `cbow` and `skip gram`
for more reference, please have a look at this papers:
* [Distributed Representations of Words and Phrases and their Compositionality](http://papers.nips.cc/paper/5021-distributed-representations-of-words-and-phrases-and-their-compositionality.pdf)
* [word2vec Parameter Learning Explained](http://www-personal.umich.edu/~ronxin/pdf/w2vexp.pdf)
* [Explained: Deriving Mikolov et al.s Negative-Sampling Word-Embedding Method](http://arxiv.org/pdf/1402.3722v1.pdf)
After training the model returns some interesting results, see [interesting results part](https://github.com/mouradmourafiq/philo2vec#some-interesting-results)
Evaluating `hume - empiricist + rationalist`:
```
descartes
malebranche
spinoza
hobbes
herder
```
### Some interesting results
#### Similarities
Similar words to `death`:
```
untimely
ravages
grief
torment
```
Similar words to `god`:
```
divine
De Providentia
christ
Hesiod
```
Similar words to `love`:
```
friendship
affection
christ
reverence
```
Similar words to `life`:
```
career
live
lifetime
community
society
```
Similar words to `brain`:
```
neurological
senile
nerve
nervous
```
#### operations
Evaluating `hume - empiricist + rationalist`:
```
descartes
malebranche
spinoza
hobbes
herder
```
Evaluating `ethics - rational`:
```
hiroshima
```
Evaluating `ethic - reason`:
```
inegalitarian
anti-naturalist
austere
```
Evaluating `moral - rational`:
```
commonsense
```
Evaluating `life - death + love`:
```
self-positing
friendship
care
harmony
```
Evaluating `death + choice`:
```
regret
agony
misfortune
impending
```
Evaluating `god + human`:
```
divine
inviolable
yahweh
god-like
man
```
Evaluating `god + religion`:
```
amida
torah
scripture
buddha
sokushinbutsu
```
Evaluating `politic + moral`:
```
rights-oriented
normative
ethics
integrity
```
### The repo contains:
* an object to crawl data from the philosophy encyclopedia; [PlatoData](https://github.com/mouradmourafiq/philo2vec/blob/master/data.py)
* a object to build the vocabulary based on the crawled data; [VocabBuilder](https://github.com/mouradmourafiq/philo2vec/blob/master/preprocessors.py)
* the model that computes the continuous distributed representations of words; [Philo2Vec](https://github.com/mouradmourafiq/philo2vec/blob/master/models.py)
### Installation
The dependencies used for this module can be easily installed with pip:
```
> pip install -r requirements.txt
```
### The params for the VocabBuilder:
* **min_frequency**: the minimum frequency of the words to be used in the model.
* **size**: the size of the data, the model then use the top size most frequenct words.
### The hyperparams of the model:
* **optimizer**: an instance of tensorflow `Optimizer`, such as `GradientDescentOptimizer`, `AdagradOptimizer`, or `MomentumOptimizer`.
* **model**: the model to use to create the vectorized representation, possible values: `CBOW`, `SKIP_GRAM`.
* **loss_fct**: the loss function used to calculate the error, possible values: `SOFTMAX`, `NCE`.
* **embedding_size**: dimensionality of word embeddings.
* **neg_sample_size**: number of negative samples for each positive sample
* **num_skips**: numer of skips for a `SKIP_GRAM` model.
* **context_window**: window size, this window is used to create the context for calculating the vector representations [ window target window ].
### Quick usage:
```python
params = {
'model': Philo2Vec.CBOW,
'loss_fct': Philo2Vec.NCE,
'context_window': 5,
}
x_train = get_data()
validation_words = ['kant', 'descartes', 'human', 'natural']
x_validation = [StemmingLookup.stem(w) for w in validation_words]
vb = VocabBuilder(x_train, min_frequency=5)
pv = Philo2Vec(vb, **params)
pv.fit(epochs=30, validation_data=x_validation)
```
```python
params = {
'model': Philo2Vec.SKIP_GRAM,
'loss_fct': Philo2Vec.SOFTMAX,
'context_window': 2,
'num_skips': 4,
'neg_sample_size': 2,
}
x_train = get_data()
validation_words = ['kant', 'descartes', 'human', 'natural']
x_validation = [StemmingLookup.stem(w) for w in validation_words]
vb = VocabBuilder(x_train, min_frequency=5)
pv = Philo2Vec(vb, **params)
pv.fit(epochs=30, validation_data=x_validation)
```
### about stemming
Since the words are stemmed as part of the preprocessing, some operation are sometimes necessary
```python
StemmingLookup.stem('religious') # returns "religi"
StemmingLookup.original_form('religi') # returns "religion"
```
### Getting similarities
```python
pv.get_similar_words(['rationalist', 'empirist'])
```
### Evaluating operations
```python
pv.evaluate_operation('moral - rational')
```
### plotting vectorized words
```python
pv.plot(['hume', 'empiricist', 'descart', 'rationalist'])
```
### Training details
#### skip_gram:
#### cbow: