news-title-stock-prediction-pytorch
所属分类:金融证券系统
开发工具:Python
文件大小:0KB
下载次数:0
上传日期:2018-12-06 03:11:06
上 传 者:
sh-1993
说明: 新闻标题股票预测pytorch,,
(news-title-stock-prediction-pytorch,,)
文件列表:
.idea/ (0, 2018-12-05)
.idea/vcs.xml (180, 2018-12-05)
img_src/ (0, 2018-12-05)
img_src/NTN.png (36655, 2018-12-05)
img_src/lms_cnn.png (38343, 2018-12-05)
network/ (0, 2018-12-05)
network/__init__.py (864, 2018-12-05)
network/lms_cnn.py (5826, 2018-12-05)
network/lms_data_generator.py (6687, 2018-12-05)
network/model_config.py (2596, 2018-12-05)
network/preprocess.py (12046, 2018-12-05)
network/util.py (3803, 2018-12-05)
nlp_vocab/ (0, 2018-12-05)
nlp_vocab/Vocab.py (4678, 2018-12-05)
nlp_vocab/VocabularyProcessor.py (6067, 2018-12-05)
nlp_vocab/__init__.py (875, 2018-12-05)
requirements.txt (29, 2018-12-05)
run_extract_triplet.py (2427, 2018-12-05)
run_predicting.py (6401, 2018-12-05)
run_preprocess.py (3144, 2018-12-05)
run_training.py (1990, 2018-12-05)
scripts/ (0, 2018-12-05)
scripts/get_spacy_model.sh (96, 2018-12-05)
scripts/prepare_chinese_word_segmenter.sh (201, 2018-12-05)
scripts/prepare_fasttext.sh (137, 2018-12-05)
scripts/run_chinese_word_segmenter.sh (171, 2018-12-05)
scripts/train_fasttext_embedding.sh (263, 2018-12-05)
# News titles for stock prediction
Deep learning for stock long short signal prediction
based on chronologically sorted news titles.
## Run
### prepare or train your own embedding for the task
Trained word embedding based on news corpus provided from above dataset:
I have a trained word embedding based on the news corpus to get you started on
the project under data/embedding/fasttext/model.vec
If you want to train your own word embedding:
```bash
sh scripts/prepare_fasttext.sh
# train the word embedding and save embedding under data/embedding, notice you might want to change model name
sh scripts/train_fasttext_embedding.sh data/train_own_word_embedding_example.txt
```
Main code has three part:
run_preprocess.py: given input dataframe, prepare date_news_embeddings and input_dataframe_with_signal for training.
vocabulary_processor and embedding matrix are saved for later use in online prediction.
run_training.py: train long mid short term cnn with daily average news embedding.
run_predicting.py: online predictor for trained model. An english version toy example is given
under run_predicting.py. After training without can mode_config.py, run this script get
a prediction result.
```python
# preprocess data
python run_preprocess.py
# traing model
python run_training.py
# online prediction
python run_predicting.py
```
## Data
### Raw
1. news data
original dataset news from Bloomberg \& Reuters from 20061020 to 20131126
total number of trading days:1786
[google drive](https://drive.google.com/open?id=0B3C8GEFwm08QY3AySmE2Z1daaUE)
please refer to network.preprocess.get_data_dict to get raw text into structured
format.
2. market data
[SPY historical data from 20061020 to 20131126](https://finance.yahoo.com/quote/SPY/history?period1=1161273600&period2=1385395200&interval=1d&filter=history&frequency=1d)
### preprocessed data and data interface to the model
You can find raw data processed method under network.preprocess.
Feel free to try by yourself on cleaning up the raw data.
Now, let's focus on the data format sent to model pipeline.
The interface to the model is a dataframe header as:
`Date, Adj Close, news_title`
An example row:
`2006-10-20 108.961449 [["news_title_1","00:00:00"],["news_title_2","01:00:00"]]`
In short we need date, adjust close price and _sorted_ daily news along reported time from old to new. Currently, the time
is only used for sorting within a day.
You can find complete Bloomberg \& Reuters data interface example at training_dir/input_dataframe.pickle
## Model
### long mid short term CNN
![lms_cnn](img_src/lms_cnn.png)
input to long_mid_short cnn(or deep prediction model mentioned in the paper) has three parts:
of same essence
events embedding: Most recent N days of events embedding with all events within a day represents
by one dense vector(currently averaged events embedding within a day).
With long term use N = 30, mid term use N = 7, short term use N =1.
Inputs on long term events embedding and mid term events embedding are Convolved and then Max pooled
. The output of both long and mid term from max pooling merged with short term input to form
a dense vector represents a long_mid_short term embedding for the prediction on next day long
or short signal.
### Events Embedding
Event embedding can be generalized define as a dense vector represents an event.
(this implementation use word embeddings concatenation of news title sentence as event embedding)
Originally in paper[1], they follows thoughts in [3] and [4] trying to learn events embedding
based on triplet relations.
Even though events embedding learned through method in [1] has around 3% ~ 4% improvements over
word embeddings concatenation of news title sentence as event embedding on the dataset in the paper.
It is not clear events embedding learned through Neural Tensor Network [3][4] can scale well at practise.
There are several issue considering events embedding learned through neural tensor network:
1. events triplet in the news title extraction can lose information: [1] use openIE and dependency parser
to extract (Subject,Verb,Object) triplet assuming openIE extraction contains S,V,O and dependency parser
can help narrow words within the extractions of openIE to come up with the final SVO.
However, the state-of-the-art openIE [5] can only extract 25%~50% of the news titles, leaving the rest of
SVO empty. This leads to huge information loss based on this method.
2. events triplet in the news title extraction is tedious for agile project development:
given above description, to prepare the the event embedding, you need to extract SVO first based
on openIE and dependency parser, then you need to train the Neural Tensor Network to get the event
embedding and save them to disk. Finally can you train CNN on event embeddings to get prediction
input. For now it is not worth around 3% accuracy gain to prepare that much effort.
3. events triplet in the news title extraction is not fully end to end for the project: It is impossible for the loss to be
back-propagate to the event extraction part (aka neural tensor network), given our goal is to learn the next day long short signal.
The author believe if given enough data(currently only 1786 trading days news title), it is worth a deeper model to learn the
information extraction by a complete model itself instead of separate SVO extraction and prediction.
![NTN](img_src/NTN.png)
## Experiments
experiments train/val results on average embedding based on word vector concatenation for news titles:
```
...
Epoch 411/1000
1404/1404 [==============================] - 1s 570us/step - loss: 0.6815 - acc: 0.6524 - val_loss: 0.7171 - val_acc: 0.6125
Epoch 412/1000
1404/1404 [==============================] - 1s 562us/step - loss: 0.6772 - acc: 0.6531 - val_loss: 0.7307 - val_acc: 0.6097
Epoch 413/1000
1404/1404 [==============================] - 1s 517us/step - loss: 0.6875 - acc: 0.6524 - val_loss: 0.7249 - val_acc: 0.6125
Epoch 414/1000
1404/1404 [==============================] - 1s 542us/step - loss: 0.6916 - acc: 0.6531 - val_loss: 0.7229 - val_acc: 0.6125
Epoch 415/1000
1404/1404 [==============================] - 1s 541us/step - loss: 0.6738 - acc: 0.6524 - val_loss: 0.7099 - val_acc: 0.6097
Epoch 416/1000
1404/1404 [==============================] - 1s 556us/step - loss: 0.6732 - acc: 0.6538 - val_loss: 0.7173 - val_acc: 0.6125
Epoch 417/1000
1404/1404 [==============================] - 1s 553us/step - loss: 0.6858 - acc: 0.6531 - val_loss: 0.7100 - val_acc: 0.6154
Epoch 418/1000
1404/1404 [==============================] - 1s 559us/step - loss: 0.6756 - acc: 0.6538 - val_loss: 0.7294 - val_acc: 0.6125
Epoch 419/1000
1404/1404 [==============================] - 1s 537us/step - loss: 0.6852 - acc: 0.6524 - val_loss: 0.7037 - val_acc: 0.6154
Epoch 420/1000
1404/1404 [==============================] - 1s 546us/step - loss: 0.6800 - acc: 0.6531 - val_loss: 0.7146 - val_acc: 0.6125
Epoch 421/1000
1404/1404 [==============================] - 1s 560us/step - loss: 0.6797 - acc: 0.6524 - val_loss: 0.7254 - val_acc: 0.6125
Epoch 422/1000
1404/1404 [==============================] - 1s 523us/step - loss: 0.6837 - acc: 0.6524 - val_loss: 0.7187 - val_acc: 0.6125
Epoch 423/1000
1404/1404 [==============================] - 1s 548us/step - loss: 0.6918 - acc: 0.6517 - val_loss: 0.7261 - val_acc: 0.6125
Epoch 424/1000
1404/1404 [==============================] - 1s 562us/step - loss: 0.6830 - acc: 0.6517 - val_loss: 0.7017 - val_acc: 0.6154
Epoch 425/1000
1404/1404 [==============================] - 1s 542us/step - loss: 0.6797 - acc: 0.6510 - val_loss: 0.7323 - val_acc: 0.6125
Epoch 426/1000
1404/1404 [==============================] - 1s 552us/step - loss: 0.6935 - acc: 0.6517 - val_loss: 0.7419 - val_acc: 0.6125
Epoch 427/1000
1404/1404 [==============================] - 1s 545us/step - loss: 0.6943 - acc: 0.6538 - val_loss: 0.7473 - val_acc: 0.6097
Epoch 428/1000
1404/1404 [==============================] - 1s 561us/step - loss: 0.6908 - acc: 0.6524 - val_loss: 0.7482 - val_acc: 0.6125
Epoch 429/1000
1404/1404 [==============================] - 1s 541us/step - loss: 0.6899 - acc: 0.6517 - val_loss: 0.7246 - val_acc: 0.6125
Epoch 430/1000
1404/1404 [==============================] - 1s 539us/step - loss: 0.6898 - acc: 0.6531 - val_loss: 0.7229 - val_acc: 0.6040
Epoch 431/1000
1404/1404 [==============================] - 1s 519us/step - loss: 0.6867 - acc: 0.6546 - val_loss: 0.7166 - val_acc: 0.6125
Epoch 432/1000
1404/1404 [==============================] - 1s 541us/step - loss: 0.6871 - acc: 0.6524 - val_loss: 0.7144 - val_acc: 0.6125
Epoch 433/1000
1404/1404 [==============================] - 1s 560us/step - loss: 0.6775 - acc: 0.6538 - val_loss: 0.7045 - val_acc: 0.6068
Epoch 434/1000
1404/1404 [==============================] - 1s 544us/step - loss: 0.6713 - acc: 0.6524 - val_loss: 0.7042 - val_acc: 0.6068
Epoch 435/1000
1404/1404 [==============================] - 1s 550us/step - loss: 0.6733 - acc: 0.6531 - val_loss: 0.7130 - val_acc: 0.6125
Epoch 436/1000
1404/1404 [==============================] - 1s 532us/step - loss: 0.6719 - acc: 0.6531 - val_loss: 0.7115 - val_acc: 0.6068
Epoch 437/1000
1404/1404 [==============================] - 1s 572us/step - loss: 0.6743 - acc: 0.6517 - val_loss: 0.7140 - val_acc: 0.6154
Epoch 438/1000
1404/1404 [==============================] - 1s 564us/step - loss: 0.6838 - acc: 0.6531 - val_loss: 0.7051 - val_acc: 0.6125
Epoch 439/1000
1404/1404 [==============================] - 1s 556us/step - loss: 0.6768 - acc: 0.6538 - val_loss: 0.7097 - val_acc: 0.6125
```
# Prepare Chinese Version input for the model
The only difference between chinese and english in the sense of model input is the natural word segmentation on
the english side. To prepare the experiments in Chinese, you need :
## A chinese word segmenter
```bash
# get THULAC, compile and download model
sh scripts/prepare_chinese_word_segmenter.sh
# A toy input example at data/chinese_word_seg_example.txt
sh scripts/run_chinese_word_segmenter.sh data/chinese_word_seg_example.txt
```
Now you have a segmented chinese corpus, you can train your own chinese word
embedding following the above instructions under train your own embedding.
## chinese counterpart input_dataframe.pickle
As mentioned above: the input_dataframe.pickle loads as a dataframe with header
`Date, Adj Close, news_title`
For chinese, you only need use chinese word segmenter to segment each chinese news
title with a space \' \'
An example row:
`2006-10-20 108.961449 [["将 句子 从 繁体 转化 为 简体","00:00:00"],["将 句子 从 繁体 转化 为 简体
","01:00:00"]]`
# references:
[1] [Ding, Xiao, Yue Zhang, Ting Liu and Junwen Duan. “Deep Learning for Event-Driven Stock Prediction.” IJCAI (2015).](https://www.semanticscholar.org/paper/Deep-Learning-for-Event-Driven-Stock-Prediction-Ding-Zhang/4938e8c8c9ea3d351d283181819af5e5801efbed)
[2] [Bojanowski, Piotr, Edouard Grave, Armand Joulin and Tomas Mikolov. “Enriching Word Vectors with Subword Information.” TACL 5 (2017): 135-146.](https://www.semanticscholar.org/paper/Enriching-Word-Vectors-with-Subword-Information-Bojanowski-Grave/0a6383b13794452fb7339a7f8a5384885186ccf6)
[3] [Socher, Richard, Danqi Chen, Christopher D. Manning and Andrew Y. Ng. “Reasoning With Neural Tensor Networks for Knowledge Base Completion.” NIPS (2013).](https://www.semanticscholar.org/paper/Reasoning-With-Neural-Tensor-Networks-for-Knowledg-Socher-Chen/50d53cc562225549457cbc782546bfbe1ac6f0cf)
[4] [Mikolov, Tomas, Ilya Sutskever, Kai Chen, Gregory S. Corrado and Jeffrey Dean. “Distributed Representations of Words and Phrases and their Compositionality.” NIPS (2013).](https://www.semanticscholar.org/paper/Distributed-Representations-of-Words-and-Phrases-a-Mikolov-Sutskever/762b63d2eb86f8fd0de98a08561b77527ae8f165)
[5] [OpenIE 5.0](https://github.com/dair-iitd/OpenIE-standalone)
[6] [Zhongguo Li, Maosong Sun. Punctuation as Implicit Annotations for Chinese Word Segmentation. Computational Linguistics, vol. 35, no. 4, pp. 505-512, 2009.](https://github.com/thunlp/THULAC)
近期下载者:
相关文件:
收藏者: