financeSpiders
所属分类:数据采集/爬虫
开发工具:Julia
文件大小:217KB
下载次数:0
上传日期:2018-06-11 11:18:31
上 传 者:
sh-1993
说明: 各种财经新闻来源的轻量级网络抓取器和爬虫。免责声明:开发用于教育目的...
(Light weight web scraper and crawlers for various financial news sources. Disclaimer: Developed for educational purposes only.)
文件列表:
financeScraper (0, 2018-06-11)
financeScraper\blo_AAPL.jl (38931, 2018-06-11)
financeScraper\blo_NVDA.jl (8103, 2018-06-11)
financeScraper\crawlers.py (5760, 2018-06-11)
financeScraper\data (0, 2018-06-11)
financeScraper\data\.ipynb_checkpoints (0, 2018-06-11)
financeScraper\data\.ipynb_checkpoints\Untitled-checkpoint.ipynb (72, 2018-06-11)
financeScraper\data\Untitled.ipynb (4327, 2018-06-11)
financeScraper\data\blo_AAPL-Jun-11-2018.jl (8581, 2018-06-11)
financeScraper\data\blo_NVDA-Jun-11-2018.jl (4719, 2018-06-11)
financeScraper\data\msn_AAPL-Jun-11-2018.jl (0, 2018-06-11)
financeScraper\data\msn_NVDA-Jun-11-2018.jl (26931, 2018-06-11)
financeScraper\data\mws_AAPL-Jun-11-2018.jl (0, 2018-06-11)
financeScraper\data\mws_NVDA-Jun-11-2018.jl (0, 2018-06-11)
financeScraper\data\reu_AAPL-Jun-11-2018.jl (22911, 2018-06-11)
financeScraper\data\reu_NVDA-Jun-11-2018.jl (11874, 2018-06-11)
financeScraper\financeScraper (0, 2018-06-11)
financeScraper\financeScraper\__init__.py (0, 2018-06-11)
financeScraper\financeScraper\__pycache__ (0, 2018-06-11)
financeScraper\financeScraper\__pycache__\__init__.cpython-35.pyc (169, 2018-06-11)
financeScraper\financeScraper\__pycache__\items.cpython-35.pyc (508, 2018-06-11)
financeScraper\financeScraper\__pycache__\pipelines.cpython-35.pyc (1144, 2018-06-11)
financeScraper\financeScraper\__pycache__\settings.cpython-35.pyc (413, 2018-06-11)
financeScraper\financeScraper\__pycache__\utils.cpython-35.pyc (3634, 2018-06-11)
financeScraper\financeScraper\items.py (437, 2018-06-11)
financeScraper\financeScraper\middlewares.py (3613, 2018-06-11)
financeScraper\financeScraper\pipelines.py (754, 2018-06-11)
financeScraper\financeScraper\settings.py (3121, 2018-06-11)
financeScraper\financeScraper\spiders (0, 2018-06-11)
financeScraper\financeScraper\spiders\__init__.py (161, 2018-06-11)
financeScraper\financeScraper\spiders\__pycache__ (0, 2018-06-11)
financeScraper\financeScraper\spiders\__pycache__\__init__.cpython-35.pyc (177, 2018-06-11)
financeScraper\financeScraper\spiders\__pycache__\crawlers.cpython-35.pyc (2068, 2018-06-11)
financeScraper\financeScraper\spiders\__pycache__\mwSpider.cpython-35.pyc (1476, 2018-06-11)
financeScraper\financeScraper\spiders\__pycache__\wsjSpider.cpython-35.pyc (889, 2018-06-11)
financeScraper\financeScraper\spiders\mwSpider.py (1322, 2018-06-11)
financeScraper\financeScraper\spiders\wsjSpider.py (547, 2018-06-11)
financeScraper\financeScraper\utils.py (4296, 2018-06-11)
... ...
# financeSpiders
Light weight web scraper and crawlers for various financial news sources.
Disclaimer: Developed for educational purposes only.
### Usage:
Dependencies: python3, Scrapy, Twisted
1. List stock tickers interested in separate line in a file, i.e. ```stock.txt```
2. Execute
```python3 crawlers.py -i stock.txt```
3. Data output is in current directory following '{news_source_name}\_{stock_ticker}.jl'
### Financial News Sources Supported:
- Wall Street Journal (HOLD: Needs subscription to view articles...)
- Market Watch (WIP: Handle crawling of infinite scrolling article list, check out https://stackoverflow.com/questions/25583414/working-with-post-request-to-load-more-articles-with-scrapy-python)
- 100% able to extract from MarketWatch
- Bloomberg (Supported)
- Reuters (Supported)
- MSNBC (Supported)
- TheStreet (Not supported)
- MarketRealist (Hold: paywall)
- SeekingAlpha (Supported)
- Fool (Not supported)
- Investopedia (Not supported)
### Changelog:
- Basic scraping of current related news article headlines, links, and texts
- Examples of scraped data in `financeScraper/*.jl`
- Centralized script: ```crawlers.py``` to simplify execution and pipelining
- Crawls all MarketWatch links and scrapes their articles
- Supports scraping of multiple stock ticker symbols
- Added dynamic parsing based on source news website
- Added support for Reuters articles
- Hold on WSJ, needs subscription
#### Feb. 19th, 2018
- Added support for MSNBC
#### Mar. 5th, 2018
- Added support for SeekingAlpha
### Overall TODOs:
```diff
+ Develop web crawlers to curate article information from current links
+ Create API for scraping specific companies by stock ticker labels
+ More dynamic crawlers that can extract from different news sites
- Support more market news sites, parsing wise
- Add date tags to .jl data files
- Add chron job to periodically scrape at some `time`
- Method to eliminate duplicate articles
```
近期下载者:
相关文件:
收藏者: