financeSpiders

所属分类:数据采集/爬虫
开发工具:Julia
文件大小:217KB
下载次数:0
上传日期:2018-06-11 11:18:31
上 传 者sh-1993
说明:  各种财经新闻来源的轻量级网络抓取器和爬虫。免责声明:开发用于教育目的...
(Light weight web scraper and crawlers for various financial news sources. Disclaimer: Developed for educational purposes only.)

文件列表:
financeScraper (0, 2018-06-11)
financeScraper\blo_AAPL.jl (38931, 2018-06-11)
financeScraper\blo_NVDA.jl (8103, 2018-06-11)
financeScraper\crawlers.py (5760, 2018-06-11)
financeScraper\data (0, 2018-06-11)
financeScraper\data\.ipynb_checkpoints (0, 2018-06-11)
financeScraper\data\.ipynb_checkpoints\Untitled-checkpoint.ipynb (72, 2018-06-11)
financeScraper\data\Untitled.ipynb (4327, 2018-06-11)
financeScraper\data\blo_AAPL-Jun-11-2018.jl (8581, 2018-06-11)
financeScraper\data\blo_NVDA-Jun-11-2018.jl (4719, 2018-06-11)
financeScraper\data\msn_AAPL-Jun-11-2018.jl (0, 2018-06-11)
financeScraper\data\msn_NVDA-Jun-11-2018.jl (26931, 2018-06-11)
financeScraper\data\mws_AAPL-Jun-11-2018.jl (0, 2018-06-11)
financeScraper\data\mws_NVDA-Jun-11-2018.jl (0, 2018-06-11)
financeScraper\data\reu_AAPL-Jun-11-2018.jl (22911, 2018-06-11)
financeScraper\data\reu_NVDA-Jun-11-2018.jl (11874, 2018-06-11)
financeScraper\financeScraper (0, 2018-06-11)
financeScraper\financeScraper\__init__.py (0, 2018-06-11)
financeScraper\financeScraper\__pycache__ (0, 2018-06-11)
financeScraper\financeScraper\__pycache__\__init__.cpython-35.pyc (169, 2018-06-11)
financeScraper\financeScraper\__pycache__\items.cpython-35.pyc (508, 2018-06-11)
financeScraper\financeScraper\__pycache__\pipelines.cpython-35.pyc (1144, 2018-06-11)
financeScraper\financeScraper\__pycache__\settings.cpython-35.pyc (413, 2018-06-11)
financeScraper\financeScraper\__pycache__\utils.cpython-35.pyc (3634, 2018-06-11)
financeScraper\financeScraper\items.py (437, 2018-06-11)
financeScraper\financeScraper\middlewares.py (3613, 2018-06-11)
financeScraper\financeScraper\pipelines.py (754, 2018-06-11)
financeScraper\financeScraper\settings.py (3121, 2018-06-11)
financeScraper\financeScraper\spiders (0, 2018-06-11)
financeScraper\financeScraper\spiders\__init__.py (161, 2018-06-11)
financeScraper\financeScraper\spiders\__pycache__ (0, 2018-06-11)
financeScraper\financeScraper\spiders\__pycache__\__init__.cpython-35.pyc (177, 2018-06-11)
financeScraper\financeScraper\spiders\__pycache__\crawlers.cpython-35.pyc (2068, 2018-06-11)
financeScraper\financeScraper\spiders\__pycache__\mwSpider.cpython-35.pyc (1476, 2018-06-11)
financeScraper\financeScraper\spiders\__pycache__\wsjSpider.cpython-35.pyc (889, 2018-06-11)
financeScraper\financeScraper\spiders\mwSpider.py (1322, 2018-06-11)
financeScraper\financeScraper\spiders\wsjSpider.py (547, 2018-06-11)
financeScraper\financeScraper\utils.py (4296, 2018-06-11)
... ...

# financeSpiders Light weight web scraper and crawlers for various financial news sources. Disclaimer: Developed for educational purposes only. ### Usage: Dependencies: python3, Scrapy, Twisted 1. List stock tickers interested in separate line in a file, i.e. ```stock.txt``` 2. Execute ```python3 crawlers.py -i stock.txt``` 3. Data output is in current directory following '{news_source_name}\_{stock_ticker}.jl' ### Financial News Sources Supported: - Wall Street Journal (HOLD: Needs subscription to view articles...) - Market Watch (WIP: Handle crawling of infinite scrolling article list, check out https://stackoverflow.com/questions/25583414/working-with-post-request-to-load-more-articles-with-scrapy-python) - 100% able to extract from MarketWatch - Bloomberg (Supported) - Reuters (Supported) - MSNBC (Supported) - TheStreet (Not supported) - MarketRealist (Hold: paywall) - SeekingAlpha (Supported) - Fool (Not supported) - Investopedia (Not supported) ### Changelog: - Basic scraping of current related news article headlines, links, and texts - Examples of scraped data in `financeScraper/*.jl` - Centralized script: ```crawlers.py``` to simplify execution and pipelining - Crawls all MarketWatch links and scrapes their articles - Supports scraping of multiple stock ticker symbols - Added dynamic parsing based on source news website - Added support for Reuters articles - Hold on WSJ, needs subscription #### Feb. 19th, 2018 - Added support for MSNBC #### Mar. 5th, 2018 - Added support for SeekingAlpha ### Overall TODOs: ```diff + Develop web crawlers to curate article information from current links + Create API for scraping specific companies by stock ticker labels + More dynamic crawlers that can extract from different news sites - Support more market news sites, parsing wise - Add date tags to .jl data files - Add chron job to periodically scrape at some `time` - Method to eliminate duplicate articles ```

近期下载者

相关文件


收藏者