search-engine
所属分类:搜索引擎
开发工具:C++
文件大小:22584KB
下载次数:0
上传日期:2022-10-26 22:50:57
上 传 者:
sh-1993
说明: 大量财经新闻文章的搜索引擎
(Search engine for a large collection of financial news articles)
文件列表:
CMakeLists.txt (261, 2022-10-27)
data (0, 2022-10-27)
data\blogs_0000001.json (3477, 2022-10-27)
data\blogs_0000002.json (1935, 2022-10-27)
data\blogs_0000010.json (3076, 2022-10-27)
data\blogs_0000048.json (3920, 2022-10-27)
data\blogs_0000083.json (2657, 2022-10-27)
data\blogs_0000101.json (5096, 2022-10-27)
data\blogs_0000102.json (6177, 2022-10-27)
data\blogs_0000103.json (4329, 2022-10-27)
data\blogs_0000104.json (4298, 2022-10-27)
data\blogs_0000105.json (4125, 2022-10-27)
data\blogs_0000415.json (2459, 2022-10-27)
data\blogs_0000416.json (3816, 2022-10-27)
data\blogs_0000417.json (3155, 2022-10-27)
data\blogs_0000601.json (3169, 2022-10-27)
data\blogs_0000602.json (3750, 2022-10-27)
data\blogs_0000603.json (6864, 2022-10-27)
data\blogs_0000609.json (1720, 2022-10-27)
data\blogs_0000617.json (4095, 2022-10-27)
data\blogs_0000701.json (4106, 2022-10-27)
data\blogs_0000836.json (4123, 2022-10-27)
data\blogs_0000837.json (4123, 2022-10-27)
data\blogs_0000838.json (4614, 2022-10-27)
data\blogs_0000839.json (5510, 2022-10-27)
data\blogs_0000840.json (1883, 2022-10-27)
data\blogs_0000841.json (2744, 2022-10-27)
data\blogs_0001076.json (1515, 2022-10-27)
data\blogs_0001079.json (2258, 2022-10-27)
data\blogs_0001163.json (7437, 2022-10-27)
data\blogs_0001164.json (4016, 2022-10-27)
data\blogs_0001192.json (1735, 2022-10-27)
data\blogs_0001220.json (4809, 2022-10-27)
data\blogs_0001221.json (4326, 2022-10-27)
data\blogs_0001305.json (1428, 2022-10-27)
... ...
# Search Engine
This repository was created and maintained by Cullen Watson.
Email: cullen@cullenwatson.com
## Functionality
This program is a search engine for a large collection of financial news
articles from Jan - May 2018. The dataset contains more than 300,000 articles. Included in the repo
is the first 10,000 articles to test the search engine with.
It uses a self-implemented AVL Tree for storage of word objects and a HashTable for the associated people and organizations of the article.
The articles are ranked in the search results by term-frequency/inverse document frequency (tf/idf) metric.
Link to the full dataset can be found [here](https://www.kaggle.com/datasets/jeet2016/us-financial-news-articles?resource=download&select=2018_02_112b52537b67659ad3609a234388c50a)
## How to Query with Examples
After loading in the dataset, you can view statistics about the index, including most popular words, average words per
article, etc.
There are many ways to query the articles:
* **markets**
This query returns all articles that contain the word markets.
* **AND social network**
This query returns all articles that contain the words “social” and “network”
(doesn’t have to be as a 2-word phrase)
* **AND social network PERSON cramer**
This query returns all articles that contain the words social and network and that
mention cramer as a person entity.
* **AND social network ORG facebook PERSON cramer**
This query returns all articles that contain the words social and network, that
have an entity organization of facebook and that mention cramer as a person entity.
* **OR snap facebook**
This query returns all articles that contain either snap OR facebook
* **OR facebook meta NOT profits**
This query returns all articles that contain facebook or meta but that do not
contain the word profits.
* **bankruptcy NOT facebook**
This query returns all articles that contain bankruptcy, but not facebook.
* **OR facebook instagram NOT bankruptcy ORG snap PERSON cramer**
This query returns any article that contains the word facebook OR instagram but
that does NOT contain the word bankruptcy, and the article should have an organization
entity with Snap and a person entity of cramer
## How to Run (must use WSL)
[Video Demo](https://youtu.be/H_8EPUopvew)
Compile and build with CMAKE. There are no command-line arguments.
First, load in the dataset by entering option one in the menu and then specifying the location of the dataset.
Type `../data` to use the 10,000 articles included in the repo. The index will be created after around a minute. You can then begin querying.
近期下载者:
相关文件:
收藏者: