search-engine

所属分类:搜索引擎
开发工具:C++
文件大小:22584KB
下载次数:0
上传日期:2022-10-26 22:50:57
上 传 者sh-1993
说明:  大量财经新闻文章的搜索引擎
(Search engine for a large collection of financial news articles)

文件列表:
CMakeLists.txt (261, 2022-10-27)
data (0, 2022-10-27)
data\blogs_0000001.json (3477, 2022-10-27)
data\blogs_0000002.json (1935, 2022-10-27)
data\blogs_0000010.json (3076, 2022-10-27)
data\blogs_0000048.json (3920, 2022-10-27)
data\blogs_0000083.json (2657, 2022-10-27)
data\blogs_0000101.json (5096, 2022-10-27)
data\blogs_0000102.json (6177, 2022-10-27)
data\blogs_0000103.json (4329, 2022-10-27)
data\blogs_0000104.json (4298, 2022-10-27)
data\blogs_0000105.json (4125, 2022-10-27)
data\blogs_0000415.json (2459, 2022-10-27)
data\blogs_0000416.json (3816, 2022-10-27)
data\blogs_0000417.json (3155, 2022-10-27)
data\blogs_0000601.json (3169, 2022-10-27)
data\blogs_0000602.json (3750, 2022-10-27)
data\blogs_0000603.json (6864, 2022-10-27)
data\blogs_0000609.json (1720, 2022-10-27)
data\blogs_0000617.json (4095, 2022-10-27)
data\blogs_0000701.json (4106, 2022-10-27)
data\blogs_0000836.json (4123, 2022-10-27)
data\blogs_0000837.json (4123, 2022-10-27)
data\blogs_0000838.json (4614, 2022-10-27)
data\blogs_0000839.json (5510, 2022-10-27)
data\blogs_0000840.json (1883, 2022-10-27)
data\blogs_0000841.json (2744, 2022-10-27)
data\blogs_0001076.json (1515, 2022-10-27)
data\blogs_0001079.json (2258, 2022-10-27)
data\blogs_0001163.json (7437, 2022-10-27)
data\blogs_0001164.json (4016, 2022-10-27)
data\blogs_0001192.json (1735, 2022-10-27)
data\blogs_0001220.json (4809, 2022-10-27)
data\blogs_0001221.json (4326, 2022-10-27)
data\blogs_0001305.json (1428, 2022-10-27)
... ...

# Search Engine This repository was created and maintained by Cullen Watson. Email: cullen@cullenwatson.com ## Functionality This program is a search engine for a large collection of financial news articles from Jan - May 2018. The dataset contains more than 300,000 articles. Included in the repo is the first 10,000 articles to test the search engine with. It uses a self-implemented AVL Tree for storage of word objects and a HashTable for the associated people and organizations of the article. The articles are ranked in the search results by term-frequency/inverse document frequency (tf/idf) metric. Link to the full dataset can be found [here](https://www.kaggle.com/datasets/jeet2016/us-financial-news-articles?resource=download&select=2018_02_112b52537b67659ad3609a234388c50a) ## How to Query with Examples After loading in the dataset, you can view statistics about the index, including most popular words, average words per article, etc. There are many ways to query the articles: * **markets**
This query returns all articles that contain the word markets. * **AND social network**
This query returns all articles that contain the words “social” and “network” (doesn’t have to be as a 2-word phrase) * **AND social network PERSON cramer**
This query returns all articles that contain the words social and network and that mention cramer as a person entity. * **AND social network ORG facebook PERSON cramer**
This query returns all articles that contain the words social and network, that have an entity organization of facebook and that mention cramer as a person entity. * **OR snap facebook**
This query returns all articles that contain either snap OR facebook * **OR facebook meta NOT profits**
This query returns all articles that contain facebook or meta but that do not contain the word profits. * **bankruptcy NOT facebook**
This query returns all articles that contain bankruptcy, but not facebook. * **OR facebook instagram NOT bankruptcy ORG snap PERSON cramer**
This query returns any article that contains the word facebook OR instagram but that does NOT contain the word bankruptcy, and the article should have an organization entity with Snap and a person entity of cramer ## How to Run (must use WSL) [Video Demo](https://youtu.be/H_8EPUopvew) Compile and build with CMAKE. There are no command-line arguments. First, load in the dataset by entering option one in the menu and then specifying the location of the dataset. Type `../data` to use the 10,000 articles included in the repo. The index will be created after around a minute. You can then begin querying.

近期下载者

相关文件


收藏者