song-search-engine-csf469

所属分类:特征抽取
开发工具:Python
文件大小:35990KB
下载次数:0
上传日期:2017-11-25 11:54:19
上 传 者sh-1993
说明:  BITS Pilani海得拉巴校区信息检索课程作业(CS469)
(Information Retrieval Course Assignments (CS469) at BITS Pilani Hyderabad Campus)

文件列表:
.vscode (0, 2017-11-25)
.vscode\settings.json (38, 2017-11-25)
LICENSE (1071, 2017-11-25)
__init__.py (0, 2017-11-25)
__pycache__ (0, 2017-11-25)
__pycache__\naya.cpython-35.pyc (4438, 2017-11-25)
__pycache__\store_scores_gui.cpython-35.pyc (5613, 2017-11-25)
assignment-1.zip (639810, 2017-11-25)
assignment-2.zip (2021690, 2017-11-25)
assignment-3.zip (7052, 2017-11-25)
assignment-3 (0, 2017-11-25)
assignment-3\find_cur.py (4052, 2017-11-25)
assignment-3\recommendations0811.py (8160, 2017-11-25)
assignment-3\svd2.py (3773, 2017-11-25)
assignment-3\svd_recommender.py (2962, 2017-11-25)
corpus.zip (34653094, 2017-11-25)
documentation (0, 2017-11-25)
documentation\code-documetation.md (3467, 2017-11-25)
documentation\design-document.md (3590, 2017-11-25)
final_gui.py (1504, 2017-11-25)
savers.zip (15573598, 2017-11-25)
songname.txt (1345657, 2017-11-25)
static (0, 2017-11-25)
static\bootstrap.min.js (37045, 2017-11-25)
static\jquery.min.js (86658, 2017-11-25)
static\main.css (13142, 2017-11-25)
static\styles (0, 2017-11-25)
static\styles\bootstrap.min.css (121200, 2017-11-25)
static\styles\style.css (143, 2017-11-25)
store_document_tokens_list.py (1453, 2017-11-25)
store_megadict.py (2887, 2017-11-25)
store_scores_gui.py (8121, 2017-11-25)
store_vocabulary.py (1511, 2017-11-25)
templates (0, 2017-11-25)
templates\index.html (1979, 2017-11-25)

## Song Search Engine -A tf-idf based Search Engine for searching about 50000 songs . The main purpose of this project is understand how vector space based retrieval models work. -More on [Tf-Idf](https://en.wikipedia.org/wiki/Tf%E2%80%93idf). Install all the dependencies using pip3. ## The program/application can be broken down into the various subparts (actual file names also added) : 1. store_document_tokens_list.py: Stores the tokenized words of each document as lists and then the corresponding list is stored in a json file. 2. store_vocabulary.py: Stores all the unique words present in the corpus 3. store_megadict.py: creates a dictionary which contains the words in the vocabulary as the key and the value as another dictionary which contains each document as key and its value is one more dictionary as which contains the TF,IDF and TF-IDF values. 4. store_scores_gui.py: Takes query as input and calculates the scores for each document. 5. final_gui.py: Contains the gui program writtem in flask framework for python to accept query and receive the names of the top 10 documents with the highest scores ### Order of executing the files. ``` $ sudo python3 store_document_tokens_list.py $ sudo python3 store_vocabulary.py $ sudo python3 store_megadict.py $ sudo python3 store_scores_gui.py $ sudo python3 final_gui.py ``` ## Installation: Run the follwing in terminal. ``` $ sudo pip install -r requirements.txt ``` If you face any problem, install `nltk` separately. ### Installing `nltk` ``` $ pip3 install nltk $ python3 >>> import nltk >>> nltk.download() Packages: all ``` ## DATA STRUCTURES USED: ### Document_tokens_list Contains lists enclosed within a list It will contain the stemmed tokens from each file in the corpus as individual lists. All are appended to make a list. Example: ``` [[‘i’,’play’,’cricket’],[‘sachin’,’tendulkar’],[‘india’,’is’,’best’]] ``` ### Vocabulary Will contain a dictionary of all the unique words in the corpus. Example: ``` {‘i’: 1, ‘play’:2, ‘cricket’:3, ‘sachin’:4, ‘tendulkar’ :5, ‘india’:6 , ‘is’ :7, ‘best’:8] ``` ### Prime Dictionary A nested dictionary containing the following structure explained through the following example:(Numbers are just representational ) ``` {‘i’:{‘0’: {‘TF’:1 ,“IDF”:0.8, ‘TF-IF’ : 0.8} , ‘1’:{‘TF’: 2 ,‘IDF’: 0.4, ‘TF-IDF’:0.8}, ‘2’:{‘TF’: 0 ,‘IDF’: 0.3, ‘TF-IDF’:0}} , ‘cricket’ :{‘0’: {‘TF’:2 ,“IDF”:0.6, ‘TF-IF’ :1.2} , ‘1’:{‘TF’: 0 ,‘IDF’: 0.4, ‘TF-IDF’:0}, ‘2’:{ ‘TF’: 1 ,‘IDF’: 0.4, ‘TF-IDF’:0.4}}} ``` ### Scores A dictionary which will contain the scores of the documents after inputting the query and running cosine similarity algorithm. Example : ``` {‘0’: 0.2323 , ‘1’: 0.3125 , ‘2’ : 0.467 } ``` ## Creating The GUI GUI has been created using flask framework of python and the front end web pages have been designed using HTML, CSS and Bootstrap. We have also provided multilingual query support using google API. Details about the song is obtained using the iTunes API. ### The Search Engine Home page. ![](https://i.imgur.com/uIDeHaq.png?1) ### The Result page ![](https://i.imgur.com/LrN9dEA.png) ### The result page, query in chinese (Traditional) ![](https://i.imgur.com/GRR3llH.png?1) ## Machine specs: 1. Processor: i7 4700HQ 2. Ram: 24 GB DDR3 3. OS: Ubuntu 16.04 LTS ## Results Index building time: - No stemming/lemmatization - 41.67s - Stemmed text + stopwords_removal - 146.13 s Memory usage (RAM) while building the index: around 8 GB for 3000 documents, 1.3 GB for 800 files . ## Members [Shubadeep Jana](https://github.com/subhadip7879) [Shardul Parab](https://github.com/shardulparab97) [Himanshu Gupta](https://github.com/him1411)

近期下载者

相关文件


收藏者