CSI4107_A1

所属分类:其他
开发工具:Java
文件大小:0KB
下载次数:0
上传日期:2024-02-11 18:35:44
上 传 者sh-1993
说明:  沪深4107 A1
(CSI4107 A1)

文件列表:
.gradle/
app/
gradle/wrapper/
gradlew
gradlew.bat
settings.gradle.kts

# Information Retrieval System This project is an infomation retrieval system built using Apache Lucene. ## Group Information **Students:** - Ishan Phadte 300238878 - Lauren Gu 300320106 - Angus Leung 300110509 **Division of Work:** - Ishan Phadte: Part 1 & Part 2 - Lauren Gu: Part 3 and Apache Lucene Implementation - Angus Leung: Part 2 ## Prerequisites - Java Development Kit (JDK) ## Instructions Final Code can be found at https://github.com/laurgu/CSI4107_A1 Initial Fork from https://github.com/IshanPhadte776/CSI4107 1. **Clone the repository:** ```bash git clone https://github.com/laurgu/CSI4107_A1.git ``` 2. **Build the project using gradle wrapper:** ```bash ./gradlew build ``` **_Or using gradle if installed:_** `gradle build` 3. **Build the index:** ```bash ./gradlew runLuceneIndex ``` **_Or if gradle installed:_** `gradle runLuceneIndex` 4. **Run the the search function using gradle wrapper:** ```bash ./gradlew run ``` **_Or using gradle if installed:_** `gradle run` 5. **To view results, navigate to app directory then open the Results.txt file:** ```bash cd app notepad Results.txt ``` ## Functionality ### Part 1 1. **Jsoup Parsing:** The files in the coll folder are read using Jsoup which allows ill-formatted xml files to still be parsed. 2. **Document Extraction:** The seperate documents in each file are extracted by identifying \ tags. The \ and \ of each document are extracted. 3. **Preprocessing:** The documents are preprocessed using a custom analyzer that extends the analyzer class provided by Lucene. It tokenizes words and applies lowercasing and stop word removal. ### Part 2 The index is created using Lucene's "Index Writer". This index is written to a file where it can be reused for different queries. ### Part 3 1. The query string is preprocessed using the custome analyzer, like how documents are preprocessed in Part 1. 2. Querying is done using Lucene's "Index Searcher" to search the index build in Part 2. It retrieves the top 1000 results for a given query and these results are written to a txt file. ### Optimizations Initially, we implemented our IR system using the tf-idf weighting system. For comparison we implemented this Lucene version and found it seemed to produce more accurate results without significantly impacting the runtime. ### Data Structures A Set was used for stops words because each stop word is unique and order isn't required A Dictionary was used for the inverted index because we needed a key value data structure and a dictionary fits the description ### Sample of 100 Tokens ['nation', 'governors', 'appealed', 'whitehouse', 'sunday', 'relief', '163', 'federal', 'rules', 'regulations', 'andheard', 'former', 'governor', 'call', 'constitutional', 'convention', 'torestore', 'states', 'rights', 'new', 'hampshire', 'gov', 'john', 'h', 'sununu', 'opening', 'nationalgovernors', 'association', 'winter', 'meeting', 'said', 'time', 'hascome', 'press', 'new', 'division', 'authority', 'statesand', 'washington', 'erosion', 'fundamental', 'balance', 'struck200', 'years', 'ago', 'philadelphia', 'sununu', 'nga', 'chairman', 'said', 'ata', 'news', 'conference', 'gaveling', 'first', 'plenary', 'session', 'toorder', 'president', 'reagan', 'black', 'tie', 'dinner', 'governors', 'sundaynight', 'told', 'governors', 'envied', 'balanced', 'budgetrequirements', 'line', 'item', 'vetoes', 'many', 'possess', 'notone', 'would', 'put', 'mess', 'inwashington', 'budget', 'time', 'president', 'said', 'also', 'said', 'want', 'tie', 'successor', 'hands', 'butexpressed', 'hope', 'next', 'president', 'would', 'continue', 'tradition', 'ofinviting', 'governors', 'white'] ### Results For some queries, fewer than 1000 queries were provided as only the provided queries were deemed relevant

近期下载者

相关文件


收藏者