Document-QA-System

所属分类:特征抽取
开发工具:Jupyter Notebook
文件大小:0KB
下载次数:0
上传日期:2024-04-08 23:39:19
上 传 者sh-1993
说明:  Python脚本从PDF文档中提取并预处理文本,然后应用多个文本嵌入技术(BoW、TF-IDF、Word2Vec、GloVe、FastText、BERT、句子转换器)来生成用户问题的答案。
(The Python script extracts and preprocesses text from a PDF document, then applies multiple text embedding techniques (BoW, TF-IDF, Word2Vec, GloVe, FastText, BERT, Sentence Transformers) to generate answers to user questions.)

文件列表:
Text_GPT.ipynb
Text_QA.py

### Introduction #### Employs PDF text extraction and NLP preprocessing to extract and prepare text from PDF documents. Utilizes diverse embedding models for numerical representation generation. Through user interaction via a Streamlit interface, enables efficient querying and retrieval of relevant text content from PDFs. ### Methodology #### -> PDF Text Extraction: The system extracts text from PDF documents using the fitz library and preprocesses it using Natural Language Processing (NLP) techniques. #### -> Text Preprocessing: Text undergoes preprocessing steps such as tokenization, stop word removal, lemmatization, and punctuation removal to prepare it for analysis. #### ->Text Embedding Models: Various embedding models including Bag-of-Words, TF-IDF, Word2Vec (CBOW and Skip Gram), GloVe, FastText, BERT, and Sentence Transformers are applied to the preprocessed text to generate numerical representations. #### -> User Interaction: Users upload a PDF document, input a question, and select an embedding model through a Streamlit interface. #### -> Answer Generation: The system computes the similarity between the user question and sentences in the document using the chosen embedding model. The most relevant sentences are then presented as answers to the user's query. ### Results #### The system adeptly retrieves the most relevant sentences from PDF documents based on user queries, ensuring accurate and informative answers. ![image](https://github.com/LoheshM/Document-QA-System/assets/116341584/65212952-44de-4858-853c-159e83554368)

近期下载者

相关文件


收藏者