document-classification-using-bert

所属分类:聚类算法
开发工具:Jupyter Notebook
文件大小:0KB
下载次数:0
上传日期:2023-10-07 14:38:33
上 传 者sh-1993
说明:  使用Bert进行文档分类| 97%准确性,
(Document Classification using Bert | 97 % accuracy,)

文件列表:
.dockerignore (17, 2023-10-28)
Dockerfile (565, 2023-10-28)
document_classification_using_bert.ipynb (272785, 2023-10-28)
modeling_service/ (0, 2023-10-28)
modeling_service/samples/ (0, 2023-10-28)
modeling_service/samples/convert_image_curl_request.txt (106364, 2023-10-28)
modeling_service/samples/doc_app.png (115669, 2023-10-28)
modeling_service/samples/examples/ (0, 2023-10-28)
modeling_service/samples/examples/email.png (89586, 2023-10-28)
modeling_service/samples/examples/resume.png (79642, 2023-10-28)
modeling_service/samples/examples/scientific_publication.png (164995, 2023-10-28)
modeling_service/services/ (0, 2023-10-28)
modeling_service/services/app.py (3939, 2023-10-28)
modeling_service/services/modeling.py (1046, 2023-10-28)
modeling_service/utils/ (0, 2023-10-28)
modeling_service/utils/convert_image.py (510, 2023-10-28)
modeling_service/utils/data_preprocessing.py (2645, 2023-10-28)
modeling_service/utils/model_prediction.py (1167, 2023-10-28)
modeling_service/utils/ocr.py (82, 2023-10-28)
requirements.txt (1861, 2023-10-28)
start.sh (137, 2023-10-28)

# Document Classification with OCR and BERT ### Overview Document Classification with OCR and BERT is a project aimed at automatically categorizing textual images into predefined classes. This repository contains the code and resources necessary to train a powerful document classification model leveraging Optical Character Recognition (OCR) and the Bidirectional Encoder Representations from Transformers (BERT) algorithm. The model is deployed using FastAPI and Docker. ![image](https://github.com/yesdeepakmittal/document-classification-using-bert/blob/master/./modeling_service/samples/doc_app.png) ### Project Highlights - **Automated Document Classification**: Classify textual images into categories without manual intervention, enabling efficient sorting and organization of large document datasets. - **OCR Integration**: Utilize Tesseract OCR, a popular open-source text recognition engine, to extract textual content from images, enabling the model to work with image-based documents. - **BERT-based Document Understanding**: Leverage BERT, a state-of-the-art language model, to understand the context and semantics of extracted text, improving the accuracy of document classification. - **Flexibility and Customization**: Adapt the project to your specific use case by easily modifying the number of classes, training data, and model architecture. ### How it Works - **Text Extraction with Tesseract OCR**: - Images containing textual content are processed using Tesseract OCR to extract the text. - Extracted text is preprocessed and tokenized for further analysis. - **BERT Model Training**: - Preprocessed text data and corresponding labels are used to train a BERT-based document classification model. - The model learns to classify documents into predefined categories. - **Inference and Classification**: - Trained model is utilized to classify new textual images into appropriate classes. - Predictions enable automated sorting and organization of documents based on their content. ### Prerequisites - Python 3.x (Mine - Python 3.10.12 in Ubuntu 22.04) - Libraries: transformers, torch, pytesseract, PIL, FastAPI, Gradio - Tesseract OCR Installed - Docker ### Usage - **Clone the Repository**: ```git clone https://github.com/yesdeepakmittal/document-classification-using-bert.git``` - **Make a virtual environment and install all the libraries mentioned in `requirements.txt` file** - **Train the model using Jupyter Notebook** - **Serve the model using FastAPI and Deploy using Docker** ### Challenges & Remedies 1. **Computational**: - Training a BERT model trains well if we have a dedicated GPU. - **Remedy**: Utilized the GPU in Google Colab - Preprocessing text of a single document require at least 30 seconds making it infeasible working with 1000s of document - **Remedy**: Run the preprocessing task at More Core processor and save the processed text as a .txt file. 2. **OCR Engine Performance**: - Input to the model is the text which is extracted using OCR Engine. The more accurate the OCR Engine is, the better the model fine-tuning will be. - **Remedy**: Premium OCR Engine like Google Vision OCR performs well and give the result faster as compare to Tesseract OCR Engine which is used in this project. 3. **Data Quality & Quantity**: - BERT models require large amounts of data for effective training, and obtaining a substantial, well-labeled dataset can be challenging, especially for specific domains. - **Remedy**: - **Data Augmentation**: Apply techniques such as synonym replacement to artificially increase the size of your dataset. - **Domain-Specific Pretraining**: Consider using domain-specific pretrained BERT models. 4. **Training Challenges**: - Training large transformer models like BERT can be time-consuming, especially if the dataset is vast and the model architecture is complex. - **Remedy**: - **Gradient Accumulation**: simulate training with larger batch sizes without increasing GPU memory requirements significantly. 5. **Fine-Tuning Challenges**: - Finding the optimal learning rate, batch size, and number of epochs for fine-tuning BERT can be challenging and time-consuming. - **Remedy**: Hyperparameter Tuning with multiple values & Early Stopping. 6. **Label Imbalance**: - classes might not be balanced, leading to biased models. - **Remedy**: Assign higher weights to minority classes during loss calculation to penalize misclassifications of minority classes more. 7. **Python Version Difference**: - Incompatible Python version to load the pretrained model for serving. - **Remedy**: Use the correct Python Version for model training and model loading. ### Data Source [Kaggle](https://github.com/yesdeepakmittal/document-classification-using-bert/blob/master/https://www.kaggle.com/datasets/ritvik1909/document-classification-dataset)

近期下载者

相关文件


收藏者