• h3_846836
  • 6.3MB
  • zip
  • 0
  • VIP专享
  • 0
  • 2022-06-14 08:50
构建和部署文本分类Web应用 网路应用程式: : 关于 在这个项目中,我将通过一系列博客文章来建立模型(也称为并将该模型作为Web应用程序的一部分进行部署,以根据其摘要预测研究论文的主题。 第一篇博客文章:处理不平衡的数据 在第一篇博客文章中,我将使用库和一个不平衡的数据集(语料库),该数据集将根据在发表的论文摘要创建。 每篇论文的主题都已被标记为类别,因此减轻了我标记数据集的需要。 数据集的不平衡将由我们试图预测的每个类别中的样本数量不平衡引起。 数据不平衡在分类问题中经常发生,这使得开发好的模型更具挑战性。 通常情况下,获取很少样本的类的更多数据过于昂贵或无法实现。 因此,开发用于处理
# Building and Deploying A Text Classification Web App ----------------- **Web App**: https://docwebapp-j3zdo3lhcq-uc.a.run.app/ ## About ------ In this project, over a series of blog posts I'll be buidling a model [document classification](https://en.wikipedia.org/wiki/Document_classification), also known as [text classification](https://monkeylearn.com/text-classification/) and deploying the model as part of a web application to predict the topic of research papers from their abstract. ## 1st Blog Post: Dealing With Imbalanced Data ----- In the first blog post I will be working with the <a href="http://scikit-learn.org/" rel='nofollow' onclick='return false;'>Scikit-learn</a> library and an imbalanced dataset (corpus) that I will create from summaries of papers published on [arxiv](https://arxiv.org). The topic of each paper is already labeled as the category therefore alleviating the need for me to label the dataset. The imbalance in the dataset will be caused by the imbalance in the number of samples in each of the categories we are trying to predict. Imbalanced data occurs quite frequently in classification problems and makes developing a good model more challenging. Often times it is too expensive or not possible to get more data on the classes that have to few samples. Developing strategies for dealing with imbalanced data is therefore paramount for creating a good classification model. I will cover some of the basics of dealing with imbalanced data using the [Imbalance-Learn](https://imbalanced-learn.readthedocs.io/en/stable/) library as well as building a Naive Bayes classifier and Support Vector Machine using from <a href="http://scikit-learn.org/" rel='nofollow' onclick='return false;'>Scikit-learn</a>. I will also over the basics of term frequency-inverse document frequency and visualizing it using the [Plotly](https://plotly.com/python/) library. ## 2nd Blog Post: Using The Natural Language Toolkit ----- In this blogpost I picked up from the last one and went over using the [Natural Language Toolkit (NLTK)](https://www.nltk.org/) to improve the performance of our text classification models. Specifically, we went over how to remove stopwords, stemming and lemmitization. I applied each of these to the weighted Support Vector Machine model and performed a grid search to find the optimal parameters to use for our models. Finally I persist our model to disk using [Joblib](https://joblib.readthedocs.io/en/latest/) so that we can use it as part of a rest API. ## 3rd Blog Post: A Machine Learning Powered Web App ----- In this post we'll build out a serverless web app using a few technologies. The advantage of using a serverless framework for me is cost effectiveness: I don't pay much at all unless people use my web app a ton and I don't expect people to visit this app very often. However, due to the serverless framework I will have issues with latency, which I can live with. I'll first go over how to convert my text classification model from the [last post](http://michael-harmon.com/blog/NLP2.html) into a Rest API using [FastAPI](https://fastapi.tiangolo.com/) and [Joblib](https://joblib.readthedocs.io/en/latest/). Using our model in this way will allow us to send our paper abstracts as [json](https://en.wikipedia.org/wiki/JSON) through an HTTP request and get back the predicted topic label for the paper abstract. After this I'll build out a web application usign [FastAPI](https://fastapi.tiangolo.com/) and [Bootstrap](https://getbootstrap.com/). Using Bootstrap allows us to have a beautiful responsive website without having to write HTML or JavaScript. Finally, I'll go over deploying both the model API and Web app using [Docker](https://www.docker.com/) and [Google Cloud Run](https://cloud.google.com/run) to build out a serverless web application! ## How To Run This: ------ To use the notebooks in this project first download [Docker](https://www.docker.com/) and then you can start the notebook with the command: docker-compose up and going to the posted url. To recreate the restapi and web app use the commands listed in `modealapi` and `webapp` respectively.s