Classification-Model-BBC-News

所属分类:特征抽取
开发工具:Jupyter Notebook
文件大小:21757KB
下载次数:0
上传日期:2021-12-01 15:22:13
上 传 者sh-1993
说明:  ...-的单词或TF-IDF矩阵,然后将数据拟合到各种分类模型中,对BBC新闻故事进行分类...
(I have been tasked to solve a text analytic problem using Python. I was given a dataset (bbc-text.csv) consists of 2225 documents from the BBC news website corresponding to stories in five topical areas from 2004-2005. The csv file consists of two columns, first being the “category”, which are the labels for each document: business, entertainmen...)

文件列表:
AA Assignment Presentation.pptx (9725322, 2021-12-01)
AA Assignment Report - Chen Han.docx (9902980, 2021-12-01)
AA_Assignment - Chen Han.ipynb (245343, 2021-12-01)
Model improvement ws.xlsx (8346, 2021-12-01)
bbc-text.csv (5057493, 2021-12-01)
imdb_export.csv (10156434, 2021-12-01)
stopwords.txt (2485, 2021-12-01)

## AA_Classification_Modelling_Project ## Description / Business understanding: I was given a dataset (bbc-text.csv) consists of 2225 documents from the BBC news website corresponding to stories in five topical areas from 2004-2005.
The csv file consists of two columns, first being the “category”, which are the labels for each document: business, entertainment, politics, sport, and tech. Second column is “text”, which is the BBC news stories/ Document.
The goal to preprocess and transform the data into either bag-of-words or TF-IDF matrix, then fit the data into various classification models to classify the BBC news stories into different categories and lastly, comparing the performance of each model.
## Data Understanding and pre-processing The first step will be text data pre-processing. Started off by extracting the data from the csv file and store it inside a data frame, followed by data cleansing using proper techniques such as removing stop words, stemming and lemmatizing.

![image](https://user-images.githubusercontent.com/73086331/144260531-a443c8a6-44b1-47e5-a249-db1442ebf194.png)
^ It has a two columns, first being the category column, which indicates which category that the text column belongs to
![image](https://user-images.githubusercontent.com/73086331/144260436-cbdb1dbc-7aee-432d-9c01-93afa401f902.png)
^Before cleansing
![image](https://user-images.githubusercontent.com/73086331/144260788-1de3e656-f2ba-4849-8a1f-7fb6155049bb.png)
^After cleansing ## Data Tranformation And lastly, the cleansed text data will be transformed into bag of words or TF-IDF matrix for classification modelling later.
For step 2, it will be text data understanding, first by extracting the keywords from each document using TF-IDF matrix, the keywords extracted can be then used for analysis using Association Rule Mining.
The goal is to visualize and understand the associations between the keywords by category or by overall documents.

### Transform Data using bag-of-words matrix
![image](https://user-images.githubusercontent.com/73086331/144260***1-bc443d49-b8f6-41aa-be2f-8dcae9feb2bc.png)
### Transform Data using tf-idf value in matrix
![image](https://user-images.githubusercontent.com/73086331/144261125-f3432f66-fa3e-4128-8d8c-***970f5570***.png)
### Extract Keywords
![image](https://user-images.githubusercontent.com/73086331/144261283-2b61***aa-0312-4736-adda-194b69b4e07f.png)
### Association Rule Mining
![image](https://user-images.githubusercontent.com/73086331/144261333-d7310ca4-a21e-41d9-9c8b-36219d6d5265.png)

## Modelling (Classification model) For step 3, it aims to use classification modellings on bag-of-word or TF-IDF matrix to classify the BBC news stories into different categories.
### 1. Using Logistic Regression (Word count)
![image](https://user-images.githubusercontent.com/73086331/144261484-c762e06a-9214-4552-9176-73514d76dc55.png)
### 2. SGD Classifier (word count)
![image](https://user-images.githubusercontent.com/73086331/144261612-3eaf6848-3dc0-492a-bed2-51afe9263cf8.png)
### 3. Random Forest (Word count)
![image](https://user-images.githubusercontent.com/73086331/144261680-fd69552d-24***-454f-b14a-6d33ce97f548.png)

## Model Evaluation And to evaluate the models performance using testing data and further improve the model performance by tuning model hyperparameters or further cleanse or transform the text data. The last step is to summarize the findings, which can be a collection of accuracy scores from different models using either bag-of-word or TF-IDF matrix and choose the most suitable model to use to classify future data. And finally, some possible improvements to be made.
### Comparing the 6 findings (including models using TF-IDF Matrix as input)
![image](https://user-images.githubusercontent.com/73086331/144261750-34167f50-cc19-41c0-ae5e-6751da08bd0a.png)
## Decision: Using Logistic Regression with IF-IDF Matrix as input is the best model for this problem.

近期下载者

相关文件


收藏者