SMS-Spam-Detection-NLP

所属分类:数学计算
开发工具:Jupyter Notebook
文件大小:0KB
下载次数:0
上传日期:2024-01-02 12:47:21
上 传 者sh-1993
说明:  使用Python实现了一个垃圾邮件检测项目,采用了数据清理、探索性数据分析和文本预处理技术。训练和评估朴素贝叶斯模型,使用贝努利朴素贝叶分类器实现了显著的97%准确性和97.35%精度。
(Implemented a spam detection project using Python, employing data cleaning, exploratory data analysis, and text preprocessing techniques. Trained and evaluated Naive Bayes models, achieving a notable 97% accuracy and 97.35% precision with the Bernoulli Naive Bayes classifier.)

文件列表:
00. Concepts/
app.py
ham_WordCloud.png
model.pkl
not spam.JPG
sms-spam-detection.ipynb
spam.JPG
spam.csv
spam_WordCloud.png
vectorizer.pkl

# SMS Spam Detection: NLP ## 1. Data Cleaning * Objective: Prepare the dataset for analysis and model training by addressing missing values, duplicates, and irrelevant columns. * Data Loading and Inspection: - Loaded the 'spam.csv' dataset with encoding "ISO-8859-1". - Explored the structure of the dataset with a focus on the initial entries and random samples. - Checked the shape of the dataset, revealing 5572 entries and 5 columns. * Handling Missing Values: - Investigated missing values in the dataset using `df.info()`. - Dropped three columns ('Unnamed: 2', 'Unnamed: 3', 'Unnamed: 4') with high null values. - Renamed the remaining columns to 'target' and 'sms_text'. * Label Encoding: - Encoded the 'target' column ('ham' or 'spam') numerically using scikit-learn's LabelEncoder. * Handling Duplicates: - Explored and identified 403 duplicate entries. - Removed duplicates, resulting in a dataset with 5169 entries. ## 2. Exploratory Data Analysis (EDA) * Objective: Understand the distribution and characteristics of the dataset. * Statistical Analysis: - Utilized descriptive statistics to analyze numerical features such as the number of characters, words, and sentences in the messages. - Conducted separate analyses for ham and spam messages, revealing variations in character count, word count, and sentence count. * Data Visualization: - Created pie charts to visualize the class distribution, highlighting the imbalance between ham and spam messages. - Employed histograms and pair plots for numerical features, providing insights into the underlying patterns. ## 3. Text (Data) Preprocessing * Objective: Prepare the text data for model training by applying various preprocessing techniques. * Text Transformation: - Lowercased the text data. - Tokenized and removed special characters. - Removed stop words and punctuation. - Applied stemming using the Porter Stemmer from NLTK. * Word Cloud Visualization: - Utilized WordCloud to visualize the most frequent words in both spam and ham messages. ## 4. Model Building * Objective: Develop machine learning models for SMS spam detection. * Feature Extraction: - Used CountVectorizer to convert text data into numerical format suitable for model training. - Obtained a feature matrix with 6708 features. * Model Selection: - Implemented three Naive Bayes models – Gaussian, Multinomial, and Bernoulli. * Model Evaluation: - Evaluated models based on accuracy, precision, and confusion matrices. ![spam](https://github.com/ArpitaSatsangi/SMS-Spam-Detection-NLP/assets/107709451/27a77b2f-1382-4a23-9737-fe25fcbf6418) ![not spam](https://github.com/ArpitaSatsangi/SMS-Spam-Detection-NLP/assets/107709451/efb1439f-5d21-4de4-a5b3-f9f9e76d874b) ## Conclusion The SMS spam detection project successfully addressed data cleaning, exploratory data analysis, text preprocessing, and model building. Leveraging Naive Bayes models, particularly the Bernoulli Naive Bayes, the project demonstrated high accuracy and precision in distinguishing between spam and ham messages. The comprehensive approach to data preprocessing and thoughtful model selection showcased the effectiveness of the developed system in combating unwanted SMS communication.

近期下载者

相关文件


收藏者