naive_bayes_classifier
所属分类:特征抽取
开发工具:C++
文件大小:1051KB
下载次数:0
上传日期:2015-04-28 05:42:22
上 传 者:
sh-1993
说明: 基于TF IDF的朴素贝叶斯文本分类器
(Naive bayes text classifier using TF IDF)
文件列表:
Makefile (94, 2015-04-28)
classify.cpp (8766, 2015-04-28)
predictions.txt (7834, 2015-04-28)
testing.txt (944709, 2015-04-28)
training.txt (2550699, 2015-04-28)
# Naive Bayes Classifier
#### C++
Text-based naive bayes classifier with TF/IDF smoothing
Design Decisions:
My design is as follows. First, I instantiated 4 maps of for each of the categories to keep track of the word and the count of the word within the training.txt. I also instantiated 4 maps of to keep track of the probabilities of each word given each category. I read the training file, perform a getline on each line, and then process each word, adding it into its respective maps if the term is not already in the map, otherwise I increase the count by 1. I have a computeTotalWords function that finds the total number of words within each map (category). I have a computeProbability() function that computes the probability with my algorithm and stores it into the probability maps. I have a classify(queue q) function that takes a queue and processes each word in the queue. The queue stores the words in a specific line in the test file. It then computes the probability for each category of that line and chooses the maximum and returns the appropriate string.
Algorithm Details:
For the algorithm, I chose the log TF with a smoothing factor and I used IDF. Given a particular term, my formula to calculate probability is as follows: log((1+X/(Y+m) * (1+log(A/B))) where:
X = number of times a term occurs in a given category
Y = number of words in a given category
m = smoothing factor, total number of lines in the file
A = total # of documents
B = # of documents with that term
近期下载者:
相关文件:
收藏者: