naive_bayes_classifier

所属分类:特征抽取
开发工具:C++
文件大小:1051KB
下载次数:0
上传日期:2015-04-28 05:42:22
上 传 者sh-1993
说明:  基于TF IDF的朴素贝叶斯文本分类器
(Naive bayes text classifier using TF IDF)

文件列表:
Makefile (94, 2015-04-28)
classify.cpp (8766, 2015-04-28)
predictions.txt (7834, 2015-04-28)
testing.txt (944709, 2015-04-28)
training.txt (2550699, 2015-04-28)

# Naive Bayes Classifier #### C++ Text-based naive bayes classifier with TF/IDF smoothing Design Decisions: My design is as follows. First, I instantiated 4 maps of for each of the categories to keep track of the word and the count of the word within the training.txt. I also instantiated 4 maps of to keep track of the probabilities of each word given each category. I read the training file, perform a getline on each line, and then process each word, adding it into its respective maps if the term is not already in the map, otherwise I increase the count by 1. I have a computeTotalWords function that finds the total number of words within each map (category). I have a computeProbability() function that computes the probability with my algorithm and stores it into the probability maps. I have a classify(queue q) function that takes a queue and processes each word in the queue. The queue stores the words in a specific line in the test file. It then computes the probability for each category of that line and chooses the maximum and returns the appropriate string. Algorithm Details: For the algorithm, I chose the log TF with a smoothing factor and I used IDF. Given a particular term, my formula to calculate probability is as follows: log((1+X/(Y+m) * (1+log(A/B))) where: X = number of times a term occurs in a given category Y = number of words in a given category m = smoothing factor, total number of lines in the file A = total # of documents B = # of documents with that term

近期下载者

相关文件


收藏者