Using TFIDF with Naive bayes

3 visualizaciones (últimos 30 días)
Sarah Alduayj
Sarah Alduayj el 11 de Jul. de 2018
Comentada: Oscar Green el 10 de Mayo de 2019
I'm building a sentiment classification model using TFIDF and naive bayes. But the model keeps misclassifying the second class.Although I have used TFIDf with other models such as SVM and random forest and it was working fine. Below I will describe my data and steps used: I have 2000 comments (1000 positive, 1000 negative). I did the following steps: 1) data preprocessing
cleanTextData = erasePunctuation(textData);
cleanTextData = lower(cleanTextData);
words = stopWords;
cleanDocuments = tokenizedDocument(cleanTextData);
cleanDocuments = removeWords(cleanDocuments,words);
cleanDocuments = normalizeWords(cleanDocuments);
cleanDocuments(1:10)
%%Bag of Words
cleanBag = bagOfWords(cleanDocuments)
cleanBag = removeInfrequentWords(cleanBag,2) % remove words with frequency less than or equal 2
%%remove emplty documents caused by preprocessing
[cleanBag,idx] = removeEmptyDocuments(cleanBag);
Then I used TFIDF
predictors = tfidf(cleanBag,'Normalized',true,'TFWeight','log','IDFWeight','smooth');
Then I passed the results to my naive bayes model
t = templateNaiveBayes('DistributionNames','mvmn');
CVMdl = fitcecoc(predictors,response,'KFold',10,'Learners',t,'FitPosterior',true,'Coding','onevsone','ResponseName','response');
But the confusion matrix will give the following results :
C1 C2
____ __
990 10
1000 0
It seems it is classifying almost all the 2000 observations to one class only. Please advice, I have tried almost all what I know and what ever suggested by others. This is related to my master thesis and I only have few weeks to submit it.
  4 comentarios
Christopher Creutzig
Christopher Creutzig el 26 de Nov. de 2018
Editada: Christopher Creutzig el 26 de Nov. de 2018
Do you have to use naïve Bayes, or did you try other models and got even worse results?
With only two classes, I do not see why you use fitcecoc, which is an interface to use multiple binary classifiers to build a multi-class one. You could use fitclinear instead, which in my experience is pretty good at the kind of high-dimensional fitting required in text analytics.
Oscar Green
Oscar Green el 10 de Mayo de 2019
One thing I've done in the past is to aggregate/discretize into log-frequency buckets and treat those as features. It's a bit of a hack, but so is naive bayes, and it ends up working pretty well.

Iniciar sesión para comentar.

Respuestas (0)

Community Treasure Hunt

Find the treasures in MATLAB Central and discover how the community can help you!

Start Hunting!

Translated by