The goal of this experiment is to show how classification results can be improved by incorporating unlabeled documents into a training corpus and using expectation maximization (EM) to include the unlabeled examples in training.


A subset of the Reuters data was used that consists of samples from 3 categories. For training, 80% the corpus (996 documents) was used for training and 10% (126 documents) was used for a developement test set. Another 10% is being witheld until the model is ready for final testing.

A total of 18 tests were conducted. For each test, a portion of the training data was unlabeled. For the first test, about 10% of the instances had their labels removed. Each subsequent test unlabeled an increasing amount of the data. After the data was unlabeled, a multinomial naive bayes classifier was trained using EM to take advantage of the unlabeled data. Then, a multinomial classifer was trained, without EM, on just the remaining labeled portion of the data. The results of these experiments are shown below.

Tests were repeated with a larger data set. XXX - Describe here.


All files were corrupted!


The xls file has been updated to include the results from the larger data set: nbemVSnb.xls




The first graph shows how using the unlabeled data helps keep the accuracy of the classifier relatively level. In comparison, the classifier without EM performs very poorly as less and less labeled data is available to it for training.

nlp/naive-bayes-with-em-and-unlabeled-documents-vs-naive-bayes.txt · Last modified: 2015/04/23 15:44 by ryancha
Back to top
CC Attribution-Share Alike 4.0 International = chi`s home Valid CSS Driven by DokuWiki do yourself a favour and use a real browser - get firefox!! Recent changes RSS feed Valid XHTML 1.0