This assignment is designed to:

- solidify the ideas of text classification by classifying proper names or text documents
- provide hands-on experience with the graphical model known as Naïve Bayes
- provide experience with smoothing a distribution in the Bayesian manner, using a prior
- introduce the idea of a confusion matrix to facilitate error analysis
- build understanding necessary for problems that use text classification (especially spam classification, document sorting, routing, and filtering, language identification, etc.)

For this project, you will build a model that will predict a label for a given input datum. The input may be a proper name such as “Washington” or a document such as a news article. If you are working with proper names, then the model should predict type of proper name (place, movie, drug, person, or company). If you are working with documents, then the model should predict the category to which the document belongs. To accomplish this task, you will conduct supervised learning, in which all of the training data is labeled. You will train a Naive Bayes model using the multivariate categorical event model using simple add-one smoothing (i.e., a Dirichlet prior).

Choose your dataset.

- Option 1: a data set consisting of a list of names gathered from multiple sources by Dan Klein of U.C. Berkeley.
- Each name is labeled with one of five possible labels: PLACE, MOVIE, DRUG, PERSON, COMPANY.
- Download here: Labeled Proper Names

- Option 2: the 20 Newsgroups data set consisting of 20,000 Usenet posts.
- Each document belongs to one of 20 Usenet newsgroups.
- Download here: 20 Newsgroups
- Be sure to remove any headers that reveal a document's true label.

- Option 3: find your own labeled document data set with at least 5 categories and at least 5000 documents.

Dedicate one section of your report to specify your choice of data set. If you chose your own, then explain the source and content of the data.

- [20 points] Implement a classifier as a conditional query over classes on a Naïve Bayes (a.k.a. class-conditional unigram) model.
- Smooth using add-one smoothing (i.e., using a Dirichlet prior) – if you wish to try some other value than 1, then you may do so, but be sure to report what you are doing..
- If you are classifying names, use characters (rather than words) as the evidence.
- If you are classifying documents, use words as the evidence.

- [5 points] Implement a baseline classifier.
- In general, the purpose of a baseline algorithm is to show how well a problem can be solved by an easy, obvious solution. It is a useful point of comparison when you are coming up with more interesting/fancy solutions. For example, if you run your fancy algorithm on a problem and it generates a solution that looks good, somebody could argue that it's because the problem is easy. Similarly if your fancy algorithm's solution looks bad, you could argue that it's because the problem is very hard. If you can come up with a simple common-sense baseline to compare with, you'll be able to show that your fancy solution has added real value when it beats that baseline, regardless of whether the absolute level of performance is high or low. Implement one of the following baselines and see how your better classifier compares against it:
- You could use the most frequent label as a weak baseline.
- You could use a Naïve Bayes classifier without smoothing as a stronger baseline.

- [10 points] Implement a confusion matrix as illustrated here. A confusion matrix is simply a two-dimensional table with true labels (“REF” for reference) along one axis and predicted labels (“HYP” for hypothesis) along the other. Cells in the matrix / table contain the count of the instances with true label REF confused for predicted label HYP. Entries along the diagonal capture correct classification, as illustrated in the figure.

[Extra Credit: 20 points]

Implement you classifier in such a way that the model makes weaker independence assumptions. In addition to a Naïve Bayes model, which has a Markov order 0 model of evidence ($P(w|c)$), implement an additional model with a higher Markov order: e.g., an order 1 model of evidence conditions each word on its predecessor: $P(w_i|c,w_{i-1})$. Consider experimenting with different Markov orders.

Be sure to smooth the distribution properly. i.e, in the Markov order 1 case, make sure that you estimate $P(w_i|c,w_{i-1})$ in such a way that the smoothed estimates still constitute a valid distribution.

*Please limit the non-code portion of you report to 4 pages maximum.*

For this assignment, Write a well-structured report on the work you have done, including the following ingredients:

**Time**: Include at the top of page 1 of your report a clear measure (in hours) of how long it took you to complete this project.- [5 points]
**Data**: Dedicate one section of your report to specify your choice of data set. If you chose your own, then explain the source and content of the data. Explain how the data is partitioned so that training and test data are distinct. - [10 points]
**Design Decisions**: Specify what you built and what choices you made. Please do not simply submit a debug or work log.*2015: Be sure to explain how you implemented the smoothing of the estimates of the probability of a word given a class. What data structures did you use? How are they consulted in order to make a decision?*

- [10 points]
**Results**: Include the accuracies, confusion matrices, etc., for each model on a test data set separate from the training set. - [20 points]
**Error Analysis**: Include error analysis that responds to the questions raised below – enough to convince us that you looked at the specific behavior of your models and thought about what they’re doing wrong and how you’d fix them.- The primary goal of error analysis is to understand what your classifiers are doing well, what they’re doing badly, and why.
- Use your confusion matrix to show how often each pair of labels is confused.
- Use diagrams, graphs, and interesting examples, where needed, to make your points.

**Questions**: Address the following questions:

- How are the results surprising?
- Why do you think some labels are easier or harder?
- Identify some specific errors that are particularly revealing as to why or how the system makes errors.
- What cases would you (as a human) have trouble with? If so, identify a few and discuss them.
- If you were going to invest a large amount of effort into raising the accuracy of this system, what would you do and why?

- [10 points]
**Validation**: For your model(s),*demonstrate*how or*justify*why the distributions in your model are proper distributions.- Option 1: To
*justify*that your distributions are proper (i.e., valid), you need to*explain*why the local model parts of your overall model ($P(c)$ and $P(w|c)$) are proper distributions. - Option 2: To
*demonstrate*that your distributions are proper, show that the local model parts of your overall model ($P(c)$ and $P(w|c)$) are proper distributions by actually summing over all values in the range of random variable over which the distribution assigns probability.

**Feedback**: Include at the end of your report a short section titled “Feedback”. Reflect on your experience in the project and provide any concrete feedback you have for us that would help make this project a better learning experience.**Code**: Include your code at the end of your report.

[10 points] Write your report with clarity and easy-to-follow structure.

[Extra: up to 10 points]

You can think of the posterior class probability $P(c | d)$ as the classifier's confidence. Have your evaluation code compute how your classifier’s posterior class probability correlates with its accuracy in order to assess how useful this confidence really is. We are talking about statistical correlation; see the article on correlation on Wikipedia as one possible source of further insight. In essence, take the outcome (right = 1, wrong = 0) and plot that against the posterior probability for the chosen class; this allows you to visually see the correlation. Correlation is the percentage of variance explained by a linear fit. Each (outcome, probability) pair is a data point that is used to compute the correlation as explained in your book, Wikipedia, etc. After computing this measure, address the following question:

- Are more confident decisions more likely to be correct? Why or why not?

Your report should be submitted as a .pdf document via Learning Suite.

Thanks go to Dan Klein of U. C. Berkeley for the labeled proper name data set.