Assigned Thursday, January 23
Due Tuesday, February 18 at 11:59:59 p.m.
Total points: 100
When you hand in your results from this homework, you should submit the following, in separate files:
On this homework, you must work on your own and submit your own results written in your own words.
You will train and test your two classifiers on six datasets. For five of them, the number of classes is M=2 and the number of dimensions is l=2. For the sixth, the number of classes is M=4 and the number of dimensions is l=4. Because M>2, you must use a multi-class technique (e.g. Kessler's construction or ECOC) for Perceptron and Winnow.
The data sets were generated according to different probability distributions and some are linearly separable while others are not (each testing set is generated in the same fashion as its corresponding training set). In your Bayesian classifiers, do not forget that all classes might not be equally probable!
When you choose values for your different parameters, choose widely varying numbers (e.g. differing by factors of 2 or more), so you can see how the parameter values impact training speed and classification error.
Report on the performance of these classifiers on the different data sets with the different parameter values (one way is to hold all parameter values fixed while varying one, measuring its effect). You might also try varying the number of training examples to see which classifiers perform better with less data. Include comments on time (in number of iterations and in real time) to train, time to test, and error rates during training and testing. Also, for each data set, note if a classifier tended to make a particular type of error. I.e. were all its misclassified feature vectors near each other? If so, explain this phenomenon.
Contrast the performances (error rates and times) of these classifiers on the different sets. From these results, what can you infer about the characteristics of each data set in terms of linear separability, probability distribution, etc.? Based on the characteristics of each data set, which classifier do you feel is most appropriate? Do you think better results could be obtained by using other methods (including those from Chapter 4)? Which methods do you think would improve performance and why?
When you hand in your results, submit source code electronically as well as a well-written, detailed report (much of your grade will depend on your presentation).
The data sets are available on the web. Each directory has 2M files: classi.train and classi.test for i=1,...,M. The *.train files are for training the classifier, the *.test files are for testing it (if you wish to experiment with various amounts of training data, you can prune some data out of the training sets). The file classi.* contains one omega_i feature vector per line, real numbers separated by spaces. You may write scripts to reformat the data before it goes to your executables, so long as you keep the training and testing sets separate. However, you must provide these scripts in your .tar.gz file. I.e. I should be able to run your main scripts on the original data sets to evaluate your programs.
Extra Credit (15 pts) Find a good SVM or RBF toolset that is freely available (or write your own). Download and compile it (if necessary) and run it on different training and testing sets (including the ones I generated, if you wish). Briefly report on its performance with different network architectures and parameters. Also report on its ease of use.
Last modified 16 August 2011; please report problems to sscott AT cse.