Assigned Tuesday, February 21
Due
Thursday, March 23 at 11:59 p.m.
Sunday, April 2 at 11:59 p.m.
Total points: 120
When you hand in your results from this homework, you should submit the following, in separate files:
Submit everything by the due date and time using the web-based handin program.
On this homework, you must work on your own and submit your own results written in your own words.
Of course, your kernel cannot do everything on its own; it needs help. So your first step will be to find, download, and compile an SVM package, or at least an optimizer for the quadratic program. (Of course, you may implement your own as well.) You will then implement (not merely downlad) your kernel to work with this package, and modify your chosen SVM package to work with your kernel. There are multiple ways of doing this: You may implement your kernel within the source code of the SVM package, or you may implement a stand-alone program that outputs your kernel's Gram matrix for a particular data set, then add to your SVM package some code that simply reads in that matrix. (I personally recommend the latter since it allows you to implement your kernel in the language of your choice. Further, doing the former instead may require you to modify the SVM package's data structures and/or input format. But whichever you do, it is your decision.)
Of course, you also need data. Lots and lots of data. Since your kernel has a different input space than the kernels of some of your classmates, I cannot simply assign a set of data for everyone. Thus you will need to come up with your own data sets that are appropriate for your kernel. If your chosen kernel has applications to a research project you're doing, I suggest you start there. You might also try contacting the authors of the papers you read when choosing your kernel, to find out where they got their data. Another possible source is the UCI Machine Learning Repository, although I doubt you'll find data there appropriate for graph or string kernels (I may be wrong, however).
You will download two separate data sets, each representing a different learning task. E.g. if your kernel is a string kernel, then one of your data sets might consist of biological sequences, with e.g. positive and negative examples of sequences from a specific protein family. The second data set might have positive and negative examples of another protein family (but a different data set, or at least a different representation of the same data set, like secondary versus primary structure), DNA sequences with and without certain properties, or data from an information retrieval application that does and does not match a particular query. Partition each set S into two disjoint subsets Strain and Stest, where the latter has size at least 30 (but the larger the better). You will obviously train on the examples in Strain and test on those in Stest.
Report on the performance of your classifiers on the different data sets using receiver operating characteristic (ROC) curves. (For more information, see the "ROC Analysis" portion of my CSE 878 slides and Peter Flach's ROC tutorial.) Comment on the learner's performance, and where applicable, contrast your performance with those reported in the papers where you found your kernel. Also, for each data set, note if your classifier tended to make a particular type of error. I.e. were many of its misclassified (where misclassified means a positive ranked below several negatives or a negative ranked above many positives) examples similar to each other in input and/or feature space? If so, explain this phenomenon.
When you hand in your results, submit source code electronically as well as a well-written, detailed report (much of your grade will depend on your presentation). Extra credit opportunities include experimenting with single-class SVMs, experimenting with more data sets (or other representations of the same data sets), and anything else that goes above and beyond what is asked for in this homework.
To get you started in your search for SVM packages, you may look at this list or at the Weka site, which has an implementation of SMO. Further, SVMlight is a good package. You can also look at LIBSVM or its Weka version (note that this version is not at the Weka site).
One more note: those using kernels on discrete structures (e.g. strings) might get a kernel matrix that is "diagonally dominant", i.e. the entries on the diagonal could be much larger than those off the diagonal. This might cause a problem with learning, e.g. generalization error might be higher because of this. If you have trouble, you might try a simple trick from Schölkopf et al., which is to take the log of each entry in the kernel matrix or raise each entry to the 1/p, where p > 1. Then to ensure that the resultant matrix is PD, you multiply it by itself. See the paper for more information.
Last modified 16 August 2011.