CSCE 978 (Spring 2006): Homework 2

CSCE 978 (Spring 2006) Homework 2

Assigned Tuesday, February 21
Due ~~Thursday, March 23 at 11:59 p.m.~~ Sunday, April 2 at 11:59 p.m.
Total points: 120

When you hand in your results from this homework, you should submit the following, in separate files:

A single .tar.gz or .tar.Z file (make sure you use a UNIX-based compression program) called username.tar.gz where username is your username on cse. In this tar file, put:
- Source code in the language of your choice (in plain text files).
- A makefile and a README file facilitating compilation and running of your code (include a description of command line options). If I cannot easily re-create your experiments, you might not get full credit.
- All your data and results (in plain text files).
A single .pdf file with your writeup of the results for all the homework problems, including the last problem. Only pdf will be accepted, and you should only submit one pdf file, with the name username.pdf, where username is your username on cse. Include all your plots in this file, as well as a detailed summary of your experimental setup, results, and conclusions. If you have several plots, you might put a few example ones in the main text and defer the rest to an appendix. Remember that the quality of your writeup strongly affects your grade. See the web page on ``Tips on Presenting Technical Material''.

Submit everything by the due date and time using the web-based handin program.

On this homework, you must work on your own and submit your own results written in your own words.

(100 pts) Well, the kernel that you adopted last month in Homework 1 has done nothing but sit around, play videogames, and eat all the food in your fridge. Now it's time for you to put your kernel to work.
Of course, your kernel cannot do everything on its own; it needs help. So your first step will be to find, download, and compile an SVM package, or at least an optimizer for the quadratic program. (Of course, you may implement your own as well.) You will then implement (not merely downlad) your kernel to work with this package, and modify your chosen SVM package to work with your kernel. There are multiple ways of doing this: You may implement your kernel within the source code of the SVM package, or you may implement a stand-alone program that outputs your kernel's Gram matrix for a particular data set, then add to your SVM package some code that simply reads in that matrix. (I personally recommend the latter since it allows you to implement your kernel in the language of your choice. Further, doing the former instead may require you to modify the SVM package's data structures and/or input format. But whichever you do, it is your decision.)
Of course, you also need data. Lots and lots of data. Since your kernel has a different input space than the kernels of some of your classmates, I cannot simply assign a set of data for everyone. Thus you will need to come up with your own data sets that are appropriate for your kernel. If your chosen kernel has applications to a research project you're doing, I suggest you start there. You might also try contacting the authors of the papers you read when choosing your kernel, to find out where they got their data. Another possible source is the UCI Machine Learning Repository, although I doubt you'll find data there appropriate for graph or string kernels (I may be wrong, however).
You will download two separate data sets, each representing a different learning task. E.g. if your kernel is a string kernel, then one of your data sets might consist of biological sequences, with e.g. positive and negative examples of sequences from a specific protein family. The second data set might have positive and negative examples of another protein family (but a different data set, or at least a different representation of the same data set, like secondary versus primary structure), DNA sequences with and without certain properties, or data from an information retrieval application that does and does not match a particular query. Partition each set S into two disjoint subsets S_train and S_test, where the latter has size at least 30 (but the larger the better). You will obviously train on the examples in S_train and test on those in S_test.
Report on the performance of your classifiers on the different data sets using receiver operating characteristic (ROC) curves. (For more information, see the "ROC Analysis" portion of my CSE 878 slides and Peter Flach's ROC tutorial.) Comment on the learner's performance, and where applicable, contrast your performance with those reported in the papers where you found your kernel. Also, for each data set, note if your classifier tended to make a particular type of error. I.e. were many of its misclassified (where misclassified means a positive ranked below several negatives or a negative ranked above many positives) examples similar to each other in input and/or feature space? If so, explain this phenomenon.
When you hand in your results, submit source code electronically as well as a well-written, detailed report (much of your grade will depend on your presentation). Extra credit opportunities include experimenting with single-class SVMs, experimenting with more data sets (or other representations of the same data sets), and anything else that goes above and beyond what is asked for in this homework.
To get you started in your search for SVM packages, you may look at this list or at the Weka site, which has an implementation of SMO. Further, SVM^light is a good package. You can also look at LIBSVM or its Weka version (note that this version is not at the Weka site).
One more note: those using kernels on discrete structures (e.g. strings) might get a kernel matrix that is "diagonally dominant", i.e. the entries on the diagonal could be much larger than those off the diagonal. This might cause a problem with learning, e.g. generalization error might be higher because of this. If you have trouble, you might try a simple trick from Schölkopf et al., which is to take the log of each entry in the kernel matrix or raise each entry to the 1/p, where p > 1. Then to ensure that the resultant matrix is PD, you multiply it by itself. See the paper for more information.
(15 pts) Do Exercise 7.4 on page 223 of the text.
(5 pts) State the approximate amount of time you spent on each problem.

Return to the CSCE 978 (Spring 2006) Home Page

Last modified 16 August 2011.

Back