CSCE 478/878 (Fall 2003): Homework 1

CSCE 478/878 (Fall 2003) Homework 1

Assigned Wednesday, September 24
~~Due Sunday, October 12 at 11:59:59 p.m.~~
Now due Tuesday, October 14 at 11:59:59 p.m.

When you hand in your results from this homework, you should submit the following, in separate files:

A single .tar.gz or .tar.Z file (make sure you use a UNIX-based compression program) called username.tar.gz where username is your username on cse. In this tar file, put:
- Source code in C, C++, or Java (in plain text files).
- A makefile and a README file facilitating compilation and running of your code (include a description of command line options). If we cannot easily re-create your experiments, you might not get full credit.
- All your data and results (in plain text files).
A single .pdf file with your writeup of the results for all the homework problems, including the last problem. Only pdf will be accepted, and you should only submit one pdf file, with the name username.pdf, where username is your username on cse. Include all your plots in this file, as well as a detailed summary of your experimental setup, results, and conclusions. If you have several plots, you might put a few example ones in the main text and defer the rest to an appendix. Remember that the quality of your writeup strongly affects your grade. See the web page on ``Tips on Presenting Technical Material''.

Submit everything by the due date and time using the web-based handin program.

On this homework, you must work on your own and submit your own results written in your own words.

(40 pts) Using the Lasso API, implement the ID3 algorithm from table 3.1 (p. 56). You will train and test your algorithm on three different data sets from the UCI Machine Learning Repository. You may choose any three data sets that you wish, but you should note the following when making your selections.
- You will use these same three data sets in future homeworks. Thus if you choose data sets with more than two classes, while using them in ID3 will be no problem, in future homeworks you may have to adapt your binary classifiers to work with multiclass data. This is possible, but requires a little more work, which will also result in extra credit.
- It is better to use larger data sets so you can more easily split the data into training and testing sets. You will set aside at least 30 examples per set for testing. Thus if your data set only has e.g. 50 examples, very few are left over for training, and you may not get a very good result.
- Beware of data sets with unspecified attribute values. If you opt for such sets and cleverly handle these cases, you will receive extra credit.
Let U₁, U₂, and U₃ be your three data sets from UCI. From each set U_i remove 30 examples, placing them in set T_i (T_i will serve as the test set for experiment i). We will refer to the set of examples left over in U_i as D_i, i.e. D_i is the set of examples in U_i that are not in T_i. For each pair (D_i, T_i), do the following.
1. Choose 20 numbers evenly between 10 and |D_i| = size of D_i. Call these numbers s₁, s₂, ... s₂₀.
2. For j = 1, ..., 20, Uniformly at random select s_j examples from D_i without replacement (i.e. do not select the same example twice). Use these examples to learn a decision tree with ID3 and test it with the test set T_i. (Note that all 20 tests are made on the same test set; this is important to allow comparisons between different runs.)
3. Using the 20 error rates generated in the above loop, plot a curve of test error versus size of the training set.
Thus you will end up with three plots, one per U_i, each with 20 points.
You are to submit a detailed, well-written report, with real conclusions and everything. In particular, you should answer the following questions. How did increasing the training set size influence generalization error? Did overfitting occur? If not, can you push the learner to the point of overfitting? Why or why not? In your report, you should also discuss how you randomly selected the test sets T_i and how you subsampled D_i to get the training sets. From your report, the reader should be able to get enough information to repeat your experiments, and the reader should be convinced that your methods are sound, e.g. that your subsampling methods are sufficiently random. Refer to Numerical Recipes online if you have questions about simulating random processes.
Extra credit opportunities for this problem include (but are not limited to) running on extra data sets, handling continuous-valued attributes, and handling unspecified attribute values. The amount of extra credit is commensurate with the level of extra effort and the quality of your report of the results.
(5 pts) Do Problem 3.2 on p. 77
(15 pts) Do Problem 7.2 on p. 227
(5 pts) State how many hours you spent on each problem of this homework assignment (for CSCE 878 students, this includes the next two problems).
The following two problems are only for students registered for CSCE 878. CSCE 478 students who do these will receive extra credit, but the amount will be less than the number of points indicated.
(20 pts) Do Problem 7.6 on p. 228. (NOTE: The delta value in the problem should be 0.05, not 0.95. However, if using 0.95 will not change the curves very much.) Hand in your source code and data sets as part of your solution to this problem, as well as a brief report of your results, including a discussion of how you generated the examples and how you generated the empirical and theoretical plots. You do not need to use the Lasso API on this problem, but you may if you wish.
(10 pts) A binary decision stump is a depth-1 decision tree, i.e. it has a root node and two leaves. What is the VC dimension of the hypothesis class of binary decision stumps defined over the real plane? Argue that your answer is correct.

Return to the CSCE 478/878 (Fall 2003) Home Page

Last modified 16 August 2011; please report problems to sscott AT cse.