CSCE 478/878 (Fall 2008) Homework 1

Assigned Monday, September 22
Due Friday, October 10 at 11:59 p.m.

When you hand in your results from this homework, you should submit the following, in separate files:

  1. A single .tar.gz or .tar.Z file (make sure you use a UNIX-based compression program) called username.tar.gz where username is your username on cse. In this tar file, put:
  2. A single .pdf file with your writeup of the results for all the homework problems, including the last problem. Only pdf will be accepted, and you should only submit one pdf file, with the name username.pdf, where username is your username on cse. Include all your plots in this file, as well as a detailed summary of your experimental setup, results, and conclusions. If you have several plots, you might put a few example ones in the main text and defer the rest to an appendix. Remember that the quality of your writeup strongly affects your grade. See the web page on ``Tips on Presenting Technical Material''.

Submit everything by the due date and time using the web-based handin program.

On this homework, you must work on your own and submit your own results written in your own words.


  1. (40 pts) Implement the ID3 algorithm from Table 3.1 (p. 56). You will train and test your algorithm on three different data sets from the UCI Machine Learning Repository. You must use both the "Vote" and "Monks I" data sets. For the third data set, you may choose any that you wish, but you should note the following when making your selection.
  2. Let U1, U2, and U3 be your three data sets. From each set Ui remove 30 examples, placing them in set Ti (Ti will serve as the test set for experiment i). We will refer to the set of examples left over in Ui as Di, i.e. Di is the set of examples in Ui that are not in Ti. For each pair (Di, Ti), do the following.

    1. Choose 5 numbers evenly between 10 and |Di| = size of Di. Call these numbers s1, s2, ... s5.
    2. For j = 1, ..., 5, uniformly at random select sj examples from Di without replacement (i.e. do not select the same example twice). Use these examples to learn a decision tree with ID3 and test it with the test set Ti. Repeat this process three times for each j and take the average. Thus you will run 15 experiments, generating 5 average error rates, one per value of j. (Note that all 15 tests are made on the same test set; this is important to allow comparisons between different runs.)
    3. Using the 5 average error rates generated in the above loops, plot a curve of test error versus size of the training set.

    Thus you will end up with three plots, one per Ui, each with 5 points.

    You are to submit a detailed, well-written report, with real conclusions and everything. In particular, you should answer the following questions. How did increasing the training set size influence generalization error? Did overfitting occur? If not, can you push the learner to the point of overfitting? Why or why not? In your report, you should also discuss how you randomly selected the test sets Ti and how you subsampled Di to get the training sets. From your report, the reader should be able to get enough information to repeat your experiments, and the reader should be convinced that your methods are sound, e.g. that your subsampling methods are sufficiently random. Refer to Numerical Recipes if you have questions about simulating random processes.

    Extra credit opportunities for this problem include (but are not limited to) running on extra data sets, handling continuous-valued attributes, and handling unspecified attribute values. The amount of extra credit is commensurate with the level of extra effort and the quality of your report of the results.

  3. (15 pts) Do Problem 2.4 on p. 48

  4. (5 pts) Do Problem 3.2 on p. 77

  5. (5 pts) State how many hours you spent on each problem of this homework assignment (for CSCE 878 students, this includes the next problem).
  6. The following problem is only for students registered for CSCE 878. CSCE 478 students who do it will receive extra credit, but the amount will be less than the number of points indicated.

  7. (20 pts) Rerun your ID3 experiments with rule post-pruning, testing on the same test sets as before. You will therefore need to set aside at least 30 of your training examples for validation during pruning. Generate the same plots as before to determine the decrease in test error that rule post-pruning yields. (To make your comparisons fair with the non-pruning case, you are advised to not train on the validation set in Problem 1, i.e. build the tree on the exact same set of examples for this problem and Problem 1.)

Return to the CSCE 478/878 (Fall 2008) Home Page

Last modified 16 August 2011; please report problems to sscott.