Assigned Wednesday, September 24
Due Sunday, October 12 at 11:59:59 p.m.
Now due Tuesday, October 14 at 11:59:59 p.m.
When you hand in your results from this homework, you should submit the following, in separate files:
On this homework, you must work on your own and submit your own results written in your own words.
Let U1, U2, and U3 be your three data sets from UCI. From each set Ui remove 30 examples, placing them in set Ti (Ti will serve as the test set for experiment i). We will refer to the set of examples left over in Ui as Di, i.e. Di is the set of examples in Ui that are not in Ti. For each pair (Di, Ti), do the following.
Thus you will end up with three plots, one per Ui, each with 20 points.
You are to submit a detailed, well-written report, with real conclusions and everything. In particular, you should answer the following questions. How did increasing the training set size influence generalization error? Did overfitting occur? If not, can you push the learner to the point of overfitting? Why or why not? In your report, you should also discuss how you randomly selected the test sets Ti and how you subsampled Di to get the training sets. From your report, the reader should be able to get enough information to repeat your experiments, and the reader should be convinced that your methods are sound, e.g. that your subsampling methods are sufficiently random. Refer to Numerical Recipes online if you have questions about simulating random processes.
Extra credit opportunities for this problem include (but are not limited to) running on extra data sets, handling continuous-valued attributes, and handling unspecified attribute values. The amount of extra credit is commensurate with the level of extra effort and the quality of your report of the results.
The following two problems are only for students registered for CSCE 878. CSCE 478 students who do these will receive extra credit, but the amount will be less than the number of points indicated.
Last modified 16 August 2011; please report problems to sscott AT cse.