CSCE 478/878 (Fall 2004): Homework 3

CSCE 478/878 (Fall 2004) Homework 3

Assigned Monday, November 1
Due Wednesday, November 17 at 11:59 p.m.

When you hand in your results from this homework, you should submit the following, in separate files:

A single .tar.gz or .tar.Z file (make sure you use a UNIX-based compression program) called username.tar.gz where username is your username on cse. In this tar file, put:
- Source code in the language of your choice.
- A makefile and a README file facilitating compilation and running of your code (include a description of command line options). If we cannot easily re-create your experiments, you might not get full credit.
- All your data and results (in plain text files).
A single .pdf file with your writeup of the results for all the homework problems, including the last problem. Only pdf will be accepted, and you should only submit one pdf file, with the name username.pdf, where username is your username on cse. Include all your plots in this file, as well as a detailed summary of your experimental setup, results, and conclusions. If you have several plots, you might put a few example ones in the main text and defer the rest to an appendix. Remember that the quality of your writeup strongly affects your grade. See the web page on "Tips on Presenting Technical Material".

Submit everything by the due date and time using the web-based handin program.

You may work with a partner on this homework. If you do, then you will be graded more rigorously, especially on the written presentation of your results.

(60 pts) You will run experiments similar to those in previous homeworks, but with an ensemble of classifiers.
Implement the boosting algorithm. Use it to build an ensemble of either (1) single-node ANNs using your GD/EG implementation from Homework 2, or (2) decision stumps (depth-1 decision trees) using your ID3 implementation from Homework 1 (you may use the Weka version of these [or related] algorithms you wish). Make sure you limit the depth of the trees if you use ID3. You may have your learners train on resampled data sets, or you may have the algorithm use its knowledge of the distribution over the training set to train to directly minimize training error. (You can do the latter for ANNs by modifying the function that GD or EG minimizes; this is worth extra points.)
Run your classifier on the same data sets that you used in the previous homeworks. Keep track of your error on the training set after each round of boosting, and use these values to generate a plot like those in Figure 4.9 on p. 110: training error versus training round, where error is sample error on the set [p. 130], not squared error. (For extra credit, you may also plot error on an independent validation set.) Also, report your final error results (on the test set) either with confidence intervals or with a ROC curve. When did training error go to zero? Did overfitting occur?
The following problem is only for students registered for CSCE 878. CSCE 478 students who do it will receive extra credit, but the amount will be less than the number of points indicated.
(20 pts) Write a 2-3 page essay on the history of boosting, including where it came from, how it evolved, and what its major milestones were. Your essay should be "based on a true story," but like so many good stories out there, you may embellish a bit here and there.
(5 pts) State how many hours you spent on each problem of this homework assignment.

back
CSCE 478/878 (Fall 2004) Home Page

Last modified 16 August 2011; please report problems to sscott.