CSCE 478/878 (Fall 2014) Homework 1

Assigned Thursday, September 11
Due Sunday, September 28 at 11:59 p.m.

When you hand in your results from this homework, you should submit the following, in separate files:

  1. A zip file called, where username is your username on cse. In this zip file, put:
  2. A single .pdf file with your writeup of the results for all the homework problems. Only pdf will be accepted, and you should only submit one pdf file, with the name username.pdf, where username is your username on cse. Include all your plots in this file, as well as a detailed summary of your experimental setup, results, and conclusions. If you have several plots, you might put a few example ones in the main text and defer the rest to an appendix. Remember that the quality of your writeup strongly affects your grade. See the web page on ``Tips on Presenting Technical Material''.

Submit everything by the due date and time using the web-based handin program.

On this homework, you must work on your own and submit your own results written in your own words.

  1. Assume that you are given a set of training data {(5, +), (8, +), (7, +), (1, −), (12, −), (15, −)}, where each instance consists of a single, integer-valued attribute and a binary label. The label comes from a function C, which is represented as a single interval. Formally, the interval is represented by two points a and b, and an instance x is labeled as positive if and only if a ≤ x ≤ b.
    1. (2 pts) Describe a hypothesis that is consistent with the training set.
    2. (3 pts) Describe the version space and compute its size.
    3. (5 pts) Suppose that your learning algorithm is allowed to pose queries, in which the algorithm can ask a teacher the label C(x) for any instance x. Specify a query x1 that is guaranteed to reduce the size of the version space, regardless of the answer. Specify a query x2 that is guaranteed to not change the size of the version space, regardless of the answer.
    4. (7 pts) Suppose that you start with a size-three training set {(x1, −), (x2, +), (x3, −)}, where 0 < x1 < x2 < x3. Describe a series of fewer than 2(log2x3) queries that will reduce the size of the version space to a single hypothesis.

  2. Give a decision tree to represent each of the following boolean functions (⊕ is exclusive OR, ∼ is negation):
    1. (2 pts) A ∧ [∼ B]
    2. (2 pts) A ∨ [BC]
    3. (2 pts) AB
    4. (2 pts) [AB] ∨ [CD]

  3. Consider the following set of training examples and answer the questions below. Show your work.
    Instance Label a1 a2
    1 + T T
    2 + T T
    3 T F
    4 + F F
    5 F T
    6 F T
    1. (5 pts) What is the entropy of the data set?
    2. (5 pts) Which attribute (a1 or a2) would the algorithm ID3 choose next?

  4. (40 pts) Implement the ID3 algorithm. You will train and test your algorithm on three different data sets from the UCI Machine Learning Repository. You must use both the "Congressional Voting Records" and "Monks I" data sets. For the third data set, you may choose any that you wish, but you should note the following when making your selection.
  5. Let U1, U2, and U3 be your three data sets. From each set Ui remove 30 randomly-selected examples, placing them in set Ti (Ti will serve as the test set for experiment i). We will refer to the set of examples left over in Ui as Di, i.e., Di is the set of examples in Ui that are not in Ti. Use Di to train a decision tree with your ID3 implementation, and test that tree on Ti. Report the accuracies for each data set in your writeup.

    You are to submit a detailed, well-written report, with conclusions that you can justify with your results. In particular, you should attempt to answer the following questions. How large of a tree did ID3 produce for each learning problem? Did overfitting occur (why do you think so or not)? Would pruning the tree reduce overfitting? In your report, you should also discuss how you randomly selected the test sets Ti. From your report, the reader should be able to get enough information to repeat your experiments, and the reader should be convinced that your methods are sound, e.g. that your sampling methods are sufficiently random. Refer to Numerical Recipes if you have questions about simulating random processes.

    Extra credit opportunities for this problem include (but are not limited to) running on extra data sets, varying the training set size to gauge overfitting, handling continuous-valued attributes, and handling unspecified attribute values. The amount of extra credit is commensurate with the level of extra effort and the quality of your report of the results.

    A note on the Congressional Voting Records data. Each vote result (attribute) has three possible values: "yea", "nea", and "unknown disposition". The third value can be treated as a legitimate attribute value; it does not have to be considered unspecified, since there is valuable information in it.

    The following problem is only for students registered for CSCE 878. CSCE 478 students who do it will receive extra credit, but the amount will be less than the number of points indicated.

  6. (25 pts) Rerun your ID3 experiments with rule post-pruning, testing on the same data sets as before. You will therefore need to set aside at least 30 of your training examples from Di for validation during pruning. Report accuracies as before to determine the decrease in test error that rule post-pruning yields. (To make your comparisons fair with the non-pruning case, you are advised to not train on the validation set in Problem 4, i.e., build the tree on the exact same set of examples Di for this problem and Problem 4.)

Return to the CSCE 478/878 (Fall 2014) Home Page

Last modified 09 October 2014; please report problems to sscott.