CSCE 478/878 (Fall 2016) Homework 1
Assigned Thursday, September 8
Due
Sunday, September 25
at 11:59 p.m.
When you hand in your results from this homework,
you should submit the following, in separate
files:
- A zip file
called username.zip, where username is
your username on cse. In this zip file, put:
- Source code in the language of your choice (in plain text files).
- A makefile and a README file facilitating compilation and running
of your code (include a description of command line options).
If we cannot easily re-create your experiments, you might not get full
credit.
- All your data and results (in plain text files).
- A single .pdf file with your writeup of the results for all the
homework problems.
Only pdf will be accepted, and you should
only submit one pdf
file, with the name username.pdf, where username is
your username on cse. Include all your plots in this file, as
well as a detailed summary of your experimental setup, results,
and conclusions. If you have several plots, you might put a few
example ones in the main text and defer the rest to an appendix.
Remember that the quality of your writeup strongly affects your grade.
See the web page on
"Tips
on Presenting Technical Material".
Submit everything by the
due date and time using the
handin program.
On this homework, you must work with your homework partner.
- Assume that you are given a set of training data {(5, +),
(8, +),
(7, +),
(1, −),
(12, −),
(15, −)}, where each instance consists of a single, integer-valued attribute and a binary label.
The label comes from a function C, which is represented as a single interval. Formally,
the interval is represented by two points a and b, and an instance x is
labeled as positive if and only if a ≤ x ≤ b.
- (2 pts) Describe a hypothesis that is consistent with the training set.
- (3 pts) Describe the version space and compute its size.
- (5 pts) Suppose that your learning algorithm is allowed to pose queries, in which
the algorithm can ask a teacher the label C(x) for any instance x. Specify a query
x1 that is guaranteed
to reduce the size of the version space, regardless of the answer. Specify a query
x2 that is guaranteed to not change the size of the version space, regardless of the answer.
- (7 pts) Suppose that you start with a size-three training set
{(x1, −),
(x2, +),
(x3, −)}, where
0 < x1 < x2 < x3.
Describe a series of fewer than 2(log2x3) queries that will reduce the size of the version
space to a single hypothesis.
- Give a decision tree to represent each of the following boolean functions (⊕ is exclusive
OR, ∼ is negation):
- (2 pts) A ∧ [∼ B]
- (2 pts) A ∨ [B ∧ C]
- (2 pts) A ⊕ B
- (2 pts) [A ∧ B] ∨ [C ∧ D]
- Consider the following set of training examples and answer the questions below. Show your work.
Instance | Label |
a1 |
a2 |
1 | + | T | T |
2 | + | T | T |
3 | − | T | F |
4 | + | F | F |
5 | − | F | T |
6 | − | F | T |
- (5 pts) What is the entropy of the data set?
- (5 pts) Which attribute (a1 or a2) would the algorithm ID3 choose next?
- (40 pts)
Implement the ID3 algorithm.
You will train and test your algorithm on three different data sets from the
UCI Machine
Learning Repository. You must use both the "Congressional Voting Records" and
"Monks I" data sets. For the third data set, you may choose any
that you wish, but you should note the following
when making your selection.
- You will use these same three data sets in future homeworks.
Thus if your chosen third data set has more than two classes, while
using them in ID3 will be no problem, in future homeworks you may
have to adapt your binary classifiers to work with multiclass data. This
is possible, but requires a little more work, which will also result
in extra credit.
- It is better to use larger data sets (though not enormous)
so you can more easily split the
data into training and testing (and validation)
sets. You will set aside at least 30 examples
per set for testing (and validation). Thus if your data set only has
e.g., 50 examples, very few are left over for training, and you may not get a
very good result.
- Beware of data sets with unspecified attribute values. If you opt
for such sets and cleverly handle these cases, you will receive extra
credit.
Let U1, U2, and
U3 be your three data sets.
From each set Ui remove 30 randomly-selected examples,
placing them in set Ti (Ti will
serve as the test set for experiment i). We will refer to
the set of examples left over in Ui as Di,
i.e., Di is the set of examples in Ui that
are not in Ti.
Use Di to train a decision tree with your ID3 implementation, and
test that tree on Ti. Report the accuracies for each data set in
your writeup.
You are to submit a detailed,
well-written
report, with conclusions that you can justify with your results. In particular, you
should attempt to
answer the following questions. How large of a tree did ID3 produce for each
learning problem? Did overfitting occur (i.e., is there a different hypothesis
that generalizes better)? Would pruning the tree reduce overfitting?
In your report, you should also discuss how you randomly selected the test sets
Ti. From your report, the reader should be able to get enough
information to repeat your experiments, and the reader should be convinced that
your methods are sound, e.g., that your sampling methods are sufficiently
random. Refer to
Numerical
Recipes if you have questions about simulating random
processes.
Extra credit opportunities for this problem include (but are not
limited to) running on extra data sets, varying the training set size to
gauge overfitting, handling continuous-valued
attributes, and handling unspecified attribute values. The amount of
extra credit is commensurate with the level of extra effort and the
quality of your report of the results.
A note on the Congressional Voting Records data.
Each vote result (attribute) has three
possible values:
"yea",
"nea", and
"unknown disposition". The third value can be treated as
a legitimate attribute value; it does not have to be considered unspecified, since there
is valuable information in it.
- (25 pts) Update your program to convert your tree into a set of rules and then perform
rule post-pruning. Then
rerun your ID3 experiments with rule post-pruning, testing
on the same data sets as before. You will therefore need to set aside at least
30 of your training examples from Di for validation during pruning. Report
accuracies as before to determine the decrease in test error that rule post-pruning
yields. (To make your comparisons fair with the non-pruning case, you are
advised to not train on the validation set in Problem 4, i.e., build the tree on
the exact same set of examples Di for this problem and Problem 4.)
Return to the CSCE 478/878 (Fall 2016) Home Page
Last modified 28 September 2016; please report problems to
sscott.