Using LASSO and the LearnerWrapper

Introduction

LassoWrapper provides a simple method to run and test LASSO Learner API components from the command line. It provides only a basic interface to the component.

It is very important to maintain some distinctions between the wrapper and the Learner component. The Learner component should be written (for the purpose of this exercise) so that it receives only numeric feature data. Any permutation or mapping of attributes should be performed in the LearnerWrapper, not the LASSO component. Similarly, the LASSO Learner component should NOT perform IO to the console aside from debug messages for your own use. All interaction with the user should be performed in LassoWrapper.

It is ABSOLUTELY ESSENTIAL to your future sanity that you do not modify the LASSO API functions. You may add private functions to the class, but do not change the parameters or return types of any existing function. This can not be overemphasized. You may make other changes to LassoWrapper (beyond those required) if you'd like, although it is not recommended to change the general operation of it, including the command line arguments.

Steps

Download LearnerWrapper and DumbLearner in either C++ or Java formats.
Run LearnerWrapper with DumbLearner and understand the enjoysport2 file format and names file.
Download UCI data. Examine the file structure. Make any changes to LearnerWrapper to allow it to handle your datasets. You do not need to maintain compatability with enjoysport2; it is merely provided as an example. Note that enjoysport2 has space-delimited features, but your UCI data may be delimited in some other way. You may want to adjust the existing string tokenization functions or make your own (up to you).
Ensure that you are properly mapping and parsing the file by putting print statements in the train and/or classify functions in the API...just print out the feature vector and make sure that it is being properly produced.
Create your own classifier class. You may want to start with just a blank API or you may want to copy DumbLearner. Rename the file to whatever you want. Adjust the Makefile to reflect the new name. Adjust LearnerWrapper to invoke your class instead of DumbLearner.
Build a decision tree classifier as specified in the homework.
Build a disk serializer
Modify LearnerWrapper to generate results. You may find it useful to save your results to a file. You may want to build functions which calculate the error.
Run experiments using UCI data.

Performing Experiments with LassoWrapper

LassoWrapper comes with a simple set of test examples based on the enjoysport example in Tom Mitchell's book. In LassoWrapper and LASSO, training and testing are distinct operations in which any classifier built during training must be saved to the disk and then re-read during testing. To run the examples:

C++ java

./learnerWrapper train enjoysport2 3 2 java LearnerWrapper train enjoysport2 3 2

./learnerWrapper test enjoysport2 3 2 java LearnerWrapper test enjoysport2 3 2

Arguments for both versions:

Operation - either "train" or "test"
Data Name - name of the dataset
Feature Count - Number of features excluding labels
Label Value Count - Number of possible labels in data

File Structure

LearnerWrapper expects separate files for training and testing. Discussion on this follow later in the document. In addition, a ".names" file is used to handle expected formatting in both training and test files. See enjoysport2.names for an example. The format is simple, it describes either a "numeric" or "text" feature (which will be mapped to a numeric one) or a "label". In the example LearnerWrapper, label is expected to be the last argument. You may want to enhance this if you use data that has a label first, for example.

Using UCI data with LearnerWrapper

Instead of trying to make UCI data files work directly with LearnerWrapper it may be useful instead to create a utility which prepares the *.train and *.test files for you. Note that you need to randomly select examples from the original--a simple program/script that selects examples randomly by line number should suffice.

Handling Serialization

Since training and testing happen during different runtimes, you must store a representation of the classifier to the disk and restore it. In our example DumbLearner, we simply calculate a set of probabilities, and store those to the disk. In the C++ version, we store the information in text format and restore with simple C++ IO constructs. In the java version, we use Object Serialization to store our datastructures directly to the disk. The file is created at the end of training (trainStop()) and is read during the start of classification (classifyStart()).

Clearly, saving a tree structure to disk is more challenging than saving a set of probabilities. Here are some possible techniques:

Convert a tree to an array structure (remember CSE310?).
Build a tree data structure and use serialization (java only).
Save the tree instead as a set of rules and classify using the rules. You can also prune this set of rules (pruning will get you extra credit).
Your own novel method of saving a tree to the file.

Further Documentation

Web-based documentation for the C++ version is available here and the Java version here.