Homework 2: CSCE 970 (Spring 2009)

CSCE 970 (Spring 2009) Homework 2

Assigned Monday, February 9
~~Monday, February 23~~ Friday, February 27
Total points: 70

When you hand in your results from this homework, you should submit the following, in separate files:

A single .tar.gz, .tar.Z, or .zip file called username.tar.gz (or username.tar.Z, etc.) where username is your username on cse. In this archive file, put:

Source code in the language of your choice (in plain text files).
A makefile and a README file facilitating compilation and running of your code (include a description of command line options). If I cannot easily re-create your experiments, you might not get full credit.
All your data (in plain text files), except for data that we provide to you.

A single .pdf file with your writeup of the results for all the homework problems, including the last problem. Only pdf will be accepted, and you should only submit one pdf file, with the name username.pdf, where username is your username on cse. Include all your plots in this file, as well as a detailed summary of your experimental setup, results, and conclusions. If you have several plots, you might put a few example ones in the main text and defer the rest to an appendix. Remember that the quality of your writeup strongly affects your grade. See the web page on "Tips on Presenting Technical Material".

Submit everything by the due date and time using the web-based handin program.

On this homework, you must work on your own and submit your own results written in your own words.

(15 pts) Do Exercise 3.6 on page 67 of Durbin's book.

(50 pts) In this exercise, you will implement a program to infer hidden Markov models via the Baum-Welch algorithm and to evaluate these models using log likelihood. As with Homework 1, you may assume that the sequences are the result of dice rolls from a 3-dice model. In contrast to Homework 1, your training data consist of multiple sequences (one per line in the input files) rather than one long sequence. Thus you should also include begin and end states in your model. You should use pseudocounts to initialize your models so there are no 0 probabilities.
You will use these two data sets: data1.txt and data2.txt. Each of these files consists of several lines. Each line is a sequence of outcomes of dice rolls (numbered 0–5). In contrast with Homework 1, no state information is embedded in these sequences, but you may assume that only three states emit symbols in this model (and the begin and end states determine each sequence's length).

You will repeat the following steps three times for each of the two data sets. First, read in the data set. Second, randomly subsample half the sequences in the data set and use Baum-Welch to infer a hidden Markov model based on the subsampled half. Third, compute the log likelihood of seeing the other half of the data set given the model you just inferred. (Put another way, the subsampled half is the training set and the other half is the test set.) Once you've done this three times, graphically display in your report the model that has the highest likelihood. Then repeat this entire process for the second data set. Thus at the end, you will have built six models, computed six log likelihoods, and graphically displayed two models (one from data1 and one from data2) in your report.

You are to submit a detailed, well-written report, with conclusions. In particular, you should answer the following questions. How much variance was there in the measured log likelihoods for each model? Are you comfortable with running each training set three times and taking the maximum, or are more rounds necessary? Can you think of other ways to avoid getting trapped in local maxima? Of course, this is merely the minimum that is required in your report. Other experiments that you run and other interesting questions that you answer might yield extra points.
(5 pts) State how many hours you spent on each problem of this homework assignment.

Back

Last modified 16 August 2011; please report problems to sscott.