CSCE 471/871 (Spring 2011): Homework 1

CSCE 471/871 (Spring 2011) Homework 1
Assigned Friday, February 6
Due Friday, February 18 at 11:59 p.m.
Total points: 70

When you hand in your results from this homework, you should submit the following, in separate files:

A single tarball or zip file called username.tar.gz or username.zip, where username is your username on cse. In this tar file, put:
- Source code in the language of your choice (in plain text files).
- A makefile and a README file facilitating compilation and running of your code (include a description of command line options). If we cannot easily re-create your experiments, you might not get full credit.
- All your data and results (in plain text files).
A single .pdf file with your writeup of the results for all the homework problems. Only pdf will be accepted, and you should only submit one pdf file, with the name username.pdf, where username is your username on cse. Include all your plots in this file, as well as a detailed summary of your experimental setup, results, and conclusions. If you have several plots, you might put a few example ones in the main text and defer the rest to an appendix. Remember that the quality of your writeup strongly affects your grade. See the web page on ``Tips on Presenting Technical Material''.

Submit everything by the due date and time using the web-based handin program.

On this homework, you must work on your own and submit your own results written in your own words.

(70 pts) In this assignment, you will compare (in terms of time complexity and quality of hits returned) BLAST with one of the (optimal) dynamic programming algorithms described in the text.

First, implement the dynamic programming algorithm that returns multiple local alignments, described in the "Repeated Matches" section of Durbin, pages 24–26. Then use this to search this database with the sequences stored in this file and this file. These files are all in FASTA format, which is a simple text-based file format. (In this format, the name of a protein and other relevant information is given on a line starting with ">" and ending with a newline. After the newline, until another ">" is read as the first character of a line, everything read is an amino acid in the protein.) When you conduct the search, use the scoring matrices BLOSUM62 and PAM70. With both matrices, use a linear gap penalty of 4 per gap. Also, use at least two different values of the threshold T for your algorithm. For all eight experiments (2 query sequences times 2 matrices times 2 values of T), report all the significant local alignments you get from your search, and report the average time to search each sequence in the database (normalized by sequence length). Finally, you should model your output format after that of BLAST.

You will then compare your results to BLAST. The executable for running BLASTP (for searching protein sequences) is
/usr/bin/blastp
on cse, or you can download your own copy from http://blast.ncbi.nlm.nih.gov/Blast.cgi?CMD=Web&PAGE_TYPE=BlastDocs&DOC_TYPE=Download. For more information on running this and other BLAST tools, you can refer to the BLAST web page. You may use the default arguments, except for the following: (1) use a gap open penalty of 6 and a gap extension penalty of 2; (2) vary the window size (use values of 2 and 3); (3) use both BLOSUM62 and PAM70 as your scoring matrices; and (4) vary the threshold for extending hits (use the same values of T you used in the dynamic programming part). When running blastp, you need to have a binary version of database.txt. You can create this using the BLAST program makeblastdb (also on cse), or you may download the files directly. As with your dynamic programming algorithm, for all sixteen experiments (2 query sequences times 2 matrices times 2 window sizes times 2 thresholds) you are to report all the significant local alignments you get from your search, and report the average time to search each sequence in the database (normalized by sequence length).

A note on BLAST's score reports: Note that BLAST rescales the scores of its hits before reporting them. E.g. if a scoring matrix is in units of 1/3 bits, BLAST will scale the scores to full bits before reporting. You should take this into consideration when comparing hit scores. Note that you can still directly compare the hits themselves without worrying about rescaling, e.g. if a DP hit is a supersequence of a BLAST hit, this will happen regardless of the scale. For more information on BLAST's rescaling procedure, see http://www.ncbi.nlm.nih.gov/BLAST/tutorial/Altschul-1.html

You are to submit a detailed, well-written report, with conclusions. In particular, you should answer the following questions. Were there alignments that BLAST missed that the dynamic programming algorithm found? Were missing these alignments worth the speedup that you observed in BLAST over DP? How much of a speedup was there? Of course, this is merely the minimum that is required in your report. Other experiments that you run (with e.g. other parameter values, scoring schemes, query sequences or databases) and other interesting questions that you answer might yield extra points. Correctly modifying your dynamic programming algorithm to use the same affine gap penalty function that BLAST uses and rerunning your experiments will certainly yield extra points.

Return to the CSCE 471/871 (Spring 2011) Home Page

Last modified 16 August 2011; please report problems to sscott AT cse.