CSC 155 Assignment: DNA Sequence Alignment with Dynamic Programming Part II: Implementation and Testing Due Thursday 5/5, submit under designation "dna" For the actual implementation of the assignment, you can still help eachother, but each of you should write YOUR OWN programs. You MUST implement BOTH the "simple" and "advanced" scoring schemes as described in the handout, and do it in a modular manner (such as using inheritance). I will NOT accept two different versions of the program, one for each scoring scheme. I will also be looking at your code, and it should follow the outline we've agreed upon, unless there was a significant problem no one could have forseen. To test your program, you need to: 1. Print out the alignment and alignment score, such as: CATTAATTACACTCTCGCACTCAC-CACCAAACATCCTA-AACCCAGACAGGCCTCGACTCC | ||| | | ||||| || | || | | || || | ||| | ||| | -ACTAAACA-AGACTCGC-CTGTCTAACTAGGGAGTTTATAATGAACCGTGGCGTAGAC-CA Alignment score is 27 (sample was produced using the advanced scoring scheme) 2. Use small, hand crafted examples, such as (but not limited to): AATCG ATCAG (checks if it figures out where to put the gaps) ********* For at least one example, you need to show how your program generated output that corresponds to what you have worked out on paper. You need to be able to print the matrix generated by your program, and show that it corresponds to what you have done by hand. ********* 3. Use large, randomly generated sequences. Test your program several times before declaring that "it works". Be sure to use examples where the lengths of the sequences are not the same. Here's a function that will generate a random sequence of length n, and store the characters into an array (that already exist): #include void generate(char *A, int n) { int i, j; for(i=1;i<=n;i++) { j = random() % 4; switch (j) { case 0: A[i]='A'; break; case 1: A[i]='C'; break; case 2: A[i]='G'; break; default: A[i]='T'; break; } // switch } // for } ************** Other Hints ****************** Notice that in the function above, the loop started at index 1 (which means you need to allocate at least n+1 bytes to the array). This is to make the array coorespond to the matrix, which also must have an extra row and column. -------- Geeky C Trick: In pure C, there is no separate datatype for strings. A string is just an array of chars, so there is no need to create a separate string to represent the sequence if you need to print it. However, the array must have a 0 byte (not '0' but just 0) at the end of the string, to indicate where the string ended. You must insert this zero manually. That is, you can do: char *A = new char[102]; generate(A,100); A[101] = 0; cout << A << endl; ------- When using the random() function to generate random numbers, you'll find that it always generates the same numbers unless we call srandom(int) - to "seed" the random number generator. The method that I prefer is to use a command line argument for this purpose. That way, when I need a different set of data, I can just pass the program a different number, and if I want to reproduce the same data, I can use the same number: int main(int arc, char *argv[]) { int r = atoi(argv[1]); // 1st command line argument; srandom(r); ... ------ Here's a problem that have caused me trouble in the past: when dealing with a coordinate system, we usually think in terms of x, then y, or column then row (this is how the tutorial refers to the coordinates). But in the C/C++ (and Java) programming language, 2D arrays are stored in row-major order, and we typically give the row (y) coordinate first. Thus your program may behave in a way that's different from how you envision it. Be very careful, and consistent, with how you deal with array coordinates. ------ Finally, the tutorials did not cover one detail, which only revealed itself as I was writing my own program: During the "Traceback" stage of the algorithm, you can find yourself on either the left or top edge of the matrix (that is, when y==0 or x==0). Under the "simple" scoring scheme, this doesn't cause a problem, because there's no gap penalty. However, when you're using the advanced scheme, then can't just look at the top or left neighbors in the normal way, because they're values will be zero. At this point, you must adopt a different method for traceback: if y==0 but x>0, then you can only go left, and if y>0 and x==0, then you can only go up. Your program must deal with this situation, regardless of what scoring scheme it's using.