CSC 17 Lab: DNA Sequence Alignment with Dynamic Programming The Needleman-Wunsch-Sellers algorithm for DNA alignment is a classic application of Dynamic Programming. It can be adopted to more than just DNA analysis, including approximate string matching, voice matching, etc. The PROBLEM that the algorithm is trying to solve is the following: Given two sequences (not necessarily DNA sequences) A and B, compute a *score* that measures how similar (how close) they are to each other. This is called "alignment". It may be necessary to insert gaps in either sequence to obtain the optimal alignment. For example: what's the best way to align GATCTCG and GCCTCAG? We could just line them up character by character (assuming the sequences are strings) GATCTCT GCCTCAT But only two pairs of corresponding characters match each other. A better way to align them would be to insert some gaps: GATCTC_T GC_CTCAT Now five pairs of characters match. This is a "better" alignment. The goal is to find an optimal alignment and produce a numerical score: the score of the second alignment should be higher than the first. The two sequences do NOT have to be of the same length because gaps can always be inserted if necessary to produce alignments of the same length. This is an informal description. The algorithm is the following: Given a *score function* int score(int i, int k) that compares the ith char in string A and the kth char in string B and a "gap penalty" function int W(); // returns zero or negative to represent a penalty Different versions of Needleman Wunsch may require different types of arguments to the W function. For the part I of the lab, we assume that it does not require any arguments (but this will change). This function should return zero or a NEGATIVE number indicating a penalty. Assuming that there are Strings A and B, the following recursive function calculates the optimal alignment score between the first i chars in A and the first k chars in B: int D(int i, int k) { // goal is to compute the largest of three values: (assume D(0,0)=0) // D(i-1,k-1)+score(i,k) // D(i-1,k) + W() // D(i,k-1) + W() if (i<=0 && k<=0) return 0; // base case int a, b, c; a = b= c = 0x80000000; // smallest possible 32 bit signed integer if (i>0 && k>0) a = D(i-1,k-1) + score(i,k); if (i>0) b = D(i-1,k) + W(); if (k>0) c = D(i,k-1) + W(); return Math.max(a,Math.max(b,c)); // return largest of a,b,c }//D then to calculate the optimal alignment score between A and B: int ALIGNMENT_SCORE = D(A.length()-1,B.length()-1); To get the indices of characters in A and B to align properly with the arguments of D, we should first insert a dummy character infront of each string: A = "."+A; B = "."+B. //////// IMPORTANT! Now the first char of a sequence is A.charAt(1). We do this because the index 0 is reserved to represent "before" the first char, or the zero-length prefix of the sequence. Refer to class notes and web articles to understand the *logic* of the algorithm. Is the recursive function D the best way to implement this algorithm??? Review the principal stages and requirements of Dynamic Programming as described in the basic "Routes" example: 1. Define a recursive solution to the problem 2. Determine if the solution is: a. of exponential time complexity and therefore needs optimization b. "stateless," which means multiple calls will return the same value 3. Choose between top-down (memoization) or bottom-up dynamic programming. Memoization is preferred if not all values in a matrix are needed. 4. From the optimal solution matrix, we should be able to derive at least one optimal solution (there could be more than one equally- optimal solutions). This requires a "trace-back" stage. Stage 1: the recursive solution has been given to you as the function D Stage 2: It is not too hard to see that the algorithm runs in exponential time. Let n be the sum of the lengths of sequences A and B. Then the recurrence relation for the running time of D, in terms of its three recursive calls, is: T(n) = T(n-2) + 2T(n-1) + 1 T(n) is therefore between 2T(n-1)+1 and 3T(n-1)+1, both of which are exponential time (2**n and 3**n). If an O(3**n) algorithm takes 1 millisecond (0.001 second) for n=10, then for n=50 the algorithm will take approximately 385 million years. If you don't believe my math then you can try to prove me wrong experimentally (extra credit will be given after you're done). Futhermore, the function D is stateless because it does not change any external (non-local) variables. This assumes that the score function is also stateless and that sequences A and B do not change while D is running. These properties are easy to guarantee. Stage 3: Since the calculation of the each value depends on the the value to the left, the value above, and the value to the upper-left, you should be able to see that we can implement this algorithm most efficiently using BOTTOM-UP dynamic programming. ==== Write a superclass to implement the algorithm ==== This time, instead of giving you a superclass I WILL GIVE YOU SOME SUBCLASSES. Each subclass represents one possible scoring scheme. A scoring scheme is defined by the score(int i, int k) function and the int W() function. **Please call your class "absdna" as in "abstract dna" /// simple scoring scheme, no gap penalty class scheme1 extends absdna { //put this in a different file @Override public int score(int i, int k) { if (A.charAt(i) == B.charAt(k)) return 1; else return 0; } @Override public int W() { return 0; } // constructors public scheme1(int n, int m) { super(n,m); } public scheme1(String n, String m, boolean arefiles) {super(n,m,arefiles);} }// scheme1 class scheme2 extends absdna { // this is from the sample on wikipedia article @Override public int score(int i, int k) { if (A.charAt(i) == B.charAt(k)) return 1; else return -1; } @Override public int W() { return -1; } // constructors public scheme2(int n, int m) { super(n,m); } public scheme2(String n, String m, boolean arefiles) {super(n,m,arefiles);} }// scheme2 publc class scheme3 extends absdna { // another scheme, with a main @Override public int score(int i, int k) { if (A.charAt(i) == B.charAt(k)) return 2; else return 0; } @Override public int W() { return -1; } // constructors public scheme3(int n, int m) { super(n,m); } public scheme3(String n, String m, boolean arefiles) {super(n,m,arefiles);} public static void main(String[] args) { scheme1 S1 = new scheme1(10,11); scheme2 S2 = new scheme2("GATTACA","GCATGCG",false); scheme3 S3 = new scheme3("seq1.dna","seq2.dna",true); //load from files S1.align(); S2.align(); S3.align(); }//main }// scheme3 In order for these subclasses to compile your absdna class must contain: 1. String variables A and B representing the two sequences. These variables should be declared 'protected' so they can be accessed in a subclass 2. Virtual methods int score(int i, int k); and int W(); that can be overridden. 3. Two constructors, a. absdna(int n, int m): takes two integers and should generate A as random DNA sequence of length n, and B of length m. Use the provided 'random_dna' function given below. b. absdna(String a, String b, boolean arefiles): for this version of the constructor, the two strings a and b can either be actual sequences, or name files that contain sequences that must be loaded. Use the provided 'load_dna' function given below. For example, you can say ' new sample1("ACGTC","AGTC",false) ' or, for longer DNA sequences stored in files "seq1.dna", "seq2.dna", ' new sample1("seq1.dna","seq2.dna",true) ' 4. A method public void align(); This function must implement the Needleman-Wunsch algorithm in a way that works for any scoring scheme. It will probably need to call several other functions. It must ultimately produce output in the following format: possible output from S1: -CAGTAGGA-CG || || | | CCATTA-AATCA Alignment score: 6 possible output from S2: G-ATTACA | | | | GCA-TGCG Alignment score: 0 Note that there may be more than one optimal alignment, but the score should be the same for all of them. ////////////////// Some provided functions. The only actual code that I will give you are the following static functions. These are just provided to help you write the constructors and to debug your code by printing out 2D arrays for small examples. The functions require you to import java.io.*; import java.util.Optional; // ... place these in your absdna class: public static String random_dna(int n) { // random dna sequence of length n if (n<1) return ""; char[] C = new char[n]; char[] DNA = {'A','C','G','T'}; for(int i=0;i load_dna(String filename) { Optional answer = Optional.empty(); try { BufferedReader br = new BufferedReader(new InputStreamReader(new FileInputStream(filename))); answer = Optional.of(br.readLine()); br.close(); } // IOExceptions MUST be caught catch (IOException ie) {} return answer; }//load_dna // for debugging only! only call on small strings public static void printmatrix(String A, String B, int[][] M) { if (A==null || B==null || M==null || M[0]==null) return; if (A.length() != M.length || B.length() != M[0].length) return; int rows = A.length(); int cols = B.length(); System.out.print(" "); for(int i=0;i