Lecture 14 Algorithm Analysis Arne Kutzner Hanyang University / Seoul Korea
Sequence Alignments
Sequence Alignment Problem Given two sequences a, b over some alphabet Σ. Problem: Find some scheme so that a and b fit together. Example: a = GATTACATAAGTTTT b = GCATGCUTGCTCTT Possible alignment: mismatch G - A T T A C A T A A G - T T T T G C A T G - C U T - - G C T C T T gap match 2016/11 Algorithm Analysis
Instances of the Alignment Problem Global Alignment = end-to-end alignment Local Alignment = best subsection alignment Example: Align FTALLLAAV to FTFTALILLAVAV: 2016/11 Algorithm Analysis
Needleman-Wunsch Algorithm For global alignments Input: Two strings a and b over some alphabet Σ Scoring system, that defines bonus and penalty for matches and mismatches penalty for inserting a gap comprising one symbol into a penalty for inserting a gap comprising one symbol into b (this is equal to the deletion of one symbol in a) Technique used by algorithm: Dynamic programming on the foundation of matrix computation 2016/11 Algorithm Analysis
Matrix Initialization b Initialize with d * i (d * j), where d is the gap penalty -1 -2 -3 -1 -2 -3 a 2016/11 Algorithm Analysis
Compute Cell Values T A T T Match (+1) Mismatch (-1) 𝐹 𝑖−1,𝑗−1 3 𝐹 𝑖−1,𝑗 2 𝐹 𝑖−1,𝑗−1 3 𝐹 𝑖−1,𝑗 2 𝐹 𝑖−1,𝑗−1 3 𝐹 𝑖−1,𝑗 2 𝐹 𝑖−1,𝑗−1 3 𝐹 𝑖−1,𝑗 2 𝐹 𝑖,𝑗−1 1 𝐹 𝑖,𝑗 4 𝐹 𝑖,𝑗−1 1 𝐹 𝑖,𝑗 2 𝐹 𝑖,𝑗−1 1 𝐹 𝑖,𝑗 1 𝐹 𝑖,𝑗−1 1 𝐹 𝑖,𝑗 T T Match (+1) Mismatch (-1) Delete (gap in b) (-1) Insert (gap in a) (-1) Take the maximum of these values as value of 𝐹 𝑖,𝑗 and store the direction of the blue arrow 2016/11 Algorithm Analysis
Pseudocode for Matrix Computation NW-Matrix(A, B, S, d, F) 2016/11 Algorithm Analysis
Example match mismatch insertion deletion 2016/11 Algorithm Analysis
Alignment Computation on the Foundation of the Matrix Start at the bottom cell in the rightmost column and follow the arrows, until you reach the leftmost column or the topmost row. Situation can be ambiguous, so we can have more than one best match. 2016/11 Algorithm Analysis
Pseudocode for Alignment Computation 2016/11 Algorithm Analysis
Example (cont.) G - A T T A C A G C A T G - C U G - A T T A C A G C A insertion mismatch deletion mismatch G - A T T A C A G C A T - G C U G - A T T A C A G C A - T G C U 2016/11 Algorithm Analysis
Complexity Analysis Let m=length(a) and n=length(b) Matrix computation: θ(𝑚∗𝑛) Alignment Computation: O(max{𝑚,𝑛}) Together: θ(𝒎∗𝒏) Practically quite expensive with respect to time as well as space. 2016/11 Algorithm Analysis
Similarity Matrix Static values for match and mismatch can be replaced by a similarity matrix: Example: In the field of Bio-IT exist several predefined similarity matrices for amino acids: BLOSUM (BLOcks SUbstitution Matrix) PAM (Point Accepted Mutation ) 2016/11 Algorithm Analysis
Smith-Waterman Algorithm For local alignments Input: Two strings a and b over some alphabet Σ Similarity scoring scheme 𝒔 𝒂 𝒊 , 𝒃 𝒋 over the alphabet Σ 𝑾 𝒊 gap-scoring scheme Ouput: Scoring Matrix 𝑯 Variation of Needleman-Wunsch Algorithm, so that the NW-Alg. works for local alignments. 2016/11 Algorithm Analysis
= differences to NW-Alg. Matrix Computation The matrix H is build as follows: where: m is the length of a, and n is the length of b = differences to NW-Alg. 2016/11 Algorithm Analysis
Computation of alignments Backtracking like in the NW-Alg., but with significant difference: Search the cell with the highest score and start over there H backtracking area cell with highest score 2016/11 Algorithm Analysis
Complexity Analysis Same story like for Needleman-Wunsch Alg.: Let m=length(a) and n=length(b) Matrix computation: θ(𝑚∗𝑛) Backtracking (Alignment Computation): O(max{𝑚,𝑛}) Together: θ(𝒎∗𝒏) Practically quite expensive with respect to time as well as space. 2016/11 Algorithm Analysis
How to overcome the demanding space-time requirements of NW-Alg How to overcome the demanding space-time requirements of NW-Alg. and SW-Alg.? Many solutions … Long story… Heuristic Approaches: BLAST (One of the standard tools for sequence alignment nowadays) BLAT fast but considerably less sensitive than BLAST BWT-based approaches as e.g. Bowtie or BWA Many of the above tools/algorithms rely on some form of seeding before starting the core alignment 2016/11 Algorithm Analysis
Seeding technique Step 1: Somehow “digest” (break into smaller pieces) sequence b (let us call it query sequence). Step 2 (seeding): Align these short segments quickly using some form of precomputed dictionary the comprises data for sequence a (let us call a the reference sequence) Step 3: Take the output of step 2 in order to limit the search space for further alignment activities 2016/11 Algorithm Analysis
Example: Local alignment by seeding query sequence 1. digest 1 2 3 4 5 (seeding) 2. align reference sequence 1 2 4 5 6 3 3. compute area of interest area of interest section of reference query sequence 4. cut and SW align local alignment 2016/11 Algorithm Analysis
Efficient seeding technique Suffix array: Contains the starting positions of suffixes of a string in lexicographical order Example: word banana$ sort as array 2016/11 Algorithm Analysis
Search a suffix area … Where is an in banana$? 2016/11 Algorithm Analysis
How to search suffix arrays efficiently? FM-Index: Makes use of Burrows-Wheeler transform (BWT) Stores precomputed symbol counts in a tabular form (occurrence table) Foundation of the aligner Bowtie and BWA 2016/11 Algorithm Analysis
ITBE working group at Hanyang University Projects in the joint field of Information Technology (Computer Science) and Microbiology (Genetics) Analysis of genes/gene families by using/combining available computational tools Development of special tailored solutions/algorithms for specific kinds of problems Example (big data analysis): Taxonomic heat diagrams that show the expression/occurrence of some gene with respect to some given taxonomy 2016/11 Algorithm Analysis
Heat diagram for Gene FAM72 Example Example 2016/11 Algorithm Analysis