PatternHunter: faster and more sensitive homology search By Bin Ma, John Tromp and Ming Li B92902019 鍾承宏 B92902033 王凱平 B92902039 莊謹譽 B92902072 張智翔 B92902086.

Slides:

Advertisements

Similar presentations

Fa07CSE 182 CSE182-L4: Database filtering. Fa07CSE 182 Summary (through lecture 3) A2 is online We considered the basics of sequence alignment –Opt score.

Advertisements

Blast outputoutput. How to measure the similarity between two sequences Q: which one is a better match to the query ? Query: M A T W L Seq_A: M A T P.

Gapped BLAST and PSI-BLAST Altschul et al Presenter: 張耿豪莊凱翔.

Bioinformatics Tutorial I BLAST and Sequence Alignment.

BLAST Sequence alignment, E-value & Extreme value distribution.

1 CAP5510 – Bioinformatics Database Searches for Biological Sequences or Imperfect Alignments Tamer Kahveci CISE Department University of Florida.

Bioinformatics Unit 1: Data Bases and Alignments Lecture 2: “Homology” Searches and Sequence Alignments.

Local alignments Seq X: Seq Y:. Local alignment  What’s local? –Allow only parts of the sequence to match –Results in High Scoring Segments –Locally.

Seeds for Similarity Search Presentation by: Anastasia Fedynak.

Database Searching for Similar Sequences Search a sequence database for sequences that are similar to a query sequence Search a sequence database for sequences.

. Class 4: Fast Sequence Alignment. Alignment in Real Life u One of the major uses of alignments is to find sequences in a “database” u Such collections.

Sequence Alignment Storing, retrieving and comparing DNA sequences in Databases. Comparing two or more sequences for similarities. Searching databases.

Heuristic alignment algorithms and cost matrices

Design of Optimal Multiple Spaced Seeds for Homology Search Jinbo Xu School of Computer Science, University of Waterloo Joint work with D. Brown, M. Li.

We continue where we stopped last week: FASTA – BLAST

. Class 4: Fast Sequence Alignment. Alignment in Real Life u One of the major uses of alignments is to find sequences in a “database” u Such collections.

1 1. BLAST (Basic Local Alignment Search Tool) Heuristic Only parts of protein are frequently subject to mutations. For example, active sites (that one.

Pairwise Sequence Alignment Part 2. Outline Global alignments-continuation Local versus Global BLAST algorithms Evaluating significance of alignments.

Introduction to bioinformatics

Sequence similarity.

Alignment methods June 26, 2007 Learning objectives- Understand how Global alignment program works. Understand how Local alignment program works.

Similar Sequence Similar Function Charles Yan Spring 2006.

Sequence Alignment III CIS 667 February 10, 2004.

Heuristic Approaches for Sequence Alignments

Bioinformatics Unit 1: Data Bases and Alignments Lecture 3: “Homology” Searches and Sequence Alignments (cont.) The Mechanics of Alignments.

Practical algorithms in Sequence Alignment Sushmita Roy BMI/CS 576 Sep 16 th, 2014.

Blast heuristics Morten Nielsen Department of Systems Biology, DTU.

Sequence alignment, E-value & Extreme value distribution

From Pairwise Alignment to Database Similarity Search.

Alignment methods II April 24, 2007 Learning objectives- 1) Understand how Global alignment program works using the longest common subsequence method.

Heuristic methods for sequence alignment in practice Sushmita Roy BMI/CS 576 Sushmita Roy Sep 27 th,

Whole genome alignments Genome 559: Introduction to Statistical and Computational Genomics Prof. James H. Thomas

Pairwise Alignment How do we tell whether two sequences are similar? BIO520 BioinformaticsJim Lund Assigned reading: Ch , Ch 5.1, get what you can.

Gapped BLAST and PSI-BLAST : a new generation of protein database search programs Team2 邱冠儒黃尹柔田耕豪蕭逸嫻謝朝茂莊閔傑 2014/05/12 1.

An Introduction to Bioinformatics

Protein Sequence Alignment and Database Searching.

Gapped BLAST and PSI- BLAST: a new generation of protein database search programs By Stephen F. Altschul, Thomas L. Madden, Alejandro A. Schäffer, Jinghui.

Classifier Evaluation Vasileios Hatzivassiloglou University of Texas at Dallas.

Pairwise Sequence Alignment. The most important class of bioinformatics tools – pairwise alignment of DNA and protein seqs. alignment 1alignment 2 Seq.

Eric C. Rouchka, University of Louisville Sequence Database Searching Eric Rouchka, D.Sc. Bioinformatics Journal Club October.

Pairwise alignment of DNA/protein sequences I519 Introduction to Bioinformatics, Fall 2012.

Database Searches BLAST. Basic Local Alignment Search Tool –Altschul, Gish, Miller, Myers, Lipman, J. Mol. Biol. 215 (1990) –Altschul, Madden, Schaffer,

What is BLAST? BLAST® (Basic Local Alignment Search Tool) is a set of similarity search programs designed to explore all of the available sequence databases.

Last lecture summary. Window size? Stringency? Color mapping? Frame shifts?

BLAST Anders Gorm Pedersen & Rasmus Wernersson. Database searching Using pairwise alignments to search databases for similar sequences Database Query.

CISC667, F05, Lec9, Liao CISC 667 Intro to Bioinformatics (Fall 2005) Sequence Database search Heuristic algorithms –FASTA –BLAST –PSI-BLAST.

PatternHunter II: Highly Sensitive and Fast Homology Search Bioinformatics and Computational Molecular Biology (Fall 2005): Representation R 林語君.

BLAST: Basic Local Alignment Search Tool Altschul et al. J. Mol Bio CS 466 Saurabh Sinha.

Sequence Comparison Algorithms Ellen Walker Bioinformatics Hiram College.

Rationale for searching sequence databases June 25, 2003 Writing projects due July 11 Learning objectives- FASTA and BLAST programs. Psi-Blast Workshop-Use.

Pairwise Local Alignment and Database Search Csc 487/687 Computing for Bioinformatics.

PatternHunter: A Fast and Highly Sensitive Homology Search Method Bin Ma Department of Computer Science University of Western Ontario.

Pairwise Sequence Alignment Part 2. Outline Summary Local and Global alignments FASTA and BLAST algorithms Evaluating significance of alignments Alignment.

Heuristic Methods for Sequence Database Searching BMI/CS 576 Colin Dewey Fall 2015.

Sequence Alignment.

Doug Raiford Phage class: introduction to sequence databases.

Step 3: Tools Database Searching

Heuristic Methods for Sequence Database Searching BMI/CS 576 Colin Dewey Fall 2010.

MGM workshop. 19 Oct 2010 Some frequently-used Bioinformatics Tools Konstantinos Mavrommatis Prokaryotic Superprogram.

Heuristic Methods for Sequence Database Searching BMI/CS 776 Mark Craven February 2002.

Techniques for Protein Sequence Alignment and Database Searching G P S Raghava Scientist & Head Bioinformatics Centre, Institute of Microbial Technology,

9/6/07BCB 444/544 F07 ISU Dobbs - Lab 3 - BLAST1 BCB 444/544 Lab 3 BLAST Scoring Matrices & Alignment Statistics Sept6.

Database Scanning/Searching FASTA/BLAST/PSIBLAST G P S Raghava.

Homology Search Tools Kun-Mao Chao (趙坤茂)

Blast Basic Local Alignment Search Tool

Basics of BLAST Basic BLAST Search - What is BLAST?

BLAST Anders Gorm Pedersen & Rasmus Wernersson.

paper study for class presentation on Nov16th, 2005 slider by 陳奕先

Basic Local Alignment Search Tool

Sequence alignment, E-value & Extreme value distribution

Presentation transcript:

PatternHunter: faster and more sensitive homology search By Bin Ma, John Tromp and Ming Li B 鍾承宏 B 王凱平 B 莊謹譽 B 張智翔 B 洪錫全 B 郭立翔

Agenda PatternHunter Spaced Seed Algorithm Performance PatternHunter II Algorithm Performance Translated PatternHunter

PatternHunter – Spaced Seed

Outline A short review about BLAST. Some definition and background. What ’ s the difference and the same between BLAST and PatternHunter. Why PatternHunter is better?? Nonconsecutive seeds Proof

Blast Algorithm Find seeded matches Extent to HSP’s ( High scoring Segment Pairs ) Gapped Extension, dynamic programming Report significant local alignments

A short review about BLAST Find hits. BLAST first scans the database for words that score at least T when aligned with some word within the query sequence. Any aligned word pair satisfying this condition is called a hit.

A short review about BLAST Find HSPs HSP (High scoring Segment Pair) is much longer than a single word pair, and may therefore entail multiple hits on the same diagonal within a relative shot distance of one another.

A short review about BLAST Generate gapped alignment This means that two or more HSPs in BLAST with scores well below 38 bits can, in combination, rise to statistical significance. If any one of these HSPs is missed, so may be the combined result.

A short review about BLAST In summary, the new gapped BLAST algorithm requires two non-overlapping hits of score at least T, within a distance A of one another, to invoke an ungapped extension of the second hit. If the HSP generated normalized score at least Sg bits, then a gapped extension is triggered.

Some definition, some background Similarity How similar it is between two sequences? Usually mean that the probability of the same symbol appear in anywhere of two sequences. Sensitivity The probability to find a local alignment. Specificity In all local alignments, how many alignments are homologous.

Define the Seed Defining the seed: w -> weight or number of positions to match Blastn: 11 MegaBlast: 28 model -> relative position of letters for each w m -> length of model “window” Reference Bin Ma, John Tromp, Ming Li Bioinformatics Vol. 18 no

m = 18 w = 11 model Patternhunter most sensitive model Seed Parameters: 1 – exact match required 0 – no match required, any value { letters: 0, 1 Blastn seed is all “1”s Reference Bin Ma, John Tromp, Ming Li Bioinformatics Vol. 18 no

Seed, Hit, Homology What is a seed? Seeds determine how an algorithm looks for hits What is a hit? Hits indicate a similarity that may indicate a homology Reference Bin Ma, John Tromp, Ming Li Bioinformatics Vol. 18 no

GCNTACACGTCACCATCTGTGCCACCACNCATGTCTCTAGTGATCCCTCATAAGTTCCAACAAAGTTTGC || ||||| | ||| |||| || |||||||||||||||||| | |||||||| | | ||||| GCCTACACACCGCCAGTTGTG-TTCCTGCTATGTCTCTAGTGATCCCTGAAAAGTTCCAGCGTATTTTGC GAGTACTCAACACCAACATTGATGGGCAATGGAAAATAGCCTTCGCCATCACACCATTAAGGGTGA---- || ||||||||| |||||| | ||||| |||||||| ||| |||||||| | | | || GAATACTCAACAGCAACATCAACGGGCAGCAGAAAATAGGCTTTGCCATCACTGCCATTAAGGATGTGGG TGTTGAGGAAAGCAGACATTGACCTCACCGAGAGGGCAGGCGAGCTCAGGTA ||||||||||||| ||| ||||||||||| || ||||||| || |||| | TTGACAGTACACTCATAGTGTTGAGGAAAGCTGACGTTGACCTCACCAAGTGGGCAGGAGAACTCACTGA GGATGAGGTGGAGCATATGATCACCATCATACAGAACTCAC CAAGATTCCAGACTGGTTCTTG ||||||| |||| | | |||| ||||| || ||||| || |||||| ||||||||||||||| GGATGAGATGGAACGTGTGATGACCATTATGCAGAATCCATGCCAGTACAAGATCCCAGACTGGTTCTTG Human-Mouse genome homology hit Reference Bin Ma, John Tromp, Ming Li Bioinformatics Vol. 18 no

Example: Consider the following two sequences: GAGTACTCAACACCAACATCAGTGGGCAATGGAAAAT || ||||||||| |||||||| |||||| |||||| GAATACTCAACAGCAACATCAATGGGCAGCAGAAAAT What ’ s the differences in finding the seed between Blast and PatternHunter? Reference Bin Ma, John Tromp, Ming Li Bioinformatics Vol. 18 no

BLAST uses “consecutive seeds” In BLAST, we often use the consecutive model with weight 11. GAGTACTCAACACCAACATCAGTGGGCAATGGAAAAT || ||||||||| |||||||| |||||| |||||| GAATACTCAACAGCAACATCAATGGGCAGCAGAAAAT → → … →… … → ← However, it fails to find the alignment in the two sequence. Reference Bin Ma, John Tromp, Ming Li Bioinformatics Vol. 18 no

Consecutive seeds There ’ s also a dilemma for BLAST type of search. Dilemma Sensitivity – needs shorter seeds too many random hits, slow computation Speed – needs longer seeds lose distant homologies Reference Bin Ma, John Tromp, Ming Li Bioinformatics Vol. 18 no

PatternHunter uses “non-consecutive seed” In PatternHunter, we often use the spaced model with weight 11 and length 18. GAGTACTCAACACCAACATCAGTGGGCAATGGAAAAT || ||||||||| |||||||| |||||| |||||| GAATACTCAACAGCAACATCAATGGGCAGCAGAAAAT Reference Bin Ma, John Tromp, Ming Li Bioinformatics Vol. 18 no

Consecutive vs. Nonconsecutive? The non-consecutive seed is the primary difference and strength of Patternhunter Blastn: PatternHunter: Reference Bin Ma, John Tromp, Ming Li Bioinformatics Vol. 18 no

A trivial comparison between spaced and consecutive seed Consider 111 and To fail seed 111, we can use … 66.66% similarity But we can prove, seed 1101 will hit every region with 61% similarity for sufficient long region. Reference Ming Li, NHC2005

Proof Suppose there is a length 100 region which is not hit by We can break the region into blocks of 1 a 0 b. Besides the last block, the other blocks have the following few cases: 10 b for b>=1 110 b for b>= b for b>=2 In each block, similarity <= 3/5. The last block has at most 3 matches. So, in total there are at most 61 matches in 100 positions. The similarity is <=61%. Reference Ming Li, NHC2005

Formalize Given i.i.d. sequence (homology region) with Pr(1)=p and Pr(0)=1-p for each bit: Which seed is more likely to hit this region: BLAST seed: Spaced seed: 111*1**1*1**11* *1**1*1**11*111 Reference Ming Li, NHC2005

Expect Less, Get More Lemma: The expected number of hits of a weight W length M seed model within a length L region with homology level p is (L-M+1)p W Proof. E(#hits) = ∑ i=1 … L-M+1 p W ■ Example: In a region of length 64 with p=0.7 Pr(BLAST seed hits)=0.3 E(# of hits by BLAST seed)=1.07 Pr(optimal spaced seed hits)=0.466, 50% more E(# of hits by spaced seed)=0.93, 14% less Reference Ming Li, NHC2005

Why Is Spaced Seed Better? A wrong, but intuitive, proof: seed s, interval I, similarity p E(#hits) = Pr(s hits) E(#hits | s hits) Thus: Pr(s hits) = Lp w / E(#hits | s hits) For optimized spaced seed, E(#hits | s hits) 111*1**1*1**11*111 Non overlap Prob 111*1**1*1**11*111 6 p 6 111*1**1*1**11*111 7 p 7 ….. For spaced seed: the divisor is 1+p 6 +p 6 +p 6 +p 7 + … For BLAST seed: the divisor is bigger: 1+ p + p 2 + p 3 + … Reference Ming Li, NHC2005

Simulated sensitivity curves Reference Ming Li, NHC2005

Observations of spaced seeds Seed models with different shapes can detect different homologies. Two consequences: Some models may detect more homologies than others More sensitive homology search PatternHunter I Can use several seed models simultaneously to hit more homologies Approaching 100% sensitive homology search PatternHunter II Reference Ming Li, NHC2005

PatternHunter – Algorithm & Performance

Outline Hit generation Hit extension Gapped extension Performance

Hit generation Index created for each position in the query sequence

Hit generation Similar to MegaBlast: Hash tables Encode ATCG into binary code 00, 01, 10, 11 respectively Find each situations in one of the sequence and record the offsets in the hash table

Hit generation An example: Now we want to find hits between sequences S and T

Spaced seed For sequence T: Model Seed ATATGCAT A 00 T 01 C 10 G 11 ATTCA = 88 ‧‧ Scan Weight=5  the value is between 0~2^10-1

After filling in the hash table … ‧‧‧ ‧‧‧ (NULL) ‧‧‧ Position in T For each position in S: 1.Calculate int value 2. Find hits in S by the lookup value

Hash tables: space required ‧‧‧ ‧‧‧ (NULL) ‧‧‧ Position in T 4^w integers |T| integers Total: 4^(w+1)+4|T| bytes

Cost a lot to make a hash table? If the number of hits found for one index is large, the cost of computing index is relatively negligible.

Hit extension HSP: Highscoring Segment Pair Scan those hits with a window, and choose the highest-scored one.

Hit extension S T The chosen hit

Hit extension Set the mid point of the chosen hit as the cut point, split the graph into 4

Hit extension S T

And then do the Smith-Waterman in 2 of the 4, until it reaches the dropoff score.

Hit extension S T Smith-Waterman Cost=1/2*O(mn)

Hit extension If the resulting segment pair has a score below certain minimum, then ignore it. Else we gain a HSP and do the next step-gap extension.

Hit extension A question: when doing extension in 2 ways, how to synchronize the score?

To find the best way to extend an HSP to the left across gaps. To extend an HSP we try all candidates from a diagonal-sorted set. Penalty for gap open + gap extension + cropping Gapped Extension

Search front Gapped Extension

Optimal Left Too Far Right Optimal Left From left to right

Too Far Right Optimal Left From left to right

Descriptions in the paper We use a red-black tree for this. Insert HSP when the optimal alignment to its left is found Retired from the tree once newly generated HSPs are too far beyond its right endpoint to make use of it.

Thought 1 The first one will be inserted  Fast

Better Worse Start End May not find the best one Thought 1

Not complete HSP Insert HSP when the optimal alignment to its left is found Thought 2

Close but short (Bad) Far but long (Good) Insert both HSPs Next turn Thought 2

Tree 2 Tree 1 Thought 1 Retired alignments are put into a priority queue according to their scores.

Performance Ref. Bin Ma, John Tromp, Ming Li Bioinformatics Vol. 18 no Ref. Altschul,S.F. et al (1997) Nucleic Acids Res., 25, 3389 – 3402.

PatternHunter II

Outline Overview PatternHunter II design Computing hit probability Finding seeds set Seed performance PHII performance

Overview PatternHunter: spaced seed PH2: design for better sensitivity “ Achieve a sensitivity approaching that of Smith- Waterman with a speed similar to the default Blastn ” Extend single spaced seed to multiple ones Two main problem: Large memory required for multiple hash tables Complexity of finding optimal seed combination

PatternHunter II design A hash table is built for each seeds All hits generated from all hash tables are used for gap extension In two-hit mode, two nearby hits can be from different hash tables

PatternHunter II design (cont.) Large memory problem: Divide into smaller segments e.g., with k = 8, w = 11, and n = 32 x 10 6, the hash tables use about 256MBytes of memory Extend alignments across division boundary Still may lose alignments

Computing hit probability Use DP, but extend the algorithm from single seed to multiple seeds Definition Homologous region R with length L Substring from i to j is denoted by R[i : j] A set of k seeds A = { a 1, …, a k } A hits R if there ’ s an a i that hits R p is called the similarity level of R if R = p% identities

Computing hit probability (cont.) For a binary string b and, define The goal is to find f(L, ε) For any i > |b|, we have We can compute f(i,b) from other f(i’,b’) computed earlier

Computing hit probability (cont.) Definition b is compatible with a seed a if b[|b|-j] =1 whenever a[|a|-j] = 1 for 0 < j ≦ min(|a|, |b|) Define B be the set of binary strings that are not hit by A but compatible with some a in A. B(x) denote the longest proper prefix of x in B

Computing hit probability (cont.) First, εis in B Suppose b is in B, then b is compatible with some a in A by definition. Therefore, 1b is also compatible with some a in A If 1b is not in B, it must hit some a’ in A, so f(i,1b)=1 If 0b is not in B, it cannot be hit by A, therefore it cannot be compatible with any a in A, so f(i,0b)=f(i-|b|+|b’|, 0b’), where 0b’=B(0b)

Computing hit probability (cont.) Ref. Li,M. et al, (2004) Comput. Biol., 2, 417 – 440.

Computing hit probability (cont.) Can also compute k -hits probability Change f(i,b) to f(i,b,k) We already have k = 1. By induction, compute each f(i,b,k) from f(i,b,k-1)

Computing hit probability (cont.) Ref. Li,M. et al, (2004) Comput. Biol., 2, 417 – 440.

Computing hit probability (cont.) Complexity It is proved that computing the hit probability of multiple seeds is NP-hard The time complexity of the algorithm is which

Computing hit probability (cont.) Implement Algorithm DP on PC It took 0.70 sec to compute hit probability for a set of 16 weight-11 seeds with length < 21 on a random region with length 64 It only took 0.37 sec for the same number of set and the same length but change the weight to 12 The running time largely depends on the maximum number of 0 in every seed

Finding seeds set Cannot enumerate all possible seed sets by Algorithm DP The number of them are exponential! Also, finding the optimal space seed set is proved NP-hard Use a “ greedy ” method

Finding seeds set (cont.) Compute the first seed a 1 which maximizes the hit probability of the set {a 1 } Then computer the second seed a 2 for the set {a 1, a 2 }. Then a 3 … Compute a i until Achieve the desire number of seeds Achieve the desire hit probability

Finding seeds set (cont.) May not optimize the hit probability It is still time-consuming e.g. It took 12 CPU days for a Pentium 4 3GHz PC to compute a set of 16 weight-11 seeds, each of them are no longer then 21 It take much longer time if the seeds become slightly longer Need a different approach

Finding seeds set (cont.) Suppose we already have N seeds, and C is the candidate set for the ( N+1 )-th seed For each c in C, estimates the hit probability in m random region samples m is reasonably large, such as 500 Remove the worst performing halve from C, and increase m to 2m Repeat until only one seed left

Seed performance Two ways to increase the sensitivity: Increase the number of seeds Reduce the weight of a single seed Both increase running time The sensitivity of “ doubling the number of seeds ” is approximately equal to “ reducing the weight of a single seed by 1 ” At high level, doubling the number of seeds achieves better sensitivity

Seed performance (cont.) From low to high: Solid curves: using the first k =(1, 2, 4, 8, 16) weight-11 seeds Dashed curves: single optimal weight w =(10, 9, 8, 7) seeds Ref. Li,M. et al, (2004) Comput. Biol., 2, 417 – 440.

Comparison Sensitivity / Speed PatternHunter II Blast Smith-Waterman algorithm SSearch

SSearch Configuration Smith-Waterman algorithm A sub-program in the FASTA package FASTA package ftp://ftp.virginia.edu/pub/FASTA/

Common Environment Score scheme Match = 1 Mismatch = -1 Gapopen = -5 Gapextension = -1 Local alignments scores >= 16

Common Environment DNA sequences 2 sets of human and mouse EST sequences ftp://ftp.ncbi.nlm.nih.gov/blast/db/FASTA/ month.est_human.Z month.est_mouse.Z Pentium IV 3GHz Linux PC

Term Explanation EST Expressed Sequence Tag A unique stretch of DNA within a coding region of a gene that is useful for identifying.DNAgene A short sub-sequence of a transcribed sequence.

Term Explanation Coding Regions Regions of DNA/RNA sequences that code for proteins. Usually starts with a start codon (ATG) and ends with a stop codon. The coding region of a gene is the portion of DNA that is transcribed into mRNA and translated into proteins.geneDNAtranscribed mRNAtranslatedproteins

Repeat Masking Fact: Long sequences of identical letters Especially of As and Ts example (Will be shown later) Solution: Turn all those sequences of ten or more repetitive letters to Ns.

SSearch Result Num of human ’ s EST: 4 Num of mouse ’ s EST: 2005 EST example (show) Ref. Li,M. et al, (2004) Comput. Biol., 2, 417 – 440.

Optimal Versus Sub-Optimal Neither PatternHunter nor Blast tries to compute the optimal alignments for the homologies they have found. Q: Why not find the optimal alignments? Ans: use Blast or PH2 to “ detect ”, then compute.

Found SSearch finds a local alignment score = x PatternHunter II finds a local alignment score >= x/2 Then “ found ” for a pair of ESTs

Sensitivity Definition Smith-Waterman Finds y pairs of ESTs Local alignment score at least x Other programs y ’ of the y pairs can be found With alignment score >= x/2 Ratio: y ’ / y

Blastn Configuration Version NCBI ’ s website -F F option To turn off the low-complexity region filtering Weight 11 seeds

Speed comparison Ref. Li,M. et al, (2004) Comput. Biol., 2, 417 – 440.

Sensitivity comparison From low to high Dashed: Blastn, seed weight 11 Solid: PH II, 1, 2, 4, 8 seeds weight 11

Compare with other seeds From left to right PH II, two weight 11 seeds PH II, one weight 10 seed HMM model,

Seed Selection Use heuristic or exponential time algorithms For general seed selection problem PTAS polynomial time approximation scheme

Homology Search Time-consuming DNA-DNA searches Blastn translated DNA-protein searches tBlastx tPH protein-protein searches Small query and database sizes

Conclusion Optimized spaced seeds Blastn & PH II Same sensitivity Speeds up by times Optimized multiple spaced seeds PH II & Smith-Waterman Approximately same sensitivity >1000 times faster

Translated PatternHunter

Outline What ’ s translated search? BLAST ’ s translated search Translated Pattern Hunter Performance

What ’ s translated search? To translate a DNA sequence into a protein sequence for alignment with another protein sequence But what ’ s “ translation ” ?

What ’ s translation? In biology, “ translation ” means to translate DNA into amino acids (AA) with a universal genetic code map on a 3-codon basis. The DNA sequence is transcribed into a RNA sequence in which all T ’ s are replaced by U ’ s AUGUCACUAGAAUCGUUAUAG MetSerValGluSerLeu.

The Genetic code We can use translation in homology search since the genetic code is universal Degeneracy: some DNA codons map to the same AA They usually differs in the third codon Translation is one-way: DNA → Protein AUGUCACUAGAAUCGUUAUAG MetSerValGluSerLeu.

Why we need translated search? When a DNA database or a Protein database is not available Blastx: DNA query, protein database tBlastn: protein query, DNA database To find very distant homologies tBlastx: DNA query & database, both translated Slowest but more functional & structural homology in addition to sequential homology Why?

Substitution Matrix Some AAs are similar in their chemical or physical properties Not only match/mismatch in substitution anymore! Stop codon is assigned the most negative score in BLAST and tPH PAM (Point Accepted Mutation) Based on global alignment of closely related proteins (1% divergence for PAM1) BLOSUM (BLOck SUbstitution Matrix) Based on local alignment of divergent proteins (62% similarity for BLOSUM 62)

Substitution Matrix Short alignments need to be relatively strong to rise above background noise, so can only detect close related homologies Query LengthSubstitution MatrixGap costs <35PAM-30(9,1) 35-50PAM-70(10,1) 50-85BLOSUM-80(10,1) 85BLOSUM-62(10,1) Related Divergent adapted from NCBI: substitution matrix

BLAST ’ s translated search The same in tBlast, tBlastn, tBlastx Aligns the 6-frame translations of the DNA sequence against another protein sequence

Reading Frame of DNA Sequence The DNA sequence can be read in six reading frames, three in the forward and three in the reverse direction. AACGUUUUCUACUAGAAAGAG CA Open Reading Frame UUGCAAAAGAUGA U CUUUCUC GU AsnAspThrArgIleValIle MetThrValGluSerLeu..His.AsnArgTyrSer HisCys.PheArg.Leu.ValLeuIleThrIleAla IleValSer AspAsnTyr

BLAST ’ s translated search 1. Translate the DNA sequence into all 6 possible frames 2. Align each frame against the protein sequence, just like BLASTp. 3. The pairs with significant scores are reported

How good is significant? The expected number of alignments scoring S or greater between two sequences m, n is E = mnKe – λS or E = mne -S ’ where K,λ, used for normalization, depend on the sequence composition Different K,λis used for each frame Non-conding sequence tend to yield alignments of marginal significance

Translated PatternHunter The version of PH for translated search Compared with PatternHunter, tPH uses very different algorithms for hit generation and gapped extensions

Hit Generation in tPH Weight = 5 instead of 11 Space complexity: 5 20 ~ 11 4 in PH Length = 6 or 7 Does not require exact matches Hit = all the five pairs have scores ≥ 0 and the total score is above a tolerance T Use BLOSUM 62 Multiple seeds are used

Hit Generation in tPH Seed = 1011, T=7 MetThrValGluSerLeu. AACGUUUUCUACUAGAAAGAG.His.AsnArgTyrSer AsnAspThrArgIleValIle CA MetPheAlaGlnSerValLeu Query Indexed Subject All possible hits MetXAlaGln MetXValGlu MetXAlaGlu ≥ ≥ ≥ 6 GlnXValLeu ArgXValIle ArgXValLeu ≥ ≥ ≥ 6

Gapped Extension in tPH The same as in BLAST? BLAST can ’ t handle frame shift errors Huh?

Frame Shift Error When a single DNA is deleted/inserted, it cause the reading frame to shift MetThrVal GluSerLeu. AACGUUUUCUACUAGAAAGAG AsnAspThrArgIleValIle A BLAST can ’ t detect such variation It aligns the 6 frames with subject independently In fact, most frame shift mutations can completely abolish the protein ’ s function They are usually lethal

Frame Shift Error In this example BLAST can only find at most two separated segments tPH can connect them with a single deletion of “ C ” How?

Gapped Extension in tPH tPH regards the DNA sequences as a sequence of overlapped codons Use a modified Smith-Waterman algorithm that can take frame shift into account Substitution: S(i-1, j-3) + σ (p i, n [j-2..j] ) Insertion of DNA: S(i, j-1) + frameshift Insertion of DNA: S(i, j-2) + frameshift Insertion of AA: S(i, j-3) + gap Deletion of AA: S(i-1, j) + gap

Scoring Scheme nGACACUAGAAUCG P Asp Arg Tyr Ser Query: GAC ACU A-- GAA --- UCG Asp Thr --- Glu Tyr Ser Subject: Asp Arg Tyr Ser Insertion S(i-1, j-3) + σ (p i, n [j-2..j] ) S(i, j-1) + frameshift (-1) S(i, j-2) + frameshift (-1) S(i, j-3) + gap (-2) S(i-1, j) + gap (-2) Frameshift Deletion Substitution

Performance Evaluation 4407 human expressed sequence tag (EST) sequences Split in the middle as subject and query

Number of Alignments Found T=12 for BLAST 3x speed Higher sensitivity Ref. Derek Kisman et al, Bioinformatics Vol. 21 no

Unique Alignment Found Most contains frameshifts Ref. Derek Kisman et al, Bioinformatics Vol. 21 no

Using 4 Seeds Differs from PH2 Short seeds High dependency between seeds Ref. Derek Kisman et al, Bioinformatics Vol. 21 no

Reference PatternHunter Bin Ma, John Tromp, Ming Li Bioinformatics Vol. 18 no Ming Li, NHC2005 PatternHunter II Li,M., Ma,B., Kisman,D. and Tromp,J. (2004) Comput. Biol., 2, 417 – 440. NTU R 林語君 ’ s powerpoint

Reference tPatternHunter Derek Kisman, Ming Li, Bin Ma, and Li Wang, Bioinformatics Vol. 21 no Others Wikipedia NCBI

Thank you for your attention!