Download presentation
Presentation is loading. Please wait.
1
. Class 4: Sequence Alignment II Gaps, Heuristic Search
2
Alignment with Gaps – Example 1 2 AAC—AATTAAG—ACTAC—GTTCATGAC A—CGA—TTA—GCAC—ACTG—T—A—GA— AACAATTAAGACTACGTTCATGAC——— AACAATT————————GTTCATGACGCA
3
Gaps u Both alignments have the same number of matches and spaces but alignment 2 seems better u Definition: A gap is any maximal, consecutive run of spaces in a single string. The length of a gap = the number of spaces in it u Example 1 has 11 gaps, example 2 only 2 gaps u Idea: develop alignment scores that take gaps (not spaces) into account
4
Biological Motivation u Number of mutational events: A single gap – due to a single event that removed a number of residues Each gap – due to distinct, independent event u Protein structure: Protein secondary structure consists of alpha helices, beta sheets and loops Loops of varying size can lead to very similar structure
5
Biological Motivation
6
cDNA Mataching u cDNA is the sequence after splicing (introns have been removed) and editing u We expect regions of high similarity, separated by long gaps
7
Gap Penalty Models (I) Constant model Gives each gap a constant score, spaces are free Maximize: Time: O(mn) Works well with cDNA matching Affine model Penalty for starting a gap + penalty for each additional space Each gap costs: W g + qW s Maximize: Time: O(mn) Widely used
8
Gap Penalty Models (II) Convex model Each extra space contributes less penalty Gap function is convex in its length Example: W s + log(q) Time O(mnlogm) A better model of biology General model The weight of a gap is some arbitrary w(q) Time O(mn 2 + nm 2 )
9
Example Revised 1 2 AAC—AATTAAG—ACTAC—GTTCATGAC A—CGA—TTA—GCAC—ACTG—T—A—GA— AACAATTAAGACTACGTTCATGAC——— AACAATT————————GTTCATGACGCA
10
Indel Model Score: -6 Scoring Parameters Match: +1 Indel: -2 1 2 AAC—AATTAAG—ACTAC—GTTCATGAC A—CGA—TTA—GCAC—ACTG—T—A—GA— AACAATTAAGACTACGTTCATGAC——— AACAATT————————GTTCATGACGCA
11
Constant Model Scoring Parameters Match: +1 Open gap: -2 Score: -6 Score: 12 1 2 AAC—AATTAAG—ACTAC—GTTCATGAC A—CGA—TTA—GCAC—ACTG—T—A—GA— AACAATTAAGACTACGTTCATGAC——— AACAATT————————GTTCATGACGCA
12
Affine Model Scoring Parameters Match: +1 Open gap: -2, each space: -1 Score: -17 Score: 1 1 2 AAC—AATTAAG—ACTAC—GTTCATGAC A—CGA—TTA—GCAC—ACTG—T—A—GA— AACAATTAAGACTACGTTCATGAC——— AACAATT————————GTTCATGACGCA
13
Convex Model Scoring Parameters Match: +1 Open gap: -2, gap length: -logn Score: -6 Score: ~7 1 2 AAC—AATTAAG—ACTAC—GTTCATGAC A—CGA—TTA—GCAC—ACTG—T—A—GA— AACAATTAAGACTACGTTCATGAC——— AACAATT————————GTTCATGACGCA
14
Affine Weight Model We divide all possible alignments of the prefixes s[1..i] and t[1..j] into 3 types s: i t:j s:i----- t:j s: i t:j-----
15
Affine Weight Model Recurrence relations:
16
Affine Weight Model Initial condition: Optimal alignment: Complexity: Time: O(mn) Space:O(mn)
17
Affine Weight Model This model has a natural explanation as a finite state automata A B C S(i,j) W g +W s S(i,j) WsWs W g +W s S(i,j) WsWs
18
Alignment in Real Life u One of the major uses of alignments is to find sequences in a “database” u Such collections contain massive number of sequences (order of 10 6 ) u Finding homologies in these databases with the standard dynamic programming can take too long u Example: query protein : 232 AAs NR protein DB: 2.7 million sequences; 748 million AAs m*n = ~ 1.7 *10 11 cells !
19
Heuristic Search u Instead, most searches rely on heuristic procedures u These are not guaranteed to find the best match u Sometimes, they will completely miss a high- scoring match u We now describe the main ideas used by some of these procedures Actual implementations often contain additional tricks and hacks
20
Basic Intuition u The main resource consuming factor in the standard DP is decision of where the gaps are. If there were no gaps, life was easy! u Almost all heuristic search procedures are based on the observation that real-life well-matching pairs of sequences often do contain long strings with gap-less matches. u These heuristics try to find significant local gap-less matches and then extend them.
21
Banded DP Suppose that we have two strings s[1..n] and t[1..m] such that n m u If the optimal global alignment of s and t has few gaps, then path of the alignment will be close to the diagonal s t
22
Banded DP u To find such a path, it suffices to search in a diagonal region of the matrix If the diagonal band has presumed width a, then the dynamic programming step takes O(an) Much faster than O(n 2 ) of standard DP in this case s t a
23
Banded DP Problem (for local alignment): If we know that t[i..j] matches the query s[p..q], then we can use banded DP to evaluate quality of the match u However, we do not know i,j,p,q ! u How do we select which sub-sequences to align using banded DP?
24
FASTA Overview u Main idea: Find (fast!) “good” diagonals and extend them to complete matches u Suppose that we have a relatively long gap-less local match (diagonal): …AGCGCCATGGATTGAGCGA… …TGCGACATTGATCGACCTA… u Can we find “clues” that will let us find it quickly?
25
Signature of a Match Assumption: good matches contain several “patches” of perfect matches AGCGCCATGGATTGAGCGA TGCGACATTGATCGACCTA s t
26
FASTA Given s and t, and a parameter k u Find all pairs (i,j) such that s[i..i+k] and t[j..j+k] match perfectly Locate sets of pairs that are on the same diagonal by sorting according to i-j thus … u Locating diagonals that contain many close pairs. This is faster than O(nm) ! s t i i+k j j+k
27
FASTA u Extend the “best” diagonal matches to imperfect (yet ungapped) matches, compute alignment scores per diagonal. Pick the best-scoring matches. u Try to combine close diagonals to potential gapped matches, picking the best-scoring matches. u Finally, run banded DP on the regions containing these matches, resulting in several good candidate alignments. Most applications of FASTA use very small k (2 for proteins, and 4-6 for DNA)
28
BLAST Overview u FASTA drawback is its reliance on perfect matches u BLAST (Basic Local Alignment Search Tool) uses similar intuition, but relies on high scoring matches rather than exact matches Given parameters: length k, and threshold T Two strings s and t of length k are a high scoring pair (HSP) if d(s,t) > T
29
High-Scoring Pair Given a query string s, BLAST construct all words w (“neighborhood words”), such that w is an HSP with a k -substring of s. Note: not all k-mers have an HSP in s
30
BLAST: phase 1 u Phase 1: compile a list of word pairs (k=3) above threshold T u Example: for the following query: …FSGTWYA… (query word is in green) u A list of words (k=3) is: FSG SGT GTW TWY WYA YSG TGT ATW SWY WFA FTG SVT GSW TWF WYS
31
GTW 6,5,11 22 neighborhoodASW 6,1,11 18 word hitsATW 0,5,1116 > threshold NTW 0,5,1116 GTY 6,5,213 GNW10 neighborhood GAW9 word hits below threshold (T=11) scores BLAST: phase 1
32
BLAST: phase 2 u Search the database for perfect matches with neighborhood words. Those are “hits” for further alignment. u We can locate seed words in a large database in a single pass, given the database is properly preprocessed (using hashing techniques).
33
Extending Potential Matches u Once a hit is found, BLAST attempts to find a local alignment that extends it. u Seeds on the same diagonal tend to be combined (as in FASTA) s t
34
u An improvement: look for 2 HSPs on close diagonals u Extend the alignment between them u Fewer extensions considered u There is a version of BLAST, involving gapped extensions. u Generally faster then FASTA, arguably better. Two HSP diagonal s t
35
Blast Variants u blastn (nucleotide BLAST) u blastp (protein BLAST) u tblastn (protein query, translated DB BLAST) u blastx (translated query, protein DB BLAST) u tblastx (translated query, translated DB BLAST) u bl2seq (pairwise alignment)
36
Biological Databases u Today, most of the biological information can be freely accessed on the web. u One can: Search for information on a known gene Check if a sequence exists in a database Find a homologous protein, helping us guess: Structure Function
37
Databases and Tool u Important gateways: National Center for Biotechnology (GenBank) http://www.ncbi.nlm.nih.gov/ http://www.ncbi.nlm.nih.gov/ European Bioinformatics Institue (EMBL-Bank) http://www.ebi.ac.uk/ http://www.ebi.ac.uk/ Expert Protein Analysis System (SwissProt) http://www.expasy.org/ http://www.expasy.org/ → Different tools and DBs to allow biologists a rich suite of queries
38
Database Types u Nucleotide DBs (GenBank, EMBL-Bank): Contain any and every type of DNA fragment: Full cDNA, ESTs, repeats, fragments “Dirty” and redundant u Protein DBs (SwissProt): Contain amino-acid sequences for full proteins High quality, strict screening process Lots of annotated information on each protein
Similar presentations
© 2024 SlidePlayer.com. Inc.
All rights reserved.