Sequence Alignment Kun-Mao Chao (趙坤茂)

Slides:



Advertisements
Similar presentations
Fa07CSE 182 CSE182-L4: Database filtering. Fa07CSE 182 Summary (through lecture 3) A2 is online We considered the basics of sequence alignment –Opt score.
Advertisements

Gapped BLAST and PSI-BLAST Altschul et al Presenter: 張耿豪 莊凱翔.
Bioinformatics Tutorial I BLAST and Sequence Alignment.
BLAST Sequence alignment, E-value & Extreme value distribution.
1 CAP5510 – Bioinformatics Database Searches for Biological Sequences or Imperfect Alignments Tamer Kahveci CISE Department University of Florida.
Sequence Alignment Kun-Mao Chao ( 趙坤茂 ) Department of Computer Science and Information Engineering National Taiwan University, Taiwan
Space/Time Tradeoff and Heuristic Approaches in Pairwise Alignment.
Developing Pairwise Sequence Alignment Algorithms Dr. Nancy Warter-Perez.
Developing Pairwise Sequence Alignment Algorithms Dr. Nancy Warter-Perez June 23, 2005.
Developing Pairwise Sequence Alignment Algorithms Dr. Nancy Warter-Perez June 23, 2004.
Developing Pairwise Sequence Alignment Algorithms Dr. Nancy Warter-Perez May 20, 2003.
Heuristic Approaches for Sequence Alignments
Developing Sequence Alignment Algorithms in C++ Dr. Nancy Warter-Perez May 21, 2002.
Bioinformatics Workshop, Fall 2003 Algorithms in Bioinformatics Lawrence D’Antonio Ramapo College of New Jersey.
Developing Pairwise Sequence Alignment Algorithms Dr. Nancy Warter-Perez May 10, 2005.
Alignment methods II April 24, 2007 Learning objectives- 1) Understand how Global alignment program works using the longest common subsequence method.
Heuristic methods for sequence alignment in practice Sushmita Roy BMI/CS 576 Sushmita Roy Sep 27 th,
Speed Up DNA Sequence Database Search and Alignment by Methods of DSP
Developing Pairwise Sequence Alignment Algorithms
BLAT – The B LAST- L ike A lignment T ool Kent, W.J. Genome Res : Presenter: 巨彥霖 田知本.
Pair-wise Sequence Alignment What happened to the sequences of similar genes? random mutation deletion, insertion Seq. 1: 515 EVIRMQDNNPFSFQSDVYSYG EVI.
Dynamic-Programming Strategies for Analyzing Biomolecular Sequences Kun-Mao Chao ( 趙坤茂 ) Department of Computer Science and Information Engineering National.
Content of the previous class Introduction The evolutionary basis of sequence alignment The Modular Nature of proteins.
Gapped BLAST and PSI- BLAST: a new generation of protein database search programs By Stephen F. Altschul, Thomas L. Madden, Alejandro A. Schäffer, Jinghui.
Dynamic-Programming Strategies for Analyzing Biomolecular Sequences.
Dynamic Programming Method for Analyzing Biomolecular Sequences Tao Jiang Department of Computer Science University of California - Riverside (Typeset.
Pairwise Sequence Alignment. The most important class of bioinformatics tools – pairwise alignment of DNA and protein seqs. alignment 1alignment 2 Seq.
Computational Biology, Part 9 Efficient database searching methods Robert F. Murphy Copyright  1996, 1999, All rights reserved.
Indexing DNA sequences for local similarity search Joint work of Angela, Dr. Mamoulis and Dr. Yiu 17/5/2007.
Database Searches BLAST. Basic Local Alignment Search Tool –Altschul, Gish, Miller, Myers, Lipman, J. Mol. Biol. 215 (1990) –Altschul, Madden, Schaffer,
1 Data structure:Lookup Table Application:BLAST. 2 The Look-up Table Data Structure A k-mer is a string of length k. A lookup table is a table of size.
Multiple Sequence Alignment Kun-Mao Chao ( 趙坤茂 ) Department of Computer Science and Information Engineering National Taiwan University, Taiwan WWW:
Sequence Alignment Kun-Mao Chao ( 趙坤茂 ) Department of Computer Science and Information Engineering National Taiwan University, Taiwan
PatternHunter II: Highly Sensitive and Fast Homology Search Bioinformatics and Computational Molecular Biology (Fall 2005): Representation R 林語君.
BLAST: Basic Local Alignment Search Tool Altschul et al. J. Mol Bio CS 466 Saurabh Sinha.
BLAST, which stands for basic local alignment search tool, is a heuristic algorithm that is used to find similar sequences of amino acids or nucleotides.
Pairwise Local Alignment and Database Search Csc 487/687 Computing for Bioinformatics.
Biocomputation: Comparative Genomics Tanya Talkar Lolly Kruse Colleen O’Rourke.
Pairwise Sequence Alignment Part 2. Outline Summary Local and Global alignments FASTA and BLAST algorithms Evaluating significance of alignments Alignment.
BLAST, which stands for basic local alignment search tool, is a heuristic algorithm that is used to find similar sequences of amino acids or nucleotides.
Heuristic Methods for Sequence Database Searching BMI/CS 576 Colin Dewey Fall 2015.
Heuristic Methods for Sequence Database Searching BMI/CS 576 Colin Dewey Fall 2010.
Heuristic Alignment Algorithms Hongchao Li Jan
Never-ending stories Kun-Mao Chao ( 趙坤茂 ) Dept. of Computer Science and Information Engineering National Taiwan University, Taiwan
Heuristic Methods for Sequence Database Searching BMI/CS 776 Mark Craven February 2002.
Alignments and Phylogenetic tree Reading: Introduction to Bioinformatics. Arthur M. Lesk. Fourth Edition Chapter 5.
Database Scanning/Searching FASTA/BLAST/PSIBLAST G P S Raghava.
Homology Search Tools Kun-Mao Chao (趙坤茂)
Sequence Alignment Kun-Mao Chao (趙坤茂)
Homology Search Tools Kun-Mao Chao (趙坤茂)
BLAST Anders Gorm Pedersen & Rasmus Wernersson.
LSM3241: Bioinformatics and Biocomputing Lecture 4: Sequence analysis methods revisited Prof. Chen Yu Zong Tel:
Dynamic-Programming Strategies for Analyzing Biomolecular Sequences
Homology Search Tools Kun-Mao Chao (趙坤茂)
Fast Sequence Alignments
SMA5422: Special Topics in Biotechnology
Pairwise sequence Alignment.
Sequence Alignment Kun-Mao Chao (趙坤茂)
Basic Local Alignment Search Tool (BLAST)
BIOINFORMATICS Fast Alignment
Sequence Alignment Kun-Mao Chao (趙坤茂)
Space-Saving Strategies for Computing Δ-points
Multiple Sequence Alignment
Space-Saving Strategies for Computing Δ-points
Space-Saving Strategies for Analyzing Biomolecular Sequences
Sequence Alignment (I)
Basic Local Alignment Search Tool (BLAST)
Homology Search Tools Kun-Mao Chao (趙坤茂)
Sequence alignment, E-value & Extreme value distribution
Multiple Sequence Alignment
Presentation transcript:

Sequence Alignment Kun-Mao Chao (趙坤茂) Department of Computer Science and Information Engineering National Taiwan University, Taiwan E-mail: kmchao@csie.ntu.edu.tw WWW: http://www.csie.ntu.edu.tw/~kmchao

Bioinformatics

Bioinformatics and Computational Biology-Related Journals: Bioinformatics (previously called CABIOS) Bulletin of Mathematical Biology Computers and Biomedical Research Genome Research Genomics Journal of Bioinformatics and Computational Biology Journal of Computational Biology Journal of Molecular Biology Nature Nucleic Acid Research Science

Bioinformatics and Computational Biology-Related Conferences: Intelligent Systems for Molecular Biology (ISMB) Pacific Symposium on Biocomputing (PSB) The Annual International Conference on Research in Computational Molecular Biology (RECOMB) The IEEE Computer Society Bioinformatics Conference (CSB) ...

Bioinformatics and Computational Biology-Related Books: Calculating the Secrets of Life: Applications of the Mathematical Sciences in Molecular Biology, by Eric S. Lander and Michael S. Waterman (1995) Introduction to Computational Biology: Maps, Sequences, and Genomes, by Michael S. Waterman (1995) Introduction to Computational Molecular Biology, by Joao Carlos Setubal and Joao Meidanis (1996) Algorithms on Strings, Trees, and Sequences: Computer Science and Computational Biology, by Dan Gusfield (1997) Computational Molecular Biology: An Algorithmic Approach, by Pavel Pevzner (2000) Introduction to Bioinformatics, by Arthur M. Lesk (2002)

Useful Websites MIT Biology Hypertextbook http://www.mit.edu:8001/afs/athena/course/other/esgbio/www/7001main.html The International Society for Computational Biology: http://www.iscb.org/ National Center for Biotechnology Information (NCBI, NIH): http://www.ncbi.nlm.nih.gov/ European Bioinformatics Institute (EBI): http://www.ebi.ac.uk/ DNA Data Bank of Japan (DDBJ): http://www.ddbj.nig.ac.jp/

Sequence Alignment

Dot Matrix Sequence A:CTTAACT Sequence B:CGGATCAT C G G A T C A T

Pairwise Alignment Sequence A: CTTAACT Sequence B: CGGATCAT An alignment of A and B: C---TTAACT CGGATCA--T Sequence A Sequence B

Pairwise Alignment Sequence A: CTTAACT Sequence B: CGGATCAT An alignment of A and B: Mismatch Match C---TTAACT CGGATCA--T Deletion gap Insertion gap

Alignment Graph C---TTAACT CGGATCA--T Sequence A: CTTAACT Sequence B: CGGATCAT C G G A T C A T C T T A A C T C---TTAACT CGGATCA--T

A simple scoring scheme Match: +8 (w(x, y) = 8, if x = y) Mismatch: -5 (w(x, y) = -5, if x ≠ y) Each gap symbol: -3 (w(-,x)=w(x,-)=-3) C - - - T T A A C T C G G A T C A - - T +8 -3 -3 -3 +8 -5 +8 -3 -3 +8 = +12 Alignment score

An optimal alignment -- the alignment of maximum score Let A=a1a2…am and B=b1b2…bn . Si,j: the score of an optimal alignment between a1a2…ai and b1b2…bj With proper initializations, Si,j can be computed as follows.

Computing Si,j j w(ai,bj) w(ai,-) i w(-,bj) Sm,n

Initializations C G G A T C A T -3 -6 -9 -12 -15 -18 -21 -24 -3 -6 -9 -12 -15 -18 -21 -24 C T T A A C T

S3,5 = ? C G G A T C A T -3 -6 -9 -12 -15 -18 -21 -24 8 5 2 -1 -4 -7 -3 -6 -9 -12 -15 -18 -21 -24 8 5 2 -1 -4 -7 -10 -13 3 7 4 1 -2 -5 ? C T T A A C T

S3,5 = 5 C G G A T C A T -3 -6 -9 -12 -15 -18 -21 -24 8 5 2 -1 -4 -7 -3 -6 -9 -12 -15 -18 -21 -24 8 5 2 -1 -4 -7 -10 -13 3 7 4 1 -2 -5 9 6 -8 -11 -14 14 C T T A A C T optimal score

C T T A A C – T C G G A T C A T 8 – 5 –5 +8 -5 +8 -3 +8 = 14 8 – 5 –5 +8 -5 +8 -3 +8 = 14 C G G A T C A T -3 -6 -9 -12 -15 -18 -21 -24 8 5 2 -1 -4 -7 -10 -13 3 7 4 1 -2 -5 9 6 -8 -11 -14 14 C T T A A C T

Now try this example in class Sequence A: CAATTGA Sequence B: GAATCTGC Their optimal alignment?

Initializations G A A T C T G C -3 -6 -9 -12 -15 -18 -21 -24 -3 -6 -9 -12 -15 -18 -21 -24 C AA T T G A

S4,2 = ? G A A T C T G C -3 -6 -9 -12 -15 -18 -21 -24 -5 -8 -11 -14 -4 -3 -6 -9 -12 -15 -18 -21 -24 -5 -8 -11 -14 -4 -7 -10 -13 3 11 8 5 2 -1 ? C AA T T G A

S5,5 = ? G A A T C T G C -3 -6 -9 -12 -15 -18 -21 -24 -5 -8 -11 -14 -4 -3 -6 -9 -12 -15 -18 -21 -24 -5 -8 -11 -14 -4 -7 -10 -13 3 11 8 5 2 -1 19 16 13 10 7 ? C AA T T G A

S5,5 = 14 G A A T C T G C -3 -6 -9 -12 -15 -18 -21 -24 -5 -8 -11 -14 -3 -6 -9 -12 -15 -18 -21 -24 -5 -8 -11 -14 -4 -7 -10 -13 3 11 8 5 2 -1 19 16 13 10 7 14 24 21 18 32 29 1 27 C AA T T G A optimal score

C A A T - T G A G A A T C T G C -5 +8 +8 +8 -3 +8 +8 -5 = 27 -5 +8 +8 +8 -3 +8 +8 -5 = 27 G A A T C T G C -3 -6 -9 -12 -15 -18 -21 -24 -5 -8 -11 -14 -4 -7 -10 -13 3 11 8 5 2 -1 19 16 13 10 7 14 24 21 18 32 29 1 27 C AA T T G A

Global Alignment vs. Local Alignment

An optimal local alignment Si,j: the score of an optimal local alignment ending at ai and bj With proper initializations, Si,j can be computed as follows.

local alignment C G G A T C A T 8 5 2 3 13 11 ? C T T A A C T Match: 8 Mismatch: -5 Gap symbol: -3 C G G A T C A T 8 5 2 3 13 11 ? C T T A A C T

local alignment C G G A T C A T 8 5 2 3 13 11 10 7 18 C T T A A C T Match: 8 Mismatch: -5 Gap symbol: -3 C G G A T C A T 8 5 2 3 13 11 10 7 18 C T T A A C T The best score

A – C - T A T C A T 8-3+8-3+8 = 18 C G G A T C A T 8 5 2 3 13 11 10 7 8 5 2 3 13 11 10 7 18 C T T A A C T The best score

Now try this example in class Sequence A: CAATTGA Sequence B: GAATCTGC Their optimal local alignment?

Did you get it right? G A A T C T G C 8 5 2 3 16 13 10 7 4 1 24 21 18 8 5 2 3 16 13 10 7 4 1 24 21 18 15 12 19 29 26 23 37 34 32 C AA T T G A

A A T – T G A A T C T G 8+8+8-3+8+8 = 37 G A A T C T G C 8 5 2 3 16 13 10 7 4 1 24 21 18 15 12 19 29 26 23 37 34 32 C AA T T G A

Affine gap penalties C - - - T T A A C T C G G A T C A - - T Match: +8 (w(x, y) = 8, if x = y) Mismatch: -5 (w(x, y) = -5, if x ≠ y) Each gap symbol: -3 (w(-,x)=w(x,-)=-3) Each gap is charged an extra gap-open penalty: -4. -4 -4 C - - - T T A A C T C G G A T C A - - T +8 -3 -3 -3 +8 -5 +8 -3 -3 +8 = +12 Alignment score: 12 – 4 – 4 = 4

Affine gap panalties A gap of length k is penalized x + k·y. gap-open penalty Three cases for alignment endings: ...x ...x ...x ...- ...- ...x gap-symbol penalty an aligned pair a deletion an insertion

Affine gap penalties Let D(i, j) denote the maximum score of any alignment between a1a2…ai and b1b2…bj ending with a deletion. Let I(i, j) denote the maximum score of any alignment between a1a2…ai and b1b2…bj ending with an insertion. Let S(i, j) denote the maximum score of any alignment between a1a2…ai and b1b2…bj.

Affine gap penalties (A gap of length k is penalized x + k·y.)

Affine gap penalties S I D S I D -y w(ai,bj) -x-y S I D D -x-y I S -y

Constant gap penalties Match: +8 (w(x, y) = 8, if x = y) Mismatch: -5 (w(x, y) = -5, if x ≠ y) Each gap symbol: 0 (w(-,x)=w(x,-)=0) Each gap is charged a constant penalty: -4. -4 -4 C - - - T T A A C T C G G A T C A - - T +8 0 0 0 +8 -5 +8 0 0 +8 = +27 Alignment score: 27 – 4 – 4 = 19

Constant gap penalties Let D(i, j) denote the maximum score of any alignment between a1a2…ai and b1b2…bj ending with a deletion. Let I(i, j) denote the maximum score of any alignment between a1a2…ai and b1b2…bj ending with an insertion. Let S(i, j) denote the maximum score of any alignment between a1a2…ai and b1b2…bj.

Constant gap penalties

Restricted affine gap panalties A gap of length k is penalized x + f(k)·y. where f(k) = k for k <= c and f(k) = c for k > c Five cases for alignment endings: ...x ...x ...x ...- ...- ...x and 5. for long gaps an aligned pair a deletion an insertion

Restricted affine gap penalties

D(i, j) vs. D’(i, j) Case 1: the best alignment ending at (i, j) with a deletion at the end has the last deletion gap of length <= c D(i, j) >= D’(i, j) Case 2: the best alignment ending at (i, j) with a deletion at the end has the last deletion gap of length >= c D(i, j) <= D’(i, j)

k best local alignments Smith-Waterman (Smith and Waterman, 1981; Waterman and Eggert, 1987) FASTA (Wilbur and Lipman, 1983; Lipman and Pearson, 1985) BLAST (Altschul et al., 1990; Altschul et al., 1997)

FASTA Find runs of identities, and identify regions with the highest density of identities. Re-score using PAM matrix, and keep top scoring segments. Eliminate segments that are unlikely to be part of the alignment. Optimize the alignment in a band.

FASTA Step 1: Find runes of identities, and identify regions with the highest density of identities. Sequence B Sequence A

FASTA Step 2: Re-score using PAM matrix, and keep top scoring segments.

FASTA Step 3: Eliminate segments that are unlikely to be part of the alignment.

FASTA Step 4: Optimize the alignment in a band.

BLAST Basic Local Alignment Search Tool (by Altschul, Gish, Miller, Myers and Lipman) The central idea of the BLAST algorithm is that a statistically significant alignment is likely to contain a high-scoring pair of aligned words.

The maximal segment pair measure A maximal segment pair (MSP) is defined to be the highest scoring pair of identical length segments chosen from 2 sequences. (for DNA: Identities: +5; Mismatches: -4) The MSP score may be computed in time proportional to the product of their lengths. (How?) An exact procedure is too time consuming. BLAST heuristically attempts to calculate the MSP score. the highest scoring pair

BLAST Build the hash table for Sequence A. Scan Sequence B for hits. Extend hits.

BLAST Step 1: Build the hash table for Sequence A. (3-tuple example) For DNA sequences: Seq. A = AGATCGAT 12345678 AAA AAC .. AGA 1 .. ATC 3 .. CGA 5 .. GAT 2 6 .. TCG 4 .. TTT For protein sequences: Seq. A = ELVIS Add xyz to the hash table if Score(xyz, ELV) ≧ T; Add xyz to the hash table if Score(xyz, LVI) ≧ T; Add xyz to the hash table if Score(xyz, VIS) ≧ T;

BLAST Step2: Scan sequence B for hits.

BLAST Step2: Scan sequence B for hits. Step 3: Extend hits. BLAST 2.0 saves the time spent in extension, and considers gapped alignments. hit Terminate if the score of the sxtension fades away. (That is, when we reach a segment pair whose score falls a certain distance below the best score found for shorter extensions.)

Remarks Filtering is based on the observation that a good alignment usually includes short identical or very similar fragments. The idea of filtration was used in both FASTA and BLAST.