LSM3241: Bioinformatics and Biocomputing Lecture 4: Sequence analysis methods revisited Prof. Chen Yu Zong Tel: 6874-6877 Email: csccyz@nus.edu.sg http://xin.cz3.nus.edu.sg Room 07-24, level 7, SOC1, National University of Singapore
Sequence Analysis Methods
Gene and Protein Sequence Alignment as a Mathematical Problem: Example: Sequence a: ATTCTTGC Sequence b: ATCCTATTCTAGC Best Alignment: ATTCTTGC ATCCTATTCTAGC /|\ gap Bad Alignment: AT TCTT GC ATCCTATTCTAGC /|\ /|\ gap gap What is a good alignment?
How to rate an alignment? Match: +8 (w(x, y) = 8, if x = y) Mismatch: -5 (w(x, y) = -5, if x ≠ y) Each gap symbol: -3 (w(-,x)=w(x,-)=-3) a1 a2 a3 - - x - - b1 b2 b3 - - y - -
Pairwise Alignment An alignment of a and b: Sequence a: CTTAACT Sequence b: CGGATCAT An alignment of a and b: C---TTAACT CGGATCA--T Insertion gap Match Mismatch Deletion gap
Alignment Graph C---TTAACT CGGATCA--T Sequence a: CTTAACT Sequence b: CGGATCAT Insertion gap C G G A T C A T C T T A A C T C---TTAACT CGGATCA--T Deletion gap
Graphic representation of an alignment Sequence a: CTTAACT Sequence b: CGGATCAT C C C---TTAACT CGGATCA--T
Graphic representation of an alignment Sequence a: CTTAACT Sequence b: CGGATCAT C G G A C C---TTAACT CGGATCA--T
Graphic representation of an alignment Sequence a: CTTAACT Sequence b: CGGATCAT C G G A T C T C---TTAACT CGGATCA--T
Graphic representation of an alignment Sequence a: CTTAACT Sequence b: CGGATCAT C G G A T C A C T T A A C C---TTAACT CGGATCA--T
Graphic representation of an alignment Sequence a: CTTAACT Sequence b: CGGATCAT C G G A T C A T C T T A A C T C---TTAACT CGGATCA--T
Pathway of an alignment Sequence a: CTTAACT Sequence b: CGGATCAT C G G A T C A T C T T A A C T C---TTAACT CGGATCA--T
Graphic representation of an alignment Sequence a: CTTAACT Sequence b: CGGATCAT C G G A T C A T C T T A A C T CTTAACT- CGGATCAT
Pathway of an alignment Sequence a: CTTAACT Sequence b: CGGATCAT C G G A T C A T C T T A A C T CTTAACT- CGGATCAT
Use of graph to generate alignments Sequence a: CTTAACT Sequence b: CGGATCAT C G G A T C A T C T T A A C T - CTTAACT CGGATCAT
Use of graph to generate alignments Sequence a: CTTAACT Sequence b: CGGATCAT C G G A T C A T C T T A A C T - C - - TTAACT CGGATC - AT -
Use of graph to generate alignments Sequence a: CTTAACT Sequence b: CGGATCAT C G G A T C A T C T T A A C T CTTAACT - - - - - CGGATCAT
Which pathway is better? Sequence a: CTTAACT Sequence b: CGGATCAT C G G A T C A T C T T A A C T Multiple pathways Each with a unique scoring function
Alignment Score 8 C---TTAACT CGGATCA--T Sequence a: CTTAACT Sequence b: CGGATCAT C G G A T C A T 8 C T T A A C T C---TTAACT CGGATCA--T
Alignment Score C---TTAACT CGGATCA--T Sequence a: CTTAACT Sequence b: CGGATCAT C G G A T C A T 8 8-3 =5 C T T A A C T C---TTAACT CGGATCA--T
Alignment Score C---TTAACT CGGATCA--T Sequence a: CTTAACT Sequence b: CGGATCAT C G G A T C A T 8 8-3 =5 5-3 =2 2-3 =-1 C T T A A C T C---TTAACT CGGATCA--T
Alignment Score C---TTAACT CGGATCA--T Sequence a: CTTAACT Sequence b: CGGATCAT C G G A T C A T 8 5 2 -1 -1+8 =7 7-3 =4 4+8 =12 12-3 =9 9-3 =6 C T T A A C T C---TTAACT CGGATCA--T Alignment score 6+8=14
An optimal alignment -- the alignment of maximum score Let A=a1a2…am and B=b1b2…bn . Si,j: the score of an optimal alignment between a1a2…ai and b1b2…bj With proper initializations, Si,j can be computed as follows.
Computing Si,j j w(ai,bj) w(ai,-) i w(-,bj) Sm,n
Initializations C G G A T C A T C T T A A C T Gap symbol: -3 -3 -6 -9 S0,1=-3, S0,2=-6, S0,3=-9, S0,4=-12, S0,5=-15, S0,6=-18, S0,7=-21, S0,8=-24 S1,0=-3, S2,0=-6, S3,0=-9, S4,0=-12, S5,0=-15, S6,0=-18, S7,0=-21 C G G A T C A T -3 -6 -9 -12 -15 -18 -21 -24 C T T A A C T
S1,1 = ? C G G A T C A T ? C T T A A C T Match: 8 Mismatch: -5 Gap symbol: -3 Option 1: S1,1 = S0,0 +w(a1, b1) = 0 +8 = 8 Option 2: S1,1=S0,1 + w(a1, -) = -3 - 3 = -6 Option 3: S1,1=S1,0 + w( - , b1) = -3-3 = -6 Optimal: S1,1 = 8 C G G A T C A T -3 -6 -9 -12 -15 -18 -21 -24 ? C T T A A C T
S1,2 = ? C G G A T C A T C T T A A C T Match: 8 Mismatch: -5 Gap symbol: -3 Option 1: S1,2 = S0,1 +w(a1, b2) = -3 -5 = -8 Option 2: S1,2=S0,2 + w(a1, -) = -6 - 3 = -9 Option 3: S1,2=S1,1 + w( - , b2) = 8-3 = 5 Optimal: S1,2 =5 C G G A T C A T -3 -6 -9 -12 -15 -18 -21 -24 8 ? C T T A A C T
S2,1 = ? C G G A T C A T C T T A A C T Match: 8 Mismatch: -5 Gap symbol: -3 S2,1 = ? Option 1: S2,1= S1,0 +w(a2, b1) = -3 -5 = -8 Option 2: S2,1=S1,1 + w(a2, -) = 8 - 3 = 5 Option 3: S2,1=S2,0 + w( - , b1) = -6-3 = -9 Optimal: S2,1 =5 C G G A T C A T -3 -6 -9 -12 -15 -18 -21 -24 8 5 ? C T T A A C T
S2,2 = ? C G G A T C A T C T T A A C T Match: 8 Mismatch: -5 Gap symbol: -3 Option 1: S2,2= S1,1 +w(a2, b2) = 8 -5 = 3 Option 2: S2,2=S1,2 + w(a2, -) = 5 - 3 = 2 Option 3: S2,2=S2,1 + w( - , b2) = 5-3 = 2 Optimal: S2,2 =3 C G G A T C A T -3 -6 -9 -12 -15 -18 -21 -24 8 5 ? C T T A A C T
S3,5 = ? C G G A T C A T C T T A A C T -3 -6 -9 -12 -15 -18 -21 -24 8 -3 -6 -9 -12 -15 -18 -21 -24 8 5 2 -1 -4 -7 -10 -13 3 7 4 1 -2 -5 ? C T T A A C T
S3,5 = ? C G G A T C A T C T T A A C T -3 -6 -9 -12 -15 -18 -21 -24 8 -3 -6 -9 -12 -15 -18 -21 -24 8 5 2 -1 -4 -7 -10 -13 3 7 4 1 -2 -5 9 6 -8 -11 -14 14 C T T A A C T optimal score
C T T A A C – T C G G A T C A T 8 – 5 –5 +8 -5 +8 -3 +8 = 14 8 – 5 –5 +8 -5 +8 -3 +8 = 14 C G G A T C A T -3 -6 -9 -12 -15 -18 -21 -24 8 5 2 -1 -4 -7 -10 -13 3 7 4 1 -2 -5 9 6 -8 -11 -14 14 C T T A A C T
Local vs. Global Sequence Alignment: Example: DNA sequence a: ATTCTTGC DNA sequence b: ATCCTATTCTAGC Local Alignment: ATTCTTGC Gaps ignored in local alignments ATCCTATTCTAGC /|\ gap Global Alignment: AT TCTT GC ATCCTATTCTAGC /|\ /|\ gap gap Gaps counted in global alignments
Global Alignment vs. Local Alignment All sections are counted Only local sections (normally separated by gaps) are counted
An optimal local alignment Si,j: the score of an optimal local alignment ending at ai and bj With proper initializations, Si,j can be computed as follows.
Initializations C G G A T C A T C T T A A C T Match: 8 Mismatch: -5 Gap symbol: -3 C G G A T C A T C T T A A C T
S1,1 = ? C G G A T C A T ? C T T A A C T Match: 8 Mismatch: -5 Gap symbol: -3 Option 1: S1,1 = S0,0 +w(a1, b1) = 0 +8 = 8 Option 2: S1,1=S0,1 + w(a1, -) = 0 - 3 = -3 Option 3: S1,1=S1,0 + w( - , b1) = 0-3 = -3 Option 4: S1,1=0 Optimal: S1,1 = 8 C G G A T C A T ? C T T A A C T
local alignment C G G A T C A T 8 5 2 3 13 11 ? C T T A A C T Match: 8 Mismatch: -5 Gap symbol: -3 C G G A T C A T 8 5 2 3 13 11 ? C T T A A C T
local alignment A – C - T A T C A T 8-3+8-3+8 = 18 C G G A T C A T 8 5 8 5 2 3 13 11 10 7 18 C T T A A C T The best score
BLAST Basic Local Alignment Search Tool Procedure: Divide all sequences into overlapping constituent words (size k) Build the hash table for Sequence a. Scan Sequence b for hits. Extend hits.
BLAST Basic Local Alignment Search Tool Step 1: Hash table for sequence A
Amino acid similarity matrix PAM 120 Instead of using the simple values +8 and -5 for matches and mismatches, this statistically derived score matrix is used to rank the level of similarity between two amino acids
Amino acid similarity matrix PAM 250 This is a more popularly used score matrix for ranking the level of similarity of two amino acids. It is derived by consideration of more diverse sets of data and more number of statistical steps.
Amino acid similarity matrix Blosum 45 The Blosum matrices were calculated using data from the BLOCKS database which contains alignments of more distantly-related proteins. In principle, Blosum matrices should be more realistic for comparing distantly-related proteins, but may introduce error for conventional proteins. .
BLAST Basic Local Alignment Search Tool
BLAST Basic Local Alignment Search Tool Step 2: Use all of the 2-letter words in query sequence to scan against database sequence and mark those with score > 8 Note: Marked points can be on the diagonal and off-diagonal LN:LN=9 NF:NY=8 GW:PW=10
BLAST Step2: Scan sequence b for hits.
BLAST Step2: Scan sequence b for hits. Step 3: Extend hits. BLAST 2.0 saves the time spent in extension, and considers gapped alignments. hit Terminate if the score of the extension fades away.
Multiple sequence alignment (MSA) The multiple sequence alignment problem is to simultaneously align more than two sequences. Seq1: GCTC Seq2: AC Seq3: GATC GC-TC A---C G-ATC
Multiple sequence alignment MSA
How to score an MSA? Sum-of-Pairs (SP-score) Score + Score Score = + GC-TC A---C Score + GC-TC A---C G-ATC GC-TC G-ATC Score Score = + A---C G-ATC Score
How to score an MSA? Sum-of-Pairs (SP-score) Score + Score Score = + -5-3+8-3+8= 5 + 8-3-3+8+8= 18 -5+8-3-3+8= 5 = 28 SP-score=5+18+5=28 GC-TC A---C Score + GC-TC A---C G-ATC GC-TC G-ATC Score Score = + A---C G-ATC Score
Position Specific Iterated BLAST PSI-BLAST is a rather permissive alignment tool and it can find more distantly related sequences than FASTA or BLAST Especially, in many cases, it is much more sensitive to weak but biologically relevant sequence similarities.
Position Specific Iterated BLAST PSI-BLAST is used for: Distant homology detection Fold assignment: profile-profile comparison Domain identification Evolutionary Analysis (e.g. tree building) Sequence Annotation / function assignment Profile export to other programs Sequence clustering Structural genomics target selection
Position Specific Iterated BLAST Collect all database sequence segments that have been aligned with query sequence with E-value below set threshold (default 0.001, but all sequences with E<10 are displayed for manual inclusion) Construct position specific scoring matrix for collected sequences. Rough idea: Align all sequences to the query sequence as the template. Assign weights to the sequences Construct position specific scoring matrix Iterate
How PLS-BLAST works? using profile Take a sequence . Y 002000080202000 using profile Take a sequence MGLLTREIF--ILQQ Search for similar sequences in a full sequence database MGLLTREIF--ILQQ FGLLRT-I-T-YMTN -RLTRD-I---LGLY FGLLRT-I---FMTS New sequences in the multiple alignment FGLGRT-I-T-YMTN -GLVRT-I---LGLE FGLLRT-I---YMTQ Sequences are multiply aligned A 029001100003200 C 000070000000000 . Y 002000080202000 Construct a new profile A 027005101003200 C 000070000000000 . Y 202000060202000 After several iterations of this procedure we have: Sequence information, including links to annotation Several sets of multiple alignments. Profiles, derived by us or by PSI-BLAST Threshold information (alignment statistics) Construct a profile, and represent conservation in each position numerically Profile holds more information than a single sequence: use the profile to retrieve additional sequences
Consensus sequence A sequence where each position is defined by majority vote based on multiple sequence alignment. Use consensus sequence for data base search. PEAINYGRFTPFS I KSDVW
Flow chart of PSI-BLAST MGLLTREIF--ILQQ FGLGRT-I-T-YMTN -GLVRT-I---LGLE FGLLRT-I---YMTQ Take a sequence Search for similar sequences in a full sequence database A 029001100003200 C 000070000000000 . Y 002000080202000 Construct a profile, and represent conservation in each position numerically Profile holds more information than a single sequence: use the profile to retrieve additional sequences Sequences are multiply aligned Construct a new profile A 027005101003200 C 000070000000000 . Y 202000060202000 Using profile to search for similar sequences in a full sequence database A 029001100003200 Y 002000080202000 FGLLRT-I-T-YMTN -RLTRD-I---LGLY FGLLRT-I---FMTS New sequences in the multiple alignments New iteration Next New iteration……
PSI-BLAST NCBI PSI-BLAST tutorial : http://www.ncbi.nlm.nih.gov/Education/BLASTinfo/psi1.html
PSI-BLAST NCBI PSI-BLAST tutorial : http://www.ncbi.nlm.nih.gov/Education/BLASTinfo/psi1.html
PSI-BLAST NCBI PSI-BLAST tutorial : http://www.ncbi.nlm.nih.gov/Education/BLASTinfo/psi1.html
PSI-BLAST NCBI PSI-BLAST tutorial : http://www.ncbi.nlm.nih.gov/Education/BLASTinfo/psi1.html
PSI-BLAST NCBI PSI-BLAST tutorial : http://www.ncbi.nlm.nih.gov/Education/BLASTinfo/psi1.html
PSI-BLAST NCBI PSI-BLAST tutorial : http://www.ncbi.nlm.nih.gov/Education/BLASTinfo/psi1.html
PSI-BLAST NCBI PSI-BLAST tutorial : http://www.ncbi.nlm.nih.gov/Education/BLASTinfo/psi1.html
PSI-BLAST NCBI PSI-BLAST tutorial : http://www.ncbi.nlm.nih.gov/Education/BLASTinfo/psi1.html
PSI-BLAST NCBI PSI-BLAST tutorial : http://www.ncbi.nlm.nih.gov/Education/BLASTinfo/psi1.html
PSI-BLAST NCBI PSI-BLAST tutorial : http://www.ncbi.nlm.nih.gov/Education/BLASTinfo/psi1.html
PSI-BLAST NCBI PSI-BLAST tutorial : http://www.ncbi.nlm.nih.gov/Education/BLASTinfo/psi1.html
PSI-BLAST NCBI PSI-BLAST tutorial : http://www.ncbi.nlm.nih.gov/Education/BLASTinfo/psi1.html
PSI-BLAST NCBI PSI-BLAST tutorial : http://www.ncbi.nlm.nih.gov/Education/BLASTinfo/psi1.html
PSI-BLAST NCBI PSI-BLAST tutorial : http://www.ncbi.nlm.nih.gov/Education/BLASTinfo/psi1.html
PSI-BLAST NCBI PSI-BLAST tutorial : http://www.ncbi.nlm.nih.gov/Education/BLASTinfo/psi1.html
PSI-BLAST NCBI PSI-BLAST tutorial : http://www.ncbi.nlm.nih.gov/Education/BLASTinfo/psi1.html
PSI-BLAST NCBI PSI-BLAST tutorial : http://www.ncbi.nlm.nih.gov/Education/BLASTinfo/psi1.html
PSI-BLAST NCBI PSI-BLAST tutorial : http://www.ncbi.nlm.nih.gov/Education/BLASTinfo/psi1.html
PSI-BLAST NCBI PSI-BLAST tutorial : http://www.ncbi.nlm.nih.gov/Education/BLASTinfo/psi1.html
PSI-BLAST NCBI PSI-BLAST tutorial : http://www.ncbi.nlm.nih.gov/Education/BLASTinfo/psi1.html
PSI-BLAST NCBI PSI-BLAST tutorial : http://www.ncbi.nlm.nih.gov/Education/BLASTinfo/psi1.html
PSI-BLAST NCBI PSI-BLAST tutorial : http://www.ncbi.nlm.nih.gov/Education/BLASTinfo/psi1.html
PSI-BLAST NCBI PSI-BLAST tutorial : http://www.ncbi.nlm.nih.gov/Education/BLASTinfo/psi1.html
PSI-BLAST NCBI PSI-BLAST tutorial : http://www.ncbi.nlm.nih.gov/Education/BLASTinfo/psi1.html
PSI-BLAST NCBI PSI-BLAST tutorial : http://www.ncbi.nlm.nih.gov/Education/BLASTinfo/psi1.html
Summary of Today’s lecture Sequence alignment methods revisited: Pair-wise alignment Multiple sequence alignment BLAST PSI-BLAST Use of PSI-BLAST to probe protein function