Bioinformatics sequence alignment Blast From Pevsner, Jonathan-<Bioinformatics and Functional Genomics>-Wiley-Blackwell (2015)
Outline: pairwise alignment Overview and examples Definitions: homologs, paralogs, orthologs Assigning scores to aligned amino acids: Dayhoff’s PAM matrices Alignment algorithms: Needleman-Wunsch, Smith-Waterman
Outline part2 BLAST Practical use Algorithm Strategies Finding distantly related proteins: PSI-BLAST Hidden Markov models(Bao’s lecture) BLAST-like tools for genomic DNA PatternHunter Megablast BLAT, BLASTZ
Learning objectives Define homologs, paralogs, orthologs Perform pairwise alignments (NCBI BLAST) Understand how scores are assigned to aligned amino acids using Dayhoff’s PAM matrices Explain how the Needleman-Wunsch algorithm performs global pairwise alignments 4
Pairwise alignments in the 1950s b-corticotropin促肾上腺皮质激素 (sheep) Corticotropin A (pig) ala gly glu asp asp glu asp gly ala glu asp glu CYIQNCPLG CYFQNCPRG Oxytocin Vasopressin Page 46
Early example of sequence alignment: globins (1961) myoglobin Early example of sequence alignment: globins (1961) H.C. Watson and J.C. Kendrew, “Comparison Between the Amino-Acid Sequences of Sperm Whale Myoglobin and of Human Hæmoglobin.” Nature 190:670-672, 1961.
Pairwise sequence alignment is the most fundamental operation of bioinformatics • It is used to decide if two proteins (or genes) are related structurally or functionally • It is used to identify domains or motifs that are shared between proteins It is the basis of BLAST searching • It is used in the analysis of genomes Page 47
Pairwise alignment: protein sequences can be more informative than DNA • protein is more informative (20 vs 4 characters); many amino acids share related biophysical properties • codons are degenerate: changes in the third position often do not alter the amino acid that is specified • protein sequences offer a longer “look-back” time DNA sequences can be translated into protein, and then used in pairwise alignments
Page 54
Pairwise alignment: protein sequences can be more informative than DNA Many times, DNA alignments are appropriate --to confirm the identity of a cDNA --to study noncoding regions of DNA --to study DNA polymorphisms --example: Neanderthal vs modern human DNA Query: 181 catcaactacaactccaaagacacccttacacccactaggatatcaacaaacctacccac 240 |||||||| |||| |||||| ||||| | ||||||||||||||||||||||||||||||| Sbjct: 189 catcaactgcaaccccaaagccacccct-cacccactaggatatcaacaaacctacccac 247
Outline: pairwise alignment Overview and examples Definitions: homologs, paralogs, orthologs Assigning scores to aligned amino acids: Dayhoff’s PAM matrices Alignment algorithms: Needleman-Wunsch, Smith-Waterman 11
Definition: pairwise alignment The process of lining up two sequences to achieve maximal levels of identity (and conservation, in the case of amino acid sequences) for the purpose of assessing the degree of similarity and the possibility of homology. Page 53
Definition: homology Homology Similarity attributed to descent from a common ancestor. Page 49
myoglobin (NP_005359) 2MM1 Beta globin (NP_000509) 2HHB Page 49
Definitions: two types of homology Orthologs Homologous sequences in different species that arose from a common ancestral gene during speciation; may or may not be responsible for a similar function. Paralogs Homologous sequences within a single species that arose by gene duplication. Page 49
You can view these sequences at www.bioinfbook.org (document 3.1) Orthologs: members of a gene (protein) family in various organisms. This tree shows globin orthologs. You can view these sequences at www.bioinfbook.org (document 3.1) Page 51
Paralogs: members of a gene (protein) family within a species. This tree shows human globin paralogs. Page 52
Orthologs and paralogs are often viewed in a single tree Source: NCBI
General approach to pairwise alignment Choose two sequences Select an algorithm that generates a score Allow gaps (insertions, deletions) Score reflects degree of similarity Alignments can be global or local Estimate probability that the alignment occurred by chance
Calculation of an alignment score Source: http://www.ncbi.nlm.nih.gov/Education/BLASTinfo/Alignment_Scores2.html
Find BLAST from the home page of NCBI and select protein BLAST…
Choose align two or more sequences… Page 52
Enter the two sequences (as accession numbers or in the fasta format) and click BLAST. Optionally select “Algorithm parameters” and note the matrix option.
Pairwise alignment result of human beta globin and myoglobin Myoglobin RefSeq Information about this alignment: score, expect value, identities, positives, gaps… Query = HBB Subject = MB Middle row displays identities; + sign for similar matches Page 53
Pairwise alignment result of human beta globin and myoglobin: the score is a sum of match, mismatch, gap creation, and gap extension scores Page 53
V matching V earns +4 These scores come from Pairwise alignment result of human beta globin and myoglobin: the score is a sum of match, mismatch, gap creation, and gap extension scores V matching V earns +4 These scores come from T matching L earns -1 a “scoring matrix”! Page 53
Definitions: homology and identity Similarity attributed to descent from a common ancestor. Identity The extent to which two (nucleotide or amino acid) sequences are invariant. Page 50
Definition: similarity The extent to which nucleotide or protein sequences are related. It is based upon identity plus conservation. Identity The extent to which two sequences are invariant. Conservation Changes at a specific position of an amino acid or (less commonly, DNA) sequence that preserve the physico-chemical properties of the original residue. Page 51
Definition: pairwise alignment The process of lining up two sequences to achieve maximal levels of identity (and conservation, in the case of amino acid sequences) for the purpose of assessing the degree of similarity and the possibility of homology. Page 53
First gap position scores -11 Second gap position scores -1 Mind the gaps First gap position scores -11 Second gap position scores -1 Gap creation tends to have a large negative score; Gap extension involves a small penalty Page 55
Gaps • Positions at which a letter is paired with a null are called gaps. • Gap scores are typically negative. • Since a single mutational event may cause the insertion or deletion of more than one residue, the presence of a gap is ascribed more significance than the length of the gap. Thus there are separate penalties for gap creation and gap extension. • In BLAST, it is rarely necessary to change gap values from the default.
Pairwise alignment of retinol-binding protein and b-lactoglobulin: Example of an alignment with internal, terminal gaps 1 MKWVWALLLLAAWAAAERDCRVSSFRVKENFDKARFSGTWYAMAKKDPEG 50 RBP . ||| | . |. . . | : .||||.:| : 1 ...MKCLLLALALTCGAQALIVT..QTMKGLDIQKVAGTWYSLAMAASD. 44 lactoglobulin 51 LFLQDNIVAEFSVDETGQMSATAKGRVR.LLNNWD..VCADMVGTFTDTE 97 RBP : | | | | :: | .| . || |: || |. 45 ISLLDAQSAPLRV.YVEELKPTPEGDLEILLQKWENGECAQKKIIAEKTK 93 lactoglobulin 98 DPAKFKMKYWGVASFLQKGNDDHWIVDTDYDTYAV...........QYSC 136 RBP || ||. | :.|||| | . .| 94 IPAVFKIDALNENKVL........VLDTDYKKYLLFCMENSAEPEQSLAC 135 lactoglobulin 137 RLLNLDGTCADSYSFVFSRDPNGLPPEAQKIVRQRQ.EELCLARQYRLIV 185 RBP . | | | : || . | || | 136 QCLVRTPEVDDEALEKFDKALKALPMHIRLSFNPTQLEEQCHI....... 178 lactoglobulin
Pairwise alignment of retinol-binding protein from human (top) and rainbow trout (O. mykiss): Example of an alignment with few gaps 1 .MKWVWALLLLA.AWAAAERDCRVSSFRVKENFDKARFSGTWYAMAKKDP 48 :: || || || .||.||. .| :|||:.|:.| |||.||||| 1 MLRICVALCALATCWA...QDCQVSNIQVMQNFDRSRYTGRWYAVAKKDP 47 . . . . . 49 EGLFLQDNIVAEFSVDETGQMSATAKGRVRLLNNWDVCADMVGTFTDTED 98 |||| ||:||:|||||.|.|.||| ||| :||||:.||.| ||| || | 48 VGLFLLDNVVAQFSVDESGKMTATAHGRVIILNNWEMCANMFGTFEDTPD 97 99 PAKFKMKYWGVASFLQKGNDDHWIVDTDYDTYAVQYSCRLLNLDGTCADS 148 ||||||:||| ||:|| ||||||::||||| ||: |||| ..||||| | 98 PAKFKMRYWGAASYLQTGNDDHWVIDTDYDNYAIHYSCREVDLDGTCLDG 147 149 YSFVFSRDPNGLPPEAQKIVRQRQEELCLARQYRLIVHNGYCDGRSERNLL 199 |||:||| | || || |||| :..|:| .|| : | |:|: 148 YSFIFSRHPTGLRPEDQKIVTDKKKEICFLGKYRRVGHTGFCESS...... 192
Pairwise sequence alignment allows us to look back billions of years ago (BYA) Origin of life Earliest fossils Origin of eukaryotes Eukaryote/ archaea Fungi/animal Plant/animal insects 4 3 2 1 When you do a pairwise alignment of homologous human and plant proteins, you are studying sequences that last shared a common ancestor 1.5 billion years ago! Page 56
Multiple sequence alignment of glyceraldehyde 3-phosphate dehydrogenases: 甘油醛3-磷酸酯脱氢酶example of extremely high conservation fly GAKKVIISAP SAD.APM..F VCGVNLDAYK PDMKVVSNAS CTTNCLAPLA human GAKRVIISAP SAD.APM..F VMGVNHEKYD NSLKIISNAS CTTNCLAPLA plant GAKKVIISAP SAD.APM..F VVGVNEHTYQ PNMDIVSNAS CTTNCLAPLA bacterium GAKKVVMTGP SKDNTPM..F VKGANFDKY. AGQDIVSNAS CTTNCLAPLA yeast GAKKVVITAP SS.TAPM..F VMGVNEEKYT SDLKIVSNAS CTTNCLAPLA archaeon GADKVLISAP PKGDEPVKQL VYGVNHDEYD GE.DVVSNAS CTTNSITPVA fly KVINDNFEIV EGLMTTVHAT TATQKTVDGP SGKLWRDGRG AAQNIIPAST human KVIHDNFGIV EGLMTTVHAI TATQKTVDGP SGKLWRDGRG ALQNIIPAST plant KVVHEEFGIL EGLMTTVHAT TATQKTVDGP SMKDWRGGRG ASQNIIPSST bacterium KVINDNFGII EGLMTTVHAT TATQKTVDGP SHKDWRGGRG ASQNIIPSST yeast KVINDAFGIE EGLMTTVHSL TATQKTVDGP SHKDWRGGRT ASGNIIPSST archaeon KVLDEEFGIN AGQLTTVHAY TGSQNLMDGP NGKP.RRRRA AAENIIPTST fly GAAKAVGKVI PALNGKLTGM AFRVPTPNVS VVDLTVRLGK GASYDEIKAK human GAAKAVGKVI PELNGKLTGM AFRVPTANVS VVDLTCRLEK PAKYDDIKKV plant GAAKAVGKVL PELNGKLTGM AFRVPTSNVS VVDLTCRLEK GASYEDVKAA bacterium GAAKAVGKVL PELNGKLTGM AFRVPTPNVS VVDLTVRLEK AATYEQIKAA yeast GAAKAVGKVL PELQGKLTGM AFRVPTVDVS VVDLTVKLNK ETTYDEIKKV archaeon GAAQAATEVL PELEGKLDGM AIRVPVPNGS ITEFVVDLDD DVTESDVNAA Page 57
Multiple sequence alignment of human lipocalin paralogs: example of extremely low conservation ~~~~~EIQDVSGTWYAMTVDREFPEMNLESVTPMTLTTL.GGNLEAKVTM lipocalin 1 LSFTLEEEDITGTWYAMVVDKDFPEDRRRKVSPVKVTALGGGNLEATFTF odorant-binding protein 2a TKQDLELPKLAGTWHSMAMATNNISLMATLKAPLRVHITSEDNLEIVLHR progestagen-assoc. endo. VQENFDVNKYLGRWYEIEKIPTTFENGRCIQANYSLMENGNQELRADGTV apolipoprotein D VKENFDKARFSGTWYAMAKDPEGLFLQDNIVAEFSVDETGNWDVCADGTF retinol-binding protein LQQNFQDNQFQGKWYVVGLAGNAI.LREDKDPQKMYATIDKSYNVTSVLF neutrophil gelatinase-ass. VQPNFQQDKFLGRWFSAGLASNSSWLREKKAALSMCKSVDGGLNLTSTFL prostaglandin D2 synthase VQENFNISRIYGKWYNLAIGSTCPWMDRMTVSTLVLGEGEAEISMTSTRW alpha-1-microglobulin PKANFDAQQFAGTWLLVAVGSACRFLQRAEATTLHVAPQGSTFRKLD... complement component 8 Page 57
Outline: pairwise alignment Overview and examples Definitions: homologs, paralogs, orthologs Assigning scores to aligned amino acids: Dayhoff’s PAM matrices Alignment algorithms: Needleman-Wunsch, Smith-Waterman 37
lys found at 58% of arg sites Emile Zuckerkandl and Linus Pauling (1965) considered substitution frequencies in 18 globins (myoglobins and hemoglobins from human to lamprey八目鳗). Black: identity Gray: very conservative substitutions (>40% occurrence) White: fairly conservative substitutions (>21% occurrence) Red: no substitutions observed Page 93
Page 93
scoring matrix that assigns scores… Where we’re heading: to a PAM250 log odds scoring matrix that assigns scores… Page 69 40
…and to a whole series of scoring matrices such as PAM10 Page 69 41
Dayhoff’s 34 protein superfamilies Protein PAMs per 100 million years Ig kappa chain 37 Kappa casein 33 luteinizing hormone b 30 lactalbumin 27 complement component 3 27 epidermal growth factor 26 proopiomelanocortin 21 pancreatic ribonuclease 21 haptoglobin alpha 20 serum albumin 19 phospholipase A2, group IB 19 prolactin 17 carbonic anhydrase C 16 Hemoglobin a 12 Hemoglobin b 12 Page 59
Dayhoff’s 34 protein superfamilies Protein PAMs per 100 million years Ig kappa chain 37 Kappa casein酪蛋白 33 luteinizing hormone b 30 lactalbumin 27 complement component 3 27 epidermal growth factor 26 proopiomelanocortin 21 pancreatic ribonuclease 21 haptoglobin alpha 20 serum albumin 19 phospholipase A2, group IB 19 prolactin 17 carbonic anhydrase C 16 Hemoglobin a 12 Hemoglobin b 12 human (NP_005203) versus mouse (NP_031812)
Dayhoff’s 34 protein superfamilies Protein PAMs per 100 million years apolipoprotein A-II 10 lysozyme 9.8 gastrin 9.8 myoglobin 8.9 nerve growth factor 8.5 myelin basic protein 7.4 thyroid stimulating hormone b 7.4 parathyroid hormone 7.3 parvalbumin 7.0 trypsin 5.9 insulin 4.4 calcitonin 4.3 arginine vasopressin 3.6 adenylate kinase 1 3.2 Page 59
Dayhoff’s 34 protein superfamilies Protein PAMs per 100 million years triosephosphate isomerase 1 2.8 vasoactive intestinal peptide 2.6 glyceraldehyde phosph. dehydrogease 2.2 cytochrome c 2.2 collagen 1.7 troponin C, skeletal muscle 1.5 alpha crystallin B chain 1.5 glucagon 1.2 glutamate dehydrogenase 0.9 histone H2B, member Q 0.9 ubiquitin 泛素 0 Page 59
Pairwise alignment of human (NP_005203) versus mouse (NP_031812) ubiquitin
Dayhoff’s approach to assigning scores for any two aligned amino acid residues Dayhoff et al. defined the score of two aligned residues i,j as 10 times the log of how likely it is to observe these two residues (based on the empirical observation of how often they are aligned in nature) divided by the background probability of finding these amino acids by chance. This provides a score for each pair of residues. Page 58
Dayhoff’s numbers of “accepted point mutations”: what amino acid substitutions occur in proteins? Dayhoff (1978) p.346. Page 61
Multiple sequence alignment of glyceraldehyde 3-phosphate dehydrogenases: columns of residues may have high or low conservation fly GAKKVIISAP SAD.APM..F VCGVNLDAYK PDMKVVSNAS CTTNCLAPLA human GAKRVIISAP SAD.APM..F VMGVNHEKYD NSLKIISNAS CTTNCLAPLA plant GAKKVIISAP SAD.APM..F VVGVNEHTYQ PNMDIVSNAS CTTNCLAPLA bacterium GAKKVVMTGP SKDNTPM..F VKGANFDKY. AGQDIVSNAS CTTNCLAPLA yeast GAKKVVITAP SS.TAPM..F VMGVNEEKYT SDLKIVSNAS CTTNCLAPLA archaeon GADKVLISAP PKGDEPVKQL VYGVNHDEYD GE.DVVSNAS CTTNSITPVA fly KVINDNFEIV EGLMTTVHAT TATQKTVDGP SGKLWRDGRG AAQNIIPAST human KVIHDNFGIV EGLMTTVHAI TATQKTVDGP SGKLWRDGRG ALQNIIPAST plant KVVHEEFGIL EGLMTTVHAT TATQKTVDGP SMKDWRGGRG ASQNIIPSST bacterium KVINDNFGII EGLMTTVHAT TATQKTVDGP SHKDWRGGRG ASQNIIPSST yeast KVINDAFGIE EGLMTTVHSL TATQKTVDGP SHKDWRGGRT ASGNIIPSST archaeon KVLDEEFGIN AGQLTTVHAY TGSQNLMDGP NGKP.RRRRA AAENIIPTST fly GAAKAVGKVI PALNGKLTGM AFRVPTPNVS VVDLTVRLGK GASYDEIKAK human GAAKAVGKVI PELNGKLTGM AFRVPTANVS VVDLTCRLEK PAKYDDIKKV plant GAAKAVGKVL PELNGKLTGM AFRVPTSNVS VVDLTCRLEK GASYEDVKAA bacterium GAAKAVGKVL PELNGKLTGM AFRVPTPNVS VVDLTVRLEK AATYEQIKAA yeast GAAKAVGKVL PELQGKLTGM AFRVPTVDVS VVDLTVKLNK ETTYDEIKKV archaeon GAAQAATEVL PELEGKLDGM AIRVPVPNGS ITEFVVDLDD DVTESDVNAA Page 57
The relative mutability of amino acids Asn 134 His 66 Ser 120 Arg 65 Asp 106 Lys 56 Glu 102 Pro 56 Ala 100 Gly 49 Thr 97 Tyr 41 Ile 96 Phe 41 Met 94 Leu 40 Gln 93 Cys 20 Val 74 Trp 18 Page 63
The relative mutability of amino acids Asn 134 His 66 Ser 120 Arg 65 Asp 106 Lys 56 Glu 102 Pro 56 Ala 100 Gly 49 Thr 97 Tyr 41 Ile 96 Phe 41 Met 94 Leu 40 Gln 93 Cys 20 Val 74 Trp 18 Note that alanine is normalized to a value of 100. Trp and cys are least mutable. Asn and ser are most mutable. Page 63
Normalized frequencies of amino acids Gly 8.9% Arg 4.1% Ala 8.7% Asn 4.0% Leu 8.5% Phe 4.0% Lys 8.1% Gln 3.8% Ser 7.0% Ile 3.7% Val 6.5% His 3.4% Thr 5.8% Cys 3.3% Pro 5.1% Tyr 3.0% Glu 5.0% Met 1.5% Asp 4.7% Trp 1.0% blue=6 codons; red=1 codon These frequencies fi sum to 1 Page 63
Page 64
Dayhoff’s mutation probability matrix for the evolutionary distance of 1 PAM We have considered three kinds of information: a table of number of accepted point mutations (PAMs) relative mutabilities of the amino acids normalized frequencies of the amino acids in PAM data This information can be combined into a “mutation probability matrix” in which each element Mij gives the probability that the amino acid in column j will be replaced by the amino acid in row i after a given evolutionary interval (e.g. 1 PAM). Page 63
Dayhoff’s PAM1 mutation probability matrix Original amino acid Page 66
Dayhoff’s PAM1 mutation probability matrix Each element of the matrix shows the probability that an original amino acid (top) will be replaced by another amino acid (side) Page 66
Substitution Matrix A substitution matrix contains values proportional to the probability that amino acid i mutates into amino acid j for all pairs of amino acids. Substitution matrices are constructed by assembling a large and diverse sample of verified pairwise alignments (or multiple sequence alignments) of amino acids. Substitution matrices should reflect the true probabilities of mutations occurring through a period of evolution. The two major types of substitution matrices are PAM and BLOSUM.
Point-accepted mutations PAM matrices: Point-accepted mutations PAM matrices are based on global alignments of closely related proteins. The PAM1 is the matrix calculated from comparisons of sequences with no more than 1% divergence. At an evolutionary interval of PAM1, one change has occurred over a length of 100 amino acids. Other PAM matrices are extrapolated from PAM1. For PAM250, 250 changes have occurred for two proteins over a length of 100 amino acids. All the PAM data come from closely related proteins (>85% amino acid identity). Page 63
Dayhoff’s PAM1 mutation probability matrix Page 66
Dayhoff’s PAM0 mutation probability matrix: the rules for extremely slowly evolving proteins Top: original amino acid Side: replacement amino acid Page 68
Dayhoff’s PAM2000 mutation probability matrix: the rules for very distantly related proteins PAM A Ala R Arg N Asn D Asp C Cys Q Gln E Glu G Gly 8.7% 4.1% N 4.0% D 4.7% C 3.3% Q 3.8% E 5.0% G 8.9% 8.9% 8.9% 8.9% 8.9% 8.9% 8.9% 8.9% Top: original amino acid Side: replacement amino acid Page 68
PAM250 mutation probability matrix Top: original amino acid Side: replacement amino acid Page 68
PAM250 log odds scoring matrix Page 69
Why do we go from a mutation probability matrix to a log odds matrix? We want a scoring matrix so that when we do a pairwise alignment (or a BLAST search) we know what score to assign to two aligned amino acid residues. Logarithms are easier to use for a scoring system. They allow us to sum the scores of aligned residues (rather than having to multiply them). Page 69
How do we go from a mutation probability matrix to a log odds matrix? The cells in a log odds matrix consist of an “odds ratio”: the probability that an alignment is authentic the probability that the alignment was random The score S for an alignment of residues a,b is given by: S(a,b) = 10 log10 (Mab/pb) As an example, for tryptophan, S(a,tryptophan) = 10 log10 (0.55/0.010) = 17.4 Page 69
Normalized frequencies of amino acids Arg 4.1% Asn 4.0% Phe 4.0% Gln 3.8% Ile 3.7% His 3.4% Cys 3.3% Tyr 3.0% Met 1.5% Trp 1.0%
What do the numbers mean in a log odds matrix? S(a,tryptophan) = 10 log10 (0.55/0.010) = 17.4 A score of +17 for tryptophan means that this alignment is 50 times more likely than a chance alignment of two Trp residues. S(a,b) = 17 Probability of replacement (Mab/pb) = x Then 17 = 10 log10 x 1.7 = log10 x 101.7 = x = 50 Page 58
What do the numbers mean in a log odds matrix? A score of +2 indicates that the amino acid replacement occurs 1.6 times as frequently as expected by chance. A score of 0 is neutral. A score of –10 indicates that the correspondence of two amino acids in an alignment that accurately represents homology (evolutionary descent) is one tenth as frequent as the chance alignment of these amino acids. Page 58
PAM250 log odds scoring matrix Page 58
PAM10 log odds scoring matrix Page 59
More conserved Less conserved Rat versus mouse globin Rat versus bacterial globin
two nearly identical proteins two distantly related proteins page 72
BLOSUM Matrices BLOSUM matrices are based on local alignments. BLOSUM stands for blocks substitution matrix. BLOSUM62 is a matrix calculated from comparisons of sequences with no less than 62% divergence. Page 70 73
BLOSUM Matrices 100 collapse Percent amino acid identity 62 30 74
BLOSUM Matrices 100 100 100 collapse collapse 62 62 62 collapse Percent amino acid identity 30 30 30 BLOSUM80 BLOSUM62 BLOSUM30 75
BLOSUM Matrices All BLOSUM matrices are based on observed alignments; they are not extrapolated from comparisons of closely related proteins. The BLOCKS database contains thousands of groups of multiple sequence alignments. BLOSUM62 is the default matrix in BLAST 2.0. Though it is tailored for comparisons of moderately distant proteins, it performs well in detecting closer relationships. A search for distant relatives may be more sensitive with a different matrix. Page 72 76
Blosum62 scoring matrix Page 73 77
Point-accepted mutations PAM matrices: Point-accepted mutations PAM matrices are based on global alignments of closely related proteins. The PAM1 is the matrix calculated from comparisons of sequences with no more than 1% divergence. At an evolutionary interval of PAM1, one change has occurred over a length of 100 amino acids. Other PAM matrices are extrapolated from PAM1. For PAM250, 250 changes have occurred for two proteins over a length of 100 amino acids. All the PAM data come from closely related proteins (>85% amino acid identity). Page 74 78
Two randomly diverging protein sequences change in a negatively exponential fashion Percent identity “twilight zone” Evolutionary distance in PAMs Page 74 79
At PAM1, two proteins are 99% identical At PAM10.7, there are 10 differences per 100 residues At PAM80, there are 50 differences per 100 residues At PAM250, there are 80 differences per 100 residues Percent identity “twilight zone” Differences per 100 residues Page 75 80
PAM: “Accepted point mutation” Two proteins with 50% identity may have 80 changes per 100 residues. (Why? Because any residue can be subject to back mutations.) Proteins with 20% to 25% identity are in the “twilight zone” and may be statistically significantly related. PAM or “accepted point mutation” refers to the “hits” or matches between two sequences (Dayhoff & Eck, 1968) Page 75 81
Percent identity between two proteins: What percent is significant? 100% 80% 65% 30% 23% 19% We will see in the BLAST lecture that it is appropriate to describe significance in terms of probability (or expect) values. As a rule of thumb, two proteins sharing > 30% over a substantial region are usually homologous. 82
Outline: pairwise alignment Overview and examples Definitions: homologs, paralogs, orthologs Assigning scores to aligned amino acids: Dayhoff’s PAM matrices Alignment algorithms: Needleman-Wunsch, Smith-Waterman 83
Two kinds of sequence alignment: global and local We will first consider the global alignment algorithm of Needleman and Wunsch (1970). We will then explore the local alignment algorithm of Smith and Waterman (1981). Finally, we will consider BLAST, a heuristic version of Smith-Waterman. We will cover BLAST in detail Next time. Page 76 84
Global alignment with the algorithm of Needleman and Wunsch (1970) • Two sequences can be compared in a matrix along x- and y-axes. • If they are identical, a path along a diagonal can be drawn • Find the optimal subpaths, and add them up to achieve the best score. This involves --adding gaps when needed --allowing for conservative substitutions --choosing a scoring system (simple or complicated) N-W is guaranteed to find optimal alignment(s) Page 76 85
Three steps to global alignment with the Needleman-Wunsch algorithm [1] set up a matrix [2] score the matrix [3] identify the optimal alignment(s) Page 76 86
Four possible outcomes in aligning two sequences 1 2 [1] identity (stay along a diagonal) [2] mismatch (stay along a diagonal) [3] gap in one sequence (move vertically!) [4] gap in the other sequence (move horizontally!) Page 77 87
Page 77 88
Start Needleman-Wunsch with an identity matrix Page 77 89
Start Needleman-Wunsch with an identity matrix Page 77 90
Fill in the matrix using “dynamic programming” Page 78 91
Fill in the matrix using “dynamic programming” Page 78 92
Fill in the matrix using “dynamic programming” Page 78 93
Fill in the matrix using “dynamic programming” Page 78 94
Fill in the matrix using “dynamic programming” Page 78 95
Fill in the matrix using “dynamic programming” Page 78 96
Fill in the matrix using “dynamic programming” Page 78 97
Traceback to find the optimal (best) pairwise alignment Page 79 98
Needleman-Wunsch: dynamic programming N-W is guaranteed to find optimal alignments, although the algorithm does not search all possible alignments. It is an example of a dynamic programming algorithm: an optimal path (alignment) is identified by incrementally extending optimal subpaths. Thus, a series of decisions is made at each step of the alignment to find the pair of residues with the best score. Page 80 99
Try using needle to implement a Needleman-Wunsch global alignment algorithm to find the optimum alignment (including gaps): http://www.ebi.ac.uk/emboss/align/ Page 81 100
Queries: beta globin (NP_000509) alpha globin (NP_000549) 101
102
Global alignment versus local alignment Global alignment (Needleman-Wunsch) extends from one end of each sequence to the other. Local alignment finds optimally matching regions within two sequences (“subsequences”). Local alignment is almost always used for database searches such as BLAST. It is useful to find domains (or limited regions of homology) within sequences. Smith and Waterman (1981) solved the problem of performing optimal local sequence alignment. Other methods (BLAST, FASTA) are faster but less thorough. Page 82 103
How the Smith-Waterman algorithm works Set up a matrix between two proteins (size m+1, n+1) No values in the scoring matrix can be negative! S > 0 The score in each cell is the maximum of four values: [1] s(i-1, j-1) + the new score at [i,j] (a match or mismatch) [2] s(i,j-1) – gap penalty [3] s(i-1,j) – gap penalty [4] zero Page 82 104
Smith-Waterman algorithm allows the alignment of subsets of sequences Page 83 105
Try using SSEARCH to perform a rigorous Smith-Waterman local alignment: http://fasta.bioch.virginia.edu/ 106
Queries: beta globin (NP_000509) alpha globin (NP_000549) 107
108
Rapid, heuristic versions of Smith-Waterman: FASTA and BLAST Smith-Waterman is very rigorous and it is guaranteed to find an optimal alignment. But Smith-Waterman is slow. It requires computer space and time proportional to the product of the two sequences being aligned (or the product of a query against an entire database). Gotoh (1982) and Myers and Miller (1988) improved the algorithms so both global and local alignment require less time and space. FASTA and BLAST provide rapid alternatives to S-W. Page 84 109
Statistical significance of pairwise alignment We will discuss the statistical significance of alignment scores in the next lecture (BLAST). A basic question is how to determine whether a particular alignment score is likely to have occurred by chance. According to the null hypothesis, two aligned sequences are not homologous (evolutionarily related). Can we reject the null hypothesis at a particular significance level alpha? 110
Pairwise alignment: key points Pairwise alignments allow us to describe the percent identity two sequences share, as well as the percent similarity The score of a pairwise alignment includes positive values for exact matches, and other scores for mismatches and gaps PAM and BLOSUM matrices provide a set of rules for assigning scores. PAM10 and BLOSUM80 are examples of matrices appropriate for the comparison of closely related sequences. PAM250 and BLOSUM30 are examples of matrices used to score distantly related proteins. Global and local alignments can be made.
BLAST BLAST (Basic Local Alignment Search Tool) allows rapid sequence comparison of a query sequence against a database. The BLAST algorithm is fast, accurate, and web-accessible.
Why use BLAST? BLAST searching is fundamental to understanding the relatedness of any favorite query sequence to other known proteins or DNA sequences. Applications include identifying orthologs and paralogs discovering new genes or proteins discovering variants of genes or proteins investigating expressed sequence tags (ESTs) exploring protein structure and function
Four components to a BLAST search (1) Choose the sequence (query) (2) Select the BLAST program (3) Choose the database to search (4) Choose optional parameters Then click “BLAST”
Step 1: Choose your sequence Sequence can be input in FASTA format or as accession number
Example of the FASTA format for a BLAST query
Step 2: Choose the BLAST program
Step 2: Choose the BLAST program blastn (nucleotide BLAST) blastp (protein BLAST) tblastn (translated BLAST) blastx (translated BLAST) tblastx (translated BLAST)
Choose the BLAST program Program Input Database 1 blastn DNA DNA blastp protein protein 6 blastx DNA protein tblastn protein DNA 36 tblastx DNA DNA
DNA potentially encodes six proteins 5’ CAT CAA 5’ ATC AAC 5’ TCA ACT 5’ CATCAACTACAACTCCAAAGACACCCTTACACATCAACAAACCTACCCAC 3’ 3’ GTAGTTGATGTTGAGGTTTCTGTGGGAATGTGTAGTTGTTTGGATGGGTG 5’ 5’ GTG GGT 5’ TGG GTA 5’ GGG TAG
Step 3: choose the database nr = non-redundant (most general database) dbest = database of expressed sequence tags dbsts = database of sequence tag sites gss = genomic survey sequences htgs = high throughput genomic sequence
Step 4a: Select optional search parameters organism Entrez! algorithm
Step 4a: optional blastp search parameters Expect Word size Scoring matrix Filter, mask
Step 4a: optional blastn search parameters Expect Word size Match/mismatch scores Filter, mask
BLAST: optional parameters You can... • choose the organism to search • turn filtering on/off • change the substitution matrix • change the expect (e) value • change the word size • change the output format
(a) Query: human insulin NP_000198 Program: blastp Database: C. elegans RefSeq Default settings: Unfiltered (“composition-based statistics”) Our starting point: search human insulin against worm RefSeq proteins by blastp using default parameters
(b) Query: human insulin NP_000198 Program: blastp Database: C. elegans RefSeq Option: No compositional adjustment Note that the bit score, Expect value, and percent identity all change with the “no compositional adjustment” option
(c) Query: human insulin NP_000198 Program: blastp Database: C. elegans RefSeq Option: conditional compositional score matrix adjustment Note that the bit score, Expect value, and percent identity all change with the compositional score matrix adjustment
(d) Query: human insulin NP_000198 Program: blastp Database: C. elegans RefSeq Option: Filter low complexity regions Note that the bit score, Expect value, and percent identity all change with the filter option
(the filtered sequence is the query in lowercase and grayed out) (e) Query: human insulin NP_000198 Program: blastp Database: C. elegans RefSeq Option: Mask for lookup table only Filtering (the filtered sequence is the query in lowercase and grayed out)
(e) Query: human insulin NP_000198 Program: blastp Database: C. elegans RefSeq Option: Mask for lookup table only Note that the bit score, Expect value, and percent identity could change with the “mask for lookup table only” option
Step 4b: optional formatting parameters Alignment view Descriptions Alignments
BLAST format options
BLAST search output: multiple alignment format
BLAST search output: top portion database query program taxonomy
taxonomy
BLAST search output: graphical output
BLAST search output: tabular output High scores low E values Cut-off: .05? 10-10?
BLAST search output: alignment output
Outline of today’s lecture BLAST Practical use Algorithm Strategies Finding distantly related proteins: PSI-BLAST Hidden Markov models BLAST-like tools for genomic DNA PatternHunter Megablast BLAT, BLASTZ
BLAST: background on sequence alignment There are two main approaches to sequence alignment: [1] Global alignment (Needleman & Wunsch 1970) using dynamic programming to find optimal alignments between two sequences. (Although the alignments are optimal, the search is not exhaustive.) Gaps are permitted in the alignments, and the total lengths of both sequences are aligned (hence “global”).
BLAST: background on sequence alignment [2] The second approach is local sequence alignment (Smith & Waterman, 1980). The alignment may contain just a portion of either sequence, and is appropriate for finding matched domains between sequences. BLAST is a heuristic approximation to local alignment. It examines only part of the search space.
How a BLAST search works “The central idea of the BLAST algorithm is to confine attention to segment pairs that contain a word pair of length w with a score of at least T.” Altschul et al. (1990)
How the original BLAST algorithm works: three phases Phase 1: compile a list of word pairs (w=3) above threshold T Example: for a human RBP query …FSGTWYA… (query word is in yellow) A list of words (w=3) is: FSG SGT GTW TWY WYA YSG TGT ATW SWY WFA FTG SVT GSW TWF WYS
Phase 1: compile a list of words (w=3) GTW 6,5,11 22 neighborhood GSW 6,1,11 18 word hits ATW 0,5,11 16 > threshold NTW 0,5,11 16 GTY 6,5,2 13 GNW 10 neighborhood GAW 9 word hits < below threshold (T=11)
Pairwise alignment scores are determined using a scoring matrix such asBlosum62
How a BLAST search works: 3 phases Scan the database for entries that match the compiled list. This is fast and relatively easy.
How a BLAST search works: 3 phases Phase 3: when you manage to find a hit (i.e. a match between a “word” and a database entry), extend the hit in either direction. Keep track of the score (use a scoring matrix) Stop when the score drops below some cutoff. KENFDKARFSGTWYAMAKKDPEG 50 RBP (query) MKGLDIQKVAGTWYSLAMAASD. 44 lactoglobulin (hit) extend extend Hit!
How a BLAST search works: 3 phases In the original (1990) implementation of BLAST, hits were extended in either direction. In a 1997 refinement of BLAST, two independent hits are required. The hits must occur in close proximity to each other. With this modification, only one seventh as many extensions occur, greatly speeding the time required for a search.
How a BLAST search works: threshold You can modify the threshold parameter. The default value for blastp is 11. To change it, enter “-f 16” or “-f 5” in the advanced options of BLAST+. (To find BLAST+ go to BLAST help download.)
Phase 1: compile a list of words (w=3) GTW 6,5,11 22 neighborhood GSW 6,1,11 18 word hits ATW 0,5,11 16 > threshold NTW 0,5,11 16 GTY 6,5,2 13 GNW 10 neighborhood GAW 9 word hits < below threshold (T=11)
For blastn, the word size is typically 7, 11, or 15 (EXACT match) For blastn, the word size is typically 7, 11, or 15 (EXACT match). Changing word size is like changing threshold of proteins. w=15 gives fewer matches and is faster than w=11 or w=7. For megablast (see below), the word size is 28 and can be adjusted to 64. What will this do? Megablast is VERY fast for finding closely related DNA sequences!
How to interpret a BLAST search: expect value It is important to assess the statistical significance of search results. For global alignments, the statistics are poorly understood. For local alignments (including BLAST search results), the statistics are well understood. The scores follow an extreme value distribution (EVD) rather than a normal distribution.
normal probability distribution x 0.40 0.35 0.30 0.25 0.20 0.15 0.10 0.05 -5 -4 -3 -2 -1 1 2 3 4 5 x
The probability density function of the extreme value distribution (characteristic value u=0 and decay constant l=1) 0.40 0.35 0.30 0.25 normal distribution extreme value distribution probability 0.20 0.15 0.10 0.05 -5 -4 -3 -2 -1 1 2 3 4 5 x
How to interpret a BLAST search: expect value The expect value E is the number of alignments with scores greater than or equal to score S that are expected to occur by chance in a database search. An E value is related to a probability value p. The key equation describing an E value is: E = Kmn e-lS
E = Kmn e-lS This equation is derived from a description of the extreme value distribution S = the score E = the expect value = the number of high- scoring segment pairs (HSPs) expected to occur with a score of at least S m, n = the length of two sequences l, K = Karlin Altschul statistics
Some properties of the equation E = Kmn e-lS The value of E decreases exponentially with increasing S (higher S values correspond to better alignments). Very high scores correspond to very low E values. The E value for aligning a pair of random sequences must be negative! Otherwise, long random alignments would acquire great scores Parameter K describes the search space (database). For E=1, one match with a similar score is expected to occur by chance. For a very much larger or smaller database, you would expect E to vary accordingly
From raw scores to bit scores There are two kinds of scores: raw scores (calculated from a substitution matrix) and bit scores (normalized scores) Bit scores are comparable between different searches because they are normalized to account for the use of different scoring matrices and different database sizes S’ = bit score = (lS - lnK) / ln2 The E value corresponding to a given bit score is: E = mn 2 -S’ Bit scores allow you to compare results between different database searches, even using different scoring matrices.
How to interpret BLAST: E values and p values The expect value E is the number of alignments with scores greater than or equal to score S that are expected to occur by chance in a database search. A p value is a different way of representing the significance of an alignment. p = 1 - e-E
How to interpret BLAST: E values and p values Very small E values are very similar to p values. E values of about 1 to 10 are far easier to interpret than corresponding p values. E p 10 0.99995460 5 0.99326205 2 0.86466472 1 0.63212056 0.1 0.09516258 (about 0.1) 0.05 0.04877058 (about 0.05) 0.001 0.00099950 (about 0.001) 0.0001 0.0001000
How to interpret BLAST: overview
word size w = 3 10 is the E value gap penalties BLOSUM matrix threshold score = 11 length of database
EVD parameters 147 – 111 = 36 m n mn Effective search space = mn = length of query x db length
Why set the E value to 20,000? Suppose you perform a search with a short query (e.g. 9 amino acids). There are not enough residues to accumulate a big score (or a small E value). Indeed, a match of 9 out of 9 residues could yield a small score with an E value of 100 or 200. And yet, this result could be “real” and of interest to you. By setting the E value cutoff to 20,000 you do not change the way the search was done, but you do change which results are reported to you.
Outline of today’s lecture BLAST Practical use Algorithm Strategies Finding distantly related proteins: PSI-BLAST Hidden Markov models BLAST-like tools for genomic DNA PatternHunter Megablast BLAT, BLASTZ
BLAST search strategies General concepts How to evaluate the significance of your results How to handle too many results How to handle too few results BLAST searching with HIV-1 pol, a multidomain protein
Sometimes a real match has an E value > 1 …try a reciprocal BLAST to confirm
Sometimes a similar E value occurs for a short exact match and long less exact match short, nearly exact long, only 31% identity, similar E value
Assessing whether proteins are homologous RBP4 and PAEP: Low bit score, E value 0.49, 24% identity (“twilight zone”). But they are indeed homologous. Try a BLAST search with PAEP as a query, and find many other lipocalins.
The universe of lipocalins (each dot is a protein) retinol-binding protein odorant-binding protein apolipoprotein D
BLAST search with PAEP as a query finds many other lipocalins
Using human beta globin as a query, here are the blastp results searching against human RefSeq proteins (PAM30 matrix). Where is myoglobin? It’s absent! We need to use PSI-BLAST.
Outline of today’s lecture BLAST Practical use Algorithm Strategies Finding distantly related proteins: PSI-BLAST Hidden Markov models BLAST-like tools for genomic DNA PatternHunter Megablast BLAT, BLASTZ
Two problems standard BLAST cannot solve [1] Use human beta globin as a query against human RefSeq proteins, and blastp does not “find” human myoglobin. This is because the two proteins are too distantly related. PSI-BLAST at NCBI as well as hidden Markov models easily solve this problem. [2] How can we search using 10,000 base pairs as a query, or even millions of base pairs? Many BLAST-like tools for genomic DNA are available such as PatternHunter, Megablast, BLAT, and BLASTZ. 179
Specialized BLAST servers Organism-specific BLAST sites Molecule-specific BLAST sites Specialized algorithms (WU-BLAST 2.0)
Ensembl BLAST output includes an ideogram
Outline of today’s lecture BLAST Practical use Algorithm Strategies Finding distantly related proteins: PSI-BLAST Hidden Markov models BLAST-like tools for genomic DNA PatternHunter Megablast BLAT, BLASTZ
Position specific iterated BLAST: PSI-BLAST The purpose of PSI-BLAST is to look deeper into the database for matches to your query protein sequence by employing a scoring matrix that is customized to your query.
PSI-BLAST is performed in five steps [1] Select a query and search it against a protein database
PSI-BLAST is performed in five steps [1] Select a query and search it against a protein database [2] PSI-BLAST constructs a multiple sequence alignment then creates a “profile” or specialized position-specific scoring matrix (PSSM)
R,I,K C D,E,T K,R,T N,L,Y,G
A R N D C Q E G H I L K M F P S T W Y V ... 37 S 2 -1 0 -1 -1 0 0 0 -1 -2 -3 0 -2 -3 -1 4 1 -3 -2 -2 38 G 0 -3 -1 -2 -3 -2 -2 6 -2 -4 -4 -2 -3 -4 -2 0 -2 -3 -3 -4 39 T 0 -1 0 -1 -1 -1 -1 -2 -2 -1 -1 -1 -1 -2 -1 1 5 -3 -2 0 40 W -3 -3 -4 -5 -3 -2 -3 -3 -3 -3 -2 -3 -2 1 -4 -3 -3 12 2 -3 41 Y -2 -2 -2 -3 -3 -2 -2 -3 2 -2 -1 -2 -1 3 -3 -2 -2 2 7 -1 42 A 4 -2 -2 -2 -1 -1 -1 0 -2 -2 -2 -1 -1 -3 -1 1 0 -3 -2 0
A R N D C Q E G H I L K M F P S T W Y V ... 37 S 2 -1 0 -1 -1 0 0 0 -1 -2 -3 0 -2 -3 -1 4 1 -3 -2 -2 38 G 0 -3 -1 -2 -3 -2 -2 6 -2 -4 -4 -2 -3 -4 -2 0 -2 -3 -3 -4 39 T 0 -1 0 -1 -1 -1 -1 -2 -2 -1 -1 -1 -1 -2 -1 1 5 -3 -2 0 40 W -3 -3 -4 -5 -3 -2 -3 -3 -3 -3 -2 -3 -2 1 -4 -3 -3 12 2 -3 41 Y -2 -2 -2 -3 -3 -2 -2 -3 2 -2 -1 -2 -1 3 -3 -2 -2 2 7 -1 42 A 4 -2 -2 -2 -1 -1 -1 0 -2 -2 -2 -1 -1 -3 -1 1 0 -3 -2 0
PSI-BLAST is performed in five steps [1] Select a query and search it against a protein database [2] PSI-BLAST constructs a multiple sequence alignment then creates a “profile” or specialized position-specific scoring matrix (PSSM) [3] The PSSM is used as a query against the database [4] PSI-BLAST estimates statistical significance (E values)
PSI-BLAST is performed in five steps [1] Select a query and search it against a protein database [2] PSI-BLAST constructs a multiple sequence alignment then creates a “profile” or specialized position-specific scoring matrix (PSSM) [3] The PSSM is used as a query against the database [4] PSI-BLAST estimates statistical significance (E values) [5] Repeat steps [3] and [4] iteratively, typically 5 times. At each new search, a new profile is used as the query.
Results of a PSI-BLAST search # hits Iteration # hits > threshold 1 104 49 2 173 96 3 236 178 4 301 240 5 344 283 6 342 298 7 378 310 8 382 320
PSI-BLAST search: human RBP versus RefSeq, iteration 1
PSI-BLAST search: human RBP versus RefSeq, iteration 2
PSI-BLAST search: human RBP versus RefSeq, iteration 3
RBP4 match to ApoD, PSI-BLAST iteration 1
RBP4 match to ApoD, PSI-BLAST iteration 2
RBP4 match to ApoD, PSI-BLAST iteration 3
The universe of lipocalins (each dot is a protein) retinol-binding protein odorant-binding protein apolipoprotein D
Scoring matrices let you focus on the big (or small) picture retinol-binding protein Fig. 5.7 Page 151 your RBP query
Scoring matrices let you focus on the big (or small) picture PAM250 PAM30 retinol-binding protein retinol-binding protein Blosum80 Blosum45
PSI-BLAST generates scoring matrices more powerful than PAM or BLOSUM retinol-binding protein retinol-binding protein
PSI-BLAST: performance assessment Evaluate PSI-BLAST results using a database in which protein structures have been solved and all proteins in a group share < 40% amino acid identity.
PSI-BLAST: the problem of corruption PSI-BLAST is useful to detect weak but biologically meaningful relationships between proteins. The main source of false positives is the spurious amplification of sequences not related to the query. For instance, a query with a coiled-coil motif may detect thousands of other proteins with this motif that are not homologous. Once even a single spurious protein is included in a PSI-BLAST search above threshold, it will not go away.
PSI-BLAST: the problem of corruption Corruption is defined as the presence of at least one false positive alignment with an E value < 10-4 after five iterations. Three approaches to stopping corruption: [1] Apply filtering of biased composition regions [2] Adjust E value from 0.001 (default) to a lower value such as E = 0.0001. [3] Visually inspect the output from each iteration. Remove suspicious hits by unchecking the box.
Conserved domain database (CDD) uses RPS-BLAST Main idea: you can search a query protein against a database of position-specific scoring matrices
Outline of today’s lecture BLAST Practical use Algorithm Strategies Finding distantly related proteins: PSI-BLAST Hidden Markov models BLAST-like tools for genomic DNA PatternHunter Megablast BLAT, BLASTZ
Multiple sequence alignment to profile HMMs • in the 1990’s people began to see that aligning sequences to profiles gave much more information than pairwise alignment alone. • Hidden Markov models (HMMs) are “states” that describe the probability of having a particular amino acid residue at arranged in a column of a multiple sequence alignment • HMMs are probabilistic models • Like a hammer is more refined than a blast, an HMM gives more sensitive alignments than traditional techniques such as progressive alignments
HMMER: build a hidden Markov model Determining effective sequence number ... done. [4] Weighting sequences heuristically ... done. Constructing model architecture ... done. Converting counts to probabilities ... done. Setting model name, etc. ... done. [x] Constructed a profile HMM (length 230) Average score: 411.45 bits Minimum score: 353.73 bits Maximum score: 460.63 bits Std. deviation: 52.58 bits
HMMER: calibrate a hidden Markov model HMM file: lipocalins.hmm Length distribution mean: 325 Length distribution s.d.: 200 Number of samples: 5000 random seed: 1034351005 histogram(s) saved to: [not saved] POSIX threads: 2 - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - HMM : x mu : -123.894508 lambda : 0.179608 max : -79.334000
HMMER: search an HMM against GenBank Scores for complete sequences (score includes all domains): Sequence Description Score E-value N -------- ----------- ----- ------- --- gi|20888903|ref|XP_129259.1| (XM_129259) ret 461.1 1.9e-133 1 gi|132407|sp|P04916|RETB_RAT Plasma retinol- 458.0 1.7e-132 1 gi|20548126|ref|XP_005907.5| (XM_005907) sim 454.9 1.4e-131 1 gi|5803139|ref|NP_006735.1| (NM_006744) ret 454.6 1.7e-131 1 gi|20141667|sp|P02753|RETB_HUMAN Plasma retinol- 451.1 1.9e-130 1 . gi|16767588|ref|NP_463203.1| (NC_003197) out 318.2 1.9e-90 1 gi|5803139|ref|NP_006735.1|: domain 1 of 1, from 1 to 195: score 454.6, E = 1.7e-131 *->mkwVMkLLLLaALagvfgaAErdAfsvgkCrvpsPPRGfrVkeNFDv mkwV++LLLLaA + +aAErd Crv+s frVkeNFD+ gi|5803139 1 MKWVWALLLLAA--W--AAAERD------CRVSS----FRVKENFDK 33 erylGtWYeIaKkDprFErGLllqdkItAeySleEhGsMsataeGrirVL +r++GtWY++aKkDp E GL+lqd+I+Ae+S++E+G+Msata+Gr+r+L gi|5803139 34 ARFSGTWYAMAKKDP--E-GLFLQDNIVAEFSVDETGQMSATAKGRVRLL 80 eNkelcADkvGTvtqiEGeasevfLtadPaklklKyaGvaSflqpGfddy +N+++cAD+vGT+t++E dPak+k+Ky+GvaSflq+G+dd+ gi|5803139 81 NNWDVCADMVGTFTDTE----------DPAKFKMKYWGVASFLQKGNDDH 120 Fig. 5.13 Page 159
PFAM is a database of HMMs and an essential resource for protein families http://pfam.sanger.ac.uk/
Outline of today’s lecture BLAST Practical use Algorithm Strategies Finding distantly related proteins: PSI-BLAST Hidden Markov models BLAST-like tools for genomic DNA PatternHunter Megablast BLAT, BLASTZ
BLAST-related tools for genomic DNA The analysis of genomic DNA presents special challenges: There are exons (protein-coding sequence) and introns (intervening sequences). There may be sequencing errors or polymorphisms The comparison may between be related species (e.g. human and mouse)
BLAST-related tools for genomic DNA Recently developed tools include: MegaBLAST at NCBI. BLAT (BLAST-like alignment tool). BLAT parses an entire genomic DNA database into words (11mers), then searches them against a query. Thus it is a mirror image of the BLAST strategy. See http://genome.ucsc.edu SSAHA at Ensembl uses a similar strategy as BLAT. See http://www.ensembl.org
PatternHunter
MegaBLAST at NCBI
MegaBLAST
To access BLAT, visit http://genome.ucsc.edu “BLAT on DNA is designed to quickly find sequences of 95% and greater similarity of length 40 bases or more. It may miss more divergent or shorter sequence alignments. It will find perfect sequence matches of 33 bases, and sometimes find them down to 20 bases. BLAT on proteins finds sequences of 80% and greater similarity of length 20 amino acids or more. In practice DNA BLAT works well on primates, and protein blat on land vertebrates.” --BLAT website
Paste DNA or protein sequence here in the FASTA format
BLAT output includes browser and other formats
Blastz
Blastz (laj software): human versus rhesus duplication
Blastz (laj software): human versus rhesus gap
BLAT
BLAT
LAGAN
SSAHA
Outline of today’s lecture BLAST Practical use Algorithm Strategies Finding distantly related proteins: PSI-BLAST Hidden Markov models BLAST-like tools for genomic DNA PatternHunter Megablast BLAT, BLASTZ
Where we are in the course --We started with “bioinformatics databases” --We next covered pairwise alignment, then BLAST in which one sequence is compared to a database --Next we’ll describe multiple sequence alignment --We’ll then visualize multiple sequence alignments as phylogenetic trees That topic spans molecular evolution. 238
Lab exercises Self-Test Quiz P160-161 P161-162