Finding Sequence Similarities >query AGACGAACCTAGCACAAGCGCGTCTGGAAAGACCCGCCAGCTACGGTCACCGAG CTTCTCATTGCTCTTCCTAACAGTGTGATAGGCTAACCGTAATGGCGTTCAGGA GTATTTGGACTGCAATATTGGCCCTCGTTCAAGGGCGCCTACCATCACCCGACG.

Slides:



Advertisements
Similar presentations
Fa07CSE 182 CSE182-L4: Database filtering. Fa07CSE 182 Summary (through lecture 3) A2 is online We considered the basics of sequence alignment –Opt score.
Advertisements

Blast outputoutput. How to measure the similarity between two sequences Q: which one is a better match to the query ? Query: M A T W L Seq_A: M A T P.
Bioinformatics Tutorial I BLAST and Sequence Alignment.
BLAST Sequence alignment, E-value & Extreme value distribution.
Gapped Blast and PSI BLAST Basic Local Alignment Search Tool ~Sean Boyle Basic Local Alignment Search Tool ~Sean Boyle.
Last lecture summary.
Measuring the degree of similarity: PAM and blosum Matrix
Types of homology BLAST
Bioinformatics Unit 1: Data Bases and Alignments Lecture 2: “Homology” Searches and Sequence Alignments.
Local alignments Seq X: Seq Y:. Local alignment  What’s local? –Allow only parts of the sequence to match –Results in High Scoring Segments –Locally.
BLAST Basic Local Alignment Search Tool. BLAST החכה BLAST (Basic Local Alignment Search Tool) allows rapid sequence comparison of a query sequence [[רצף.
We continue where we stopped last week: FASTA – BLAST
Overview of sequence database searching techniques and multiple alignment May 1, 2001 Quiz on May 3-Dynamic programming- Needleman-Wunsch method Learning.
Similar Sequence Similar Function Charles Yan Spring 2006.
BLAST.
1-month Practical Course Genome Analysis Lecture 3: Residue exchange matrices Centre for Integrative Bioinformatics VU (IBIVU) Vrije Universiteit Amsterdam.
Sequence alignment, E-value & Extreme value distribution
BLAST Basic Local Alignment Search Tool. BLAST החכה BLAST (Basic Local Alignment Search Tool) allows rapid sequence comparison of a query sequence [[רצף.
BLAST: Basic Local Alignment Search Tool Urmila Kulkarni-Kale Bioinformatics Centre University of Pune.
Information theoretic interpretation of PAM matrices Sorin Istrail and Derek Aguiar.
Making Sense of DNA and protein sequence analysis tools (course #2) Dave Baumler Genome Center of Wisconsin,
Alignment Statistics and Substitution Matrices BMI/CS 576 Colin Dewey Fall 2010.
Pairwise Alignment How do we tell whether two sequences are similar? BIO520 BioinformaticsJim Lund Assigned reading: Ch , Ch 5.1, get what you can.
An Introduction to Bioinformatics
BLAST What it does and what it means Steven Slater Adapted from pt.
Basic Introduction of BLAST Jundi Wang School of Computing CSC691 09/08/2013.
Tweaking BLAST Although you normally see BLAST as a web page with boxes to place data in and tick boxes, etc., it is actually a command line program that.
Eric C. Rouchka, University of Louisville Sequence Database Searching Eric Rouchka, D.Sc. Bioinformatics Journal Club October.
Searching Molecular Databases with BLAST. Basic Local Alignment Search Tool How BLAST works Interpreting search results The NCBI Web BLAST interface Demonstration.
Database Searches BLAST. Basic Local Alignment Search Tool –Altschul, Gish, Miller, Myers, Lipman, J. Mol. Biol. 215 (1990) –Altschul, Madden, Schaffer,
Bioinformatics Workshop 1 Sequences and Similarity Searches Open a web browser and type in the URL: –informatics.gurdon.cam.ac.uk/online/workshops –Bookmark.
Last lecture summary. Window size? Stringency? Color mapping? Frame shifts?
BLAST Anders Gorm Pedersen & Rasmus Wernersson. Database searching Using pairwise alignments to search databases for similar sequences Database Query.
1 P6a Extra Discussion Slides Part 1. 2 Section A.
NCBI resources II: web-based tools and ftp resources Yanbin Yin Fall 2014 Most materials are downloaded from ftp://ftp.ncbi.nih.gov/pub/education/ 1.
Comp. Genomics Recitation 3 The statistics of database searching.
Construction of Substitution Matrices
BLAST: Basic Local Alignment Search Tool Altschul et al. J. Mol Bio CS 466 Saurabh Sinha.
BLAST Slides adapted & edited from a set by Cheryl A. Kerfeld (UC Berkeley/JGI) & Kathleen M. Scott (U South Florida) Kerfeld CA, Scott KM (2011) Using.
Basic Local Alignment Search Tool BLAST Why Use BLAST?
Database search. Overview : 1. FastA : is suitable for protein sequence searching 2. BLAST : is suitable for DNA, RNA, protein sequence searching.
Pairwise Local Alignment and Database Search Csc 487/687 Computing for Bioinformatics.
Pairwise Sequence Alignment Part 2. Outline Summary Local and Global alignments FASTA and BLAST algorithms Evaluating significance of alignments Alignment.
Construction of Substitution matrices
Tweaking BLAST Although you normally see BLAST as a web page with boxes to place data in and tick boxes, etc., it is actually a command line program that.
Doug Raiford Phage class: introduction to sequence databases.
Blast 2.0 Details The Filter Option: –process of hiding regions of (nucleic acid or amino acid) sequence having characteristics.
David Wishart February 18th, 2004 Lecture 3 BLAST (c) 2004 CGDN.
Finding Sequence Similarities >query AGACGAACCTAGCACAAGCGCGTCTGGAAAGACCCGCCAGCTACGGTCACCGAG CTTCTCATTGCTCTTCCTAACAGTGTGATAGGCTAACCGTAATGGCGTTCAGGA GTATTTGGACTGCAATATTGGCCCTCGTTCAAGGGCGCCTACCATCACCCGACG.
Step 3: Tools Database Searching
The statistics of pairwise alignment BMI/CS 576 Colin Dewey Fall 2015.
What is BLAST? Basic BLAST search What is BLAST?
Using BLAST To Teach ‘E-value-tionary’ Concepts Cheryl A. Kerfeld 1, 2 and Kathleen M. Scott 3 1.Department of Energy-Joint Genome Institute, Walnut Creek,
9/6/07BCB 444/544 F07 ISU Dobbs - Lab 3 - BLAST1 BCB 444/544 Lab 3 BLAST Scoring Matrices & Alignment Statistics Sept6.
What is BLAST? Basic BLAST search What is BLAST?
Blast Basic Local Alignment Search Tool
Basics of BLAST Basic BLAST Search - What is BLAST?
BLAST Anders Gorm Pedersen & Rasmus Wernersson.
Identifying templates for protein modeling:
Genome Center of Wisconsin, UW-Madison
Bioinformatics and BLAST
BLAST.
Sequence alignment, Part 2
Comparative Genomics.
Basic Local Alignment Search Tool
Basic Local Alignment Search Tool (BLAST)
Basic Local Alignment Search Tool
BLAST Slides adapted & edited from a set by
Sequence alignment, E-value & Extreme value distribution
BLAST Slides adapted & edited from a set by
Presentation transcript:

Finding Sequence Similarities >query AGACGAACCTAGCACAAGCGCGTCTGGAAAGACCCGCCAGCTACGGTCACCGAG CTTCTCATTGCTCTTCCTAACAGTGTGATAGGCTAACCGTAATGGCGTTCAGGA GTATTTGGACTGCAATATTGGCCCTCGTTCAAGGGCGCCTACCATCACCCGACG GTCATGCCGGTCCCCAGCAGCTGCTAATAACTTCCTTCGCTACTCAAGTTACCA CGCTAGCAAAACCCACGGCATACCGTTTACCCTTTAAAATCAGCTTCAACCAGC AACGAA There are many programs used to do this. They range from relatively slow programs which find the exact best matching alignment, through ones which take progressively inexact shortcuts to speed things up. Of this latter class, the best known, and easily most widely used is BLAST, developed by Stephen Altschul and others, and continuously refined over the last years. The essential idea is to compare your query sequence against a collection or ‘database’ of target sequences, looking for the one(s) that match the query sequence the best. >target1 AAAACAGGAATATTTACCGGGACCGGGTAATGATGCATCTCGAGGTACACAATATACCTG GAGAACCGAATTATGAGTTGGCCACCTTACTTAACGAAACCAGCAGAGAAAATCCAACAT GGCAACACCCCTCTGACTACACTAGAAGGAACTACTATGTAAGAAAACAGCCTGTCCCTT GCAGTTTGAATGACTGGGTGATGCGAAATGGGGGTCCTGCCATAGAGCGCTTCCATGGTT TACCTTGCACATTTCAGAGAAGTCCTATGCCAGGAGTCCTTCCTACAGGGCCTTCCTGAA ACTATATATGTGCTTATTCTTGTTTGATTTGGCTTTGCAG >target2 CTCTTAATTTATTTCTCTTCCTGCAGCTCCCTCGCTTTTTCCTTTCCCTGTTACATTCAT CTGACTTGAAGAGTTGCAAATTTTCAGTGTTTCTGTTTTTGTTGCTGATATGTTGTAAAC TTTTTAATAAAATCTATTTCTATAG >target3 GCAGTTTGAATGACTGGGTGATGCGAAATGGGGGTCCTGCCATAGAGCGCTTCCATGGTT TACCTTGCACATTTCAGAGAAGTCCTATGCCAGGAGTCCTTCCTACAGGGCCTTCCTGAA ACTATATATGTGCTTATTCTTGTTTGATTTGGCTTTGCAGCTAGGGTTTTCACCTTTTCT GGAAAAAAAAATACTGGCTTCC >target4 CTGCTATTAATGGGCAAAACAACTCAAATAAAGTCCCTCTGCCACCCTCAGACACTGCCC CTGGCCCCCAGCTGCCCGCTGATCCTTGTAGCCAGAGCAGTAAAGTTTTGAAAGTGGAGC CCAAGGAGAATAAAGTTATTAAAGAAACTGGCTTTGAACAAGGTGAAAAGTCTTGTGCAG CACCTCTAGATCATACTGTGAAGGAAAATCTTGGACAAACTTCTAAAGAACAGGTGGTAG query database COMPARE LIST MATCHES

Flavours of BLAST BLAST can perform a number of similar tasks with different types of sequence: BLASTn – comparing nucleotide sequence vs. nucleotide sequence database - FAST BLASTp – comparing protein sequence vs. protein sequence database- FAST BLASTx – comparing nucleotide sequence vs. protein sequence database by translating the nucleotide sequence in all possible reading frames - SLOW tBLASTn – comparing protein sequence vs. nucleotide sequence database translated into all possible reading frames - SLOWER tBLASTx – comparing nucleotide sequence vs. nucleotide sequence database translating both into all possible reading frames – EXCRUCIATINGLY SLOW! The amino acid sequence based programs use a substitution matrix to allow some amino acids to count as effective matches with each other. These are the BLOSUM and PAM matrices you may see referred to from time to time.

How does it work? The main task of any sequence comparison program is to test all possible mutual alignments of two sequence and see how good the match is: CTTCTCATTGCTCTTCCTAACAGTGATGATAGGCTAACCGTAATGGCGTTCAGGAGT CCGAGCTTCTCATTGCTCTTCCTAACAGTGTGATAGGCTAACCGTAATGGCGTTC | | | | | ||||||||||||||||||||||||| CTTCTCATTGCTCTTCCTAACAGTGATGATAGGCTAACCGTAATGGCGTTCAGGAGT || | | | | | | | | |||||||||||||||||||||||| | | | | | | CCGAGCTTCTCATTGCTCTTCCTAACAGTG=TGATAGGCTAACCGTAATGGCGTTC ||||||||||||||||||||||||| |||||||||||||||||||||||| query 1 st database sequence BLAST achieves its speed through two strategies: It ‘indexes’ the database sequences so it know where all the minor subsequences are in each sequence, so it doesn’t have to look all the way through each sequence each time, letter by letter. It’s ‘word based’, so that it will only start looking for possible extensive alignments once it’s found a seed alignment of an exact match. The default seed lengths are 11 letters for BLASTn and 3 for BLASTp. This means that some good alignments are un-findable, e.g. a 50% protein match with exactly every second amino acid matching. It relies on these ‘uniformly distributed’ alignments being very rare occurrences.

BLAST –Typical Output INPUT: >partial cDNA sequence, Xenopus tropicalis CGGAGCTCAGGCCTCACCGGCGGACATGTCCGGGAAAATAGAGAAAGCAGACAGCCAGCGTTCCCACCTCTCCTCTTTCACCATGAAGCTCAAGGACAAATTCCACTCCC CCAAAATCAAGCGCACCCCGTCCAAGAAGGGGAAGCCGGCCGACCTCACCGTCAAAACAGAAGAGAAACCCGTCAACAAAACCTTAAGCCGCTTGGAGGAACAGGAGAAA GAAGTCGTTAATGCCTTGCGTTACTTTAAGACAATTGTTGACAAGATGGCGGTGGACAAGATGGTGCTGGTGATGCTGCCAGGGTCGGCGA OUTPUT: Query= (311 letters) Database: NCBI Protein Reference Sequences 954,378 sequences; 347,895,532 total letters >gi| |ref|NP_ | similar to guanine nucleotide-releasing factor 2 (specific for crk proto-oncogene) [Danio rerio]gi| |ref|NP_ | Length=691 Score = 133 bits (335) Expect = 6e-31 Identities = 76/98 (77%) Positives = 82/98 (83%) Gaps = 4/98 (4%) Frame = +2 Query 26 MSGKIE-KADSQRSHLSSFTMKLKDKFHSPKIKRTPSKKGKPA--DLTVKTEEKPVNKTL 196 MSGKIE K +SQ+SHLSSFTMKL KFHSPKIKRTPSKKGK + VKT EKPVNK + Sbjct 1 MSGKIESKHESQKSHLSSFTMKLM-KFHSPKIKRTPSKKGKQLQPEPAVKTPEKPVNKKV 59 Query 197 SRLEEQEKEVVNALRYFKTIVDKMAVDKMVLVMLPGSA 310 SRLEEQEK+VV+ALRYFKTIVDKM VD VL MLPGSA Sbjct 60 SRLEEQEKDVVSALRYFKTIVDKMNVDTKVLQMLPGSA 97

When is a match significant? RFKISDCQHPCTYSHNQYMTNHMRECPYNGAATSIPSWHLIVHPSNGQSVSFPQSDPCQIKMNQNLHLVQMMYDMQTTHV NFSWKKTSEKETNCQFDYPNDYNEQTQCQPMTPFKADVFDLWNWEFNANPKLENGIRDLIDDKHDILQIFNMCWLEVNSS RF---KISDCQHPCTYSH-NQYMTNHMREC----PYNGAATSIPSWHLIVHPSNGQSVSFPQSDPCQIKMNQNLHLVQMMYDMQTTHV F K S+ + C + + N Y N +C P+ + +W +P + D I N M ++ NFSWKKTSEKETNCQFDYPNDY--NEQTQCQPMTPFKADVFDLWNWEFNANPKLENGIRDLIDDKHDILQIFN------MCWLEVNSS Here is a ‘typical’ weak alignment from BLASTp: In fact the sequences were randomly generated, so there is no biologically significant alignment…

E-values The number of matches like the discovered match that I would expect to find by chance. An E-value of 0.0 implies that I would expect no matches like this to arise by chance, therefore… An E-value of 1 implies I would expect 1 match like this to arise by chance, so if I have a match with such an E-value…

E-values From First Principles Some database statistics (23 rd July 2005): Database: NCBI RefSeq mRNA 272,619 sequences; 503,566,580 total letters (~5.0 x 10 8 ) Database: NCBI nr 3,329,110 sequences; 14,601,814,750 total letters (~1.4 x ) Notation: 1.2e-35 = 1.2 x x 10 6 = 4,800,000 We will consider first searching a nucleotide sequence (‘ACGTAGACGT’) against a nucleotide database, e.g. the RefSeq mRNA above. Then we will consider the more complex case of amino acid sequence (protein) searches. Which is of course what we mostly do.

Calculating an E-value The RefSeq mRNA database has ~ 5.0 x 10 8 letters There are 4 possible nucleotides - ACGT How many matches do we expect to find by chance? CCGCCAGCTACGGTCACCGAGCTTCTCATTGCTCTTCCTAACAGTGTGATAGGCTAACCGTAATGGCG AAAAAAAAAAAAAA Query = ‘A’ CCGCCAGCTACGGTCACCGAGCTTCTCATTGCTCTTCCTAACAGTGTGATAGGCTAACCGTAATGGCG Query = ‘AC’ AC CCGCCAGCTACGGTCACCGAGCTTCTCATTGCTCTTCCTAACAGTGTGATAGGCTAACCGTAATGGCG Query = ‘ACG’ ACG Expected number of matches = (5.0 x 10 8 ) / 4 = ~1.2 x 10 8 Expected number of matches = (5.0 x 10 8 ) / (4x 4) = ~3.1 x 10 7 Expected number of matches = (5.0 x 10 8 ) / (4 x 4 x 4) = ~8.1 x 10 6 Query = ‘ACGTCGA…..CTGATTCG’ - 60-mer Expected number of matches = (5.0 x 10 8 ) / (4 x 4 x 4 x 4 … 60 times ) = (5.0 x 10 8 ) / = 5.0 x E-value = 5.0 x

E-values In Practice So if I take a 60 nt sequence: >sequence ACAGCTCGTCCTCCTTCCGAGCCTACCGGGCCGCCCTCTCGGAGGTGGAACCGCCGTGCA and actually BLAST it against the RefSeq mRNA database, I get: BLAST OUTPUT: >gi| |gb|BC | Homo sapiens Rap guanine nucleotide exchange factor (GEF) 1, transcript variant 2, mRNA (cDNA clone MGC:49019 IMAGE: ), complete cds Length=6060gi| |gb|BC | Score = 119 bits (60), Expect = 2e-26 Identities = 60/60 (100%), Gaps = 0/60 (0%) Strand=Plus/Plus Query 1 ACAGCTCGTCCTCCTTCCGAGCCTACCGGGCCGCCCTCTCGGAGGTGGAACCGCCGTGCA 60 |||||||||||||||||||||||||||||||||||||||||||||||||||||||||||| Sbjct 2977 ACAGCTCGTCCTCCTTCCGAGCCTACCGGGCCGCCCTCTCGGAGGTGGAACCGCCGTGCA 3036 What do I get if I BLAST it against the larger nr database? BLAST OUTPUT: >gi| |gb|BC | Homo sapiens Rap guanine nucleotide exchange factor (GEF) 1, transcript variant 2, mRNA (cDNA clone MGC:49019 IMAGE: ), complete cds Length=6060gi| |gb|BC | Score = 119 bits (60), Expect = 6e-25 Identities = 60/60 (100%), Gaps = 0/60 (0%) Strand=Plus/Plus Query 1 ACAGCTCGTCCTCCTTCCGAGCCTACCGGGCCGCCCTCTCGGAGGTGGAACCGCCGTGCA 60 |||||||||||||||||||||||||||||||||||||||||||||||||||||||||||| Sbjct 2977 ACAGCTCGTCCTCCTTCCGAGCCTACCGGGCCGCCCTCTCGGAGGTGGAACCGCCGTGCA 3036 theoretical value was 5.0e !?

E-values: Effect of Database Size The nr mRNA database has ~ 1.4 x letters There are 4 possible nucleotides - ACGT How many matches do we expect to find by chance? CCGCCAGCTACGGTCACCGAGCTTCTCATTGCTCTTCCTAACAGTGTGATAGGCTAACCGTAATGGCG AAAAAAAAAAAAAA Query = ‘A’ CCGCCAGCTACGGTCACCGAGCTTCTCATTGCTCTTCCTAACAGTGTGATAGGCTAACCGTAATGGCG Query = ‘AC’ AC CCGCCAGCTACGGTCACCGAGCTTCTCATTGCTCTTCCTAACAGTGTGATAGGCTAACCGTAATGGCG Query = ‘ACG’ ACG Expected number of matches = (1.4 x ) / 4 = ~1.2 x 10 8 Expected number of matches = (1.4 x ) / (4x 4) = ~3.1 x 10 7 Expected number of matches = (1.4 x ) / (4 x 4 x 4) = ~8.1 x 10 6 Query = ‘ACGTCGA…..CTGATTCG’ - 60-mer Expected number of matches = (1.4 x ) / (4 x 4 x 4 x 4 … 60 times ) = (1.4 x ) / = 1.4 x E-value = 1.4 x

E-values: Effect of Database Size The E-value is simply dependent on database size. RefSeq nr 1.4 x letters 5.0 x10 8 letters 30 x bigger BLAST the same sequence against each E-value = 1.4e -26 E-value = 5.0e -28 The database was ~30 times bigger and so the E-value was ~30 times bigger.

Why were the values different? Our calculated E-value for searching against the RefSeq mRNA database was 5.0 x But our actual BLAST search at NCBI gave a value of 2.0 x about 40x larger - why is this? Gapped alignments If we were expecting N matches for a query sequence ‘ACGTACGTACGT’, imagine what would happen to N if we allowed gaps in our matches. ACGTAC?GTACGT This would now give us additional possible alignments that would meet our ‘match’ criteria: ACGTACGTACGT ACGTACAGTACGT ACGTACCGTACGT etc. |||||||||||| |||||| |||||| |||||| |||||| ACGTACGTACGT ACGTAC-GTACGT ACGTAC-GTACGT We will expect many more matches in a given database, if we allow our alignments to have gaps. The E-value will be larger.

E-values: Effect of Query Length Biologically it’s the same match! Does it mean we are any less sure that this match didn’t occur by chance? The E-value is simply dependent on match length. database BLAST 500 nt sequence against a database BLASTn Get a full length match with sequence XYZ at an E-value = 5.0e -160 >sequence ACTAGTCTAGCTAGACATCG ATCGATGATGCTACACAGAT AGACGATAGATAGTAAGTCG ATCGATCGCGCATCGATCGT CTAGATCGATCGCTCGCTGT GTAGATAGATCGGCGATAGA database BLAST half of the same sequence against the same database BLASTn >sequence ACTAGTCTAGCTAGACATCG ATCGATGATGCTACACAGAT AGACGATAGATAGTAAGTCG Get a match with sequence XYZ again, but at an E-value = 5.0e -80

Why not just use % identity? At some levels this a good question. But consider two very different searches, both of which give a 75% identity match Query1 was 60 nt long: CGGAGCTCAGGGCTTAACGACTGATATCTCCGCGCATGTCGAGAAACGATACAGCCAGCG ||||||||||| || | || | || || |||| | | | |||||| | |||||||||| CGGAGCTCAGGCCTCACCGGCGGACATGTCCGGGAAAATAGAGAAAGCAGACAGCCAGCG Which would have an E-value ~ 5.0 x And, Query2 only 16 nt long: ACGTACGTACGTACGT ||| || | |||| || ACGCACCTTCGTAGGT Which would have an E-value ~ 30 And intuitively we feel we would expect to see that sort of number of matches in the database just by chance…

So what’s the real problem? Basically you are usually trying to answer the question: Can I find the ortholog of my gene in some other species, so that I can work out what it might be doing in my organism? And the difficulty is because BLAST does not set out to address questions like orthology. BLAST only tells you about sequence similarity, with some notion of how likely a similarity is to have arisen by chance, based on some general biological principles. You will always have to add in your own knowledge of biology, and exactly what your query sequence was, and how it is related to your matching sequences. In particular whether the degree of similarity matches up to the supposed evolutionary distance between the two species. You will also need to take into account the length of the reported match, compared to the lengths of your query and matched sequences. And of course the size of the database. Are there any useful guidelines though? Basically you are usually trying to answer the question: Can I find the ortholog of my gene in some other species, so that I can work out what it might be doing in my organism?

Rules of Thumb How good does an E-value have to be before we might even think we have an ortholog?  larger/worse smaller/better  E-values fantasy borderline encouraging pretty good can’t get better But note that in some gene families with closely related members you can get an E-value of 0.0 for several different matches, and then % identity may be more sensitive. Also bear in mind, in cases like this, that ideas of ‘functional’ orthology may break down, with more than one locus producing identical proteins which share the same function…

Protein BLAST It’s (nearly) always better to make comparisons at the amino acid level between protein sequences than the DNA level, because there are many different DNA sequences that can give exactly the same protein sequence. Does this cause us to treat expected values any differently? If we follow the argument as before then for an exact match of a 20 amino acid sequence in the RefSeq protein database, each additional amino acid will reduce the E-value by 1/20 th (there are 20 different amino acids). And as there are 347,895,532 letters in that database, E-value = ~3.5 x 10 8 / (20 x 20 x 20 …20 times) = ~3.5 x But this is what we get of we run the blast at NCBI: Score = 43.1 bits (100), Expect = 8e-04 Identities = 20/20 (100%), Positives = 20/20 (100%), Gaps = 0/20 (0%) Frame = +3 Query 3 SSSSFRAYRAALSEVEPPCI 62 SSSSFRAYRAALSEVEPPCI Sbjct 972 SSSSFRAYRAALSEVEPPCI 991 Really too big a discrepancy to easily explain with hand waving…

Amino Acid Substitutions A S C F LWY G I LMV L IMFV M ILV P V ILM W FY NDHS Q REHK SANT T S Y HFW H NQY K RQE R QK DNE E DQK In fact we need to take into account both amino acid substitutability, as well as, as before, allowing gapped alignments. On average any residue can be substituted for by about 2 others, so each position has about 1/7 th chance of ‘matching’ rather than 1/20 th. So now we get: E-value = ~3.5 x 10 8 / (7 x 7 x 7 …20 times) = ~4.4 x 10 -9, which is much closer to the actual BLAST value.

Exercises Go to the file random-DNA-sequences.html, randomly select one of the 20 randomly generated nucleotide sequences, and do a BLASTx (translated DNA->protein) at NCBI against the nr protein database. Did you find any ‘significant’ hits? Repeat with a second sequence. What conclusions might you draw from this exercise? Try the same sequence(s) against the nr nucleotide database. Is there any general difference?