Introduction to Bioinformatics BLAST
Introduction –What is BLAST? –Query Sequence Formats –What does BLAST tell you? Choices –Variety of BLAST –BLAST Programs: Which One to Use? –Commonly Used BLAST programs –BLAST Databases: Which One to Search? Understanding the Output Database Search with BLAST Blast Steps – How It Works Acknowledgement: The presentation includes adaptations from NCBI’s Introduction to Molecular Biology Information ResourcesIntroduction to Molecular Biology Information Resources Modules
What is BLAST? Basic Local Alignment Search Tool The Google TM of bioinformatics Query is a DNA or protein sequence, not a text term Character string comparison against all the sequences in the target database Rigorous statistics used to identify statistically significant matches
Query Sequence Formats Bare sequence –QIKDLLVSSSTDLDTTLVLVNAIYFKGMWKTAFNAEDTREMPFHVTKQESKPVQMMCMNNSFNVATLPAE KMKILELPFASGDLSMLVLLPDEVSDLERIEKTINFEKLTEWTNPNTMEKRRVKVYLPQMKIEEKYNLTS VLMALGMTDLFIPSANLTGISSAESLKISQAVHGAFMELSEDGIEMAGSTGVIEDIKHSPESEQFRADHP FLFLIKHNPTNTIVYFGRYWSP – 1 qikdllvsss tdldttlvlv naiyfkgmwk tafnaedtre mpfhvtkqes kpvqmmcmnn 61 sfnvatlpae kmkilelpfa sgdlsmlvll pdevsdleri ektinfeklt ewtnpntmek 121 rrvkvylpqm kieekynlts vlmalgmtdl fipsanltgi ssaeslkisq avhgafmels 181 edgiemagst gviedikhsp eseqfradhp flflikhnpt ntivyfgryw sp Identifiers –accession, accession.version or gi's –e.g., p01013, AAA , , gi| FASTA format
Query Sequence in FASTA Format FASTA definition line ("def line") that begins with a >, followed by some text that briefly describes the query sequence on a single line Up to 80 nucleotide bases or amino acids per line Blank lines not allowed in the middle Example –>gi|129295|sp|P01013|OVAX_CHICK GENE X PROTEIN (OVALBUMIN-RELATED) QIKDLLVSSSTDLDTTLVLVNAIYFKGMWKTAFNAEDTREMPFHVTKQESKPVQMMCMNNSFNVATLPAE KMKILELPFASGDLSMLVLLPDEVSDLERIEKTINFEKLTEWTNPNTMEKRRVKVYLPQMKIEEKYNLTS VLMALGMTDLFIPSANLTGISSAESLKISQAVHGAFMELSEDGIEMAGSTGVIEDIKHSPESEQFRADHP FLFLIKHNPTNTIVYFGRYWSP Additional information
What does BLAST tell you? Putative identity and function of your query sequence Helps to direct experimental design to prove the function Find similar sequences in model organisms (e.g., yeast, C. elegans, mouse), which can be used to further study the gene Compare complete genomes against each other to identify similarities and differences among organisms
Variety of BLASTs:
BLAST Programs: Which One to Use? Depends on: What type of query sequence you have (nucleotide or protein) What type of database you will search against (nucleotide or protein) BLAST program descriptions –brief listbrief list –BLAST program selection guideBLAST program selection guide
Commonly Used BLAST Programs Examples of BLAST programs –BLASTN Nucleic acids against nucleic acids –BLASTP Protein query against protein database Usually better to use than nucleotide-nucleotide BLAST Since the genetic code is degenerate, blastn can often give less specific results than blastp...but... what if we don't have a protein query sequence. What are our options? –BLASTX Translated nucleic acids against protein database One way to do a protein BLAST search if you have a nucleotide query sequence The BLAST program does the translating for you, in all 6 reading frames reading frames
BLAST Databases: Which One to Search? What type of data do you want to search against? For example: Characterized sequences? Specialized sequences? Complete genomes or chromosomes? BLAST database descriptions are available in the: –BLAST help documentBLAST help document –BLAST program selection guideBLAST program selection guide
Request ID: RID An RID is like a ticket number that allows you to retrieve your search results and format them in many different ways over the next 24 hours. If you've saved RIDs from your recent searches, you can enter the RIDs directly using the Retrieve results with a Request ID page, which is accessible from the bottom of the BLAST home pageRetrieve results with a Request IDBLAST
Search Results: Understanding the Output Reference to BLAST paper Reminders about your specific query –RID –query sequence reminder (contains the information from your FASTA def line) –what database you searched against Graphical summary –shows where the hits aligned to your query –colors indicate score range –mouse over a colored bar to see info about that hit Text summary (GI numbers and Def lines) –GI links to complete record in Entrez –Score links to pairwise alignment between your query sequence and the hit Pairwise alignments BLAST statistics for your search
Database Search w/ BLAST Primary use of bioinformatics –Finding similar sequences –BLAST Acknowledgement: Slides 15 – 19 are adapted from lecture notes of Professor Chau-Wen Tseng of CS Department at the University of Maryland with permission.
Database Search w/ BLAST Set up format options and hit the Format button Click button! RID
Database Search w/ BLAST Versions of BLAST –BLASTN Nucleic acids against nucleic acids –BLASTP Protein query against protein database –BLASTX Translated nucleic acids against protein database –TBLAST Protein query against translated nucleic acid database –TBLASTX Translated nucleic acids against translated nucleic acids
Database Search w/ BLAST
BLAST graphic result
Database Search w/ BLAST BLAST result Matching sequences w/ bit-score & E-value Hyperlinks to database entry for sequence Example gi| |gb|BH |BH e-36 gi| |gb|BH |BH e-34 gi| |gb|BH |BH e-25 gi| |gb|BH |BH e-21 gi| |gb|BH |BH e-21 gi| |gb|BH |BH e-21 Hyperlinks to sequences Bit Score E-value
BLAST – Statistical Evaluation E Value – The number of different alignments with scores equivalent to or better than alignment score that are expected to occur in a database search by chance. – The lower the E value, the more significant the score.
BLAST – How It Works Find high scoring local alignments between query sequence and target database Assumption –True match alignments very likely to contain within them very high scoring matches Steps 1.Seeding 2.Searching 3.Extension 4.Evaluation
BLAST Steps 1.Seeding For each word of length w in the query (w- mer), generate a list of all possible words (neighbors) with a score of at least threshold T (determined by using the scoring matrix) Default w = 3 for protein w =11 for DNA
Query word (w = 3) Query: GSDFWQETRASFGCSLAALLNKCKTPQGQRLVNQWIKQPLMDKNRIEERLNLVEAFGCATSWPI PQG18 PEG15 PRG14 PKG14 PNG13 PDG13 PHG13 PMG13 PSG13 PQA12 PQN12 … Neighborhood score threshold (T = 13) Neighborhood words This example uses BLOSUM 62.
BLOSUM 62
BLAST Steps 2. Searching Determine the locations of all common “words” between the query and the database (“word hits”) Identifies all word hits
Query word (w = 3) Query: GSDFWQETRASFGCSLAALLNKCKTPQGQRLVNQWIKQPLMDKNRIEERLNLVEAFGCATSWPI Query: SLAALLNKCKTPQGQRLVNQWIKQPLMDKNRIEERLNLVEA Subject: TLASVLDCTVTPMGSRMLKRWLHMPVRDTRVLLERQQTIGA PQG18 PEG15 PRG14 PKG14 PNG13 PDG13 PHG13 PMG13 PSG13 PQA12 PQN12 … Neighborhood score threshold (T = 13) Neighborhood words Hit
BLAST Steps 3. Extension Extend hits to find HSPs (high-scoring segment pairs) that have scores higher than a threshold Introduce gaps using dynamic programming Problem of extension Time-consuming to find the highest score Solution (heuristic) Extend until score drops a value of X Example: ABCDEFGHIJKLMNOPQRST |||||| ||||| | ABCDEFZYIJKLMXWVUTAB Score Drop off score Match = 1 Mismatch = -1 X = 5
Query word (W = 3) Query: GSDFWQETRASFGCSLAALLNKCKTPQGQRLVNQWIKQPLMDKNRIEERLNLVEAFGCATSWPI Query: SLAALLNKCKTPQGQRLVNQWIKQPLMDKNRIEERLNLVEA +LA++L+ TP G R++ +W+ P+ D + ER + A Subject: TLASVLDCTVTPMGSRMLKRWLHMPVRDTRVLLERQQTIGA PQG18 PEG15 PRG14 PKG14 PNG13 PDG13 PHG13 PMG13 PSG13 PQA12 PQN12 … Neighborhood score threshold (T = 13) Neighborhood words Hit
BLAST Steps 3. Evaluation Maximal segment pairs (MSPs) – maximum- scoring HSPs Evaluate the statistical significance of extended hits (HSPs) Report only those above the determined threshold (MSPs)
BLAST – Statistical Evaluation For local, ungapped alignments: m: size of query n: size of database E: expected # of HSPs with scores at least S p: prob of finding at least one HSP with S good tutorial at:
Interpretations of Expected Value Expected value ranges –E < → very low, homologs or identical genes –E < → moderate, may be related genes –E > 1 → high, probably / may be unrelated –0 0.5 < E < 1 → ??? In the “twilight zone” Try detailed search If database search –Long list of gradually declining of E values → large gene family –Long regions of moderate similarity → more significant than short regions of high identity Biological relevance –Still need to determine biological significance!!!