Presentation is loading. Please wait.

Presentation is loading. Please wait.

Introduction to Bioinformatics BLAST. Introduction –What is BLAST? –Query Sequence Formats –What does BLAST tell you? Choices –Variety of BLAST –BLAST.

Similar presentations


Presentation on theme: "Introduction to Bioinformatics BLAST. Introduction –What is BLAST? –Query Sequence Formats –What does BLAST tell you? Choices –Variety of BLAST –BLAST."— Presentation transcript:

1 Introduction to Bioinformatics BLAST

2 Introduction –What is BLAST? –Query Sequence Formats –What does BLAST tell you? Choices –Variety of BLAST –BLAST Programs: Which One to Use? –Commonly Used BLAST programs –BLAST Databases: Which One to Search? Understanding the Output Database Search with BLAST Blast Steps – How It Works Acknowledgement: The presentation includes adaptations from NCBI’s Introduction to Molecular Biology Information ResourcesIntroduction to Molecular Biology Information Resources Modules

3 What is BLAST? Basic Local Alignment Search Tool The Google TM of bioinformatics Query is a DNA or protein sequence, not a text term Character string comparison against all the sequences in the target database Rigorous statistics used to identify statistically significant matches

4 Query Sequence Formats Bare sequence –QIKDLLVSSSTDLDTTLVLVNAIYFKGMWKTAFNAEDTREMPFHVTKQESKPVQMMCMNNSFNVATLPAE KMKILELPFASGDLSMLVLLPDEVSDLERIEKTINFEKLTEWTNPNTMEKRRVKVYLPQMKIEEKYNLTS VLMALGMTDLFIPSANLTGISSAESLKISQAVHGAFMELSEDGIEMAGSTGVIEDIKHSPESEQFRADHP FLFLIKHNPTNTIVYFGRYWSP – 1 qikdllvsss tdldttlvlv naiyfkgmwk tafnaedtre mpfhvtkqes kpvqmmcmnn 61 sfnvatlpae kmkilelpfa sgdlsmlvll pdevsdleri ektinfeklt ewtnpntmek 121 rrvkvylpqm kieekynlts vlmalgmtdl fipsanltgi ssaeslkisq avhgafmels 181 edgiemagst gviedikhsp eseqfradhp flflikhnpt ntivyfgryw sp Identifiers –accession, accession.version or gi's –e.g., p01013, AAA68881.1, 129295, gi|129295 FASTA format

5 Query Sequence in FASTA Format FASTA definition line ("def line") that begins with a >, followed by some text that briefly describes the query sequence on a single line Up to 80 nucleotide bases or amino acids per line Blank lines not allowed in the middle Example –>gi|129295|sp|P01013|OVAX_CHICK GENE X PROTEIN (OVALBUMIN-RELATED) QIKDLLVSSSTDLDTTLVLVNAIYFKGMWKTAFNAEDTREMPFHVTKQESKPVQMMCMNNSFNVATLPAE KMKILELPFASGDLSMLVLLPDEVSDLERIEKTINFEKLTEWTNPNTMEKRRVKVYLPQMKIEEKYNLTS VLMALGMTDLFIPSANLTGISSAESLKISQAVHGAFMELSEDGIEMAGSTGVIEDIKHSPESEQFRADHP FLFLIKHNPTNTIVYFGRYWSP Additional information

6 What does BLAST tell you? Putative identity and function of your query sequence Helps to direct experimental design to prove the function Find similar sequences in model organisms (e.g., yeast, C. elegans, mouse), which can be used to further study the gene Compare complete genomes against each other to identify similarities and differences among organisms

7 Variety of BLASTs: http://www.ncbi.nlm.nih.gov/BLAST/

8 BLAST Programs: Which One to Use? Depends on: What type of query sequence you have (nucleotide or protein) What type of database you will search against (nucleotide or protein) BLAST program descriptions –brief listbrief list –BLAST program selection guideBLAST program selection guide

9 Commonly Used BLAST Programs Examples of BLAST programs –BLASTN Nucleic acids against nucleic acids –BLASTP Protein query against protein database Usually better to use than nucleotide-nucleotide BLAST Since the genetic code is degenerate, blastn can often give less specific results than blastp...but... what if we don't have a protein query sequence. What are our options? –BLASTX Translated nucleic acids against protein database One way to do a protein BLAST search if you have a nucleotide query sequence The BLAST program does the translating for you, in all 6 reading frames reading frames

10 BLAST Databases: Which One to Search? What type of data do you want to search against? For example: Characterized sequences? Specialized sequences? Complete genomes or chromosomes? BLAST database descriptions are available in the: –BLAST help documentBLAST help document –BLAST program selection guideBLAST program selection guide

11 Request ID: RID An RID is like a ticket number that allows you to retrieve your search results and format them in many different ways over the next 24 hours. If you've saved RIDs from your recent searches, you can enter the RIDs directly using the Retrieve results with a Request ID page, which is accessible from the bottom of the BLAST home pageRetrieve results with a Request IDBLAST

12 Search Results: Understanding the Output Reference to BLAST paper Reminders about your specific query –RID –query sequence reminder (contains the information from your FASTA def line) –what database you searched against Graphical summary –shows where the hits aligned to your query –colors indicate score range –mouse over a colored bar to see info about that hit Text summary (GI numbers and Def lines) –GI links to complete record in Entrez –Score links to pairwise alignment between your query sequence and the hit Pairwise alignments BLAST statistics for your search

13 Database Search w/ BLAST Primary use of bioinformatics –Finding similar sequences –BLAST Acknowledgement: Slides 15 – 19 are adapted from lecture notes of Professor Chau-Wen Tseng of CS Department at the University of Maryland with permission.

14 Database Search w/ BLAST Set up format options and hit the Format button Click button! RID

15 Database Search w/ BLAST Versions of BLAST –BLASTN Nucleic acids against nucleic acids –BLASTP Protein query against protein database –BLASTX Translated nucleic acids against protein database –TBLAST Protein query against translated nucleic acid database –TBLASTX Translated nucleic acids against translated nucleic acids

16 Database Search w/ BLAST

17 BLAST graphic result

18 Database Search w/ BLAST BLAST result  Matching sequences w/ bit-score & E-value  Hyperlinks to database entry for sequence Example gi|17330420|gb|BH384278.1|BH384278... 153 3e-36 gi|17320126|gb|BH373984.1|BH373984... 140 9e-34 gi|17338337|gb|BH392196.1|BH392196... 112 8e-25 gi|20373967|gb|BH771010.1|BH771010... 105 1e-21 gi|17314411|gb|BH368367.1|BH368367... 104 2e-21 gi|17332712|gb|BH386570.1|BH386570... 64 3e-21 Hyperlinks to sequences Bit Score E-value

19 BLAST – Statistical Evaluation E Value – The number of different alignments with scores equivalent to or better than alignment score that are expected to occur in a database search by chance. – The lower the E value, the more significant the score.

20 BLAST – How It Works Find high scoring local alignments between query sequence and target database Assumption –True match alignments very likely to contain within them very high scoring matches Steps 1.Seeding 2.Searching 3.Extension 4.Evaluation

21 BLAST Steps 1.Seeding For each word of length w in the query (w- mer), generate a list of all possible words (neighbors) with a score of at least threshold T (determined by using the scoring matrix) Default w = 3 for protein w =11 for DNA

22 Query word (w = 3) Query: GSDFWQETRASFGCSLAALLNKCKTPQGQRLVNQWIKQPLMDKNRIEERLNLVEAFGCATSWPI PQG18 PEG15 PRG14 PKG14 PNG13 PDG13 PHG13 PMG13 PSG13 PQA12 PQN12 … Neighborhood score threshold (T = 13) Neighborhood words This example uses BLOSUM 62.

23 BLOSUM 62

24 BLAST Steps 2. Searching Determine the locations of all common “words” between the query and the database (“word hits”) Identifies all word hits

25 Query word (w = 3) Query: GSDFWQETRASFGCSLAALLNKCKTPQGQRLVNQWIKQPLMDKNRIEERLNLVEAFGCATSWPI Query: SLAALLNKCKTPQGQRLVNQWIKQPLMDKNRIEERLNLVEA Subject: TLASVLDCTVTPMGSRMLKRWLHMPVRDTRVLLERQQTIGA PQG18 PEG15 PRG14 PKG14 PNG13 PDG13 PHG13 PMG13 PSG13 PQA12 PQN12 … Neighborhood score threshold (T = 13) Neighborhood words Hit

26 BLAST Steps 3. Extension Extend hits to find HSPs (high-scoring segment pairs) that have scores higher than a threshold Introduce gaps using dynamic programming Problem of extension Time-consuming to find the highest score Solution (heuristic) Extend until score drops a value of X Example: ABCDEFGHIJKLMNOPQRST |||||| ||||| | ABCDEFZYIJKLMXWVUTAB 1234565456789876565  Score 00000012100001234345  Drop off score Match = 1 Mismatch = -1 X = 5

27 Query word (W = 3) Query: GSDFWQETRASFGCSLAALLNKCKTPQGQRLVNQWIKQPLMDKNRIEERLNLVEAFGCATSWPI Query: SLAALLNKCKTPQGQRLVNQWIKQPLMDKNRIEERLNLVEA +LA++L+ TP G R++ +W+ P+ D + ER + A Subject: TLASVLDCTVTPMGSRMLKRWLHMPVRDTRVLLERQQTIGA PQG18 PEG15 PRG14 PKG14 PNG13 PDG13 PHG13 PMG13 PSG13 PQA12 PQN12 … Neighborhood score threshold (T = 13) Neighborhood words Hit

28 BLAST Steps 3. Evaluation Maximal segment pairs (MSPs) – maximum- scoring HSPs Evaluate the statistical significance of extended hits (HSPs) Report only those above the determined threshold (MSPs)

29 BLAST – Statistical Evaluation For local, ungapped alignments: m: size of query n: size of database E: expected # of HSPs with scores at least S p: prob of finding at least one HSP with S good tutorial at: http://www.ncbi.nlm.nih.gov/BLAST/tutorial/Altschul-1.html

30 Interpretations of Expected Value Expected value ranges –E < 10 -100 → very low, homologs or identical genes –E < 10 -3 → moderate, may be related genes –E > 1 → high, probably / may be unrelated –0 0.5 < E < 1 → ??? In the “twilight zone” Try detailed search If database search –Long list of gradually declining of E values → large gene family –Long regions of moderate similarity → more significant than short regions of high identity Biological relevance –Still need to determine biological significance!!!


Download ppt "Introduction to Bioinformatics BLAST. Introduction –What is BLAST? –Query Sequence Formats –What does BLAST tell you? Choices –Variety of BLAST –BLAST."

Similar presentations


Ads by Google