Introduction to Bioinformatics DNA and Protein Database Searching BLAST: Basic local alignment search tool Xiaolong Wang College of Life Sciences Ocean.

Slides:



Advertisements
Similar presentations
Blast outputoutput. How to measure the similarity between two sequences Q: which one is a better match to the query ? Query: M A T W L Seq_A: M A T P.
Advertisements

Bioinformatics Tutorial I BLAST and Sequence Alignment.
BLAST Sequence alignment, E-value & Extreme value distribution.
Last lecture summary.
Local alignments Seq X: Seq Y:. Local alignment  What’s local? –Allow only parts of the sequence to match –Results in High Scoring Segments –Locally.
Searching Sequence Databases
BLAST (Basic Local Alignment Search Tool) In addition to the exact word, BLAST considers related words based on BLOSUM62: the neighborhood. Once a word.
Heuristic alignment algorithms and cost matrices
BLAST Basic Local Alignment Search Tool. BLAST החכה BLAST (Basic Local Alignment Search Tool) allows rapid sequence comparison of a query sequence [[רצף.
. Class 4: Fast Sequence Alignment. Alignment in Real Life u One of the major uses of alignments is to find sequences in a “database” u Such collections.
From Pairwise Alignment to Database Similarity Search.
Similar Sequence Similar Function Charles Yan Spring 2006.
From Pairwise Alignment to Database Similarity Search.
Practical algorithms in Sequence Alignment Sushmita Roy BMI/CS 576 Sep 16 th, 2014.
From Pairwise Alignment to Database Similarity Search.
Rationale for searching sequence databases June 22, 2005 Writing Topics due today Writing projects due July 8 Learning objectives- Review of Smith-Waterman.
Sequence alignment, E-value & Extreme value distribution
BLAST Basic Local Alignment Search Tool. BLAST החכה BLAST (Basic Local Alignment Search Tool) allows rapid sequence comparison of a query sequence [[רצף.
Heuristic methods for sequence alignment in practice Sushmita Roy BMI/CS 576 Sushmita Roy Sep 27 th,
BLAST: Basic Local Alignment Search Tool Urmila Kulkarni-Kale Bioinformatics Centre University of Pune.
Access to sequences: GenBank – a place to start and then some more... Links: embl nucleotide archive
Pairwise Alignment How do we tell whether two sequences are similar? BIO520 BioinformaticsJim Lund Assigned reading: Ch , Ch 5.1, get what you can.
An Introduction to Bioinformatics
BLAST What it does and what it means Steven Slater Adapted from pt.
Basic Introduction of BLAST Jundi Wang School of Computing CSC691 09/08/2013.
Introduction to Bioinformatics CPSC 265. Interface of biology and computer science Analysis of proteins, genes and genomes using computer algorithms and.
BLAST : Basic local alignment search tool B L A S T !
Biology 224 Tom Peavy Sept 20 & 22, 2010
1 Lecture outline Database searches –BLAST –FASTA Statistical Significance of Sequence Comparison Results –Probability of matching runs –Karin-Altschul.
Local alignment, BLAST and Psi-BLAST October 25, 2012 Local alignment Quiz 2 Learning objectives-Learn the basics of BLAST and Psi-BLAST Workshop-Use BLAST2.
Database Searches BLAST. Basic Local Alignment Search Tool –Altschul, Gish, Miller, Myers, Lipman, J. Mol. Biol. 215 (1990) –Altschul, Madden, Schaffer,
Last lecture summary. Window size? Stringency? Color mapping? Frame shifts?
BLAST Anders Gorm Pedersen & Rasmus Wernersson. Database searching Using pairwise alignments to search databases for similar sequences Database Query.
BLAST: Basic Local Alignment Search Tool Altschul et al. J. Mol Bio CS 466 Saurabh Sinha.
BLAST Slides adapted & edited from a set by Cheryl A. Kerfeld (UC Berkeley/JGI) & Kathleen M. Scott (U South Florida) Kerfeld CA, Scott KM (2011) Using.
FASTA and BLAST Chitta Baral. FASTA : Basic Steps Step 1: –Set a word size. (usually 6 for DNA and 2 for proteins) –Make a plot. –Find the long diagonals.
Basic Local Alignment Search Tool BLAST Why Use BLAST?
Pairwise Local Alignment and Database Search Csc 487/687 Computing for Bioinformatics.
Pairwise Sequence Alignment Part 2. Outline Summary Local and Global alignments FASTA and BLAST algorithms Evaluating significance of alignments Alignment.
Database Similarity Search. 2 Sequences that are similar probably have the same function Why do we care to align sequences?
Biology 4900 Biocomputing.
Heuristic Methods for Sequence Database Searching BMI/CS 576 Colin Dewey Fall 2015.
David Wishart February 18th, 2004 Lecture 3 BLAST (c) 2004 CGDN.
Sequence Search Abhishek Niroula Department of Experimental Medical Science Lund University
Step 3: Tools Database Searching
Heuristic Methods for Sequence Database Searching BMI/CS 576 Colin Dewey Fall 2010.
Sequence Alignment. MRCKTETGAR MRCGTETGAR % identity 90% CATTATGATA GTTTATGATT 70%
What is BLAST? Basic BLAST search What is BLAST?
BLAST: Database Search Heuristic Algorithm Some slides courtesy of Dr. Pevsner and Dr. Dirk Husmeier.
3/15/20161 BLAST : Basic local alignment search tools.
Heuristic Methods for Sequence Database Searching BMI/CS 776 Mark Craven February 2002.
Using BLAST To Teach ‘E-value-tionary’ Concepts Cheryl A. Kerfeld 1, 2 and Kathleen M. Scott 3 1.Department of Energy-Joint Genome Institute, Walnut Creek,
Chap. 4: Multiple Sequence Alignment
Introduction to Bioinformatics
9/6/07BCB 444/544 F07 ISU Dobbs - Lab 3 - BLAST1 BCB 444/544 Lab 3 BLAST Scoring Matrices & Alignment Statistics Sept6.
Database Scanning/Searching FASTA/BLAST/PSIBLAST G P S Raghava.
BLAST et BLAST avancé J.S. Bernardes/H. Richard
Courtesy of Jonathan Pevsner
Basic Local Alignment Sequence Tool (BLAST)
Blast Basic Local Alignment Search Tool
Basics of BLAST Basic BLAST Search - What is BLAST?
BLAST Anders Gorm Pedersen & Rasmus Wernersson.
Identifying templates for protein modeling:
Sequence alignment, Part 2
Johns Hopkins School of Medicine
Basic Local Alignment Search Tool (BLAST)
Basic Local Alignment Search Tool
BLAST Slides adapted & edited from a set by
Sequence alignment, E-value & Extreme value distribution
BLAST Slides adapted & edited from a set by
Presentation transcript:

Introduction to Bioinformatics DNA and Protein Database Searching BLAST: Basic local alignment search tool Xiaolong Wang College of Life Sciences Ocean University of China

2 Outline Introduction to BLAST Practical guide to database searching The BLAST algorithm BLAST search strategies Advanced BLAST Searching

3 BLAST BLAST (Basic Local Alignment Search Tool) allows rapid sequence comparison of a query sequence against a database. The BLAST algorithm is fast, accurate, and web-accessible. >100,000 searches a day.

4 Why use BLAST? BLAST searching is fundamental to understanding the relatedness of any favorite query sequence to other known proteins or DNA sequences. Applications include Identifying orthologs and paralogs --Review: Altenhoff AM, Dessimoz C (2009) Phylogenetic and Functional Assessment of Orthologs Inference Projects and Methods. PLoS Comput Biol 5(1): e Discovering new genes or proteins Discovering variants of genes or proteins Investigating expressed sequence tags (ESTs) Exploring protein structure and function

5 New Interface

Four components to a BLAST search (1) Select the BLAST program (2) Input the query sequence (3) Choose the database to search (4) Choose optional parameters Then click “BLAST”

8 Step 2: Choose your sequence Step 1: Choose the BLAST program

9 Example of the FASTA format for a BLAST query >gi| |ref|NM_ | Homo sapiens dopamine receptor D4 (DRD4), mRNA ATGGGGAACCGCAGCACCGCGGACGCGGACGGGCTGCTGGCTGGGCGCGGGCCGGCCGCGGGGGCA TCTGCGGGGGCATCTGCGGGGCTGGCTGGGCAGGGCGCGGCGGCGCTGGTGGGGGGCGTGCTGCTCA TCGGCGCGGTGCTCGCGGGGAACTCGCTCGTGTGCGTGAGCGTGGCCACCGAGCGCGCCCTGCAGAC GCCCACCAACTCCTTCATCGTGAGCCTGGCGGCCGCCGACCTCCTCCTCGCTCTCCTGGTGCTGCCGC TCTTCGTCTACTCCGAGGTCCAGGGTGGCGCGTGGCTGCTGAGCCCCCGCCTGTGCGACGCCCTCATG GCCATGGACGTCATGCTGTGCACCGCCTCCATCTTCAACCTGTGCGCCATCAGCGTGGACAGGTTCGT GGCCGTGGCCGTGCCGCTGCGCTACAACCGGCAGGGTGGGAGCCGCCGGCAGCTGCTGCTCATCGGC GCCACGTGGCTGCTGTCCGCGGCGGTGGCGGCGCCCGTACTGTGCGGCCTCAACGACGTGCGCGGCC GCGACCCCGCCGTGTGCCGCCTGGAGGACCGCGACTACGTGGTCTACTCGTCCGTGTGCTCCTTCTTC CTACCCTGCCCGCTCATGCTGCTGCTCTACTGGGCCACGTTCCGCGGCCTGCAGCGCTGGGAGGTGGC ACGTCGCGCCAAGCTGCACGGCCGCGCGCCCCGCCGACCCAGCGGCCCTGGCCCGCCTTCCCCCACG CCACCCGCGCCCCGCCTCCCCCAGGACCCCTGCGGCCCCGACTGTGCGCCCCCCGCGCCCGGCCTTC CCCGGGGTCCCTGCGGCCCCGACTGTGCGCCCGCCGCGCCCAGCCTCCCCCAGGACCCCTGTGGCCC CGACTGTGCGCCCCCCGCGCCCGGCCTCCCCCCGGACCCCTGCGGCTCCAACTGTGCTCCCCCCGAC GCCGTCAGAGCCGCCGCGCTCCCACCCCAGACTCCACCGCAGACCCGCAGGAGGCGGCGTGCCAAG ATCACCGGCCGGGAGCGCAAGGCCATGAGGGTCCTGCCGGTGGTGGTCGGGGCCTTCCTGCTGTGCT GGACGCCCTTCTTCGTGGTGCACATCACGCAGGCGCTGTGTCCTGCCTGCTCCGTGCCCCCGCGGCTG GTCAGCGCCGTCACCTGGCTGGGCTACGTCAACAGCGCCCTCAACCCCGTCATCTACACTGTCTTCAA CGCCGAGTTCCGCAACGTCTTCCGCAAGGCCCTGCGTGCCTGCTGCTGAGCCGGGCACCCCCGGACG CCCCCCGGCCTGATGGCCAGGCCTCAGGGACCAAGGAGATGGGGAGGGCGCT TTTGTACGTTAATTAAACAAATTCCTTCCC

10 Step 1: Choose the BLAST program blastn (nucleotide BLAST) blastp (protein BLAST) tblastn (translated BLAST) blastx (translated BLAST) tblastx (translated BLAST)

11 Choose the BLAST program ProgramInputDatabase 1 blastnDNADNA 1 blastpproteinprotein 6 blastxDNAprotein 6 tblastnprotein DNA 36 tblastxDNA DNA

12 DNA potentially encodes six proteins 5’ CAT CAA 5’ ATC AAC 5’ TCA ACT 5’ GTG GGT 5’ TGG GTA 5’ GGG TAG 5’ CATCAACTACAACTCCAAAGACACCCTTACACATCAACAAACCTACCCAC 3’ 3’ GTAGTTGATGTTGAGGTTTCTGTGGGAATGTGTAGTTGTTTGGATGGGTG 5’ Blastx translates the query sequence in all six reading frames and provides combined significance statistics for hits to different frames, it is particularly useful when the reading frame of the query sequence is unknown or it contains errors that may lead to frame shifts or other coding errors. Thus blastx is often the first analysis performed with a newly determined nucleotide sequence and is used extensively in analyzing EST sequences. This search is more sensitive than nucleotide blast since the comparison is performed at the protein level.

13 Step 3: choose the database nr = non-redundant (most general database) dbest = database of expressed sequence tags dbsts = database of sequence tag sites gss = genomic survey sequences htgs = high throughput genomic sequence

14 Step 4: Select optional search parameters You can... choose the organism to search turn filtering on/off change the substitution matrix change the expect (e) value change the word size

15

16

17 filtering

18 NCBI blast now offers masking as lowercase/colored

19 taxonomy page 97

20 BLAST format options

21 BLAST format options: multiple sequence alignment

22 BLAST format options: multiple sequence alignment

23 BLAST format options: Tree view

24 实战练习 找到一个基因的 DNA 序列及其蛋白质序 列; 用 BLAST 找到他的 5 个同源基因的 DNA 序列及其蛋白质序列; 将他们分别用 FASTA 格式保存; 找到该基因在染色体上的位置; 说明该基因功能是什么?

25 BLAST: background on sequence alignment There are two main approaches to sequence alignment: [1] Global alignment (Needleman & Wunsch 1970) using dynamic programming to find optimal alignments between two sequences. (Although the alignments are optimal, the search is not exhaustive.) Gaps are permitted in the alignments, and the total lengths of both sequences are aligned (hence “global”).

26 BLAST: background on sequence alignment [2] local alignment (Smith & Waterman, 1980). The alignment may contain just a portion of either sequence, and is appropriate for finding matched domains between sequences. S-W is guaranteed to find optimal alignments, but it is computationally expensive (requires O (n 2 ) time). BLAST and FASTA are heuristic approximations to local alignment. Each requires only O (n 2 ) /k time; they examine only part of the search space.

27 How a BLAST search works “The central idea of the BLAST algorithm is to confine attention to segment pairs that contain a word pair of length w with a score of at least T.” Altschul et al. (1990)

28 How the original BLAST algorithm works: three phases: Phase 1

29 1. Fig page 101 Initial Word: Expanded List ql: ql, qm, hl, zl ln: ln, lb nf: nf, af, ny, df, qf, ef, gf, hf, kf, sf, tf, bf, zf fs: fs, fa, fn, fd, fg, fp, ft, fb, ys sa: no words score 8 or more, including the initial word sa ag: ag gw: gw, aw, rw, nw, dw, qw, ew, hw, iw, kw, mw, pw, sw, tw, vw,bw, zw, xw The expanded list contains any word that scores >=8 when aligned with the initial word and scored with the PAM 120 similarity table.

30 In practice, BLAST has used a word length of 3 for protein database searches with a required threshold score of 13 using the Blosum62 similarity scoring matrix. Phase 1: compile a list of word pairs (w=3) above threshold T Example: human RBP query …FSGTWYA… (query word is in blue) A list of words (w=3) is: Initial Word: FSG SGT GTW TWY WAY Expand List: YSG TGT ATW SWY WFA FTG SVT GSW TWF WYS ……… ……… ……… ……… ………

31 Phase 1: compile a list of words (w=3) …… Scores Total GTW 6,5,11 22 neighborhoodGSW 6,1,11 18 Word hitsATW 0,5,1116 > threshold NTW 0,5,1116 GTY 6,5,213 GNW10 neighborhood GAW9 words …… < threshold (Threshold T=11)

32 DNA Pairwise alignment Scores are determined using a scoring matrix such as:

33 Protein alignment Scores are determined using a scoring matrix such as Blosum62

34 How a BLAST search works: 3 phases Phase 2: Scan the database for entries that match the compiled list. This is fast and relatively easy. KENFDKARFSGTWYAMAKKDPEG 50 RBP (query) MKGLDIQKVAGTWYSLAMAASD. 44 lactoglobulin (hit) Hit!

35 How a BLAST search works: 3 phases Phase 3: when you manage to find a hit (i.e. a match between a “word” and a database entry), extend the hit in either direction. Keep track of the score (use a scoring matrix) Stop when the score drops below some cutoff. KENFDKARFSGTWYAMAKKDPEG 50 RBP (query) MKGLDIQKVAGTWYSLAMAASD. 44 lactoglobulin (hit) Hit! extend

36

How a BLAST search works: phases 3 Phase 3: In the original (1990) implementation of BLAST, only one hit is required and hits were extended in either direction. In a 1997 refinement of BLAST, two independent hits are required. The hits must occur in close proximity to each other. With this modification, only one seventh as many extensions occur, greatly shorten the time required for a search. So, BLAST now looks for two words of length 3 that each score at least 11 using the Blosum62 similarity scoring matrix. The two matches must be within 40 amino acids of each other on the same diagonal.

38 True positivesFalse positives False negativesTrue negatives homologous sequences non-homologous sequences Sensitivity: ability to find true positives Specificity: ability to minimize false positives Sequences reported as related Sequences reported as unrelated

better worse slower faster Sensitivity Search speed small w large w lower T higher T For DNA/RNA, default word size is 11 For proteins, default word size is 3. (This yields a more accurate result than 2.) Word Size Threshold Worse better Specificity

40 How a BLAST search works: threshold You can modify the threshold parameter. The default value for blastp is 11. To change it, enter 12, 14 or 16 in the advanced options.

41 Changing E / threshold for blastp/nr/RBP Expect (Threshold) 10 (T=11) 1 (T=11) 10,000 (T=11) 10 (T=5) 10 (T=11) 10 (T=16) 10 (BL45) 10 (PAM70) Number of hits 129m 112m 386m129m Number of sequences 1,043,4551.0m 907, m Number of extensions 5.2m 508m4.5m73, m19.5m # of successful extensions 8,367 11,4847,2881,1479,08813,873 # of sequences better than E , # of HSPs >E (no gapping) 53466, # of HSPs gapped , X1, X2, X3 16 (7.4 bits) 38 (14.6 bits) 64 (24.7 bits)

42 How to interpret a BLAST search: expect value (E) It is important to assess the statistical significance of search results. For global alignments, the statistics are poorly understood. For local alignments (including BLAST search results), the statistics are well understood. The scores follow an extreme value distribution (EVD) rather than a normal distribution.

43 x probability normal distribution

44 x probability extreme value distribution normal distribution The probability density function of the extreme value distribution (characteristic value u=0 and decay constant =1)

45 The expect value E is the number of alignments with scores greater than or equal to score S that are expected to occur by chance in a database search. An E value is related to a probability value p. The key equation describing an E value is: E = Kmn e - S How to interpret a BLAST search: expect value

46 This equation is derived from a description of the extreme value distribution S = the score E = the expect value = the number of high- scoring segment pairs (HSPs) expected to occur by chance with a score of at least S m, n = the length of two sequences, K = Karlin Altschul statistics constant E = Kmn e - S

47 Some properties of the equation E = Kmn e - S The value of E decreases exponentially with increasing S (higher S values correspond to better alignments). Very high scores correspond to very low E values. The E value for aligning a pair of random sequences must be small! Otherwise, long random alignments would acquire great scores. Parameter K describes the search space (database size). For E=1, one match with a similar score is expected to occur by chance. For a very much larger or smaller database, you would expect E to vary accordingly

48 From raw scores to bit scores There are two kinds of scores: - Raw scores (calculated from a substitution matrix) - Bit scores (normalized scores) Bit scores are comparable between different searches because they are normalized to account for the use of different scoring matrices and different database sizes S’ (bit score) = ( S - lnK) / ln2 The E value corresponding to a given bit score is: E = mn 2 -S’ Bit scores allow you to compare results between different database searches, even using different scoring matrices.

49 The expect value E is the expect number of alignments with scores greater than or equal to score S that are expected to occur by chance in a database search. A p value is a different way of representing the significance of an alignment – the probability of an alignment that it could occur by chance: p = 1 - e -  How to interpret BLAST: E values and p values

50 Very small E values are very similar to p values. E values of about 1 to 10 are far easier to interpret than corresponding p values. Ep (about 0.1) (about 0.05) (about 0.001) How to interpret BLAST: E values and p values

51 threshold score = 11 EVD parameters  BLOSUM matrix  Effective search space = mn = length of query x length db 10.0 is the E value gap penalties cut-off parameters How to interpret BLAST: getting to the bottom

52 General concepts How to evaluate the significance of your results How to handle too many results How to handle too few results BLAST searching with HIV-1 pol, a multidomain protein BLAST searching with lipocalins using different matrices BLAST search strategies

53 Sometimes a real match has an E value > 1 …try a reciprocal BLAST to confirm

54 Sometimes a similar E value occurs for a short exact match and long less exact match

55 Assessing whether proteins are homologous is not as easy as one may expect Biological significance is the gold standard For example: RBP4 and PAEP: Low bit score, E value 0.49, 24% identity (“twilight zone”). But they are indeed homologous. Try a BLAST search with PAEP as a query, and find many other lipocalins.

56 BLAST search with PAEP as a query finds many other lipocalins

57 tBLASTx search with P53 mRNA as a query finds too many results and process killed by server

58 Local Blast using Linux PC CLuster formatdb -i Ecoli.fas -p F -o T blastn -query./test.fas -db Ecoli.fas -out result.txt blastall -p blastn -d Ecoli.fas -i./test.fas -o result.txt -a 4 -e 10 Arguments: -p type of Blast (ie blastn, blastp, blastx, tblastn, tblastx) -d database(s) to be searched, without file extension.database(s) to be searched -i input filename in fasta formatfasta format -o output filename. Default = stdout -a number of processors to use. Default=1 -e expectation value. Default=10.0