Using BLAST for Genomic Sequence Annotation Jeremy Buhler For HHMI / BIO4342 Tutorial Workshop.

Slides:



Advertisements
Similar presentations
Sequence comparison: Significance of similarity scores Genome 559: Introduction to Statistical and Computational Genomics Prof. James H. Thomas.
Advertisements

Fa07CSE 182 CSE182-L4: Database filtering. Fa07CSE 182 Summary (through lecture 3) A2 is online We considered the basics of sequence alignment –Opt score.
Statistics Review – Part II Topics: – Hypothesis Testing – Paired Tests – Tests of variability 1.
Blast to Psi-Blast Blast makes use of Scoring Matrix derived from large number of proteins. What if you want to find homologs based upon a specific gene.
1 Introduction to Sequence Analysis Utah State University – Spring 2012 STAT 5570: Statistical Bioinformatics Notes 6.1.
Bayesian Evolutionary Distance P. Agarwal and D.J. States. Bayesian evolutionary distance. Journal of Computational Biology 3(1):1— 17, 1996.
OUTLINE Scoring Matrices Probability of matching runs Quality of a database match.
EVOLUTIONARY CHANGE IN DNA SEQUENCES - usually too slow to monitor directly… … so use comparative analysis of 2 sequences which share a common ancestor.
Random Walks and BLAST Marek Kimmel (Statistics, Rice)
Measuring the degree of similarity: PAM and blosum Matrix
Phylogenetic Trees Understand the history and diversity of life. Systematics. –Study of biological diversity in evolutionary context. –Phylogeny is evolutionary.
Phylogenetic reconstruction
Lecture 8 Alignment of pairs of sequence Local and global alignment
BIOE 109 Summer 2009 Lecture 4- Part II Phylogenetic Inference.
Sequence Similarity Searching Class 4 March 2010.
Heuristic alignment algorithms and cost matrices
Sequence similarity.
Similar Sequence Similar Function Charles Yan Spring 2006.
. Computational Genomics Lecture #3a (revised 24/3/09) This class has been edited from Nir Friedman’s lecture which is available at
Bioinformatics Unit 1: Data Bases and Alignments Lecture 3: “Homology” Searches and Sequence Alignments (cont.) The Mechanics of Alignments.
Alignment III PAM Matrices. 2 PAM250 scoring matrix.
Sequence Alignment - III Chitta Baral. Scoring Model When comparing sequences –Looking for evidence that they have diverged from a common ancestor by.
Blast heuristics Morten Nielsen Department of Systems Biology, DTU.
Sequence alignment, E-value & Extreme value distribution
Sequence comparison: Significance of similarity scores Genome 559: Introduction to Statistical and Computational Genomics Prof. James H. Thomas.
Whole genome alignments Genome 559: Introduction to Statistical and Computational Genomics Prof. James H. Thomas
TGCAAACTCAAACTCTTTTGTTGTTCTTACTGTATCATTGCCCAGAATAT TCTGCCTGTCTTTAGAGGCTAATACATTGATTAGTGAATTCCAATGGGCA GAATCGTGATGCATTAAAGAGATGCTAATATTTTCACTGCTCCTCAATTT.
Information theoretic interpretation of PAM matrices Sorin Istrail and Derek Aguiar.
Alignment Statistics and Substitution Matrices BMI/CS 576 Colin Dewey Fall 2010.
Inferring function by homology The fact that functionally important aspects of sequences are conserved across evolutionary time allows us to find, by homology.
Sequence Analysis Alignments dot-plots scoring scheme Substitution matrices Search algorithms (BLAST)
Pairwise & Multiple sequence alignments
© 2008 McGraw-Hill Higher Education The Statistical Imagination Chapter 9. Hypothesis Testing I: The Six Steps of Statistical Inference.
Phylogeny Estimation: Traditional and Bayesian Approaches Molecular Evolution, 2003
Section 10.1 ~ t Distribution for Inferences about a Mean Introduction to Probability and Statistics Ms. Young.
CISC667, S07, Lec5, Liao CISC 667 Intro to Bioinformatics (Spring 2007) Pairwise sequence alignment Needleman-Wunsch (global alignment)
Chapter 11 Assessing Pairwise Sequence Similarity: BLAST and FASTA (Lecture follows chapter pretty closely) This lecture is designed to introduce you to.
Understanding the Variability of Your Data: Dependent Variable Two "Sources" of Variability in DV (Response Variable) –Independent (Predictor/Explanatory)
Hypothesis Testing: One Sample Cases. Outline: – The logic of hypothesis testing – The Five-Step Model – Hypothesis testing for single sample means (z.
The Argument for Using Statistics Weighing the Evidence Statistical Inference: An Overview Applying Statistical Inference: An Example Going Beyond Testing.
Sequence Alignment Goal: line up two or more sequences An alignment of two amino acid sequences: …. Seq1: HKIYHLQSKVPTFVRMLAPEGALNIHEKAWNAYPYCRTVITN-EYMKEDFLIKIETWHKP.
Pairwise Sequence Alignment. The most important class of bioinformatics tools – pairwise alignment of DNA and protein seqs. alignment 1alignment 2 Seq.
Comp. Genomics Recitation 3 The statistics of database searching.
Construction of Substitution Matrices
BLAST: Basic Local Alignment Search Tool Altschul et al. J. Mol Bio CS 466 Saurabh Sinha.
Comp. Genomics Recitation 9 11/3/06 Gene finding using HMMs & Conservation.
Pairwise Sequence Alignment Part 2. Outline Summary Local and Global alignments FASTA and BLAST algorithms Evaluating significance of alignments Alignment.
Pairwise sequence alignment Lecture 02. Overview  Sequence comparison lies at the heart of bioinformatics analysis.  It is the first step towards structural.
Sequence Alignment.
Construction of Substitution matrices
Ayesha M.Khan Spring Phylogenetic Basics 2 One central field in biology is to infer the relation between species. Do they possess a common ancestor?
Step 3: Tools Database Searching
The statistics of pairwise alignment BMI/CS 576 Colin Dewey Fall 2015.
©CMBI 2005 Database Searching BLAST Database Searching Sequence Alignment Scoring Matrices Significance of an alignment BLAST, algorithm BLAST, parameters.
Hypothesis Testing. Statistical Inference – dealing with parameter and model uncertainty  Confidence Intervals (credible intervals)  Hypothesis Tests.
BLAST: Database Search Heuristic Algorithm Some slides courtesy of Dr. Pevsner and Dr. Dirk Husmeier.
Sequence Alignment. Assignment Read Lesk, Problem: Given two sequences R and S of length n, how many alignments of R and S are possible? If you.
Tests of Significance We use test to determine whether a “prediction” is “true” or “false”. More precisely, a test of significance gets at the question.
Pairwise Sequence Alignment. Three modifications for local alignment The scoring system uses negative scores for mismatches The minimum score for.
Substitution Matrices and Alignment Statistics BMI/CS 776 Mark Craven February 2002.
Pairwise Sequence Alignment and Database Searching
Computer Applications and Bioinformatics
9.3 Hypothesis Tests for Population Proportions
The Making of the Fittest Evidence of Evolution youtube
Sequence comparison: Significance of similarity scores
Sequence comparison: Multiple testing correction
I. Statistical Tests: Why do we use them? What do they involve?
Chapter 20 Phylogenetic Trees.
Sequence comparison: Significance of similarity scores
Sequence Analysis Alan Christoffels
Presentation transcript:

Using BLAST for Genomic Sequence Annotation Jeremy Buhler For HHMI / BIO4342 Tutorial Workshop

2 Overview What is comparative annotation? How to measure similarity between biosequences How to decide whether two sequences are “similar enough”

3 Comparative Annotation Identify functional elements in DNA sequence Uses comparison to databases of sequences with known function Biosequence Database New DNA sequence human CFTR gene mouse CFTR gene dog CFTR gene Probably CFTR

4 Why Does It Work? Functional sequences are under negative selection  fewer mutations More conservation  greater similarity BLAST software recognizes similarity.

5 Caveats w/Similarity Evidence Similarity without conservation random chance Conservation without selective pressure slow mutation recent divergence Similar selective pressures, but seqs have two distinct functions

6 Overview What is comparative annotation? How to measure similarity between biosequences How to decide whether two sequences are “similar enough”

7 What is Similarity? How to measure similarity of two DNA seqs? Mutations happen… Measure should reflect desired evolutionary inference

8 Mutational Model Sequences change by series of events of (only) three types: Substitution of one base Insertion of one base Deletion of one base acgacgatgatg acgacag acgacgag

9 Sequence History (1/2) Suppose seqs S, T diverged from a common ancestral sequence U…

10 Sequence History (2/2) Draw lines between bases of S and T that come from same base of U. This is a “tree alignment” of S,T,U.

11 Sequence Alignment Now elide the ancestor… Result is correspondence between bases of S, T – a sequence alignment

12 Similarity Score of Alignment Fewer mutations  more conservation Give bonus for matches, penalties for substitutions and gaps

13 One Small Problem… Do you own a time machine? If not, how do you know ancestral sequence U? history of mutation? Hence, how to get correct alignment?

14 What We Do In Practice Guess an alignment that minimizes # of hypothesized mutations (more precisely, maximizes score)

15 Overview What is comparative annotation? How to measure similarity between biosequences How to decide whether two sequences are “similar enough”

16 Deciding What to Report Any two sequences can be aligned with some score. Higher scores are better… When is score high enough to be evidence of conservation?

17 Idea: Test a Null Hypothesis Suppose two DNA seqs S, T are completely unrelated. What is probability that best alignment between S, T has score at least  ? If score(S,T) is unlikely to occur by chance, then report (S,T)

18 Null Model Assumptions Bases of seqs S, T generated independently at random acagt tccga random process S random process T

19 P-values Given random seqs S’, T’ with same base distributions as S, T Karlin-Altschul theory tells us probability that S’, T’ align with score at least  If p(  ) is small, report alignment of S,T

20 E-values For computational reasons, BLAST reports not p(  ) but rather E(  ) E(  ) = expected # times alignment with score at least  happens by chance in current search If E(  ) < 1, then score  is interesting

21 Caveats about E-values Model from which E-values are com- puted is too simple for real bioseqs Large margin of safety is wise Be very skeptical of “matches” with E > 10 -5

22 Summary Comparative annotation with BLAST uses similarity as evidence for conserved function. Similarity score based on hypothesized evolutionary relations among sequences. E-values indicate whether scores are high enough to be real biological conservation.