K-tuple methods Statistics of alignments Phylogenetics

Slides:



Advertisements
Similar presentations
Blast outputoutput. How to measure the similarity between two sequences Q: which one is a better match to the query ? Query: M A T W L Seq_A: M A T P.
Advertisements

Gapped BLAST and PSI-BLAST Altschul et al Presenter: 張耿豪 莊凱翔.
1 Genome information GenBank (Entrez nucleotide) Species-specific databases Protein sequence GenBank (Entrez protein) UniProtKB (SwissProt) Protein structure.
Bioinformatics Tutorial I BLAST and Sequence Alignment.
BLAST Sequence alignment, E-value & Extreme value distribution.
Bioinformatics Unit 1: Data Bases and Alignments Lecture 2: “Homology” Searches and Sequence Alignments.
Searching Sequence Databases
Heuristic alignment algorithms and cost matrices
K-tuple methods Statistics of alignments
Introduction to bioinformatics
Sequence Analysis Tools
Sequence similarity.
Alignment methods June 26, 2007 Learning objectives- Understand how Global alignment program works. Understand how Local alignment program works.
Similar Sequence Similar Function Charles Yan Spring 2006.
Bioinformatics Unit 1: Data Bases and Alignments Lecture 3: “Homology” Searches and Sequence Alignments (cont.) The Mechanics of Alignments.
Sequence alignment, E-value & Extreme value distribution
Alignment methods II April 24, 2007 Learning objectives- 1) Understand how Global alignment program works using the longest common subsequence method.
Access to sequences: GenBank – a place to start and then some more... Links: embl nucleotide archive
1 BLAST: Basic Local Alignment Search Tool Jonathan M. Urbach Bioinformatics Group Department of Molecular Biology.
Alignment Statistics and Substitution Matrices BMI/CS 576 Colin Dewey Fall 2010.
Pairwise Alignment How do we tell whether two sequences are similar? BIO520 BioinformaticsJim Lund Assigned reading: Ch , Ch 5.1, get what you can.
Inferring function by homology The fact that functionally important aspects of sequences are conserved across evolutionary time allows us to find, by homology.
Gapped BLAST and PSI-BLAST : a new generation of protein database search programs Team2 邱冠儒 黃尹柔 田耕豪 蕭逸嫻 謝朝茂 莊閔傑 2014/05/12 1.
An Introduction to Bioinformatics
BLAST What it does and what it means Steven Slater Adapted from pt.
Protein Sequence Alignment and Database Searching.
Chapter 11 Assessing Pairwise Sequence Similarity: BLAST and FASTA (Lecture follows chapter pretty closely) This lecture is designed to introduce you to.
BLAST Workshop Maya Schushan June 2009.
Gapped BLAST and PSI- BLAST: a new generation of protein database search programs By Stephen F. Altschul, Thomas L. Madden, Alejandro A. Schäffer, Jinghui.
Sequence Alignment Goal: line up two or more sequences An alignment of two amino acid sequences: …. Seq1: HKIYHLQSKVPTFVRMLAPEGALNIHEKAWNAYPYCRTVITN-EYMKEDFLIKIETWHKP.
Pairwise Sequence Alignment. The most important class of bioinformatics tools – pairwise alignment of DNA and protein seqs. alignment 1alignment 2 Seq.
Introduction to Bioinformatics Algorithms BLAST: Basic Local Alignment Search Tool Excerpts by Winfried Just.
Indexing DNA sequences for local similarity search Joint work of Angela, Dr. Mamoulis and Dr. Yiu 17/5/2007.
Searching Molecular Databases with BLAST. Basic Local Alignment Search Tool How BLAST works Interpreting search results The NCBI Web BLAST interface Demonstration.
Database Searches BLAST. Basic Local Alignment Search Tool –Altschul, Gish, Miller, Myers, Lipman, J. Mol. Biol. 215 (1990) –Altschul, Madden, Schaffer,
Last lecture summary. Window size? Stringency? Color mapping? Frame shifts?
CISC667, F05, Lec9, Liao CISC 667 Intro to Bioinformatics (Fall 2005) Sequence Database search Heuristic algorithms –FASTA –BLAST –PSI-BLAST.
Comp. Genomics Recitation 3 The statistics of database searching.
Construction of Substitution Matrices
Function preserves sequences Christophe Roos - MediCel ltd Similarity is a tool in understanding the information in a sequence.
Basic terms:  Similarity - measurable quantity. Similarity- applied to proteins using concept of conservative substitutions Similarity- applied to proteins.
BLAST: Basic Local Alignment Search Tool Altschul et al. J. Mol Bio CS 466 Saurabh Sinha.
BLAST Slides adapted & edited from a set by Cheryl A. Kerfeld (UC Berkeley/JGI) & Kathleen M. Scott (U South Florida) Kerfeld CA, Scott KM (2011) Using.
Significance in protein analysis
Pairwise Local Alignment and Database Search Csc 487/687 Computing for Bioinformatics.
Pairwise Sequence Alignment Part 2. Outline Summary Local and Global alignments FASTA and BLAST algorithms Evaluating significance of alignments Alignment.
©CMBI 2005 Database Searching BLAST Database Searching Sequence Alignment Scoring Matrices Significance of an alignment BLAST, algorithm BLAST, parameters.
Sequence Alignment.
Construction of Substitution matrices
David Wishart February 18th, 2004 Lecture 3 BLAST (c) 2004 CGDN.
Step 3: Tools Database Searching
The statistics of pairwise alignment BMI/CS 576 Colin Dewey Fall 2015.
©CMBI 2005 Database Searching BLAST Database Searching Sequence Alignment Scoring Matrices Significance of an alignment BLAST, algorithm BLAST, parameters.
Pairwise Sequence Alignment Exercise 2. || || ||||| ||| || || ||||||||||||||||||| MVHLTPEEKTAVNALWGKVNVDAVGGEALGRLLVVYPWTQRFFE… ATGGTGAACCTGACCTCTGACGAGAAGACTGCCGTCCTTGCCCTGTGGAACAAGGTGGACG.
BLAST: Database Search Heuristic Algorithm Some slides courtesy of Dr. Pevsner and Dr. Dirk Husmeier.
What is sequencing? Video: WlxM (Illumina video) WlxM.
Substitution Matrices and Alignment Statistics BMI/CS 776 Mark Craven February 2002.
9/6/07BCB 444/544 F07 ISU Dobbs - Lab 3 - BLAST1 BCB 444/544 Lab 3 BLAST Scoring Matrices & Alignment Statistics Sept6.
Sequence similarity, BLAST alignments & multiple sequence alignments
Approximate Match FASTA, BLAST, and friends
Lecture 3.1 BLAST.
Blast Basic Local Alignment Search Tool
Identifying templates for protein modeling:
Sequence alignment, Part 2
Basic Local Alignment Search Tool
Basic Local Alignment Search Tool (BLAST)
BLAST Slides adapted & edited from a set by
Sequence alignment, E-value & Extreme value distribution
BLAST Slides adapted & edited from a set by
CSE 5290: Algorithms for Bioinformatics Fall 2009
Presentation transcript:

K-tuple methods Statistics of alignments Phylogenetics Sequence Alignment K-tuple methods Statistics of alignments Phylogenetics

Database searches What is the problem? Large number of sequences to search your query sequence against. Various indexing schemes and heuristics are used, one of which is BLAST. heuristic is a technique to solve a problem that ignores whether the solution can be proven to be correct, but usually produces a good solution, are intended to gain computational performance or conceptual simplicity potentially at the cost of accuracy or precision. http://en.wikipedia.org/wiki/Heuristics#Computer_science

Concepts of Sequence Similarity Searching The premise: The sequence itself is not informative; it must be analyzed by comparative methods against existing databases to develop hypothesis concerning relatives and function.

Important Terms for Sequence Similarity Searching with very different meanings The extent to which nucleotide or protein sequences are related. In BLAST similarity refers to a positive matrix score. Identity The extent to which two (nucleotide or amino acid) sequences are invariant. Homology Similarity attributed to descent from a common ancestor.

Sequence Similarity Searching: The Approach Sequence similarity searching involves the use of a set of algorithms (such as the BLAST programs) to compare a query sequence to all the sequences in a specified database. Comparisons are made in a pairwise fashion. Each comparison is given a score reflecting the degree of similarity between the query and the sequence being compared.

Blast QUERY sequence(s) BLAST results BLAST program BLAST database

Topics: BLAST program There are different blast programs Understanding the BLAST algorithm Word size HSPs (High Scoring Pairs) Understanding BLAST statistics The alignment score (S) Scoring Matrices Dealing with gaps in an alignment The expectation value (E)

The BLAST algorithm The BLAST programs (Basic Local Alignment Search Tools) are a set of sequence comparison algorithms introduced in 1990 for optimal local alignments to a query. Altschul SF, Gish W, Miller W, Myers EW, Lipman DJ (1990) “Basic local alignment search tool.” J. Mol. Biol. 215:403-410. Altschul SF, Madden TL, Schaeffer AA, Zhang J, Zhang Z, Miller W, Lipman DJ (1997) “Gapped BLAST and PSI-BLAST: a new generation of protein database search programs.” NAR 25:3389-3402.

http://www.ncbi.nlm.nih.gov/BLAST blastp blastn blastx tblastn tblastx

Other BLAST programs BLAST 2 Sequences (bl2seq) Aligns two sequences of your choice Gives dot-plot like output

More BLAST programs BLAST against genomes Many available BLAST parameters pre-optimized Handy for mapping query to genome Search for short exact matches Great for checking probes and primers

How Does BLAST Work? The BLAST programs improved the overall speed of searches while retaining good sensitivity (important as databases continue to grow) by breaking the query and database sequences into fragments ("words"), and initially seeking matches between fragments. Word hits are then extended in either direction in an attempt to generate an alignment with a score exceeding the threshold of “T".

Picture used with permission from Chapter 11 of “Bioinformatics: A Practical Guide to the Analysis of Genes and Proteins”

Each BLAST “hit” generates an alignment that can contain one or more high scoring pairs (HSPs)

Each BLAST “hit” generates an alignment that can contain one or more high scoring pairs (HSPs)

Where does the score (S) come from? The quality of each pair-wise alignment is represented as a score and the scores are ranked. Scoring matrices are used to calculate the score of the alignment base by base (DNA) or amino acid by amino acid (protein). The alignment score will be the sum of the scores for each position.

What’s a scoring matrix? Substitution matrices are used for amino acid alignments. These are matrices in which each possible residue substitution is given a score reflecting the probability that it is related to the corresponding residue in the query.

PAM vs. BLOSUM scoring matrices BLOSUM 62 is the default matrix in BLAST 2.0. Though it is tailored for comparisons of moderately distant proteins, it performs well in detecting closer relationships. A search for distant relatives may be more sensitive with a different matrix.

PAM vs BLOSUM scoring matrices The PAM Family PAM matrices are based on global alignments of closely related proteins. The PAM1 is the matrix calculated from comparisons of sequences with no more than 1% divergence. Other PAM matrices are extrapolated from PAM1. The BLOSUM family BLOSUM matrices are based on local alignments. BLOSUM 62 is a matrix calculated from comparisons of sequences with no less than 62% divergence. All BLOSUM matrices are based on observed alignments; they are not extrapolated from comparisons of closely related proteins.

What happens if you have a gap in the alignment? A gap is a position in the alignment at which a letter is paired with a null Gap scores are negative. Since a single mutational event may cause the insertion or deletion of more than one residue, the presence of a gap is frequently ascribed more significance than the length of the gap. Hence the gap is penalized heavily, whereas a lesser penalty is assigned to each subsequent residue in the gap.

Percent Sequence Identity The extent to which two nucleotide or amino acid sequences are invariant A C C T G A G – A G A C G T G – G C A G mismatch indel 70% identical

BLAST algorithm Keyword search of all words of length w in the query of default length n in database of length m with score above threshold w = 11 for nucleotide queries, 3 for proteins Do local alignment extension for each hit of keyword search Extend result until longest match above threshold is achieved and output

BLAST algorithm (cont’d) keyword Query: KRHRKVLRDNIQGITKPAIRRLARRGGVKRISGLIYEETRGVLKIFLENVIRD GVK 18 GAK 16 GIK 16 GGK 14 GLK 13 GNK 12 GRK 11 GEK 11 GDK 11 Neighborhood words neighborhood score threshold (T = 13) extension Query: 22 VLRDNIQGITKPAIRRLARRGGVKRISGLIYEETRGVLK 60 +++DN +G + IR L G+K I+ L+ E+ RG++K Sbjct: 226 IIKDNGRGFSGKQIRNLNYGIGLKVIADLV-EKHRGIIK 263 High-scoring Pair (HSP)

Original BLAST Dictionary All words of length w Alignment Ungapped extensions until score falls below statistical threshold T Output All local alignments with score > statistical threshold

Original BLAST: Example A C G A A G T A A G G T C C A G T w = 4, T = 4 Exact keyword match of GGTC Extend diagonals with mismatches until score falls below a threshold Output result GTAAGGTCC GTTAGGTCC C T G A T C C T G G A T T G C G A From lectures by Serafim Batzoglou (Stanford)

Gapped BLAST: Example Original BLAST exact keyword search, THEN: A C G A A G T A A G G T C C A G T Original BLAST exact keyword search, THEN: Extend with gaps in a zone around ends of exact match Output result GTAAGGTCCAGT GTTAGGTC-AGT C T G A T C C T G G A T T G C G A From lectures by Serafim Batzoglou (Stanford)

Gapped BLAST : Example (cont’d) A C G A A G T A A G G T C C A G T Original BLAST exact keyword search, THEN: Extend with gaps around ends of exact match until score <T, then merge nearby alignments Output result GTAAGGTCCAGT GTTAGGTC-AGT C T G A T C C T G G A T T G C G A From lectures by Serafim Batzoglou (Stanford)

Topics: BLAST databases The different blast databases provided by the NCBI Protein databases Nucleotide databases Genomic databases Considerations for choosing a BLAST database Custom databases for BLAST

BLAST protein databases available at through blastp web interface @ NCBI blastp db

Considerations for choosing a BLAST database First consider your research question: Are you looking for an ortholog in a particular species? BLAST against the genome of that species. Are you looking for additional members of a protein family across all species? BLAST against nr, if you can’t find hits check wgs, htgs, and the trace archives. Are you looking to annotate genes in your species of interest? BLAST against known genes (RefSeq) and/or ESTs from a closely related species.

When choosing a database for BLAST… It is important to know your reagents. Changing your choice of database is changing your search space completely Database size affects the BLAST statistics record BLAST parameters, database choice, database size in your bioinformatics lab book, just as you would for your wet-bench experiments. Databases change rapidly and are updated frequently It may be necessary to repeat your analyses

Topics: BLAST results Choosing the right BLAST program Running a blastp search BLAST parameters and options to consider Viewing BLAST results Look at your alignments Using the BLAST taxonomy report Lecture portion ends – should be 1hr to here (34 slides) that leaves 20 min for the remaining ~25 demo slides

BLAST parameters and options to consider: conserved domains Entrez query E-value cutoff Word size

More BLAST parameters and options to consider: filtering gap penalities matrix

Run your BLAST search: BLAST

The BLAST Queue: click for more info Note your RID

Formatting and Retrieving your BLAST results: options

A graphical view of your BLAST results:

The BLAST “hit” list: Score E-Value GenBank alignment EntrezGene

The BLAST pairwise alignments Identity Similarity

Sample BLAST output Blast of human beta globin protein against zebra fish Score E Sequences producing significant alignments: (bits) Value gi|18858329|ref|NP_571095.1| ba1 globin [Danio rerio] >gi|147757... 171 3e-44 gi|18858331|ref|NP_571096.1| ba2 globin; SI:dZ118J2.3 [Danio rer... 170 7e-44 gi|37606100|emb|CAE48992.1| SI:bY187G17.6 (novel beta globin) [D... 170 7e-44 gi|31419195|gb|AAH53176.1| Ba1 protein [Danio rerio] 168 3e-43 ALIGNMENTS >gi|18858329|ref|NP_571095.1| ba1 globin [Danio rerio] Length = 148 Score = 171 bits (434), Expect = 3e-44 Identities = 76/148 (51%), Positives = 106/148 (71%), Gaps = 1/148 (0%) Query: 1 MVHLTPEEKSAVTALWGKVNVDEVGGEALGRLLVVYPWTQRFFESFGDLSTPDAVMGNPK 60 MV T E++A+ LWGK+N+DE+G +AL R L+VYPWTQR+F +FG+LS+P A+MGNPK Sbjct: 1 MVEWTDAERTAILGLWGKLNIDEIGPQALSRCLIVYPWTQRYFATFGNLSSPAAIMGNPK 60 Query: 61 VKAHGKKVLGAFSDGLAHLDNLKGTFATLSELHCDKLHVDPENFRLLGNVLVCVLAHHFG 120 V AHG+ V+G + ++DN+K T+A LS +H +KLHVDP+NFRLL + + A FG Sbjct: 61 VAAHGRTVMGGLERAIKNMDNVKNTYAALSVMHSEKLHVDPDNFRLLADCITVCAAMKFG 120 Query: 121 KE-FTPPVQAAYQKVVAGVANALAHKYH 147 + F VQ A+QK +A V +AL +YH Sbjct: 121 QAGFNADVQEAWQKFLAVVVSALCRQYH 148

Sample BLAST output (cont’d) Blast of human beta globin DNA against human DNA Score E Sequences producing significant alignments: (bits) Value gi|19849266|gb|AF487523.1| Homo sapiens gamma A hemoglobin (HBG1... 289 1e-75 gi|183868|gb|M11427.1|HUMHBG3E Human gamma-globin mRNA, 3' end 289 1e-75 gi|44887617|gb|AY534688.1| Homo sapiens A-gamma globin (HBG1) ge... 280 1e-72 gi|31726|emb|V00512.1|HSGGL1 Human messenger RNA for gamma-globin 260 1e-66 gi|38683401|ref|NR_001589.1| Homo sapiens hemoglobin, beta pseud... 151 7e-34 gi|18462073|gb|AF339400.1| Homo sapiens haplotype PB26 beta-glob... 149 3e-33 ALIGNMENTS >gi|28380636|ref|NG_000007.3| Homo sapiens beta globin region (HBB@) on chromosome 11 Length = 81706 Score = 149 bits (75), Expect = 3e-33 Identities = 183/219 (83%) Strand = Plus / Plus Query: 267 ttgggagatgccacaaagcacctggatgatctcaagggcacctttgcccagctgagtgaa 326 || ||| | || | || | |||||| ||||| ||||||||||| |||||||| Sbjct: 54409 ttcggaaaagctgttatgctcacggatgacctcaaaggcacctttgctacactgagtgac 54468 Query: 327 ctgcactgtgacaagctgcatgtggatcctgagaacttc 365 ||||||||| |||||||||| ||||| |||||||||||| Sbjct: 54469 ctgcactgtaacaagctgcacgtggaccctgagaacttc 54507

What do the Score and the e-value really mean? The quality of the alignment is represented by the Score. Score (S) The score of an alignment is calculated as the sum of substitution and gap scores. Substitution scores are given by a look-up table (PAM, BLOSUM) whereas gap scores are assigned empirically . The significance of each alignment is computed as an E value. E value (E) Expectation value. The number of different alignments with scores equivalent to or better than S that are expected to occur in a database search by chance. The lower the E value, the more significant the score.

E value E value (E) Expectation value. The number of different alignments with scores equivalent to or better than S expected to occur in a database search by chance. The lower the E value, the more significant the score.

Assessing sequence homology Need to know how strong an alignment can be expected from chance alone “Chance” is the comparison of Real but non-homologous sequences Real sequences that are shuffled to preserve compositional properties Sequences that are generated randomly based upon a DNA or protein sequence model (favored)

High Scoring Pairs (HSPs) All segment pairs whose scores can not be improved by extension or trimming Need to model a random sequence to analyze how high the score is in relation to chance

Expected number of HSPs Expected number of HSPs with score > S E-value E for the score S: E = Kmne-lS Given: Two sequences, length n and m The statistics of HSP scores are characterized by two parameters K and λ K: scale for the search space size λ: scale for the scoring system

BLAST statistics to record in your bioinformatics labbook Record the statistics that are found at bottom of your BLAST results page

Scoring matrices Amino acid substitution matrices PAM BLOSUM

Bit Scores Normalized score to be able to compare sequences Bit score S’ = lS – ln(K) ln(2) E-value of bit score E = mn2-S’

Assessing the significance of an alignment How to assess the significance of an alignment between the comparison of a protein of length m to a database containing many different proteins, of varying lengths? Calculate a "database search" E-value. Multiply the pairwise-comparison E-value by the number of sequences in the database N divided by the length of the sequence in the database n

Homology: Some Guidelines Similarity can be indicative of homology Generally, if two sequences are significantly similar over entire length they are likely homologous Low complexity regions can be highly similar without being homologous Homologous sequences not always highly similar Need to bring class back together with 10 minutes to go in classroom time.

Homology: Some Guidelines Suggested BLAST Cutoffs (source: Chapter 11 – Bioinformatics: A Practical Guide to the Analysis of Genes and Proteins) For nucleotide based searches, one should look for hits with E-values of 10-6 or less and sequence identity of 70% or more For protein based searches, one should look for hits with E-values of 10-3 or less and sequence identity of 25% or more Need to bring class back together with 10 minutes to go in classroom time.

Contributors http://creativecommons.org/licenses/by-sa/2.0/

Odds score in sequence alignment The chance of an aligned amino acid pair being found in alignments of related sequences compared to the chance of that pair being found in random alignments of unrelated sequences.

Statistical significance of an alignment The probability that random or unrelated sequences could be aligned to produce the same score. Smaller the probability is the better.

Alignment Statistics: For two sequences of length n and m, n times m comparisons are being made; thus the longest length of the predicted match would be log1/p(mn).

Alignment Statistics: Expectation value or the mean longest match would be E(M) = log1/p(Kmn), where K is a constant that depends on amino acid or base composition and p is the probability of a match. This is only true for ungapped local alignments.

Distribution of alignment scores resembles Gumbel extreme value distribution.

Extreme Value Distribution

Alignment Statistics E(M)=log1/p(Kmn) means that match length gets bigger as the log of the product of sequence lengths. Amino acid substitution matrices will turn match lengths into alignment scores (S). More commonly  = ln(1/p) is used. Number of longest run HSP will be estimated E = Kmne-S How good a sequence score is evaluated based on how many HSPs (i.e. E value) one would expect for that score.

Alignment Statistics Two ways to get K and  : For 10000 random amino acid sequences with various gap penalties, K and lambda parameters have been tabulated. Calculation of the distribution for two sequences being aligned by keeping one of them fixed and scrambling the other, thus preserving both the sequence length and amino acid composition.

Alignment Statistics

Alignment Statistics

Alignment Statistics

Alignment Statistics

Gene Structure

Mutation Rates

Functional Constraint

Synonymous vs nonsynonymous substitutions

Synonymous vs nonsynonymous substitutions

Synonymous vs nonsynonymous substitutions

Mutation vs substitution

Estimating substitutions

Jukes-Cantor model

Transitions vs transversions

Kimura’s 2-parameter model

Kimura’s 2-parameter model

Kimura’s 2-parameter model

Functional Constraints

Molecular Clocks

Relative Rate

Distance based phylogenetics

Distance based phylogenetics

Distance based phylogenetics

Distance based phylogenetics

Distance based phylogenetics

Distance based phylogenetics

Phylogenetics Programs