Last lecture summary. Window size? Stringency? Color mapping? Frame shifts?

Slides:



Advertisements
Similar presentations
Blast outputoutput. How to measure the similarity between two sequences Q: which one is a better match to the query ? Query: M A T W L Seq_A: M A T P.
Advertisements

Gapped BLAST and PSI-BLAST Altschul et al Presenter: 張耿豪 莊凱翔.
Bioinformatics Tutorial I BLAST and Sequence Alignment.
BLAST Sequence alignment, E-value & Extreme value distribution.
Last lecture summary.
Bioinformatics Unit 1: Data Bases and Alignments Lecture 2: “Homology” Searches and Sequence Alignments.
Local alignments Seq X: Seq Y:. Local alignment  What’s local? –Allow only parts of the sequence to match –Results in High Scoring Segments –Locally.
Searching Sequence Databases
Heuristic alignment algorithms and cost matrices
We continue where we stopped last week: FASTA – BLAST
. Class 4: Fast Sequence Alignment. Alignment in Real Life u One of the major uses of alignments is to find sequences in a “database” u Such collections.
1 1. BLAST (Basic Local Alignment Search Tool) Heuristic Only parts of protein are frequently subject to mutations. For example, active sites (that one.
Sequence similarity.
Alignment methods June 26, 2007 Learning objectives- Understand how Global alignment program works. Understand how Local alignment program works.
Similar Sequence Similar Function Charles Yan Spring 2006.
Heuristic Approaches for Sequence Alignments
1-month Practical Course Genome Analysis Lecture 3: Residue exchange matrices Centre for Integrative Bioinformatics VU (IBIVU) Vrije Universiteit Amsterdam.
Practical algorithms in Sequence Alignment Sushmita Roy BMI/CS 576 Sep 16 th, 2014.
Protein Sequence Comparison Patrice Koehl
Rationale for searching sequence databases June 22, 2005 Writing Topics due today Writing projects due July 8 Learning objectives- Review of Smith-Waterman.
Sequence alignment, E-value & Extreme value distribution
From Pairwise Alignment to Database Similarity Search.
Practical algorithms in Sequence Alignment Sushmita Roy BMI/CS 576 Sep 17 th, 2013.
Heuristic methods for sequence alignment in practice Sushmita Roy BMI/CS 576 Sushmita Roy Sep 27 th,
Whole genome alignments Genome 559: Introduction to Statistical and Computational Genomics Prof. James H. Thomas
Alignment Statistics and Substitution Matrices BMI/CS 576 Colin Dewey Fall 2010.
Pairwise Alignment How do we tell whether two sequences are similar? BIO520 BioinformaticsJim Lund Assigned reading: Ch , Ch 5.1, get what you can.
Biostatistics-Lecture 15 High-throughput sequencing and sequence alignment Ruibin Xi Peking University School of Mathematical Sciences.
Wellcome Trust Workshop Working with Pathogen Genomes Module 3 Sequence and Protein Analysis (Using web-based tools)
Gapped BLAST and PSI-BLAST : a new generation of protein database search programs Team2 邱冠儒 黃尹柔 田耕豪 蕭逸嫻 謝朝茂 莊閔傑 2014/05/12 1.
An Introduction to Bioinformatics
BLAST What it does and what it means Steven Slater Adapted from pt.
Basic Introduction of BLAST Jundi Wang School of Computing CSC691 09/08/2013.
Gapped BLAST and PSI- BLAST: a new generation of protein database search programs By Stephen F. Altschul, Thomas L. Madden, Alejandro A. Schäffer, Jinghui.
Searching Molecular Databases with BLAST. Basic Local Alignment Search Tool How BLAST works Interpreting search results The NCBI Web BLAST interface Demonstration.
Local alignment, BLAST and Psi-BLAST October 25, 2012 Local alignment Quiz 2 Learning objectives-Learn the basics of BLAST and Psi-BLAST Workshop-Use BLAST2.
Database Searches BLAST. Basic Local Alignment Search Tool –Altschul, Gish, Miller, Myers, Lipman, J. Mol. Biol. 215 (1990) –Altschul, Madden, Schaffer,
BLAST Anders Gorm Pedersen & Rasmus Wernersson. Database searching Using pairwise alignments to search databases for similar sequences Database Query.
CISC667, F05, Lec9, Liao CISC 667 Intro to Bioinformatics (Fall 2005) Sequence Database search Heuristic algorithms –FASTA –BLAST –PSI-BLAST.
Comp. Genomics Recitation 3 The statistics of database searching.
Basic terms:  Similarity - measurable quantity. Similarity- applied to proteins using concept of conservative substitutions Similarity- applied to proteins.
BLAST: Basic Local Alignment Search Tool Altschul et al. J. Mol Bio CS 466 Saurabh Sinha.
BLAST Slides adapted & edited from a set by Cheryl A. Kerfeld (UC Berkeley/JGI) & Kathleen M. Scott (U South Florida) Kerfeld CA, Scott KM (2011) Using.
Rationale for searching sequence databases June 25, 2003 Writing projects due July 11 Learning objectives- FASTA and BLAST programs. Psi-Blast Workshop-Use.
Pairwise Local Alignment and Database Search Csc 487/687 Computing for Bioinformatics.
Pairwise Sequence Alignment Part 2. Outline Summary Local and Global alignments FASTA and BLAST algorithms Evaluating significance of alignments Alignment.
Heuristic Methods for Sequence Database Searching BMI/CS 576 Colin Dewey Fall 2015.
Sequence Alignment.
Doug Raiford Phage class: introduction to sequence databases.
David Wishart February 18th, 2004 Lecture 3 BLAST (c) 2004 CGDN.
Step 3: Tools Database Searching
Heuristic Methods for Sequence Database Searching BMI/CS 576 Colin Dewey Fall 2010.
What is BLAST? Basic BLAST search What is BLAST?
Summer Bioinformatics Workshop 2008 BLAST Chi-Cheng Lin, Ph.D., Professor Department of Computer Science Winona State University – Rochester Center
BLAST: Database Search Heuristic Algorithm Some slides courtesy of Dr. Pevsner and Dr. Dirk Husmeier.
Heuristic Methods for Sequence Database Searching BMI/CS 776 Mark Craven February 2002.
Using BLAST To Teach ‘E-value-tionary’ Concepts Cheryl A. Kerfeld 1, 2 and Kathleen M. Scott 3 1.Department of Energy-Joint Genome Institute, Walnut Creek,
9/6/07BCB 444/544 F07 ISU Dobbs - Lab 3 - BLAST1 BCB 444/544 Lab 3 BLAST Scoring Matrices & Alignment Statistics Sept6.
What is BLAST? Basic BLAST search What is BLAST?
Blast Basic Local Alignment Search Tool
Basics of BLAST Basic BLAST Search - What is BLAST?
BLAST Anders Gorm Pedersen & Rasmus Wernersson.
Identifying templates for protein modeling:
Sequence alignment, Part 2
Basic Local Alignment Search Tool (BLAST)
Bioinformatics Lecture 2 By: Dr. Mehdi Mansouri
Basic Local Alignment Search Tool
BLAST Slides adapted & edited from a set by
Sequence alignment, E-value & Extreme value distribution
BLAST Slides adapted & edited from a set by
Presentation transcript:

Last lecture summary

Window size? Stringency? Color mapping? Frame shifts?

Limits of detection of alignment Homology, similarity Twilight zone

Statistical significance Key question – Constitutes a given alignment evidence for homology? Or did it occur just by chance? The statistical significance of the alignment (i.e. its score) can be tested by statistical hypotheses testing. What are H 0 and H a ? Significance of local alignment gapless gapped Significance of global alignment

Gumbel distribution wikipedia.org

New stuff

Database similarity searching

BLAST Basic Local Alignment Search Tool (BLAST) – Google of the sequence world. Compare a protein or DNA sequence to other sequences in various databases, main tool of NCBI. Why to search database Determine what orthologs and paralogs are known for a particular sequence. Determine what proteins or genes are present in a particular organism. Determining the identity of a DNA or protein sequence. Determining what variants have been described for a particular gene or protein. Investigating ESTs. Exploring amino acid residues that are important in the function and/or structure of a protein (multiple alignment of BLAST results, conserved residues).

Database searching requirements I query sequence, perform pairwise alignments between the query and the whole database (target) Typically, this means that millions of alignments are analyzed in a BLAST search, and only the most closely related matches are returned. We are usually more interested in identifying locally matching regions such as protein domains. Global alignment (Needlman-Wunsch) is not often used. Smith-Watermann is too computationally intensive. Instead, heuristic is utilized, significant speed up.

Database searching requirements II sensitivity – the ability to find as many correct hits (TP) as possible selectivity (specificity) – ability to exclude incorrect hits (FP) speed ideally: high sensitivity, high specificity, high speed reality: increase in sensitivity leads to decrease in specificity, improvement in speed often comes at the cost of lowered sensitivity and selectivity

Types of algorithms exhaustive uses a rigorous algorithm to find the best or exact solution for a particular problem by examining all mathematical combinations example: DP heuristic computational strategy to find an empirical or near optimal solution by using rules of thumb this type of algorithms take shortcuts by reducing the search space according to some criteria the shortcut strategy is not guaranteed to find the best or most accurate solution

Heuristic algorithms Perform faster searches because they examine only a fraction of the possible alignments examined in regular dynamic programming currently, there are two major algorithms: FASTA BLAST Not guaranteed to find the optimal alignment or true homologs, but are 50–100 times faster than DP. The increased computational speed comes at a moderate expense of sensitivity and specificity of the search, which is easily tolerated by working molecular biologists.

BLAST Parts of algorithm list, scan, extend BLAST uses word method for pairwise alignment Find short stretches of identical (or nearly identical) letters in two sequences – words (similar to window in dot plot) Basic assumption: two related sequences must have at least one word in common By first identifying word matches, a longer alignment can be obtained by extending similarity regions from the words. Once regions of high sequence similarity are found, adjacent high-scoring regions can be joined into a full alignment.

BLAST - list Compile a list of “words” of a fixed length w that are derived from the query sequence. protein searches – word size = 3, NA searches = 11 A threshold value T is established for the score of aligned words (true for proteins, for NAs exact matches are used). Those words either at or above the threshold are collected and used to identify database matches; those words below threshold are not further pursued. The threshold score T can be lowered to identify more initial pairwise alignments. This will increase the time required to perform the search and may increase the sensitivity

BLAST - scan After compiling a list of word pairs at or above threshold T, the BLAST algorithm scans a database for hits. This requires BLAST to search an index of the database to find entries that correspond to words on the compiled list.

BLAST - extend Extend hits to find alignments called high-scoring segment pairs (HSPs). Extend in both directions (ungapped originally, gapped BLAST is newer), count the alignment score. The extension process is terminated when a score falls below a cutoff.

BLAST strategy Compare a protein or DNA query sequence to each database entry and form pairwise alignments (HSPs). When the threshold parameter is raised, the speed of the search is increased, but fewer hits are registered, and so distantly related database matches may be missed. When the threshold parameter is lowered, the search proceeds more slowly, but many more word hits are evaluated, and thus sensitivity is increased.

Recent improvement – gapped BLAST Variants BLASTN – nucleotide sequences BLASTP – protein sequences BLASTX – uses nucleotide sequences as queries and translates them in all six reading frames to produce translated protein sequences, which are used to query a protein sequence database TBLASTN – queries protein sequences to a nucleotide sequence database with the sequences translated in all six reading frames TBLASTX – uses nucleotide sequences, which are translated in all six frames, to search against a nucleotide sequence database that has all the sequences translated in six frames.

Which sequence to search? The choice of the type of sequences also influences the sensitivity of the search. Clear advantage of using protein sequences in detecting homologs If the input sequence is a protein-encoding DNA sequence, use BLASTX (six open reading frames before sequence comparisons) If you’re looking for protein homologs encoded in newly sequenced genomes, you may use TBLASTN. This may help to identify protein coding genes that have not yet been annotated. If a DNA sequence is to be used as the query, a protein- level comparison can be done with TBLASTX. TBLASTN, TBLASTX are very computationally intensive and the search process can be very slow.

BLAST Statistics Gumbel distribution

E-value Karlin S., Altschul S. F. Methods for assessing the statistical significance of molecular sequence features by using general scoring schemes. PNAS 87, 2264–2268, 1990

Properties of E-value I Decreases exponentially with increasing S. Thus, a high score corresponds to a low E-value. As E approaches zero, the probability that the alignment occurred by chance approaches zero. The expected score for aligning a random pair of amino acids must be negative. Otherwise, very long alignments of two sequences could accumulate large positive scores and appear to be significantly related when they are not. The size of the database that is searched influences the likelihood that particular alignments will occur by chance. Consider a result with an E = 1. This value indicates that in a database of this particular size one match with a similar score is expected to occur by chance. If the database were twice as big, there would be twice the likelihood of finding a score equal to or greater than S by chance.

Properties of E-value II

Bit score

Relation between E and p values I

Relation between E and p values II While BLAST reports E values rather than p values, the two measures are nearly identical, especially for very small values associated with strong database matches. An advantage of using E values is that it is easier to think about E values of 5 versus 10 rather than versus A p-value below 0.05 is usually used to define statistical significance (what does it mean?) Thus, an E value of 0.05 or less may be considered significant.

Multiple comparisons correction I

Multiple comparisons correction II

E-value interpretation E < … extremely high confidence that the database match is a result of homologous relationships E is from (10 -50, 0.01) … the match can be considered a result of homology E is from (0.01, 10) … the match is considered not significant, but may hint tentative remote homology E > 10 … the sequences under consideration are either unrelated or related by extremely distant relationships that fall below the limit of detection with the current method. E-value is proportional to the database size, as database grows E-value for a given sequence match increases. However, the evolutionary relationship between two sequences remains constant. As the db grows, one may lose previously detected homologs.