The 5 Standard BLAST Programs ProgramDatabaseQueryTypical Uses BLASTNNucleotide Mapping oligonucleotides, amplimers, ESTs, and repeats to a genome. Identifying.

Slides:

Advertisements

Similar presentations

Blast outputoutput. How to measure the similarity between two sequences Q: which one is a better match to the query ? Query: M A T W L Seq_A: M A T P.

Advertisements

Whole genome alignments Genome 559: Introduction to Statistical and Computational Genomics Prof. James H. Thomas.

Gapped BLAST and PSI-BLAST Altschul et al Presenter: 張耿豪莊凱翔.

Sources Page & Holmes Vladimir Likic presentation: 20show.pdf

Bioinformatics Tutorial I BLAST and Sequence Alignment.

BLAST Sequence alignment, E-value & Extreme value distribution.

Gapped Blast and PSI BLAST Basic Local Alignment Search Tool ~Sean Boyle Basic Local Alignment Search Tool ~Sean Boyle.

Local alignments Seq X: Seq Y:. Local alignment  What’s local? –Allow only parts of the sequence to match –Results in High Scoring Segments –Locally.

Sequence alignment Gabor T. Marth Department of Biology, Boston College BI420 – Introduction to Bioinformatics.

Searching Sequence Databases

Universiteit Utrecht BLAST CD Session 2 | Wednesday 4 May 2005 Bram Raats Lee Provoost.

. Class 4: Fast Sequence Alignment. Alignment in Real Life u One of the major uses of alignments is to find sequences in a “database” u Such collections.

Lecture outline Database searches

Database searching. Purposes of similarity search Function prediction by homology (in silico annotation) Function prediction by homology (in silico annotation)

. Class 4: Fast Sequence Alignment. Alignment in Real Life u One of the major uses of alignments is to find sequences in a “database” u Such collections.

1 1. BLAST (Basic Local Alignment Search Tool) Heuristic Only parts of protein are frequently subject to mutations. For example, active sites (that one.

Alignment methods June 26, 2007 Learning objectives- Understand how Global alignment program works. Understand how Local alignment program works.

Similar Sequence Similar Function Charles Yan Spring 2006.

Sequence Alignment III CIS 667 February 10, 2004.

Blast heuristics Morten Nielsen Department of Systems Biology, DTU.

Sequence alignment, E-value & Extreme value distribution

Sequence alignment Gabor T. Marth Department of Biology, Boston College BI420 – Introduction to Bioinformatics.

Alignment methods II April 24, 2007 Learning objectives- 1) Understand how Global alignment program works using the longest common subsequence method.

Sequence comparison: Score matrices Genome 559: Introduction to Statistical and Computational Genomics Prof. James H. Thomas

Sequence comparison: Significance of similarity scores Genome 559: Introduction to Statistical and Computational Genomics Prof. James H. Thomas.

Practical algorithms in Sequence Alignment Sushmita Roy BMI/CS 576 Sep 17 th, 2013.

Heuristic methods for sequence alignment in practice Sushmita Roy BMI/CS 576 Sushmita Roy Sep 27 th,

Whole genome alignments Genome 559: Introduction to Statistical and Computational Genomics Prof. James H. Thomas

Bioinformatics and BLAST

1 BLAST: Basic Local Alignment Search Tool Jonathan M. Urbach Bioinformatics Group Department of Molecular Biology.

Alignment Statistics and Substitution Matrices BMI/CS 576 Colin Dewey Fall 2010.

Pairwise Alignment How do we tell whether two sequences are similar? BIO520 BioinformaticsJim Lund Assigned reading: Ch , Ch 5.1, get what you can.

Gapped BLAST and PSI-BLAST : a new generation of protein database search programs Team2 邱冠儒黃尹柔田耕豪蕭逸嫻謝朝茂莊閔傑 2014/05/12 1.

An Introduction to Bioinformatics

BLAST What it does and what it means Steven Slater Adapted from pt.

NCBI Review Concepts Chuong Huynh. NCBI Pairwise Sequence Alignments Purpose: identification of sequences with significant similarity to (a)

Gapped BLAST and PSI- BLAST: a new generation of protein database search programs By Stephen F. Altschul, Thomas L. Madden, Alejandro A. Schäffer, Jinghui.

Sequence Alignment Goal: line up two or more sequences An alignment of two amino acid sequences: …. Seq1: HKIYHLQSKVPTFVRMLAPEGALNIHEKAWNAYPYCRTVITN-EYMKEDFLIKIETWHKP.

Eric C. Rouchka, University of Louisville Sequence Database Searching Eric Rouchka, D.Sc. Bioinformatics Journal Club October.

Searching Molecular Databases with BLAST. Basic Local Alignment Search Tool How BLAST works Interpreting search results The NCBI Web BLAST interface Demonstration.

Local alignment, BLAST and Psi-BLAST October 25, 2012 Local alignment Quiz 2 Learning objectives-Learn the basics of BLAST and Psi-BLAST Workshop-Use BLAST2.

Database Searches BLAST. Basic Local Alignment Search Tool –Altschul, Gish, Miller, Myers, Lipman, J. Mol. Biol. 215 (1990) –Altschul, Madden, Schaffer,

Last lecture summary. Window size? Stringency? Color mapping? Frame shifts?

BLAST Anders Gorm Pedersen & Rasmus Wernersson. Database searching Using pairwise alignments to search databases for similar sequences Database Query.

CISC667, F05, Lec9, Liao CISC 667 Intro to Bioinformatics (Fall 2005) Sequence Database search Heuristic algorithms –FASTA –BLAST –PSI-BLAST.

1 P6a Extra Discussion Slides Part 1. 2 Section A.

BLAST Basic Local Alignment Search Tool (Altschul et al. 1990)

NCBI resources II: web-based tools and ftp resources Yanbin Yin Fall 2014 Most materials are downloaded from ftp://ftp.ncbi.nih.gov/pub/education/ 1.

Comp. Genomics Recitation 3 The statistics of database searching.

BLAST: Basic Local Alignment Search Tool Altschul et al. J. Mol Bio CS 466 Saurabh Sinha.

Database search. Overview ： 1. FastA ： is suitable for protein sequence searching 2. BLAST ： is suitable for DNA, RNA, protein sequence searching.

Pairwise Local Alignment and Database Search Csc 487/687 Computing for Bioinformatics.

©CMBI 2005 Database Searching BLAST Database Searching Sequence Alignment Scoring Matrices Significance of an alignment BLAST, algorithm BLAST, parameters.

Sequence Alignment.

David Wishart February 18th, 2004 Lecture 3 BLAST (c) 2004 CGDN.

Step 3: Tools Database Searching

Dynamic programming with more complex models When gaps do occur, they are often longer than one residue.(biology) We can still use all the dynamic programming.

Techniques for Protein Sequence Alignment and Database Searching G P S Raghava Scientist & Head Bioinformatics Centre, Institute of Microbial Technology,

9/6/07BCB 444/544 F07 ISU Dobbs - Lab 3 - BLAST1 BCB 444/544 Lab 3 BLAST Scoring Matrices & Alignment Statistics Sept6.

Database Scanning/Searching FASTA/BLAST/PSIBLAST G P S Raghava.

What is BLAST? Basic BLAST search What is BLAST?

Sequence similarity, BLAST alignments & multiple sequence alignments

Blast Basic Local Alignment Search Tool

Basics of BLAST Basic BLAST Search - What is BLAST?

BLAST Anders Gorm Pedersen & Rasmus Wernersson.

Fast Sequence Alignments

Sequence alignment, Part 2

Basic Local Alignment Search Tool

Sequence alignment, E-value & Extreme value distribution

CSE 5290: Algorithms for Bioinformatics Fall 2009

Presentation transcript:

The 5 Standard BLAST Programs ProgramDatabaseQueryTypical Uses BLASTNNucleotide Mapping oligonucleotides, amplimers, ESTs, and repeats to a genome. Identifying related transcripts. BLASTPProtein Identifying common regions between proteins. Collecting related proteins for phylogenetic analysis. BLASTXProteinNucleotideFinding protein-coding genes in genomic DNA. TBLASTNNucleotideProteinIdentifying transcripts similar to a known protein (finding proteins not yet in GenBank). Mapping a protein to genomic DNA. TBLASTXNucleotide Cross-species gene prediction. Searching for genes missed by traditional methods.

WU-BLAST vs. NCBI-BLAST faster (except for BLASTN) word size unlimited nucleotide matrices gapped lambda for BLASTN links, topcomboN, kap altscore no additional output formats no PSI-BLAST, PHI-BLAST, MegaBLAST

>gi| |ref|NP_ | (NC_004193) 3-oxoacyl-(acyl carrier protein) reductase [Oceanobacillus iheyensis] Length = 253 Score = 38.9 bits (89), Expect = 3e-05 Identities = 17/40 (42%), Positives = 26/40 (64%) Frame = -1 Query: 4146 VTGAGHGLGRAISLELAKKGCHIAVVDINVSGAEDTVKQI 4027 VTGA G+G+AI+ A +G + V D+N GA+ V++I Sbjct: 10 VTGAASGMGKAIATLYASEGAKVIVADLNEEGAQSVVEEI 49

BLAST ALGORITHMBLAST STATISTCS Word Hit Heuristic Extension Heuristic Karlin-Altschul statistics: a general theory of alignment statistics Applicability goes well beyond BLAST TWO ASPECTS OF BLAST BLAST uses Karlin-Altschul Statistics to determine the statistical significance of the alignments it produces.

BLAST ALGORITHMBLAST STATISTCS Word Hit Heuristic Extension Heuristic Karlin-Altschul statistics: a general theory of alignment statistics Applicability goes well beyond BLAST TWO ASPECTS OF BLAST BLAST uses Karlin-Altschul Statistics to determine the statistical significance of the alignments it produces.

>gi| |ref|NP_ | (NC_004193) 3-oxoacyl-(acyl carrier protein) reductase [Oceanobacillus iheyensis] Length = 253 Score = 38.9 bits (89), Expect = 3e-05 Identities = 17/40 (42%), Positives = 26/40 (64%) Frame = -1 Query: 4146 VTGAGHGLGRAISLELAKKGCHIAVVDINVSGAEDTVKQI 4027 VTGA G+G+AI+ A +G + V D+N GA+ V++I Sbjct: 10 VTGAASGMGKAIATLYASEGAKVIVADLNEEGAQSVVEEI 49

Alignment Overview Sequence alignment takes place in a 2-dimensional space where diagonal lines represent regions of similarity. Gaps in an alignment appear as broken diagonals. The search space is sometimes considered as 2 sequences and somtimes as query x database. Global alignment vs. local alignment –BLAST is local Maximum scoring pair (MSP) vs. High-scoring pair (HSP) –BLAST finds HSPs (usually the MSP too) Gapped vs. ungapped –BLAST can do both

The BLAST Algorithm: Seeding (W and T) RGD17 KGD14 QGD13 RGE13 EGD12 HGD12 NGD12 RGN12 AGD11 MGD11 RAD11 RGQ11 RGS11 RND11 RSD11 SGD11 TGD11 BLOSUM62 neighborhood of RGD T=12 Speed gained by minimizing search space Alignments require word hits Neighborhood words W and T modulate speed and sensitivity

The BLAST Algorithm: 2-hit Seeding Alignments tend to have multiple word hits. Isolated word hits are frequently false leads. Most alignments have large ungapped regions. Requiring 2 word hits on the same diagonal (of 40 aa for example), greatly increases speed at a slight cost in sensitivity.

The BLAST Algorithm: Extension Alignments are extended from seeds in each direction. Extension is terminated when the maximum score drops below X. The quick brown fox jumps over the lazy dog. The quiet brown cat purrs when she sees him. Text example match +1 mismatch -1 no gaps

>gi| |ref|NP_ | (NC_004193) 3-oxoacyl-(acyl carrier protein) reductase [Oceanobacillus iheyensis] Length = 253 Score = 38.9 bits (89), Expect = 3e-05 Identities = 17/40 (42%), Positives = 26/40 (64%) Frame = -1 Query: 4146 VTGAGHGLGRAISLELAKKGCHIAVVDINVSGAEDTVKQI 4027 VTGA G+G+AI+ A +G + V D+N GA+ V++I Sbjct: 10 VTGAASGMGKAIATLYASEGAKVIVADLNEEGAQSVVEEI 49

BLAST ALGORITHMBLAST STATISTCS Word Hit Heuristic Extension Heuristic Karlin-Altschul statistics: a general theory of alignment statistics Applicability goes well beyond BLAST TWO ASPECTS OF BLAST BLAST uses Karlin-Altschul Statistics to determine the statistical significance of the alignments it produces.

BLAST STATISTCS Karlin-Altschul statistics: a general theory of alignment statistics; applicability goes well beyond BLAST Notational issues Information theory: nats & bits How alignments are scored Hw scoring schemes are created λ, E & H

5 6 4 How many runs with a score of X do we expect to find?

my $total = 0; foreach my $k (keys %frequencies){ $total += $frequencies{$k}; } my %frequences; $frequencies{‘A’} = 0.25; $frequencies{‘T’} = 0.25; $frequencies{‘G’} = 0.25; $frequencies{‘C’} = 0.25; Understanding Gaussian sum notation

A little information theory

G=A=T=C=0.25 A=T=0.45; G=C=0.05

bits vs. nats

p M =0.01 p I =0.1 q MI =0.002 S MI =log 2 (.002/0.01*0.1) = +1 bits S MI =log e (.002/0.01*0.1) = nats p R =0.1 p L =0.1 q RL =0.002 S RL =log 2 (.002/0.1*0.1) = bits S RL =log e (.002/0.01*0.1) = nats

The BLOSUM MATRICES are int(log 2 *3) ‘munge’ factor

The BLOSUM MATRICES are int(log 2 *3) ‘munge’ factor Why do this?

Recall that : λ is the number that will convert the ‘munged’ S ij back into its ‘original’ q ij for purposes of further calculation. 2 Int(3* )

2 λ allows us to recover that original q ij for purposes of further calculation

λ is found by successive approximation using the Identity below

Further calculations you can do once you know lambda Expected score Relative entropy Target frequencies Convert a raw score to a nat/bit score

Expected score of the matrix Note must be negative for K-A stats to apply What is the expected score of a +1/-3 scoring scheme?

Relative Entropy of the matrix BLOSUM 42 < BLOSUM 62 < BLOSUM 80 ‘Think of Entropy in terms of degeneracy and promiscuity’ H = far from equilibrium H = near equilibrium, alignments contain little information

Every scoring scheme is implicitly an log-odds scoring scheme. Every scoring scheme has a set of target frequencies In other words, even a simple +1/-3 scoring scheme is implictly a log odds scheme. What data justify this scheme; what imaginary data Does the scheme imply? Target Frequencies

Further calculations you can do once you know lambda Every scoring scheme is implicitly a log odds scoring matrix; Every log odds matrix has an implicit set of target frequencies. This is quite profound insight.

Commercial break!

BLAST STATISTCS The basic operations: Actual vs. Effective lengths, Raw scores, Normalized scores e.g. nat and bit scores E & P

>gi| |ref|NP_ | (NC_004193) Length = 253 Score = 38.9 bits (89), Expect = 3e-05 Identities = 17/40 (42%), Positives = 26/40 (64%) Frame = -1 Query: 4146 VTGAGHGLGRAISLELAKKGCHIAVVDINVSGAEDTVKQI 4027 VTGA G+G+AI+ A +G + V D+N GA+ V++I Sbjct: 10 VTGAASGMGKAIATLYASEGAKVIVADLNEEGAQSVVEEI 49

The Karlin-Altschul Equation A minor constant Expected number of alignments Length of query Length of database Search space Raw score Scaling factor Normalized score

The Karlin-Altschul Equation A minor constant Expected number of alignments Length of query Length of database Search space Raw score Scaling factor Normalized score

ACTUAL vs. EFFECTIVE LENGTHS

Recall that H is nats/aligned residue, thus The ‘expected HSP length’ Dependent on search space

ACGTGTGCGCAGTGTCGCGTGTGCACACTATAGCC Actual length (m) effective length(m’) = m –l effectve length (n’) = total length db – num_seqs*l What happens if m’ < 0 ?

The Karlin-Altschul Equation A minor constant Expected number of alignments Length of query Length of database Search space Raw score Scaling factor Normalized score ’’

>gi| |ref|NP_ | (NC_004193) Length = 253 Score = 38.9 bits (89), Expect = 3e-05 Identities = 17/40 (42%), Positives = 26/40 (64%) Frame = -1 Query: 4146 VTGAGHGLGRAISLELAKKGCHIAVVDINVSGAEDTVKQI 4027 VTGA G+G+AI+ A +G + V D+N GA+ V++I Sbjct: 10 VTGAASGMGKAIATLYASEGAKVIVADLNEEGAQSVVEEI 49 Converting a raw score to a bit score

>gi| |ref|NP_ | (NC_004193) Length = 253 Score = 38.9 bits (89), Expect = 3e-05 Identities = 17/40 (42%), Positives = 26/40 (64%) Frame = -1 Query: 4146 VTGAGHGLGRAISLELAKKGCHIAVVDINVSGAEDTVKQI 4027 VTGA G+G+AI+ A +G + V D+N GA+ V++I Sbjct: 10 VTGAASGMGKAIATLYASEGAKVIVADLNEEGAQSVVEEI 49 Converting a raw score or a bit score to an Expect

>gi| |ref|NP_ | (NC_004193) Length = 253 Score = 38.9 bits (89), Expect = 3e-05 Identities = 17/40 (42%), Positives = 26/40 (64%) Frame = -1 Query: 4146 VTGAGHGLGRAISLELAKKGCHIAVVDINVSGAEDTVKQI 4027 VTGA G+G+AI+ A +G + V D+N GA+ V++I Sbjct: 10 VTGAASGMGKAIATLYASEGAKVIVADLNEEGAQSVVEEI 49 Converting an Expect to a WU-BLAST P value

Note that E ~= P if either value < 1e -5 Converting an Expect to a WU-BLAST P value

>gi| |ref|NP_ | (NC_004193) Length = 253 Score = 38.9 bits (89), Expect = 3e-05 Identities = 17/40 (42%), Positives = 26/40 (64%) Frame = -1 Query: 4146 VTGAGHGLGRAISLELAKKGCHIAVVDINVSGAEDTVKQI 4027 VTGA G+G+AI+ A +G + V D+N GA+ V++I Sbjct: 10 VTGAASGMGKAIATLYASEGAKVIVADLNEEGAQSVVEEI 49 Review: where the parts of an HSP come from, and what they mean

Why use Karlin-Altschul statistics? Why not just stop with the raw score?

Why use Karlin-Altschul statistics? Why not just stop with the raw score? Scores is fine, if you are only interested In the top score… when to stop? How to compare scores produced using two different scoring schemes? Bit score provide a common currency for scores, i.e. 52 bits is 52 bits is 52 bits. Scores don’t reflect database size; Expects do. K-A stats is a bit like stoichiometry: Score ~ weight λ ~ Avogadro's’ number E ~ mass

WU-BLASTN

NCBI-BLASTN

NCBI ~ 15 WU-BLAST ~170 So how long would an oligo have to be to generate a score of 15 or 170?

l ncbi =16 l wu-BLAST=294

Sum Statistics

>gi| |ref|NP_ | (NC_004193) Length = 253 Score = 38.9 bits (89), Expect = 3e-05 Identities = 17/40 (42%), Positives = 26/40 (64%) Frame = -1 Query: 4146 VTGAGHGLGRAISLELAKKGCHIAVVDINVSGAEDTVKQI 4027 VTGA G+G+AI+ A +G + V D+N GA+ V++I Sbjct: 10 VTGAASGMGKAIATLYASEGAKVIVADLNEEGAQSVVEEI 49 Review: where the parts of an HSP come from, and what they mean

What’s different about this BLAST Hit ?

Sum Statistics

BLAST uses two distinct methods to calculate an Expect

Sum Statistics Sum statistics increases the significance (decreases the E- value) for groups of consistent alignments.

Actual Vs. effective lengths for BLASTX etc Sum Stats are ‘pair-wise’ in their focus In other words, for the purposes of sum stat calculations n = the length of the sbjct sequence; not the length on the db!

Sum Statistics are based on a ‘sum score’; rather than the raw score of the alignments The sum score is not reported by BLAST!

Calculating a Sum score

Converting a Sum score to an Expect(n)

Expect = 3.7e -10 Expect = 2.6e -8 Sum Statistics take home: buyer beware Best to calculate the ‘Expect(1)’ for each hit. Which –hopefully– you now know how to do!

Enough BLAST for one day!