MCB 5472 Lecture #4: Probabilistic models of homology: Psi-BLAST and HMMs February 17, 2014.

Slides:



Advertisements
Similar presentations
Blast outputoutput. How to measure the similarity between two sequences Q: which one is a better match to the query ? Query: M A T W L Seq_A: M A T P.
Advertisements

Hidden Markov Model in Biological Sequence Analysis – Part 2
Hidden Markov Models (1)  Brief review of discrete time finite Markov Chain  Hidden Markov Model  Examples of HMM in Bioinformatics  Estimations Basic.
BLAST Sequence alignment, E-value & Extreme value distribution.
Gapped Blast and PSI BLAST Basic Local Alignment Search Tool ~Sean Boyle Basic Local Alignment Search Tool ~Sean Boyle.
BLAST, PSI-BLAST and position- specific scoring matrices Prof. William Stafford Noble Department of Genome Sciences Department of Computer Science and.
Profile Hidden Markov Models Bioinformatics Fall-2004 Dr Webb Miller and Dr Claude Depamphilis Dhiraj Joshi Department of Computer Science and Engineering.
Hidden Markov Models Ellen Walker Bioinformatics Hiram College, 2008.
Bioinformatics Finding signals and motifs in DNA and proteins Expectation Maximization Algorithm MEME The Gibbs sampler Lecture 10.
Heuristic alignment algorithms and cost matrices
Hidden Markov Models I Biology 162 Computational Genetics Todd Vision 14 Sep 2004.
Profile-profile alignment using hidden Markov models Wing Wong.
Expect value Expect value (E-value) Expected number of hits, of equivalent or better score, found by random chance in a database of the size.
Multiple sequence alignment Conserved blocks are recognized Different degrees of similarity are marked.
Readings for this week Gogarten et al Horizontal gene transfer….. Francke et al. Reconstructing metabolic networks….. Sign up for meeting next week for.
PSI (position-specific iterated) BLAST The NCBI page described PSI blast as follows: “Position-Specific Iterated BLAST (PSI-BLAST) provides an automated,
Position-Specific Substitution Matrices. PSSM A regular substitution matrix uses the same scores for any given pair of amino acids regardless of where.
Efficient Estimation of Emission Probabilities in profile HMM By Virpi Ahola et al Reviewed By Alok Datar.
Similar Sequence Similar Function Charles Yan Spring 2006.
Multiple sequence alignment Conserved blocks are recognized Different degrees of similarity are marked.
Psi-Blast: Detecting structural homologs Psi-Blast was designed to detect homology for highly divergent amino acid sequences Psi = position-specific iterated.
Sequence alignment, E-value & Extreme value distribution
Introduction to Bioinformatics From Pairwise to Multiple Alignment.
Detecting the Domain Structure of Proteins from Sequence Information Niranjan Nagarajan and Golan Yona Department of Computer Science Cornell University.
Whole genome alignments Genome 559: Introduction to Statistical and Computational Genomics Prof. James H. Thomas
Alignment Statistics and Substitution Matrices BMI/CS 576 Colin Dewey Fall 2010.
Wellcome Trust Workshop Working with Pathogen Genomes Module 3 Sequence and Protein Analysis (Using web-based tools)
Gapped BLAST and PSI-BLAST : a new generation of protein database search programs Team2 邱冠儒 黃尹柔 田耕豪 蕭逸嫻 謝朝茂 莊閔傑 2014/05/12 1.
CSCE555 Bioinformatics Lecture 6 Hidden Markov Models Meeting: MW 4:00PM-5:15PM SWGN2A21 Instructor: Dr. Jianjun Hu Course page:
Protein Sequence Alignment and Database Searching.
MCB 5472 Assignment #5: RBH Orthologs and PSI-BLAST February 19, 2014.
Hidden Markov Models for Sequence Analysis 4
BINF6201/8201 Hidden Markov Models for Sequence Analysis
Scoring Matrices Scoring matrices, PSSMs, and HMMs BIO520 BioinformaticsJim Lund Reading: Ch 6.1.
Gapped BLAST and PSI- BLAST: a new generation of protein database search programs By Stephen F. Altschul, Thomas L. Madden, Alejandro A. Schäffer, Jinghui.
Sequence analysis: Macromolecular motif recognition Sylvia Nagl.
Chapter 6 Profiles and Hidden Markov Models. The following approaches can also be used to identify distantly related members to a family of protein (or.
Motif finding with Gibbs sampling CS 466 Saurabh Sinha.
Lab7 QRNA, HMMER, PFAM. Sean Eddy’s Lab
Database Searches BLAST. Basic Local Alignment Search Tool –Altschul, Gish, Miller, Myers, Lipman, J. Mol. Biol. 215 (1990) –Altschul, Madden, Schaffer,
Last lecture summary. Window size? Stringency? Color mapping? Frame shifts?
BLAST Anders Gorm Pedersen & Rasmus Wernersson. Database searching Using pairwise alignments to search databases for similar sequences Database Query.
CS5263 Bioinformatics Lecture 20 Practical issues in motif finding Final project.
HMMs for alignments & Sequence pattern discovery I519 Introduction to Bioinformatics.
Bioinformatics Ayesha M. Khan 9 th April, What’s in a secondary database?  It should be noted that within multiple alignments can be found conserved.
BLAST: Basic Local Alignment Search Tool Altschul et al. J. Mol Bio CS 466 Saurabh Sinha.
Database search. Overview : 1. FastA : is suitable for protein sequence searching 2. BLAST : is suitable for DNA, RNA, protein sequence searching.
Lab7 Twinscan, HMMER, PFAM. TWINSCAN TwinScan TwinScan finds genes in a "target" genomic sequence by simultaneously maximizing the probability of the.
PROTEIN PATTERN DATABASES. PROTEIN SEQUENCES SUPERFAMILY FAMILY DOMAIN MOTIF SITE RESIDUE.
Techniques for Protein Sequence Alignment and Database Searching (part2) G P S Raghava Scientist & Head Bioinformatics Centre, Institute of Microbial Technology,
Pairwise Sequence Alignment Part 2. Outline Summary Local and Global alignments FASTA and BLAST algorithms Evaluating significance of alignments Alignment.
Sequence Based Analysis Tutorial March 26, 2004 NIH Proteomics Workshop Lai-Su L. Yeh, Ph.D. Protein Science Team Lead Protein Information Resource at.
Point Specific Alignment Methods PSI – BLAST & PHI – BLAST.
Exercises Pairwise alignment Homology search (BLAST) Multiple alignment (CLUSTAL W) Iterative Profile Search: Profile Search –Pfam –Prosite –PSI-BLAST.
Blast 2.0 Details The Filter Option: –process of hiding regions of (nucleic acid or amino acid) sequence having characteristics.
Sequence Search Abhishek Niroula Department of Experimental Medical Science Lund University
Step 3: Tools Database Searching
©CMBI 2005 Database Searching BLAST Database Searching Sequence Alignment Scoring Matrices Significance of an alignment BLAST, algorithm BLAST, parameters.
V diagonal lines give equivalent residues ILS TRIVHVNSILPSTN V I L S T R I V I L P E F S T Sequence A Sequence B Dot Plots, Path Matrices, Score Matrices.
V diagonal lines give equivalent residues ILS TRIVHVNSILPSTN V I L S T R I V I L P E F S T Sequence A Sequence B Dot Plots, Path Matrices, Score Matrices.
Heuristic Methods for Sequence Database Searching BMI/CS 776 Mark Craven February 2002.
4.2 - Algorithms Sébastien Lemieux Elitra Canada Ltd.
Database Scanning/Searching FASTA/BLAST/PSIBLAST G P S Raghava.
Sequence similarity, BLAST alignments & multiple sequence alignments
BLAST Anders Gorm Pedersen & Rasmus Wernersson.
Sequence Based Analysis Tutorial
BLAST.
Basic Local Alignment Search Tool (BLAST)
Basic Local Alignment Search Tool
Sequence alignment, E-value & Extreme value distribution
Presentation transcript:

MCB 5472 Lecture #4: Probabilistic models of homology: Psi-BLAST and HMMs February 17, 2014

From last week: BLASTp searches find homologs to a single sequence in a sequence database Highest score to sequences best matching the query Corollary: lower scores to distant sequences still matching the query

Finding all homologous sequences using BLASTp In an ideal world (e.g., highly conserved sequences) a simple BLASTp search would recover all homologs of a single query sequence Simply accept all sequences above some E-value cutoff E-values can have a steep decline Sequence, ranked by decreasing similarity Sequence similarity

Nice but… Assumes the query sequence perfectly represents all homologs False because: Substitutions may be from lineage-specific biases and not conserved in homologs more generally, biasing search towards closer relatives Conservation is always incomplete: the query may not contain conserved positions present in most other homologs Sequences close to a hard E-value cutoff can be easily excluded/included depending on search bias

Shown another way

Sensitivity vs specificity Trade off between minimizing false-negative detection (sensitivity) and false-positives (specificity) A common trade-off in bioinformatics BLASTp is designed to maximize sensitivity Specificity can be low – left to the user to cull out the false-positives

Net result: Divergent homologs are hard to detect Because they are close to typical E-value cutoffs, any bias can easily lead to them being excluded Discriminating between false- and true- positives can be problematic Requires manual examination Finding deep homology is hard!

Better: use multiple queries representing all homologs Option 1: Run multiple individual BLASTps Still easy to bias – “unknown unknowns” Option 2: Make a statistical model of sequence conservation amongst all homologs and use that to find different relatives Diverse input sequences removes and averages out lineage-specific biases

PSI-BLAST “Position-Specific Iterated BLAST” Works only for proteins Uses BLASTp to create a “position-specific score matrix” (PSSM) Smith Waterman global alignments are an option for the command line but not the web (slower, more accurate) Uses matrix for subsequent database searches Matrix updated on each iteration Bias reduced each time Sensitivity increased towards distant homologs False-positives reduced by model refinement

PSI-BLAST: step #1 First iteration: standard BLASTp using a single sequence All homologs above a specified E-value threshold kept to make PSSM Can be specified via parameters, manually edited on NCBI website implimentation

NCBI web implimentation

NCBI web implementation Specific PSI-BLAST options

NCBI web implementation

Select to start using first hits for PSSM next iteration

PSI-BLAST: step #2 Makes (rough) multiple sequence alignment for the selected BLASTp results All hits aligned to the query Not a true multiple sequence alignment Possible to input an externally generated alignment via terminal version (but not web) Alternative at terminal: Smith Waterman global pairwise alignments Not available for web Slower but more accurate

PSI-BLAST: step #3 Use sequence alignment to create Position- Specific Scoring Matrix (PSSM) PSSM: Unique substitution matrix for each sequence alignment column Extra column for gap penalty Matrix is 21 x [query length] vs. 20 x 20 for normal matrix Scores merge standard distance matrix with position-specific frequencies from 1 st iteration, weighted by sequence similarity

Example PSSM Gribskov et al. (1987) PNAS 84: 4355–4358

PSI-BLAST: step #4 Query reference database using the PSSM Recall: BLASTp looks for 2-3 amino acid words similar to the query sequence above some threshold score calculated from the distance matrix An equivalent calculation can be performed using the PSSM; find possible words having a score > the same threshold Subsequent BLAST steps are the same: extend matching words, recalculate with gaps, calculate statistics E-values now reflect similarity to the query profile, not any individual sequence

PSI-BLAST: step #5 Perform as many iterations as you like PSSM updated each time based on hits passing E-value threshold on the previous iteration Sequence-specific bias reduced each time as the PSSM is adjusted to reflect homolog in the entire input set

CrtR protein 1e-50 threshold IterationHits > 1e-50Notes Query not top hit, top E-value !=

Model corruption If a non-homologous sequence is included during model construction, can bias the model away from true homologs With subsequent iteration, model can be made completely useless Using a higher E-value cutoff can ameliorate On web can examine results and limit selection Can’t do this in high-throughput at terminal

Command line PSI-BLAST Part of BLAST+ package so same basic parameters apply Additional flags: -num_iterations [number] -out_pssm [filename] -out_ascii_pssm [filename] -comp_based_stats 0 # required e.g., ~]$ psiblast –query test.faa –db all.faa –num_iterations 3 –out test_vs_all.psiblast –out_ascii_pssm pssm.out –comp_based_stats 0

Starting PSI-BLAST with pre- computed PSSM Create PSSM using PSI-BLAST with -out_pssm flag (not -out_ascii_psm ) Use –in_pssm flag instead of –query e.g., ~]$ psiblast –in_pssm pssm.out –db all.faa –num_iterations 3 –out test_vs_all.psiblast –comp_based_stats 0

RPSBLAST PSI-BLAST queries a sequence database with an individual PSSM RPSBLAST does the opposite: queries an individual sequence with a database of PSSMs e.g., From NCBI’s Conserved Domain Database (CDD) to annotate sequences according to NCBI’s ortholog family descriptsion In BLAST+, command is rpsplast+ and works similarly to other BLAST+ commands except –db is now PSSMs, not sequences Possible to make your own PSSM database but complicated (most people use HMMer instead)

Beyond BLAST Recall that all BLAST programs are local alignments Trade-off between speed and accuracy FASTA: alternative package for database Heuristics like BLAST using word matching for initial sequence matching Final alignments use Smith-Waterman global pairwise alignment method Advances in computer science and statistics on both fronts (i.e., better accuracy & better approximations)

HMMER HMMER is a software package similar to PSI- BLAST, i.e., searching databases with homology models Uses HMMs instead of PSSMs Advantages: More statistically explicit models HMMER3 as ~fast as BLAST Easy to use at command line Can make models for DNA, RNA, protein Disadvantages: Initial alignment is always a second step No NCBI interface (database-specific instead)

The purpose of HMMs To evaluate the probability of a sequence matching a model Assumes preexisting model Essentially a classification problem Given data, how well does it fit a model? Given data and multiple models, which fits best? e.g., Does a gene belong to a gene family?

Markov Chains “A finite number of states connected by transitions”

Markov Chains “A finite number of states connected by transitions”

Transition probabilities The probability of moving from one state to another P(Yellow|Green) = 1 “The probability of transitioning to yellow given green” P(Red|Yellow) = 1 P(Green|Red) = 1 P(Green|Yellow) = 0 P(Yellow|Red) = 0 P(Red|Green) = 0

Transmission probabilities The probability of moving from one state to another P(C|B) = 1 P(D|C) = 1 P(A|D) = 1 P(A|B) = P(vehicles turning left) P(A|B) = P(no vehicles turning left) P(B|A) = 0 etc. A BC D

Hidden Markov Models HMMs are like Markov Chains in that they comprise states connected by transitions Difference: each state does not comprise a single symbol but rather a distribution of them e.g., a column of a sequence alignment will contain some frequency of A, C, G and T Each state can “emit” a symbol with some probability

Hidden Markov Models Known: The number of states The transition probabilities The emission probabilities Question: how well does a sequence match the model? Evaluate the global probability by multiplying the probability of each step through the graph

A simplified model: identifying a 5’ splice site Eddy 2004 Nat. Biotechnol. 22:

Three possible states: exon, 5’ splice site, intron CTTCATGTGAAAGCAGACGTAAGTCA Exon5’ splice siteIntron End Start E5I

Emission probabilities 5’ splice site nearly always G, occasionally A Exon sequence distributed uniformly Intron sequence is AT rich End Start E5I P(A) = 0.25 P(C) = 0.25 P(G) = 0.25 P(T) = 0.25 P(A) = 0.4 P(C) = 0.1 P(G) = 0.4 P(T) = 0.1 P(A) = 0.05 P(C) = 0 P(G) = 0.95 P(T) = 0

Transition probabilities Exon & intron can be multiple bases long 5’ splice site only one base long Exact probabilities are flexible End Start E5I P(A) = 0.25 P(C) = 0.25 P(G) = 0.25 P(T) = 0.25 P(A) = 0.4 P(C) = 0.1 P(G) = 0.1 P(T) = 0.4 P(A) = 0.05 P(C) = 0 P(G) = 0.95 P(T) = 0 P = 1 P = 0.9 P = 0.1 P = 1

Discuss: given this sequence, where is the splice site? End Start E5I P(A) = 0.25 P(C) = 0.25 P(G) = 0.25 P(T) = 0.25 P(A) = 0.4 P(C) = 0.1 P(G) = 0.1 P(T) = 0.4 P(A) = 0.05 P(C) = 0 P(G) = 0.95 P(T) = 0 P = 1 P = 0.9 P = 0.1 P = 1 CTTCATGTGAAAGCAGACGTAAGTCA

14 possibilities (number of “A”s and “G”s) “A”s are sufficiently improbable that we will not work them out here End Start E5I P(A) = 0.25 P(C) = 0.25 P(G) = 0.25 P(T) = 0.25 P(A) = 0.4 P(C) = 0.1 P(G) = 0.1 P(T) = 0.4 P(A) = 0.05 P(C) = 0 P(G) = 0.95 P(T) = 0 P = 1 P = 0.9 P = 0.1 P = 1 CTTCATGTGAAAGCAGACGTAAGTCA

End Start E5I P(A) = 0.25 P(C) = 0.25 P(G) = 0.25 P(T) = 0.25 P(A) = 0.4 P(C) = 0.1 P(G) = 0.1 P(T) = 0.4 P(A) = 0.05 P(C) = 0 P(G) = 0.95 P(T) = 0 P = 1 P = 0.9 P = 0.1 P = 1 CTTCATGTGAAAGCAGACGTAAGTCA log(P(1 st )) = log(1*( *0.9 5 )*(0.1)*0.95*1*[( )*(0.1 8 )*( )]*0.1) = Start trans E emit E trans E trans 5 emit 5 trans E emit E emit E trans End trans log(P)

HMMs Given a model and input data, we can calculate the likelihood of any given classification Because model is fully parameterized, significance of each bath can be determined in a Baysian statistical framework “Posterior decoding”

Posterior decoding CTTCATGTGAAAGCAGACGTAAGTCA log(P) Post. decoding 11% 46% 28%

Note: many non-zero posterior decodings (From actual Nat. Biotechnol. paper)

Profile HMMs HMMs applied to multiple sequence alignments Each alignment column is a state Emission probabilities based on sequence conservation in the alignment weighted by sequence similarity A separate series of states exists for gaps Emission probabilities encompass gap extension Significance of sequence matching model calculated similarly to our much simpler splicing example

HMMER3 Software package for making and using profile HMMs Like BLAST+, can use own models or download from others Approximately as fast as BLAST+ (previous versions were not)

hmmbuild & hmmpress Input: multiple sequence alignment 2 steps: (i) make HMM; (ii) compress for database searching (like makeblastdb ) hmmbuild e.g., hmmbuild test.hmm test.muscle.faa hmmpress e.g., hmmpress test.hmm Creates: test.hmm.h3m test.hmm.h3i test.hmm.h3f test.hmm.h3p Requires predefined input set Requires sequence alignment No simple iterative model updating option like PSI-BLAST Have to rerun alignment and hmmbuild instead

hmmscan Actual search function (cf. rpsblast ) comparing sequences to profiles hmmscan –o e.g., hmmscan –o hmmscan.out test.hmm test.faa

HMMER long output Two outputs: one for complete sequence, one for domains E-value and score analogous to BLAST output “bias”: score adjustment based on compositional bias in the database Domains: i-Evalue: likelihood of hit in entire database c-Evalue: likelihood of hit in hits Domain boundaries Domain alignments

hmmscan domain table output format hmmscan –domtblout e.g., hmmscan –domtblout hmmscan.domtblout test.hmm test.faa

Other useful HMMER functions (working similarly) hmmsearch : search models vs. sequences (cf. PSI-BLAST ) hmmalign : align sequences to HMM nhmmscan : align nucleotide sequences to nucleotide HMM hmmbuild autodetects input format or can be specified

Database sources Many different databases supply HMMs for various purposes Pfam: protein domains in sequences Rfam: RNA annotation in genomes Interpro: integration of different orthology methods Uses HMMs and simpler motif matching cf. regular expressions Each has its own website, not integrated like NCBI