Profile Searches Revised 07/11/06. Overview Introduction Motif representation Motif screening Motif Databases Exercise.

Slides:



Advertisements
Similar presentations
Hidden Markov Model in Biological Sequence Analysis – Part 2
Advertisements

HIDDEN MARKOV MODELS IN COMPUTATIONAL BIOLOGY CS 594: An Introduction to Computational Molecular Biology BY Shalini Venkataraman Vidhya Gunaseelan.
Bioinformatics Motif Detection Revised 27/10/06. Overview Introduction Multiple Alignments Multiple alignment based on HMM Motif Finding –Motif representation.
Regulatory Motifs. Contents Biology of regulatory motifs Experimental discovery Computational discovery PSSM MEME.
Profile Hidden Markov Models Bioinformatics Fall-2004 Dr Webb Miller and Dr Claude Depamphilis Dhiraj Joshi Department of Computer Science and Engineering.
1 Profile Hidden Markov Models For Protein Structure Prediction Colin Cherry
 CpG is a pair of nucleotides C and G, appearing successively, in this order, along one DNA strand.  CpG islands are particular short subsequences in.
Hidden Markov Models Modified from:
Profiles for Sequences
JM - 1 Introduction to Bioinformatics: Lecture XIII Profile and Other Hidden Markov Models Jarek Meller Jarek Meller Division.
درس بیوانفورماتیک December 2013 مدل ‌ مخفی مارکوف و تعمیم ‌ های آن به نام خدا.
Bioinformatics Finding signals and motifs in DNA and proteins Expectation Maximization Algorithm MEME The Gibbs sampler Lecture 10.
Biochemistry and Molecular Genetics Computational Bioscience Program Consortium for Comparative Genomics University of Colorado School of Medicine
Profile HMMs for sequence families and Viterbi equations Linda Muselaars and Miranda Stobbe.
Heuristic alignment algorithms and cost matrices
HIDDEN MARKOV MODELS IN MULTIPLE ALIGNMENT. 2 HMM Architecture Markov Chains What is a Hidden Markov Model(HMM)? Components of HMM Problems of HMMs.
Fa05CSE 182 CSE182-L5: Position specific scoring matrices Regular Expression Matching Protein Domains.
HIDDEN MARKOV MODELS IN MULTIPLE ALIGNMENT
Tutorial 5 Motif discovery.
Lecture 9 Hidden Markov Models BioE 480 Sept 21, 2004.
Pattern databases in protein analysis Arthur Gruber Instituto de Ciências Biomédicas Universidade de São Paulo AG-ICB-USP.
Similar Sequence Similar Function Charles Yan Spring 2006.
Profile HMMs Biology 162 Computational Genetics Todd Vision 16 Sep 2004.
Remote Homology detection: A motif based approach CS 6890: Bioinformatics - Dr. Yan CS 6890: Bioinformatics - Dr. Yan Swati Adhau Swati Adhau 04/14/06.
Predicting Function (& location & post-tln modifications) from Protein Sequences June 15, 2015.
Wellcome Trust Workshop Working with Pathogen Genomes Module 3 Sequence and Protein Analysis (Using web-based tools)
Identification of Protein Domains Eden Dror Menachem Schechter Computational Biology Seminar 2004.
CSCE555 Bioinformatics Lecture 6 Hidden Markov Models Meeting: MW 4:00PM-5:15PM SWGN2A21 Instructor: Dr. Jianjun Hu Course page:
Guiding Motif Discovery by Iterative Pattern Refinement Zhiping Wang, Mehmet Dalkilic, Sun Kim School of Informatics, Indiana University.
Hidden Markov Models for Sequence Analysis 4
Good solutions are advantageous Christophe Roos - MediCel ltd Similarity is a tool in understanding the information in a sequence.
BINF6201/8201 Hidden Markov Models for Sequence Analysis
Scoring Matrices Scoring matrices, PSSMs, and HMMs BIO520 BioinformaticsJim Lund Reading: Ch 6.1.
Identification of Protein Domains. Orthologs and Paralogs Describing evolutionary relationships among genes (proteins): Two major ways of creating homologous.
Sequence analysis: Macromolecular motif recognition Sylvia Nagl.
Multiple Alignment and Phylogenetic Trees Csc 487/687 Computing for Bioinformatics.
Chapter 6 Profiles and Hidden Markov Models. The following approaches can also be used to identify distantly related members to a family of protein (or.
Motif finding with Gibbs sampling CS 466 Saurabh Sinha.
Hidden Markov Models Yves Moreau Katholieke Universiteit Leuven.
Motif discovery Tutorial 5. Motif discovery MEME Creates motif PSSM de-novo (unknown motif) MAST Searches for a PSSM in a DB TOMTOM Searches for a PSSM.
Comp. Genomics Recitation 3 The statistics of database searching.
HMMs for alignments & Sequence pattern discovery I519 Introduction to Bioinformatics.
Bioinformatics Ayesha M. Khan 9 th April, What’s in a secondary database?  It should be noted that within multiple alignments can be found conserved.
BLAST: Basic Local Alignment Search Tool Altschul et al. J. Mol Bio CS 466 Saurabh Sinha.
Motif discovery and Protein Databases Tutorial 5.
Finding Patterns Gopalan Vivek Lee Teck Kwong Bernett.
PROTEIN PATTERN DATABASES. PROTEIN SEQUENCES SUPERFAMILY FAMILY DOMAIN MOTIF SITE RESIDUE.
Sequence Based Analysis Tutorial March 26, 2004 NIH Proteomics Workshop Lai-Su L. Yeh, Ph.D. Protein Science Team Lead Protein Information Resource at.
Local Multiple Sequence Alignment Sequence Motifs
Exercises Pairwise alignment Homology search (BLAST) Multiple alignment (CLUSTAL W) Iterative Profile Search: Profile Search –Pfam –Prosite –PSI-BLAST.
. Finding Motifs in Promoter Regions Libi Hertzberg Or Zuk.
Sequence Alignment.
Blast 2.0 Details The Filter Option: –process of hiding regions of (nucleic acid or amino acid) sequence having characteristics.
Step 3: Tools Database Searching
Hidden Markov model BioE 480 Sept 16, In general, we have Bayes theorem: P(X|Y) = P(Y|X)P(X)/P(Y) Event X: the die is loaded, Event Y: 3 sixes.
©CMBI 2005 Database Searching BLAST Database Searching Sequence Alignment Scoring Matrices Significance of an alignment BLAST, algorithm BLAST, parameters.
Computational Biology, Part 3 Representing and Finding Sequence Features using Frequency Matrices Robert F. Murphy Copyright  All rights reserved.
(H)MMs in gene prediction and similarity searches.
Chapter 6 - Profiles1 Assume we have a family of sequences. To search for other sequences in the family we can Search with a sequence from the family Search.
Computational Biology, Part C Family Pairwise Search and Cobbling Robert F. Murphy Copyright  2000, All rights reserved.
V diagonal lines give equivalent residues ILS TRIVHVNSILPSTN V I L S T R I V I L P E F S T Sequence A Sequence B Dot Plots, Path Matrices, Score Matrices.
V diagonal lines give equivalent residues ILS TRIVHVNSILPSTN V I L S T R I V I L P E F S T Sequence A Sequence B Dot Plots, Path Matrices, Score Matrices.
Mismatch String Kernals for SVM Protein Classification Christina Leslie, Eleazar Eskin, Jason Weston, William Stafford Noble Presented by Pradeep Anand.
Free for Academic Use. Jianlin Cheng.
Sequence similarity, BLAST alignments & multiple sequence alignments
Protein Families, Motifs & Domains.
Learning Sequence Motif Models Using Expectation Maximization (EM)
Predicting Active Site Residue Annotations in the Pfam Database
HIDDEN MARKOV MODELS IN COMPUTATIONAL BIOLOGY
Basic Local Alignment Search Tool
Presentation transcript:

Profile Searches Revised 07/11/06

Overview Introduction Motif representation Motif screening Motif Databases Exercise

Features characteristic for the whole family Multiple sequence alignment Introduction How to represent the characteristic features? Motif model: captures the family characteristic features regular expression, weight matrix, HMM profile

Introduction Multiple sequence alignment Construct model Scan new sequence with the model Unaligned sequences model: captures the family characteristic features used to detect remote homologs of a family

Overview Introduction Motif representation –String based representation Consensus Regular expression –Probabilistic representation PSSM HMM Profile Motif screening Motif Databases Exercise

HMM Multiple sequence alignment Construct model Scan new sequence with the model I.II. Unaligned sequences III.

Consensus sequence: –Reductionistic representation of a motif –Most frequent instance is used as a representative –Loss of information Regular expression: –More complex representation allowing motif degeneracy String Based Representation

CTTAATATTAACTTAAT Consensus CTTAAKRTTMAYTTAAT Regular expression String Based Representation

cell signal motif Gene 1Gene 2Gene 3Gene 4 signal ? translation transcription mRNA protein gene chromosome DNA motifs String Based Representation

Sequences involved in enzymatic reactions (PROSITE) String Based Representation

Overview Introduction Motif representation –String based representation Consensus Regular expression –Probabilistic representation PSSM HMM Profile Motif screening Motif Databases Exercise

Probabilistic PSSM Frequency matrix G A A T T C A T G T C A C T T C A T T G Pseudo Counts Frequency matrix Alignment

Probabilistic PSSM G A A T T C A T G T C A C T T C A T T G Convert into PSSM Alignment PSSM p(A)=p(C)=p(G)=p(T)=0.25 Motif logo

PSSM msa Regular expression Weight matrix Motif logo

Motif Representation CTTAATATTAACTTAAT Consensus CTTAAKRTTMAYTTAAT Regular expression PSSM (motif logo)

Definition HMM State sequence path p: –Probability of a state depends only on the previous state –Transition probability from state l to state k –emission probability: probability that symbol b is seen when in state k a kl e k (b) State lState k A HIDDEN Markov model: it is not possible to tell what state the system is in by looking at the corresponding symbol Finding the possible paths = decoding beginMjMj IjIj DjDj end HMM

Probabilistic model that represents the alignment of the family –Gapped multiple alignment –Distinct states separated by transition probabilities (i.e. the probability of moving from one state to the next) –The current state  is only dependent on the previous state (first order Markov process) –The sequence of states followed in the model is called the path  –Each state has the probability of emitting a certain symbol of the alphabet (A,C,T,G for DNA) or one of the 20 amino acids for proteins: emission probability

HMM can model any possible sequence It defines a probability distribution over the whole space of sequences Training a HMM: search for the parametrisation that makes this distribution peak around members of the family Parametrisation –Determine model structure Length of alignment Number of insert states –Determine the probability parameters HMM

Training a HMM –Determine structure of the model –Determine emission and transition probabilities E.g. the first column: e 1 (A) = 4/5; e 1 (T) = 1/5; e 1 (C) = 0; e 1 (G) = 0; E.g. the second column: e 2 (A) = 0; e 2 (T) = 0; e 2 (C) = 4/5; e 2 (G) = 1/5; E.g. the third column: e 3 (A) = 4/5; e 3 (T) = 0; e 3 (C) = 1/5; e 3 (G) = 0; ACA---ATG TCAACTATC ACAC--AGC AGA---ATC ACCG--ATC A 0.8 C PC G PC T 0.2 A PC C 0.8 G 0.2 T PC A 0.8 C G T 0.2 A 1 C G T A 0.2 C 0.4 G 0.2 T

Profile representation Suppose I (amino acid b) is the ancestor What is the probability of observing a T (amino acid a) in the first column (position p) of the alignment This probability is reflected by the score M M(p,a)= W(p,b) X Y(a,b) M(1,I)= W(1,T) X Y(I,T) M is dependent on The observed frequency of T in the first position of the alignment (W) The probability of mutating I => T (according to PAM) (Y) I A … I S T V A I L T V I A I V b

Profile representation gaps

Overview Introduction Motif representation Motif screening Motif Databases Exercise

HMM Multiple sequence alignment Construct profile HMM Scan new sequence with the profile I.II. Unaligned sequences III.

Screening

The multiple alignment of the family is known (Clustal W) The motif to be detected is known but the multiple alignment does not yet exist –Motifs already described in literature –Construct the multiple alignment, derive the model Neither the motif nor the multiple alignment exist –Probabilistic motif detection

Obtained Motif Model used for genome wide screening (Motif Scanner) Identification of putative additional targets Use sliding window Attribute to each sequence within the sliding window a score Rank the hits based on their score and select the most promising candidates Genome wide screening Screening

Distinct methods differ in the motif representation and the scoring system used Consensus Sequence or Regular expression (pattern match) –Very conservative –Do not allow mismatches PSSM / HMM: more complicated scoring schemes –based on information content –Log likelihood –Less conservative –Difficult choice of threshold score –Tradeoff between sensitivity and selectivity

Screening FDR (1-Precision) FP/(TP+FP) Precision TP/(TP+FP) Specificity (related to the false positive fraction= 1-spec) TN/(TN+FP) Sensitivity (true positive fraction = recall) TP/(TP+FN)

Screening E- value: corresponds to the probability of finding a score equal or better than the one observed, by chance alone.

Screening with Regular Expression Simple perl scripts

Screening with PSSM Background frequency of each of the four nucleotides: Slide a window of length W over a sequence Calculate for each subsequence within the window a log odds-score The highest scoring positions correspond to the most likely locations of the motif 9.4 = log2(720) (0.6*0.9*0.8*0.97*0.6*0.7)/(0.25^6)

Screening with HMM Belongs a sequence to a family of proteins? Scoring a sequence with a HMM –aligning the sequence to the HMM –finding the hidden path that generates the sequence A sequence can be generated by different paths Enumerate all paths and calculate for each path the probability that is generates the sequence Viterbi Path: most likely path Total probability that sequence is generated by HMM = sum of probabilities of all possible paths

Screening with HMM Example for 1 path ATCAGT

Screening with HMM Calculate the probability of the sequence being generated by the HMM profile of a protein family versus a random model = align the unknown sequence with the HMM –The sequence can be generated by different paths Impossible to enumerate all possibilities –What is the most probable path? (Viterbi, backtracking) –What is the total probability? (Forward) Bits score ATAT A-A- -T-T ATT and TTC

Screening with HMM Hidden Markov model because if we observe a sequence, the path of states that was followed by the Markov model to generate the observed sequence is unknown or hidden. This hidden path contains the information on how the observed sequence should be aligned with the profile. Usually a sequence can be generated in multiple ways by the Markov model and more hidden paths (corresponding to distinct alignments) are possible. Usually not all possible paths have an equal probability. Indeed some transitions are not very likely (low transition probability). Usually the path with the highest probability (highest score = most likely path) corresponds to the best alignment.

Screening with HMM Detecting the underlying sequence of states allows to uncover the most probable path of transitions (decoding) –VITERBI Algorithm: most probable path (backtracking) Start at first position (state k) Move to next most probable state l –V k (i) is the probability of the most probable path ending in state k –Calculate probability –Viterbi algorithm allows to detect the most probable path and the probability of this most probable path begin MjMj IjIj DjDj end

begin MjMj IjIj DjDj end -ACA---ATG -TCAACTATC -ACAC--AGC -AGA---ATC -ACCG--ATC A ACAC Calculate Score state 1: S(1)= a(BM) +e(A) S(2)= a(BI) + e(A) S(3)= a(BD) - Maximal score state M: S(1)= a(BM) +e(A) S(1)= a(BI) + e(A) + a(IM)+e(C) HMM ACAAG

Conclusion Distinct methods differ in the motif representation and the scoring system used Consensus Sequence or Regular expression (pattern match) –Very conservative –Do not allow mismatches PSSM / HMM: more complicated scoring schemes –based on information content –Log likelihood –Less conservative –Difficult choice of threshold score –Tradeoff between sensitivity and selectivity

Overview Introduction Motif representation Motif screening Motif Databases –Prosite –Blocks –pFAM Exercise

Pfam Pfam starts from a set of automatically generated domain alignments (generated by PsiBlast). From these alignments a HMM is calculated Subsequently all sequences in the SwissProt database of proteins are classified in protein families –By scoring them with the representative HMMs –Ranking sequences according to their score –separate class members from the other sequences in the database based on a suitable threshold Pfam 7.0 is such a database that contains a total of 3360 families. Pfam contains multiple protein alignments and profile-HMMs of these families.

Pfam

Full: alignment on which the Pfam HMM was based HMMs for global and fragment search

Pfam Screening an new sequence against Pfam HMMs to classify the novel sequence

Pfam Each Pfam family: "trusted cutoff" and a "noise cutoff“ TC1 is the lowest score for sequences included in the family NC1 is the highest score for sequences not included in the Full alignment The probability that the sequence was generated by the HMM and the probability that the sequence was generated by a null model E-value is the number of hits that would be expected to have a score equal or better than this by chance alone Raw score: bitscore Scores in Pfam

Pfam

PROSITE Patterns (regular expressions) (ScanProsite) –Shorter than Pfam Enzyme catalytic sites Prosthetic group attachment sites (heme, pyridoxal- phosphate, biotin, etc) Amino acids involved in binding a metal ion Cysteines involved in disulfide bonds Regions involved in binding a molecule (ADP/ATP, GDP/GTP, calcium, DNA, etc.) or another protein

PROSITE Profiles (Profile representation)

PROSITE Aminael renew

BLOCKS Database of ungapped alignments Motif models represented as PSSMs

Example sequence >gi| |pir||B54759 ba-type ubiquinol oxidase Paracoccus denitrificans MATFSNETTFLLGRLNWDAIPKEPIVWATFVVVAIGGIAALAALTKYRLWGWLWREWFTSVDHKKIGIMYIVLALIMFVRGFA DAIMMRLQQVWAFGGSEGYLNSHHYDQIFTAHGVIMIFFVAMPFITGLMNYVVPLQIGARDVSFPFLNNFSFWMTVGGAVITM ASLFLGEFAQTGWLAFPPLSGIGYSPWVGVDYYIWGLQVAGVGTTLSGINLLVTILKMRAPGMTMMRMPIFTWTSFCANILIV ASFPVLTMTLILLTLDRYVGTNFFTNDLGGNPMMYINLIWIWGHPEVYILILPLFGVFSEVTSTFSGKRLFGYSSMVYATVCITVL SYLVWLHHFFTMGSGASVNSFFGITTMIISIPTGAKLFNWLFTMYRGRIRYELPMMWTIAFMLTFVIGGMTGVLLAVPPADFVL HNSLFLIAHFHNVIIGGVLFGLFAAINFWWPKAFGFKLDVFWGKVSFWFWVVGFWAAFMPLYILGLMGVTRRLRVFDDPDLRI WFAIAAFGAVLIACGIAAMFVQFGVSILRRDRPEYRDVSGDPWDGRTLEWATSSPPPAYNFAFNPISHGLDTWWEMKQQGATR PTGGYMPIHMPKNTGTGVILAALATVCGMALVWYVWWLAALSFLGIIAVSIAHTFNYNRDYYIPVSEIEATEDARTRQLAQGV Scan sequence with prosite, Pfam, Blocks

PSI-BLAST

Overview Query Sequence Unknown Blast Sequence to search for close homologs Search pFAM, Prosite for conserved motifs You detected homology with an annotated protein family Make a multiple sequence alignment Generate profile or HMM Search database for remote homologs Blast ClustalW PFAM PROSITE HMMer, PSSM Profile Search PSI-blast

exercises

Bits score (log odd score Bayesian) –Posterior: HMM model: is this a globin domain? –Likelihood calculated: probability of the sequence being generated by the HMM model –Prior probability: p(model) –Bayes M R