Situations where generic scoring matrix is not suitable Short exact match Specific patterns.

Slides:



Advertisements
Similar presentations
Secondary structure prediction from amino acid sequence.
Advertisements

Blast outputoutput. How to measure the similarity between two sequences Q: which one is a better match to the query ? Query: M A T W L Seq_A: M A T P.
Blast to Psi-Blast Blast makes use of Scoring Matrix derived from large number of proteins. What if you want to find homologs based upon a specific gene.
Bioinformatics Motif Detection Revised 27/10/06. Overview Introduction Multiple Alignments Multiple alignment based on HMM Motif Finding –Motif representation.
Regulatory Motifs. Contents Biology of regulatory motifs Experimental discovery Computational discovery PSSM MEME.
1 Profile Hidden Markov Models For Protein Structure Prediction Colin Cherry
Profiles for Sequences
Psi-BLAST, Prosite, UCSC Genome Browser Lecture 3.
Bioinformatics Finding signals and motifs in DNA and proteins Expectation Maximization Algorithm MEME The Gibbs sampler Lecture 10.
© Wiley Publishing All Rights Reserved. Analyzing Protein Sequences.
HIDDEN MARKOV MODELS IN MULTIPLE ALIGNMENT. 2 HMM Architecture Markov Chains What is a Hidden Markov Model(HMM)? Components of HMM Problems of HMMs.
Expect value Expect value (E-value) Expected number of hits, of equivalent or better score, found by random chance in a database of the size.
Structure Prediction. Tertiary protein structure: protein folding Three main approaches: [1] experimental determination (X-ray crystallography, NMR) [2]
HIDDEN MARKOV MODELS IN MULTIPLE ALIGNMENT
Tutorial 5 Motif discovery.
Introduction to Bioinformatics - Tutorial no. 5 MEME – Discovering motifs in sequences MAST – Searching for motifs in databanks TRANSFAC – The Transcription.
Protein Modules An Introduction to Bioinformatics.
Genome Evolution: Duplication (Paralogs) & Degradation (Pseudogenes)
Making Sense of DNA and protein sequence analysis tools (course #2) Dave Baumler Genome Center of Wisconsin,
Pattern databasesPattern databasesPattern databasesPattern databases Gopalan Vivek.
Effect of gap penalty on Local Alignment Score:Score: 161 at (seq1)[2..36] : (seq2)[53..90] 2 ASTV----TSCLEPTEVFMDLWPEDHSNWQELSPLEPSD || | | |||||||||||||||||||||||||||
Wellcome Trust Workshop Working with Pathogen Genomes Module 3 Sequence and Protein Analysis (Using web-based tools)
C OMPUTATIONAL BIOLOGY. O UTLINE Proteins DNA RNA Genetics and evolution The Sequence Matching Problem RNA Sequence Matching Complexity of the Algorithms.
Protein Sequence Alignment and Database Searching.
Hidden Markov Models for Sequence Analysis 4
Proteins Secondary Structure Predictions Structural Bioinformatics.
Good solutions are advantageous Christophe Roos - MediCel ltd Similarity is a tool in understanding the information in a sequence.
BINF6201/8201 Hidden Markov Models for Sequence Analysis
Sequence Alignment Techniques. In this presentation…… Part 1 – Searching for Sequence Similarity Part 2 – Multiple Sequence Alignment.
Scoring Matrices Scoring matrices, PSSMs, and HMMs BIO520 BioinformaticsJim Lund Reading: Ch 6.1.
Generic substitution matrix -based sequence similarity evaluation Q: M A T W L I. A: M A - W T V. Scr: 45 -?11 3 Scr: Q: M A T W L I. A: M A W.
Sequence analysis: Macromolecular motif recognition Sylvia Nagl.
Multiple Alignment and Phylogenetic Trees Csc 487/687 Computing for Bioinformatics.
Protein Secondary Structure Prediction Based on Position-specific Scoring Matrices Yan Liu Sep 29, 2003.
Sequence Alignment Goal: line up two or more sequences An alignment of two amino acid sequences: …. Seq1: HKIYHLQSKVPTFVRMLAPEGALNIHEKAWNAYPYCRTVITN-EYMKEDFLIKIETWHKP.
Neural Networks for Protein Structure Prediction Brown, JMB 1999 CS 466 Saurabh Sinha.
Multiple Alignments Motifs/Profiles What is multiple alignment? HOW does one do this? WHY does one do this? What do we mean by a motif or profile? BIO520.
Secondary structure prediction
Profile Searches Revised 07/11/06. Overview Introduction Motif representation Motif screening Motif Databases Exercise.
Construction of Substitution Matrices
Protein structure prediction May 26, 2011 HW #8 due today Quiz #3 on Tuesday, May 31 Learning objectives-Understand the biochemical basis of secondary.
HMMs for alignments & Sequence pattern discovery I519 Introduction to Bioinformatics.
Bioinformatics Ayesha M. Khan 9 th April, What’s in a secondary database?  It should be noted that within multiple alignments can be found conserved.
Computational Genomics and Proteomics Lecture 8 Motif Discovery C E N T R F O R I N T E G R A T I V E B I O I N F O R M A T I C S V U E.
Hidden Markov Models A first-order Hidden Markov Model is completely defined by: A set of states. An alphabet of symbols. A transition probability matrix.
Motif discovery and Protein Databases Tutorial 5.
Manually Adjusting Multiple Alignments Chris Wilton.
Finding Patterns Gopalan Vivek Lee Teck Kwong Bernett.
Generic substitution matrix based sequence comparison Q: M A T W L I. A: M A - W T V. Scr: 45 -?11 3 Scr: Q: M A T W L I. A: M A W T V A. Total:
Protein Structure Prediction ● Why ? ● Type of protein structure predictions – Sec Str. Pred – Homology Modelling – Fold Recognition – Ab Initio ● Secondary.
Copyright OpenHelix. No use or reproduction without express written consent1.
Protein Domain Database
How do we represent the position specific preference ? BID_MOUSE I A R H L A Q I G D E M BAD_MOUSE Y G R E L R R M S D E F BAK_MOUSE V G R Q L A L I G.
PROTEIN PATTERN DATABASES. PROTEIN SEQUENCES SUPERFAMILY FAMILY DOMAIN MOTIF SITE RESIDUE.
Generic substitution matrix based sequence comparison Q: M A T W L I. A: M A - W T V. Scr: 45 -?11 3 Scr: Q: M A T W L I. A: M A W T V A. Total:
Sequence Based Analysis Tutorial March 26, 2004 NIH Proteomics Workshop Lai-Su L. Yeh, Ph.D. Protein Science Team Lead Protein Information Resource at.
Local Multiple Sequence Alignment Sequence Motifs
Exercises Pairwise alignment Homology search (BLAST) Multiple alignment (CLUSTAL W) Iterative Profile Search: Profile Search –Pfam –Prosite –PSI-BLAST.
Construction of Substitution matrices
2016/1/27Summer Course1 Pattern Search Problems Part I: Fundament Concept.
The statistics of pairwise alignment BMI/CS 576 Colin Dewey Fall 2015.
(H)MMs in gene prediction and similarity searches.
Practice -- BLAST search in your own computer 1.Download data file from the course web page, or Ensemble. Save in the blast\dbs folder. 2.Start a CMD window,
Protein motif /domain Structural unit Functional unit Signature of protein family How are they defined?
Free for Academic Use. Jianlin Cheng.
Protein Families, Motifs & Domains.
Learning Sequence Motif Models Using Expectation Maximization (EM)
Genome Center of Wisconsin, UW-Madison
Sequence Based Analysis Tutorial
Sequence Based Analysis Tutorial
Presentation transcript:

Situations where generic scoring matrix is not suitable Short exact match Specific patterns

After Blast – which one is “real”?

Position –specific information about conserved domains is IGNORED in single sequence –initiated search BID_MOUSE SESQEEIIHN IARHLAQIGDEM DHNIQPTLVR BAD_MOUSE APPNLWAAQR YGRELRRMSDEF EGSFKGLPRP BAK_MOUSE PLEPNSILGQ VGRQLALIGDDI NRRYDTEFQN BAXB_HUMAN PVPQDASTKK LSECLKRIGDEL DSNMELQRMI BimS EPEDLRPEIR IAQELRRIGDEF NETYTRRVFA HRK_HUMAN LGLRSSAAQL TAARLKALGDEL HQRTMWRRRA Egl-1 DSEISSIGYE IGSKLAAMCDDF DAQMMSYSAH BID_MOUSE SESQEEIIHN IARHLAQIGDEM DHNIQPTLVR sequence X SESSSELLHN SAGHAAQLFDSM RLDIGSTAHR sequence Y PGLKSSAANI LSQQLKGIGDDL HQRMMSYSAH Why a BLAST match is refused by the family ?

What determine a protein family? Structural similarity Functional conservation

Specific patterns 1.DNA pattern – Transcription factor binding site. 2.Short protein pattern – enzyme recognition sites. 3.Protein motif/signature.

Binary patterns for protein and DNA Caspase recognition site: [EDQN] X [^RKH] D [ASP] Examples: Observe: Search for potential caspase recognition sites with BaGua

Searching for binary (string) patterns Seq: A G G G C T C A T G A C A G R C W G A C A G T G R C W G A C A G T G R C W G A C A G T Positive match

Does binary pattern conveys all the information ? Weighted matrix / profile HMM model For searching protein domains

What determine a protein family? Structural similarity Functional conservation

Practice: motif analysis of protein sequence using ScanProsite and Pfam 1.Open two taps for Pfam, input one of the Blast hits and one candidate TNF to each data window. 2.Compare the results

Scan protein for identified motifs A service provided by major motif databases such as Prosite,, Pfam, Block, etc. Protein family signature motif often indicates structural and function property. High frequency motifs may only have suggestive value.

What is the possible function of my protein? Which family my protein belongs to? -- Profile databases Pfam ( ) Prosite ( ) IntePro ( ) Prints ( TS.html ) TS.html

Protein motif /domain Structural unit Functional unit Signature of protein family How are they defined?

How do we represent the position specific preference ? BID_MOUSE I A R H L A Q I G D E M BAD_MOUSE Y G R E L R R M S D E F BAK_MOUSE V G R Q L A L I G D D I BAXB_HUMAN L S E C L K R I G D E L BimS I A Q E L R R I G D E F HRK_HUMAN T A A R L K A L G D E L Egl-1 I G S K L A A M C D D F Binary pattern: L [GSC] [HEQCRK] X [^ILMFV] Basic concept of motif identification 2.

How do we represent the position specific preference ? BID_MOUSE I A R H L A Q I G D E M BAD_MOUSE Y G R E L R R M S D E F BAK_MOUSE V G R Q L A L I G D D I BAXB_HUMAN L S E C L K R I G D E L BimS I A Q E L R R I G D E F HRK_HUMAN T A A R L K A L G D E L Egl-1 I G S K L A A M C D D F Statistical representation G: 5 -> 71% S: 1 -> 14 % C: 1 -> 14 % Basic concept of motif identification 2.

Scoring sequence based on Model Seq: A S L D E L G D E A C D ….... position_1 = s(A/1) + s(S/2) + s(L/3) + s(D / 4) An example of position specific matrixexample

Representation of positional information in specific motif M-C-N-S-S-C-[MV]-G-G-M-N-R-R. Binary patterns: Positional matrix:

Hidden Markov Model A specification on “how events will happen” so that statistical assessment can be readily made Used for Speech recognition and characterizing sequence patterns

Hidden Markov Model Position 1Position 2 Deletion Insertion Position 3 1.) possible states of events or “routes” --Transition Probability

Hidden Markov Model Position 1Position 2 2.) possible AA at a given position -- emission probability ACDEFGHIACDEFGHI

Hidden Markov Model 3.) That’s all. Let’s give it a try and you will know what it is about

Hidden Markov Model How to make a HMM for my petty motif Collect related sequences MEME : bin/meme.cgi *Selection of sequences determines the model*

Identifying motifs using MEME -Multiple EM for Motif Elicitation EM: Expectation maximization (P ). Identifies statistically significant motif(s) in a set of sequences. Outline the occurrence of the motifs at the end of the report

Practice: Identify conserved motifs using MEME 1.) Input your own address. 2.) Load the file of multiple Fasta format sequences. 3.) You can change other options based on your needs.

Two search examples  The outcome of the search is dependent on the inputting set of sequences.  Compose the inputting set based on your research needs. Set1: Mammalian P53 plus mosquito hits Set2: Diverse set of P53 plus mosquito hits

Two search examples Set1: Mammalian P53 plus mosquito hits Set2: Diverse set of P53 plus mosquito hits

Secondary structure prediction Predict the likelihood of amino acid x to be in each of the three (four) types of secondary structure configuration Helix Sheet Turn Coil Coiled-coil is two helices tangled together

Secondary structure prediction - different strategies and algorithms Chou-Fasman / Garnier Method -- based on AA composition Nearest Neighbor / Levin Method -- based on sequence similarity Neural Network / PHD SOPM, DPM, DSC, etc.

Results are given at single amino acid level

Secondary structure prediction --Interpretation of result Seq: - D G S L A D E R K Pre: - B B B H H B B T T What is the likelihood of helix formation here ?

Secondary structure prediction -Accuracy At the amino acid level -- ~ 75% based on testing set IF: Seq: - D G I L A V A S M I V Pre: - B B H H H H H H H H H Length > 9 > 90% chance a helix formation around this region

Secondary structure prediction -Programs There are over a dozen web sites provide 2 nd structure predication service – (tools) AntheWin has a good sample of different approaches and has other associated tools

Practice: using AnathePro to analyze protein secondary structure Open the sequence file. Chose and run a secondary structure predication method from the “Methods” menu LEFT click the left boundary of an alpha helix and then RIGHT the right boundary, perform “helical Wheel”

Helix wheel to discern helix subtype HydrophilicHydrophobicOthers AmphipathicHydrophobicHydrophilic