Generic substitution matrix based sequence comparison Q: M A T W L I. A: M A - W T V. Scr: 45 -?11 3 Scr: 45 -2 1 Q: M A T W L I. A: M A W T V A. Total:

Slides:



Advertisements
Similar presentations
Secondary structure prediction from amino acid sequence.
Advertisements

CENTER FOR BIOLOGICAL SEQUENCE ANALYSISTECHNICAL UNIVERSITY OF DENMARK DTU Sequence information, logos and Hidden Markov Models Morten Nielsen, CBS, BioCentrum,
Blast outputoutput. How to measure the similarity between two sequences Q: which one is a better match to the query ? Query: M A T W L Seq_A: M A T P.
Blast to Psi-Blast Blast makes use of Scoring Matrix derived from large number of proteins. What if you want to find homologs based upon a specific gene.
Statistics in Bioinformatics May 2, 2002 Quiz-15 min Learning objectives-Understand equally likely outcomes, Counting techniques (Example, genetic code,
Bioinformatics Motif Detection Revised 27/10/06. Overview Introduction Multiple Alignments Multiple alignment based on HMM Motif Finding –Motif representation.
Measuring the degree of similarity: PAM and blosum Matrix
Regulatory Motifs. Contents Biology of regulatory motifs Experimental discovery Computational discovery PSSM MEME.
Profiles for Sequences
Psi-BLAST, Prosite, UCSC Genome Browser Lecture 3.
Bioinformatics Finding signals and motifs in DNA and proteins Expectation Maximization Algorithm MEME The Gibbs sampler Lecture 10.
Chapter 9 Structure Prediction. Motivation Given a protein, can you predict molecular structure Want to avoid repeated x-ray crystallography, but want.
HIDDEN MARKOV MODELS IN MULTIPLE ALIGNMENT. 2 HMM Architecture Markov Chains What is a Hidden Markov Model(HMM)? Components of HMM Problems of HMMs.
Expect value Expect value (E-value) Expected number of hits, of equivalent or better score, found by random chance in a database of the size.
HIDDEN MARKOV MODELS IN MULTIPLE ALIGNMENT
Tutorial 5 Motif discovery.
Computational Biology, Part 10 Protein Structure Prediction and Display Robert F. Murphy Copyright  1996, 1999, All rights reserved.
Introduction to Bioinformatics - Tutorial no. 5 MEME – Discovering motifs in sequences MAST – Searching for motifs in databanks TRANSFAC – The Transcription.
Multiple sequence alignments and motif discovery Tutorial 5.
Biological Sequence Pattern Analysis Liangjiang (LJ) Wang March 8, 2005 PLPTH 890 Introduction to Genomic Bioinformatics Lecture 16.
Computational Biology, Part 2 Representing and Finding Sequence Features using Consensus Sequences Robert F. Murphy Copyright  All rights reserved.
Making Sense of DNA and protein sequence analysis tools (course #2) Dave Baumler Genome Center of Wisconsin,
Situations where generic scoring matrix is not suitable Short exact match Specific patterns.
Pattern databasesPattern databasesPattern databasesPattern databases Gopalan Vivek.
Effect of gap penalty on Local Alignment Score:Score: 161 at (seq1)[2..36] : (seq2)[53..90] 2 ASTV----TSCLEPTEVFMDLWPEDHSNWQELSPLEPSD || | | |||||||||||||||||||||||||||
Protein Sequence Alignment and Database Searching.
Good solutions are advantageous Christophe Roos - MediCel ltd Similarity is a tool in understanding the information in a sequence.
BINF6201/8201 Hidden Markov Models for Sequence Analysis
Sequence Alignment Techniques. In this presentation…… Part 1 – Searching for Sequence Similarity Part 2 – Multiple Sequence Alignment.
Scoring Matrices Scoring matrices, PSSMs, and HMMs BIO520 BioinformaticsJim Lund Reading: Ch 6.1.
Generic substitution matrix -based sequence similarity evaluation Q: M A T W L I. A: M A - W T V. Scr: 45 -?11 3 Scr: Q: M A T W L I. A: M A W.
Sequence analysis: Macromolecular motif recognition Sylvia Nagl.
Multiple Alignment and Phylogenetic Trees Csc 487/687 Computing for Bioinformatics.
Protein Secondary Structure Prediction Based on Position-specific Scoring Matrices Yan Liu Sep 29, 2003.
Sequence Alignment Goal: line up two or more sequences An alignment of two amino acid sequences: …. Seq1: HKIYHLQSKVPTFVRMLAPEGALNIHEKAWNAYPYCRTVITN-EYMKEDFLIKIETWHKP.
Pairwise Sequence Alignment. The most important class of bioinformatics tools – pairwise alignment of DNA and protein seqs. alignment 1alignment 2 Seq.
Motif finding with Gibbs sampling CS 466 Saurabh Sinha.
Construction of Substitution Matrices
HMMs for alignments & Sequence pattern discovery I519 Introduction to Bioinformatics.
Bioinformatics Ayesha M. Khan 9 th April, What’s in a secondary database?  It should be noted that within multiple alignments can be found conserved.
Computational Genomics and Proteomics Lecture 8 Motif Discovery C E N T R F O R I N T E G R A T I V E B I O I N F O R M A T I C S V U E.
Hidden Markov Models A first-order Hidden Markov Model is completely defined by: A set of states. An alphabet of symbols. A transition probability matrix.
Motif discovery and Protein Databases Tutorial 5.
Manually Adjusting Multiple Alignments Chris Wilton.
Finding Patterns Gopalan Vivek Lee Teck Kwong Bernett.
Protein Structure Prediction ● Why ? ● Type of protein structure predictions – Sec Str. Pred – Homology Modelling – Fold Recognition – Ab Initio ● Secondary.
How do we represent the position specific preference ? BID_MOUSE I A R H L A Q I G D E M BAD_MOUSE Y G R E L R R M S D E F BAK_MOUSE V G R Q L A L I G.
Biocomputation: Comparative Genomics Tanya Talkar Lolly Kruse Colleen O’Rourke.
PROTEIN PATTERN DATABASES. PROTEIN SEQUENCES SUPERFAMILY FAMILY DOMAIN MOTIF SITE RESIDUE.
Generic substitution matrix based sequence comparison Q: M A T W L I. A: M A - W T V. Scr: 45 -?11 3 Scr: Q: M A T W L I. A: M A W T V A. Total:
Basic Overview of Bioinformatics Tools and Biocomputing Applications II Dr Tan Tin Wee Director Bioinformatics Centre.
Sequence Based Analysis Tutorial March 26, 2004 NIH Proteomics Workshop Lai-Su L. Yeh, Ph.D. Protein Science Team Lead Protein Information Resource at.
Local Multiple Sequence Alignment Sequence Motifs
Sequence Alignment.
Construction of Substitution matrices
Copyright OpenHelix. No use or reproduction without express written consent1.
Introduction to Bioinformatics - Tutorial no. 5 MEME – Discovering motifs in sequences MAST – Searching for motifs in databanks TRANSFAC – the Transcription.
V diagonal lines give equivalent residues ILS TRIVHVNSILPSTN V I L S T R I V I L P E F S T Sequence A Sequence B Dot Plots, Path Matrices, Score Matrices.
V diagonal lines give equivalent residues ILS TRIVHVNSILPSTN V I L S T R I V I L P E F S T Sequence A Sequence B Dot Plots, Path Matrices, Score Matrices.
Practice -- BLAST search in your own computer 1.Download data file from the course web page, or Ensemble. Save in the blast\dbs folder. 2.Start a CMD window,
An Efficient Index-based Protein Structure Database Searching Method 陳冠宇.
Protein motif /domain Structural unit Functional unit Signature of protein family How are they defined?
Substitution Matrices and Alignment Statistics BMI/CS 776 Mark Craven February 2002.
A Very Basic Gibbs Sampler for Motif Detection
Learning Sequence Motif Models Using Expectation Maximization (EM)
Genome Center of Wisconsin, UW-Madison
Sequence Based Analysis Tutorial
Sequence Based Analysis Tutorial
Pairwise Sequence Alignment
Basic Local Alignment Search Tool
Presentation transcript:

Generic substitution matrix based sequence comparison Q: M A T W L I. A: M A - W T V. Scr: 45 -?11 3 Scr: Q: M A T W L I. A: M A W T V A. Total: 5 Total = 22 - ? Blosum 62: Gap openning: -6 ~ -15 Gap Extension: -2 ~ -6

Blast outputoutput

Position –specific information about conserved domains is IGNORED in single sequence –initiated search BID_MOUSE SESQEEIIHN IARHLAQIGDEM DHNIQPTLVR BAD_MOUSE APPNLWAAQR YGRELRRMSDEF EGSFKGLPRP BAK_MOUSE PLEPNSILGQ VGRQLALIGDDI NRRYDTEFQN BAXB_HUMAN PVPQDASTKK LSECLKRIGDEL DSNMELQRMI BimS EPEDLRPEIR IAQELRRIGDEF NETYTRRVFA HRK_HUMAN LGLRSSAAQL TAARLKALGDEL HQRTMWRRRA Egl-1 DSEISSIGYE IGSKLAAMCDDF DAQMMSYSAH BID_MOUSE SESQEEIIHN IARHLAQIGDEM DHNIQPTLVR sequence X SESSSELLHN SAGHAAQLFDSM RLDIGSTAHR sequence Y PGLKSSAANI LSQQLKGIGDDL HQRMMSYSAH Why a BLAST match is refused by the family ?

Representation of positional information in specific motif M-C-N-S-S-C-[MV]-G-G-M-N-R-R. Binary patterns: Positional matrix: A C D E F G H I K L M N …. Pos

DNA Pattern – Transcription factor binding site Practice: identify potential transcription factor binding sites on a promoter sequence. Using TESS : Transcription Element Search System

TESS result

Why are there many false positives for TF binding site scan?  Contextual dependency is not considered.  Stringency of the matrices.

DNA Pattern – Transcription factor binding site Pattern strings / Matrixes are extracted from known binding sequence. Core vs whole. Some short and/or ambiguous patterns will have many hits.

Stringency of the matrices ACGT Con sens u s N G R 09305C W T G 0500 Y C Y ACGT Co nse nsu s 40130G 50120G 15020A 01700C 000A 000 T 00 0G 01304C 01700C 0 00C 00 0G 00 0G 20150G 01700C 000A 000 T 00 0G 02015T 01304C 0727Y P53_01 P53_02 Consensus –10 bp Consensus –20 bp

How are the motif matrices derived? - example: Hidden Markov Model A specification on “how events will happen” so that statistical assessment can be readily made Used for Speech recognition and characterizing sequence patterns

Hidden Markov Model Position 1Position 2 Deletion Insertion Position 3 1.) possible states of events or “routes” --Transition Probability

Hidden Markov Model Position 1Position 2 2.) possible AA at a given position -- emission probability ACDEFGHIACDEFGHI

Hidden Markov Model 3.) That’s all. Let’s give it a try and you will know what it is about

Hidden Markov Model How to make a HMM for my motif ? Collect related sequences MEME : bin/meme.cgi *Selection of sequences determines the model*

Identifying motifs using MEME -Multiple EM for Motif Elicitation EM: Expectation maximization (P ). Identifies statistically significant motif(s) in a set of sequences. Outline the occurrence of the motifs at the end of the report

Practice: Identify conserved motifs using MEME 1.) Input your own address. 2.) Load the file of multiple Fasta format sequences. 3.) You can change other options based on your needs.

Two search examples  The outcome of the search is dependent on the inputting set of sequences.  Compose the inputting set based on your research needs. Set1: Mammalian P53 plus mosquito hits Set2: Diverse set of P53 plus mosquito hits

Two search examples Set1: Mammalian P53 plus mosquito hits Set2: Diverse set of P53 plus mosquito hits

Secondary structure prediction Predict the likelihood of amino acid x to be in each of the three (four) types of secondary structure configuration Helix Sheet Turn Coil Coiled-coil is two helices tangled together

Secondary structure prediction - different strategies and algorithms Chou-Fasman / Garnier Method -- based on AA composition Nearest Neighbor / Levin Method -- based on sequence similarity Neural Network / PHD SOPM, DPM, DSC, etc.

Results are given at single amino acid level

Secondary structure prediction --Interpretation of result Seq: - D G S L A D E R K Pre: - B B B H H B B T T What is the likelihood of helix formation here ?

Secondary structure prediction -Accuracy At the amino acid level -- ~ 75% based on testing set IF: Seq: - D G I L A V A S M I V Pre: - B B H H H H H H H H H Length > 9 > 90% chance a helix formation around this region

Secondary structure prediction -Programs There are over a dozen web sites provide 2 nd structure predication service – (tools) AntheWin has a good sample of different approaches and has other associated tools

Practice: using AnathePro to analyze protein secondary structure Open the sequence file. Chose and run a secondary structure predication method from the “Methods” menu LEFT click the left boundary of an alpha helix and then RIGHT the right boundary, perform “helical Wheel”

Helix wheel to discern helix subtype HydrophilicHydrophobicOthers AmphipathicHydrophobicHydrophilic