Generic substitution matrix based sequence comparison Q: M A T W L I. A: M A - W T V. Scr: 45 -?11 3 Scr: 45 -2 1 Q: M A T W L I. A: M A W T V A. Total:

Slides:



Advertisements
Similar presentations
Secondary structure prediction from amino acid sequence.
Advertisements

Blast outputoutput. How to measure the similarity between two sequences Q: which one is a better match to the query ? Query: M A T W L Seq_A: M A T P.
Blast to Psi-Blast Blast makes use of Scoring Matrix derived from large number of proteins. What if you want to find homologs based upon a specific gene.
Bioinformatics Motif Detection Revised 27/10/06. Overview Introduction Multiple Alignments Multiple alignment based on HMM Motif Finding –Motif representation.
Measuring the degree of similarity: PAM and blosum Matrix
Regulatory Motifs. Contents Biology of regulatory motifs Experimental discovery Computational discovery PSSM MEME.
Profiles for Sequences
Bioinformatics Finding signals and motifs in DNA and proteins Expectation Maximization Algorithm MEME The Gibbs sampler Lecture 10.
© Wiley Publishing All Rights Reserved. Analyzing Protein Sequences.
Intro to Bioinformatics Summary. What did we learn Pairwise alignment – Local and Global Alignments When? How ? Tools : for local blast2seq, for global.
Identification of a Novel cis-Regulatory Element Involved in the Heat Shock Response in Caenorhabditis elegans Using Microarray Gene Expression and Computational.
HIDDEN MARKOV MODELS IN MULTIPLE ALIGNMENT. 2 HMM Architecture Markov Chains What is a Hidden Markov Model(HMM)? Components of HMM Problems of HMMs.
HIDDEN MARKOV MODELS IN MULTIPLE ALIGNMENT
Tutorial 5 Motif discovery.
Introduction to Bioinformatics - Tutorial no. 5 MEME – Discovering motifs in sequences MAST – Searching for motifs in databanks TRANSFAC – The Transcription.
Protein Modules An Introduction to Bioinformatics.
Similar Sequence Similar Function Charles Yan Spring 2006.
Biological Sequence Pattern Analysis Liangjiang (LJ) Wang March 8, 2005 PLPTH 890 Introduction to Genomic Bioinformatics Lecture 16.
Computational Biology, Part 2 Representing and Finding Sequence Features using Consensus Sequences Robert F. Murphy Copyright  All rights reserved.
Single Motif Charles Yan Spring Single Motif.
Making Sense of DNA and protein sequence analysis tools (course #2) Dave Baumler Genome Center of Wisconsin,
Situations where generic scoring matrix is not suitable Short exact match Specific patterns.
Pattern databasesPattern databasesPattern databasesPattern databases Gopalan Vivek.
Effect of gap penalty on Local Alignment Score:Score: 161 at (seq1)[2..36] : (seq2)[53..90] 2 ASTV----TSCLEPTEVFMDLWPEDHSNWQELSPLEPSD || | | |||||||||||||||||||||||||||
Wellcome Trust Workshop Working with Pathogen Genomes Module 3 Sequence and Protein Analysis (Using web-based tools)
PROTEIN SEQUENCE ANALYSIS. Need good protein sequence analysis tools because: As number of sequences increases, so gap between seq data and experimental.
CSCE555 Bioinformatics Lecture 6 Hidden Markov Models Meeting: MW 4:00PM-5:15PM SWGN2A21 Instructor: Dr. Jianjun Hu Course page:
Protein Sequence Alignment and Database Searching.
Good solutions are advantageous Christophe Roos - MediCel ltd Similarity is a tool in understanding the information in a sequence.
Sequence Alignment Techniques. In this presentation…… Part 1 – Searching for Sequence Similarity Part 2 – Multiple Sequence Alignment.
Scoring Matrices Scoring matrices, PSSMs, and HMMs BIO520 BioinformaticsJim Lund Reading: Ch 6.1.
Generic substitution matrix -based sequence similarity evaluation Q: M A T W L I. A: M A - W T V. Scr: 45 -?11 3 Scr: Q: M A T W L I. A: M A W.
Sequence analysis: Macromolecular motif recognition Sylvia Nagl.
Multiple Alignment and Phylogenetic Trees Csc 487/687 Computing for Bioinformatics.
Protein Secondary Structure Prediction Based on Position-specific Scoring Matrices Yan Liu Sep 29, 2003.
Sequence Alignment Goal: line up two or more sequences An alignment of two amino acid sequences: …. Seq1: HKIYHLQSKVPTFVRMLAPEGALNIHEKAWNAYPYCRTVITN-EYMKEDFLIKIETWHKP.
Motif finding with Gibbs sampling CS 466 Saurabh Sinha.
Multiple Alignments Motifs/Profiles What is multiple alignment? HOW does one do this? WHY does one do this? What do we mean by a motif or profile? BIO520.
Construction of Substitution Matrices
Protein structure prediction May 26, 2011 HW #8 due today Quiz #3 on Tuesday, May 31 Learning objectives-Understand the biochemical basis of secondary.
HMMs for alignments & Sequence pattern discovery I519 Introduction to Bioinformatics.
Bioinformatics Ayesha M. Khan 9 th April, What’s in a secondary database?  It should be noted that within multiple alignments can be found conserved.
Computational Genomics and Proteomics Lecture 8 Motif Discovery C E N T R F O R I N T E G R A T I V E B I O I N F O R M A T I C S V U E.
Motif discovery and Protein Databases Tutorial 5.
Manually Adjusting Multiple Alignments Chris Wilton.
Finding Patterns Gopalan Vivek Lee Teck Kwong Bernett.
Generic substitution matrix based sequence comparison Q: M A T W L I. A: M A - W T V. Scr: 45 -?11 3 Scr: Q: M A T W L I. A: M A W T V A. Total:
How do we represent the position specific preference ? BID_MOUSE I A R H L A Q I G D E M BAD_MOUSE Y G R E L R R M S D E F BAK_MOUSE V G R Q L A L I G.
PROTEIN PATTERN DATABASES. PROTEIN SEQUENCES SUPERFAMILY FAMILY DOMAIN MOTIF SITE RESIDUE.
Sequence Based Analysis Tutorial March 26, 2004 NIH Proteomics Workshop Lai-Su L. Yeh, Ph.D. Protein Science Team Lead Protein Information Resource at.
Local Multiple Sequence Alignment Sequence Motifs
Sequence Alignment.
Construction of Substitution matrices
2016/1/27Summer Course1 Pattern Search Problems Part I: Fundament Concept.
Step 3: Tools Database Searching
The statistics of pairwise alignment BMI/CS 576 Colin Dewey Fall 2015.
Computational Biology, Part 3 Representing and Finding Sequence Features using Frequency Matrices Robert F. Murphy Copyright  All rights reserved.
V diagonal lines give equivalent residues ILS TRIVHVNSILPSTN V I L S T R I V I L P E F S T Sequence A Sequence B Dot Plots, Path Matrices, Score Matrices.
V diagonal lines give equivalent residues ILS TRIVHVNSILPSTN V I L S T R I V I L P E F S T Sequence A Sequence B Dot Plots, Path Matrices, Score Matrices.
Practice -- BLAST search in your own computer 1.Download data file from the course web page, or Ensemble. Save in the blast\dbs folder. 2.Start a CMD window,
Protein motif /domain Structural unit Functional unit Signature of protein family How are they defined?
Techniques for Protein Sequence Alignment and Database Searching G P S Raghava Scientist & Head Bioinformatics Centre, Institute of Microbial Technology,
Protein Families, Motifs & Domains.
A Very Basic Gibbs Sampler for Motif Detection
Learning Sequence Motif Models Using Expectation Maximization (EM)
Genome Center of Wisconsin, UW-Madison
Sequence Based Analysis Tutorial
Sequence Based Analysis Tutorial
Nora Pierstorff Dept. of Genetics University of Cologne
Basic Local Alignment Search Tool
Presentation transcript:

Generic substitution matrix based sequence comparison Q: M A T W L I. A: M A - W T V. Scr: 45 -?11 3 Scr: Q: M A T W L I. A: M A W T V A. Total: 5 Total = 22 - ? Blosum 62: Gap openning: -6 ~ -15 Gap Extension: -2 ~ -6

Blast outputoutput

Questions after the Blast search? Questions: Is this a expressed gene in the Aedes mosquito? - Gene prediction & gene structure Is this the true ortholog of TNF? - Fundamentals of sequence comparison - protein dommains/motifs.

Position –specific information about conserved domains is IGNORED in single sequence –initiated search BID_MOUSE SESQEEIIHN IARHLAQIGDEM DHNIQPTLVR BAD_MOUSE APPNLWAAQR YGRELRRMSDEF EGSFKGLPRP BAK_MOUSE PLEPNSILGQ VGRQLALIGDDI NRRYDTEFQN BAXB_HUMAN PVPQDASTKK LSECLKRIGDEL DSNMELQRMI BimS EPEDLRPEIR IAQELRRIGDEF NETYTRRVFA HRK_HUMAN LGLRSSAAQL TAARLKALGDEL HQRTMWRRRA Egl-1 DSEISSIGYE IGSKLAAMCDDF DAQMMSYSAH BID_MOUSE SESQEEIIHN IARHLAQIGDEM DHNIQPTLVR sequence X SESSSELLHN SAGHAAQLFDSM RLDIGSTAHR sequence Y PGLKSSAANI LSQQLKGIGDDL HQRMMSYSAH Why a BLAST match is refused by the family ?

Situations where generic scoring matrix is not suitable Short exact match Specific patterns

1.DNA pattern – Transcription factor binding site. 2.Short protein pattern – enzyme recognition sites. 3.Protein motif/signature.

Binary patterns for protein and DNA Caspase recognition site: [EDQN] X [^RKH] D [ASP] Examples: Observe: Search for potential caspase recognition sites with BaGua

Searching for binary (string) patterns Seq: A G G G C T C A T G A C A G R C W G A C A G T G R C W G A C A G T G R C W G A C A G T Positive match

Does binary pattern conveys all the information ? Weighted matrix / profile HMM model For searching protein domains

What determine a protein family? Structural similarity Functional conservation

Practice: motif analysis of protein sequence using ScanProsite and Pfam 1.Open two taps for Pfam, input one of the Blast hits and one candidate TNF to each data window. 2.Compare the results

Scan protein for identified motifs A service provided by major motif databases such as Prosite,, Pfam, Block, etc. Protein family signature motif often indicates structural and function property. High frequency motifs may only have suggestive value.

What is the possible function of my protein? Which family my protein belongs to? -- Profile databases Pfam ( ) Prosite ( ) IntePro ( ) Prints ( TS.html ) TS.html

Protein motif /domain Structural unit Functional unit Signature of protein family How are they defined?

How do we represent the position specific preference ? BID_MOUSE I A R H L A Q I G D E M BAD_MOUSE Y G R E L R R M S D E F BAK_MOUSE V G R Q L A L I G D D I BAXB_HUMAN L S E C L K R I G D E L BimS I A Q E L R R I G D E F HRK_HUMAN T A A R L K A L G D E L Egl-1 I G S K L A A M C D D F Binary pattern: L [GSC] [HEQCRK] X [^ILMFV] Basic concept of motif identification 2.

How do we represent the position specific preference ? BID_MOUSE I A R H L A Q I G D E M BAD_MOUSE Y G R E L R R M S D E F BAK_MOUSE V G R Q L A L I G D D I BAXB_HUMAN L S E C L K R I G D E L BimS I A Q E L R R I G D E F HRK_HUMAN T A A R L K A L G D E L Egl-1 I G S K L A A M C D D F Statistical representation G: 5 -> 71% S: 1 -> 14 % C: 1 -> 14 % Basic concept of motif identification 2.

Scoring sequence based on Model Seq: A S L D E L G D E A C D ….... position_1 = s(A/1) + s(S/2) + s(L/3) + s(D / 4) An example of position specific matrixexample

Representation of positional information in specific motif M-C-N-S-S-C-[MV]-G-G-M-N-R-R. Binary patterns: Positional matrix:

Hidden Markov Model A specification on “how events will happen” so that statistical assessment can be readily made Used for Speech recognition and characterizing sequence patterns

Hidden Markov Model Position 1Position 2 Deletion Insertion Position 3 1.) possible states of events or “routes” --Transition Probability

Hidden Markov Model Position 1Position 2 2.) possible AA at a given position -- emission probability ACDEFGHIACDEFGHI

Hidden Markov Model 3.) That’s all. Let’s give it a try and you will know what it is about

Hidden Markov Model How to make a HMM for my petty motif Collect related sequences MEME : bin/meme.cgi *Selection of sequences determines the model*

Identifying motifs using MEME -Multiple EM for Motif Elicitation EM: Expectation maximization (P ). Identifies statistically significant motif(s) in a set of sequences. Outline the occurrence of the motifs at the end of the report

Practice: Identify conserved motifs using MEME 1.) Input your own address. 2.) Load the file of multiple Fasta format sequences. 3.) You can change other options based on your needs.

Two search examples  The outcome of the search is dependent on the inputting set of sequences.  Compose the inputting set based on your research needs. Set1: Mammalian P53 plus mosquito hits Set2: Diverse set of P53 plus mosquito hits

Two search examples Set1: Mammalian P53 plus mosquito hits Set2: Diverse set of P53 plus mosquito hits

Secondary structure prediction Predict the likelihood of amino acid x to be in each of the three (four) types of secondary structure configuration Helix Sheet Turn Coil Coiled-coil is two helices tangled together

Secondary structure prediction - different strategies and algorithms Chou-Fasman / Garnier Method -- based on AA composition Nearest Neighbor / Levin Method -- based on sequence similarity Neural Network / PHD SOPM, DPM, DSC, etc.

Results are given at single amino acid level

Secondary structure prediction --Interpretation of result Seq: - D G S L A D E R K Pre: - B B B H H B B T T What is the likelihood of helix formation here ?

Secondary structure prediction -Accuracy At the amino acid level -- ~ 75% based on testing set IF: Seq: - D G I L A V A S M I V Pre: - B B H H H H H H H H H Length > 9 > 90% chance a helix formation around this region

Secondary structure prediction -Programs There are over a dozen web sites provide 2 nd structure predication service – (tools) AntheWin has a good sample of different approaches and has other associated tools

Practice: using AnathePro to analyze protein secondary structure Open the sequence file. Chose and run a secondary structure predication method from the “Methods” menu LEFT click the left boundary of an alpha helix and then RIGHT the right boundary, perform “helical Wheel”

Helix wheel to discern helix subtype HydrophilicHydrophobicOthers AmphipathicHydrophobicHydrophilic

Practice: identify potential transcription factor binding sites on a promoter sequence. Using TESS : Transcription Element Search System

TESS result

Why there are many false positives for TF binding site scan?  Contextual dependency is not considered.  Stringency of the matrices.

Stringency of the matrices ACGT Con sens u s N G R 09305C W T G 0500 Y C Y ACGT Co nse nsu s 40130G 50120G 15020A 01700C 000A 000 T 00 0G 01304C 01700C 0 00C 00 0G 00 0G 20150G 01700C 000A 000 T 00 0G 02015T 01304C 0727Y P53_01 P53_02 Consensus –10 bp Consensus –20 bp

DNA Pattern – Transcription factor binding site Pattern strings / Matrixes are extracted from known binding sequence. Core vs whole. Some short and/or ambiguous patterns will have many hits.