Bioinformatics Motif Detection Revised 27/10/06. Overview Introduction Multiple Alignments Multiple alignment based on HMM Motif Finding –Motif representation.

Slides:



Advertisements
Similar presentations
Gene Regulation and Microarrays. Finding Regulatory Motifs Given a collection of genes with common expression, Find the TF-binding motif in common......
Advertisements

Blast outputoutput. How to measure the similarity between two sequences Q: which one is a better match to the query ? Query: M A T W L Seq_A: M A T P.
Hidden Markov Model in Biological Sequence Analysis – Part 2
PREDetector : Prokaryotic Regulatory Element Detector Samuel Hiard 1, Sébastien Rigali 2, Séverine Colson 2, Raphaël Marée 1 and Louis Wehenkel 1 1 Bioinformatics.
Ulf Schmitz, Statistical methods for aiding alignment1 Bioinformatics Statistical methods for pattern searching Ulf Schmitz
Computational detection of cis-regulatory modules Stein Aerts, Peter Van Loo, Ger Thijs, Yves Moreau and Bart De Moor Katholieke Universiteit Leuven, Belgium.
Regulatory Motifs. Contents Biology of regulatory motifs Experimental discovery Computational discovery PSSM MEME.
Bioinformatics Finding signals and motifs in DNA and proteins Expectation Maximization Algorithm MEME The Gibbs sampler Lecture 10.
Gibbs sampling for motif finding in biological sequences Christopher Sheldahl.
Motif Finding. Regulation of Genes Gene Regulatory Element RNA polymerase (Protein) Transcription Factor (Protein) DNA.
Transcription factor binding motifs (part I) 10/17/07.
DNA Regulatory Binding Motif Search Dong Xu Computer Science Department 109 Engineering Building West
A Very Basic Gibbs Sampler for Motif Detection Frances Tong July 28, 2004 Southern California Bioinformatics Summer Institute.
MOPAC: Motif-finding by Preprocessing and Agglomerative Clustering from Microarrays Thomas R. Ioerger 1 Ganesh Rajagopalan 1 Debby Siegele 2 1 Department.
Sequence Motifs. Motifs Motifs represent a short common sequence –Regulatory motifs (TF binding sites) –Functional site in proteins (DNA binding motif)
An analysis of “Alignments anchored on genomic landmarks can aid in the identification of regulatory elements” by Kannan Tharakaraman et al. Sarah Aerni.
Multiple sequence alignments and motif discovery Tutorial 5.
Biological Sequence Pattern Analysis Liangjiang (LJ) Wang March 8, 2005 PLPTH 890 Introduction to Genomic Bioinformatics Lecture 16.
Microarray analysis 2 Golan Yona. 2) Analysis of co-expression Search for similarly expressed genes experiment1 experiment2 experiment3 ……….. Gene i:
Exploring Protein Sequences Tutorial 5. Exploring Protein Sequences Multiple alignment –ClustalW Motif discovery –MEME –Jaspar.
Finding Regulatory Motifs in DNA Sequences
Review of important points from the NCBI lectures. –Example slides Review the two types of microarray platforms. –Spotted arrays –Affymetrix Specific examples.
DNA Motif and protein domain discovery
Motif finding : Lecture 2 CS 498 CXZ. Recap Problem 1: Given a motif, finding its instances Problem 2: Finding motif ab initio. –Paradigm: look for over-represented.
Motif finding: Lecture 1 CS 498 CXZ. From DNA to Protein: In words 1.DNA = nucleotide sequence Alphabet size = 4 (A,C,G,T) 2.DNA  mRNA (single stranded)
A Statistical Method for Finding Transcriptional Factor Binding Sites Authors: Saurabh Sinha and Martin Tompa Presenter: Christopher Schlosberg CS598ss.
A systems biology approach to the identification and analysis of transcriptional regulatory networks in osteocytes Angela K. Dean, Stephen E. Harris, Jianhua.
CSCE555 Bioinformatics Lecture 6 Hidden Markov Models Meeting: MW 4:00PM-5:15PM SWGN2A21 Instructor: Dr. Jianjun Hu Course page:
Guiding Motif Discovery by Iterative Pattern Refinement Zhiping Wang, Mehmet Dalkilic, Sun Kim School of Informatics, Indiana University.
Good solutions are advantageous Christophe Roos - MediCel ltd Similarity is a tool in understanding the information in a sequence.
Scoring Matrices Scoring matrices, PSSMs, and HMMs BIO520 BioinformaticsJim Lund Reading: Ch 6.1.
Expectation Maximization and Gibbs Sampling – Algorithms for Computational Biology Lecture 1- Introduction Lecture 2- Hashing and BLAST Lecture 3-
Sequence analysis: Macromolecular motif recognition Sylvia Nagl.
* only 17% of SNPs implicated in freshwater adaptation map to coding sequences Many, many mapping studies find prevalent noncoding QTLs.
Eric C. Rouchka, University of Louisville SATCHMO: sequence alignment and tree construction using hidden Markov models Edgar, R.C. and Sjolander, K. Bioinformatics.
Sequence analysis – an overview A.Krishnamachari
Motif finding with Gibbs sampling CS 466 Saurabh Sinha.
Motif discovery Tutorial 5. Motif discovery MEME Creates motif PSSM de-novo (unknown motif) MAST Searches for a PSSM in a DB TOMTOM Searches for a PSSM.
Profile Searches Revised 07/11/06. Overview Introduction Motif representation Motif screening Motif Databases Exercise.
Construction of Substitution Matrices
CS5263 Bioinformatics Lecture 20 Practical issues in motif finding Final project.
Bioinformatics Multiple Alignment. Overview Introduction Multiple Alignments Global multiple alignment –Introduction –Scoring –Algorithms.
Motifs BCH364C/391L Systems Biology / Bioinformatics – Spring 2015 Edward Marcotte, Univ of Texas at Austin Edward Marcotte/Univ. of Texas/BCH364C-391L/Spring.
HMMs for alignments & Sequence pattern discovery I519 Introduction to Bioinformatics.
Bioinformatics Ayesha M. Khan 9 th April, What’s in a secondary database?  It should be noted that within multiple alignments can be found conserved.
Computational Genomics and Proteomics Lecture 8 Motif Discovery C E N T R F O R I N T E G R A T I V E B I O I N F O R M A T I C S V U E.
Motif discovery and Protein Databases Tutorial 5.
MEME homework: probability of finding GAGTCA at a given position in the yeast genome, based on a background model of A = 0.3, T = 0.3, G = 0.2, C = 0.2.
Data Mining the Yeast Genome Expression and Sequence Data Alvis Brazma European Bioinformatics Institute.
Gibbs sampling for motif finding Yves Moreau. 2 Overview Markov Chain Monte Carlo Gibbs sampling Motif finding in cis-regulatory DNA Biclustering microarray.
Cis-regulatory Modules and Module Discovery
Alternative Splicing (a review by Liliana Florea, 2005) CS 498 SS Saurabh Sinha 11/30/06.
341- INTRODUCTION TO BIOINFORMATICS Overview of the Course Material 1.
Cluster validation Integration ICES Bioinformatics.
Local Multiple Sequence Alignment Sequence Motifs
Learning Sequence Motifs Using Expectation Maximization (EM) and Gibbs Sampling BMI/CS 776 Mark Craven
. Finding Motifs in Promoter Regions Libi Hertzberg Or Zuk.
Gene Structure and Identification III BIO520 BioinformaticsJim Lund Previous reading: 1.3, , 10.4,
Intro to Probabilistic Models PSSMs Computational Genomics, Lecture 6b Partially based on slides by Metsada Pasmanik-Chor.
Transcription factor binding motifs (part II) 10/22/07.
1 Repeats!. 2 Introduction  A repeat family is a collection of repeats which appear multiple times in a genome.  Our objective is to identify all families.
Finding Motifs Vasileios Hatzivassiloglou University of Texas at Dallas.
1 Discovery of Conserved Sequence Patterns Using a Stochastic Dictionary Model Authors Mayetri Gupta & Jun S. Liu Presented by Ellen Bishop 12/09/2003.
Regulation of Gene Expression
A Very Basic Gibbs Sampler for Motif Detection
Gibbs sampling.
Motifs BCH364C/394P - Systems Biology / Bioinformatics
Learning Sequence Motif Models Using Expectation Maximization (EM)
Nora Pierstorff Dept. of Genetics University of Cologne
Motifs BCH339N Systems Biology / Bioinformatics – Spring 2016
Presentation transcript:

Bioinformatics Motif Detection Revised 27/10/06

Overview Introduction Multiple Alignments Multiple alignment based on HMM Motif Finding –Motif representation –Algorithm –Search Space –Word counting methods –Probabilistic methods Profile Searches –Introduction Exercises

Global multiple alignment (ClustalW) –Proteins, nucleotides –Long stretches of conservation essential –Identification of protein family profiles –Score gaps Local multiple alignments (motif detection) –Proteins, nucleotides –Short stretches of conservation (12 NT, 6 AA) –Identification of regulatory motifs (DNA, protein) –No explicit gap scoring –Explicit use of a profile Introduction

Overview Introduction Multiple Alignments Multiple alignment based on HMM Motif Finding –Motif representation –Algorithm –Search Space –Word counting methods –Probabilistic methods Profile Searches –Introduction Exercises

HMM

Overview Introduction Multiple Alignments Multiple alignment based on HMM Motif Finding –Motif representation –Algorithm –Search Space –Word counting methods –Probabilistic methods Profile Searches –Introduction Exercises

cellsignal motif Gene 1Gene 2Gene 3Gene 4 sigma ? translation transcription mRNA protein gene chromosome Transcriptional regulation

Consensus sequence: –reductionistic representation of a motif –Most frequent instance is used as a representative –Loss of information Regular expression: –More complex representation allowing motif degeneracy Position specific scoring matrix (PSSM): –Probabilistic representation Motif Representation

CTTAATATTAACTTAAT Consensus CTTAAKRTTMAYTTAAT Regular expression PSSM (motif logo)

Motif Representation

Search for motifs that are present more frequently in a set of sequences than in a set of unrelated sequences Methods based on word counting (regular expression) NP problems, heuristic methods clever algorithms –motif w=8; combinations=8! –Jensen & Knudsen, 2000; van Helden, 2000; Vanet, 2000 Probabilistic methods (weight matrix) Multiple alignment by locally aligning small conserved regions in a set of unaligned sequences. Motif model represented by a probability matrix EM, Gibbs sampler (optimization algorithms) –AlignACE –BioProspector: –Motif Sampler Overview Algorithms

Search space When are motifs overrepresented statistically? Set of coexpressed (coregulated sequences) –Literature searches –Microarrays, expression profiling Set of orthologous sequences (phylogenetic footprinting) –Comparative genomics –Orthologous sequences similar ancestral origin => similar mechanism of transcriptional regulation

coexpression cDNA arrays Motif finding Clustering Preprocessing of the data EMBL BLAST Upstream regions Gibbs sampling Search space

PhoPQ ubiquitous system Salmonella Escherichia Yersinia Vibrio Pseudomonas Providencia Pectobacterium PhoPQ is autoregulated Search space Phylogenetic footprinting

Search space

Methods based on word counting NP problems, heuristic methods clever algorithms –motif w=8; combinations=8! –Jensen & Knudsen, 2000; van Helden, 2000; Vanet, 2000 Probabilistic methods Optimisation problems, self learning, AI Motif model represented by a probability matrix Bayesian, Gibbs sampler –AlignACE –BioProspector: –Motif Sampler Overview Algorithms

Monad frequencies: single word counts: (RSA tools) (J. Vanhelden et al., 1998 J. Mol. Biol.) –Enumerate all oligonucleotides –count the number of occurrences of all oligonucleotides of selected size in a set of coregulated genes –compare the number of occurrences with its expected value in the background Word Counting

Relevance of the motifs detected p-Value and Sig score (string based methods) Expected number of occurrences in background Statistical significance Word Counting

Methods based on word counting NP problems, heuristic methods clever algorithms –motif w=8; combinations=8! –Jensen & Knudsen, 2000; van Helden, 2000; Vanet, 2000 Probabilistic methods Optimisation problems, self learning, AI Motif model represented by a probability matrix Bayesian, Gibbs sampler –AlignACE –BioProspector: –Motif Sampler Probabilistic Algorithms

Find common motifs, that represent regulatory elements, in the region upstream of translation start in a set of co-expressed DNA sequences  Motifs are hidden in background sequence Probabilistic Algorithms

Motif Representation: Probability matrix (PSSM) Background model Single nucleotide frequencies Described by an m th order Markov process, that can be represented by a transition matrix Probabilistic Algorithms

Step 1: Initialization of alignment vector A (predictive update) i j 1 n G A A T T C A T G T C A C T T C A T T G Step 2: Calculate motif model for all sequences except one Probabilistic Algorithms

GAATTATCGTGAATGCGTGGT P(S|M) = x x x x P(S|B) = Step 3 (expectation): Select remaining sequence For each window (site) calculate the probability that the sequence in the window is generated by the motif model versus the probability that it is generated by the background model i 1 n Assign weight based on this score to this site Probabilistic Algorithms

Step 4 (Maximization): –Re-estimate new positions based on the weights calculated in step 3 Go to step 1 i j 1 nn Re-iterate until stable motifs are found i j 1

local optima –EM update alignment vector: Select positions with highest score Deterministic output but local minimum global optimum –Gibbs sampling Select positions according to probability distribution Stochastic output: –i.e. result differs each time the algorithm runs –allows to detect stable motifs –statistical analysis describes quality of the motif detected Probabilistic Algorithms

Influence of the background model: e.g. p(ATCGT|Bm)=p(AT)p(C|AT)p(G|TC)p(T|CG) Compensates for motifs that occur frequently because of the general background composition Makes the outcome of the algorithm more robust Probabilistic Algorithms

Two organisms with similar background model Two organisms with different background model Probabilistic Algorithms

Information content (Consensus score) Log likelihood Relative entropy (Information content) Entropy Probabilistic Algorithms Motif scores for probabilistic motif finding algorithms

Result: bacterial O 2 responsive element FNR Probabilistic Algorithms Takes into account the background model Does only take into account the degree of conservation Tradeoff between the degree of conservation and the number of occurrences

Overview Introduction Multiple Alignments Multiple alignment based on HMM Motif Finding –Motif representation –Algorithm –Search Space –Word counting methods –Probabilistic methods Profile Searches –Introduction Exercises

Profile Search

EXPERIMENTAL High throughput measurements Literature GENOMICS Genomic sequence data Novel targets Novel Conditions 1. Microarray Datamining Preprocessing Clustering 2. Sequence Datamining Motif Detection 3. Comparative Genomics Genomewide Screening Phylogenetic Footprinting Clusters of coexpressed genes Summarized information Target Identification Profile Search