Finding regulatory modules

Slides:



Advertisements
Similar presentations
Exploiting transcription factor binding site clustering to identify cis-regulatory modules involved in pattern formation in the Drosophila genome ECS289A.
Advertisements

Quick Lesson on dN/dS Neutral Selection Codon Degeneracy Synonymous vs. Non-synonymous dN/dS ratios Why Selection? The Problem.
. Markov Chains. 2 Dependencies along the genome In previous classes we assumed every letter in a sequence is sampled randomly from some distribution.
Hidden Markov Model in Biological Sequence Analysis – Part 2
Computational detection of cis-regulatory modules Stein Aerts, Peter Van Loo, Ger Thijs, Yves Moreau and Bart De Moor Katholieke Universiteit Leuven, Belgium.
Combined analysis of ChIP- chip data and sequence data Harbison et al. CS 466 Saurabh Sinha.
Ab initio gene prediction Genome 559, Winter 2011.
Changes in Highly Conserved Elements John McGuigan 05/04/2009.
Regulatory Motifs. Contents Biology of regulatory motifs Experimental discovery Computational discovery PSSM MEME.
 CpG is a pair of nucleotides C and G, appearing successively, in this order, along one DNA strand.  CpG islands are particular short subsequences in.
Bioinformatics Finding signals and motifs in DNA and proteins Expectation Maximization Algorithm MEME The Gibbs sampler Lecture 10.
Identification of a Novel cis-Regulatory Element Involved in the Heat Shock Response in Caenorhabditis elegans Using Microarray Gene Expression and Computational.
Lecture 5: Learning models using EM
Phylogenetic Trees Presenter: Michael Tung
Sequence Motifs. Motifs Motifs represent a short common sequence –Regulatory motifs (TF binding sites) –Functional site in proteins (DNA binding motif)
The Model To model the complex distribution of the data we used the Gaussian Mixture Model (GMM) with a countable infinite number of Gaussian components.
Estimating recombination rates using three-site likelihoods Jeff Wall Program in Molecular and Computational Biology, USC.
(Regulatory-) Motif Finding. Clustering of Genes Find binding sites responsible for common expression patterns.
Promoter Analysis using Bioinformatics, Putting the Predictions to the Test Amy Creekmore Ansci 490M November 19, 2002.
Exploring Protein Sequences Tutorial 5. Exploring Protein Sequences Multiple alignment –ClustalW Motif discovery –MEME –Jaspar.
Cis-regultory module 10/24/07. TFs often work synergistically (Harbison 2004)
MICHAEL MORRA CSE 4939W Detection of Transcription Factor Binding Sites.
A Statistical Method for Finding Transcriptional Factor Binding Sites Authors: Saurabh Sinha and Martin Tompa Presenter: Christopher Schlosberg CS598ss.
Guiding Motif Discovery by Iterative Pattern Refinement Zhiping Wang, Mehmet Dalkilic, Sun Kim School of Informatics, Indiana University.
Gene finding with GeneMark.HMM (Lukashin & Borodovsky, 1997 ) CS 466 Saurabh Sinha.
BINF6201/8201 Hidden Markov Models for Sequence Analysis
Scoring Matrices Scoring matrices, PSSMs, and HMMs BIO520 BioinformaticsJim Lund Reading: Ch 6.1.
Identification of Regulatory Binding Sites Using Minimum Spanning Trees Pacific Symposium on Biocomputing, pp , 2003 Reporter: Chu-Ting Tseng Advisor:
Motif finding with Gibbs sampling CS 466 Saurabh Sinha.
Using Mixed Length Training Sequences in Transcription Factor Binding Site Detection Tools Nathan Snyder Carnegie Mellon University BioGrid REU 2009 University.
Statistical Analysis for Word counting in Drosophila Core Promoters Yogita Mantri April Bioinformatics Capstone presentation.
Bioinformatics Expression profiling and functional genomics Part II: Differential expression Ad 27/11/2006.
CS5263 Bioinformatics Lecture 20 Practical issues in motif finding Final project.
Exploring Alternative Splicing Features using Support Vector Machines Feature for Alternative Splicing Alternative splicing is a mechanism for generating.
Computational Genomics and Proteomics Lecture 8 Motif Discovery C E N T R F O R I N T E G R A T I V E B I O I N F O R M A T I C S V U E.
Module networks Sushmita Roy BMI/CS 576 Nov 18 th & 20th, 2014.
Hidden Markov Models A first-order Hidden Markov Model is completely defined by: A set of states. An alphabet of symbols. A transition probability matrix.
Localising regulatory elements using statistical analysis and shortest unique substrings of DNA Nora Pierstorff 1, Rodrigo Nunes de Fonseca 2, Thomas Wiehe.
Comp. Genomics Recitation 9 11/3/06 Gene finding using HMMs & Conservation.
Starting Monday M Oct 29 –Back to BLAST and Orthology (readings posted) will focus on the BLAST algorithm, different types and applications of BLAST; in.
MEME homework: probability of finding GAGTCA at a given position in the yeast genome, based on a background model of A = 0.3, T = 0.3, G = 0.2, C = 0.2.
Data Mining the Yeast Genome Expression and Sequence Data Alvis Brazma European Bioinformatics Institute.
Cis-regulatory Modules and Module Discovery
Pattern Discovery and Recognition for Genetic Regulation Tim Bailey UQ Maths and IMB.
Multiple Species Gene Finding using Gibbs Sampling Sourav Chatterji Lior Pachter University of California, Berkeley.
Local Multiple Sequence Alignment Sequence Motifs
Learning Sequence Motifs Using Expectation Maximization (EM) and Gibbs Sampling BMI/CS 776 Mark Craven
. Finding Motifs in Promoter Regions Libi Hertzberg Or Zuk.
Module Networks BMI/CS 576 Mark Craven December 2007.
Doug Raiford Phage class: introduction to sequence databases.
(H)MMs in gene prediction and similarity searches.
Pattern Discovery and Recognition for Understanding Genetic Regulation Timothy L. Bailey Institute for Molecular Bioscience University of Queensland.
Transcription factor binding motifs (part II) 10/22/07.
Computational methods for inferring cellular networks II Stat 877 Apr 17 th, 2014 Sushmita Roy.
1 Discovery of Conserved Sequence Patterns Using a Stochastic Dictionary Model Authors Mayetri Gupta & Jun S. Liu Presented by Ellen Bishop 12/09/2003.
Evaluation of count scores for weight matrix motifs Project Presentation for CS598SS Hong Cheng and Qiaozhu Mei.
Multi-Genome Multi- read (MGMR) progress report Main source for Background Material, slide backgrounds: Eran Halperin's Accurate Estimation of Expression.
1 Gene Finding. 2 “The Central Dogma” TranscriptionTranslation RNA Protein.
Constrained Hidden Markov Models for Population-based Haplotyping
A Very Basic Gibbs Sampler for Motif Detection
Maximum likelihood (ML) method
Learning Sequence Motif Models Using Expectation Maximization (EM)
Eukaryotic Gene Finding
Ab initio gene prediction
Recitation 7 2/4/09 PSSMs+Gene finding
Sequential Pattern Discovery under a Markov Assumption
Hidden Markov Models Part 2: Algorithms
SEG5010 Presentation Zhou Lanjun.
HIDDEN MARKOV MODELS IN COMPUTATIONAL BIOLOGY
Nora Pierstorff Dept. of Genetics University of Cologne
Presentation transcript:

Finding regulatory modules December 17, 2003 CSE527 Computational Biology Project Presentation Raphael Hoffmann

Agenda Importance of module discovery MCAST My Implementation Testing on simulated data Testing on real data Discussion

Importance of module discovery Full understanding of gene functions requires understanding the regulatory machinery TFBSs are usually small and appear frequently by chance True binding sites appear in clusters

MCAST (part of MetaMEME) Motifs PSPMs occurences dna sequences modules HMM Viterbi & Scoring Function EM MCAST

MCAST – new ideas Sophisticated scoring function EM HMM

My Implementation MEME Motifs PSPMs dna sequences occurences modules Linear HMM Viterbi dna sequences modules

My Implementation Written in JAVA Includes Random Sequence generator (for simulation data according to HMM) Linear HMMs Begin Motif +1 Motif +1 Motif +2 End

Testing on simulated data In real datasets the results depend on a large number of factors, so testing on simulated data first is recommendable. I generated a sequence of 16,000 base pairs distributed according to a simple HMM and Drosophila background model (described later)

Testing on simulated data Log-odds score of sliding window Background Model HMM Model

Testing on real data Available datasets that contain known modules: Drosophila and human genome Motivation: Berman et. al. used a sliding window of 700bp and counted motif occurences in Drosophila sequences. (Successfull though simple)

Testing on real data Drosophila: ~ 20 modules known upstream of the even-skipped gene Many contain transcription factor binding sites for Bcd, Cad, Hb, Kr, Kni This data has been used by many researchers

Testing on real data MEME could not correctly identify the transcription factor binding motifs I used PSPMs which were identified by a research team and aligned them to known modules using MAST

Testing on real data MAST alignment of known modules and their transcription factor binding motifs

Testing on real data Hardly any common pattern But Hb-sites often appear in pairs Created a linear HMM

Testing on real data

Discussion Modules not reliably detectable. Why? Only regarded motifs on one strand Complete Topology or Star Topology HMM instead of Linear HMM Geometric distribution of spacer lengths Models not accurate Log-odds score

Discussion Average log-odds scores were higher on the real data than on the simulated data, across the whole sequence. Why? Background model which assumes independence might be too simple.

Future Work Use EM to train the parameters of the HMM (transition probabilities)