Transcription factor binding motifs (part II) 10/22/07.

Slides:



Advertisements
Similar presentations
Periodic clusters. Non periodic clusters That was only the beginning…
Advertisements

Hidden Markov Model in Biological Sequence Analysis – Part 2
Statistics in Bioinformatics May 2, 2002 Quiz-15 min Learning objectives-Understand equally likely outcomes, Counting techniques (Example, genetic code,
Computational discovery of gene modules and regulatory networks Ziv Bar-Joseph et al (2003) Presented By: Dan Baluta.
Promoter and Module Analysis Statistics for Systems Biology.
Combined analysis of ChIP- chip data and sequence data Harbison et al. CS 466 Saurabh Sinha.
Bioinformatics Motif Detection Revised 27/10/06. Overview Introduction Multiple Alignments Multiple alignment based on HMM Motif Finding –Motif representation.
Finding Transcription Factor Binding Sites BNFO 602/691 Biological Sequence Analysis Mark Reimers, VIPBG.
Regulatory Motifs. Contents Biology of regulatory motifs Experimental discovery Computational discovery PSSM MEME.
Bioinformatics Finding signals and motifs in DNA and proteins Expectation Maximization Algorithm MEME The Gibbs sampler Lecture 10.
Epigenetics 12/05/07 Statisticians like data.
Identification of a Novel cis-Regulatory Element Involved in the Heat Shock Response in Caenorhabditis elegans Using Microarray Gene Expression and Computational.
Transcription factor binding motifs (part I) 10/17/07.
CpG islands in DNA sequences
Microarrays and Cancer Segal et al. CS 466 Saurabh Sinha.
MOPAC: Motif-finding by Preprocessing and Agglomerative Clustering from Microarrays Thomas R. Ioerger 1 Ganesh Rajagopalan 1 Debby Siegele 2 1 Department.
The Model To model the complex distribution of the data we used the Gaussian Mixture Model (GMM) with a countable infinite number of Gaussian components.
An analysis of “Alignments anchored on genomic landmarks can aid in the identification of regulatory elements” by Kannan Tharakaraman et al. Sarah Aerni.
MotifBooster – A Boosting Approach for Constructing TF-DNA Binding Classifiers Pengyu Hong 10/06/2005.
Computational Approaches in Epigenomics Guo-Cheng Yuan Department of Biostatistics and Computational Biology Dana-Farber Cancer Institute Harvard School.
Biological Sequence Pattern Analysis Liangjiang (LJ) Wang March 8, 2005 PLPTH 890 Introduction to Genomic Bioinformatics Lecture 16.
Detecting binding sites for transcription factors by correlating sequence data with expression. Erik Aurell Adam Ameur Jakub Orzechowski Westholm in collaboration.
Cis-regultory module 10/24/07. TFs often work synergistically (Harbison 2004)
A Statistical Method for Finding Transcriptional Factor Binding Sites Authors: Saurabh Sinha and Martin Tompa Presenter: Christopher Schlosberg CS598ss.
TGCAAACTCAAACTCTTTTGTTGTTCTTACTGTATCATTGCCCAGAATAT TCTGCCTGTCTTTAGAGGCTAATACATTGATTAGTGAATTCCAATGGGCA GAATCGTGATGCATTAAAGAGATGCTAATATTTTCACTGCTCCTCAATTT.
MicroRNA Targets Prediction and Analysis. Small RNAs play important roles The Nobel Prize in Physiology or Medicine for 2006 Andrew Z. Fire and Craig.
Guiding Motif Discovery by Iterative Pattern Refinement Zhiping Wang, Mehmet Dalkilic, Sun Kim School of Informatics, Indiana University.
Detecting binding sites for transcription factors by correlating sequence data with expression. Erik Aurell Adam Ameur Jakub Orzechowski Westholm in collaboration.
Motif discovery EM algorithm Gibbs Sampler Enumeration Regression methods Phylogenetic trees Purpose Construction Finding significance Not directly related.
Learning Regulatory Networks that Represent Regulator States and Roles Keith Noto and Mark Craven K. Noto and M. Craven, Learning Regulatory.
* only 17% of SNPs implicated in freshwater adaptation map to coding sequences Many, many mapping studies find prevalent noncoding QTLs.
Identification of Regulatory Binding Sites Using Minimum Spanning Trees Pacific Symposium on Biocomputing, pp , 2003 Reporter: Chu-Ting Tseng Advisor:
ChIP-on-Chip and Differential Location Analysis Junguk Hur School of Informatics October 4, 2005.
Motif finding with Gibbs sampling CS 466 Saurabh Sinha.
TGCAAACTCAAACTCTTTTGTTGTTCTTACTGTATCATTGCCCAGAATAT TCTGCCTGTCTTTAGAGGCTAATACATTGATTAGTGAATTCCAATGGGCA GAATCGTGATGCATTAAAGAGATGCTAATATTTTCACTGCTCCTCAATTT.
Unraveling condition specific gene transcriptional regulatory networks in Saccharomyces cerevisiae Speaker: Chunhui Cai.
Analysis of the yeast transcriptional regulatory network.
Construction of Substitution Matrices
CS5263 Bioinformatics Lecture 20 Practical issues in motif finding Final project.
Motifs BCH364C/391L Systems Biology / Bioinformatics – Spring 2015 Edward Marcotte, Univ of Texas at Austin Edward Marcotte/Univ. of Texas/BCH364C-391L/Spring.
Computational Genomics and Proteomics Lecture 8 Motif Discovery C E N T R F O R I N T E G R A T I V E B I O I N F O R M A T I C S V U E.
BLAST: Basic Local Alignment Search Tool Altschul et al. J. Mol Bio CS 466 Saurabh Sinha.
Starting Monday M Oct 29 –Back to BLAST and Orthology (readings posted) will focus on the BLAST algorithm, different types and applications of BLAST; in.
Journal report: High Resolution Model of Transcription Factor- DNA Affinities Improve In Vitro and In Vivo Binding Predictions Paper by: Phadera Gius,
MEME homework: probability of finding GAGTCA at a given position in the yeast genome, based on a background model of A = 0.3, T = 0.3, G = 0.2, C = 0.2.
Data Mining the Yeast Genome Expression and Sequence Data Alvis Brazma European Bioinformatics Institute.
Cis-regulatory Modules and Module Discovery
Pattern Discovery and Recognition for Genetic Regulation Tim Bailey UQ Maths and IMB.
Local Multiple Sequence Alignment Sequence Motifs
. Finding Motifs in Promoter Regions Libi Hertzberg Or Zuk.
Sequence Alignment.
Inference with Gene Expression and Sequence Data BMI/CS 776 Mark Craven April 2002.
Construction of Substitution matrices
Special Topics in Genomics Motif Analysis. Sequence motif – a pattern of nucleotide or amino acid sequences GTATGTACTTACTATGGGTGGTCAACAAATCTATGTATGA TAACATGTGACTCCTATAACCTCTTTGGGTGGTACATGAA.
Pattern Discovery and Recognition for Understanding Genetic Regulation Timothy L. Bailey Institute for Molecular Bioscience University of Queensland.
HISPIG – A Discriminative Model Refinement Approach with Iterations for Detecting Regulatory Regions Takuma Tsukahara
Motif identification with Gibbs Sampler Xuhua Xia
Finding Motifs Vasileios Hatzivassiloglou University of Texas at Dallas.
Computational methods for inferring cellular networks II Stat 877 Apr 17 th, 2014 Sushmita Roy.
HW7: Evolutionarily conserved segments ENCODE region 009 (beta-globin locus) Multiple alignment of human, dog, and mouse 2 states: neutral (fast-evolving),
1 Discovery of Conserved Sequence Patterns Using a Stochastic Dictionary Model Authors Mayetri Gupta & Jun S. Liu Presented by Ellen Bishop 12/09/2003.
Substitution Matrices and Alignment Statistics BMI/CS 776 Mark Craven February 2002.
A Very Basic Gibbs Sampler for Motif Detection
Gibbs sampling.
Learning Sequence Motif Models Using Expectation Maximization (EM)
Ab initio gene prediction
Finding regulatory modules
In collaboration with Mikkelsen Lab
Using position-specific prior for motif discovery
Nora Pierstorff Dept. of Genetics University of Cologne
Presentation transcript:

Transcription factor binding motifs (part II) 10/22/07

Information from negative control Motivation: combine information from TF binding and non-binding sequences to identify discriminative information. Methods: –REDUCE (Bussmaker et al. 2001) –Motif Regressor (Conlon et al. 2003)

Motif Regressor Algorithm Rank all genes by expression and obtain their upstream sequences Use MDscan to find motifs from most induced and most repressed genes Score each upstream sequence for matches to each MDscan reported motif Perform simple linear regression between motif- matching score and gene expression to remove insignificant motifs Perform stepwise regression on the significant motifs to find groups acting together to affect expression

Motif matching score Extract upstream sequence X mg (e.g. 800 bp) from each gene. Define which measures the overall enrichment of a motif. sum over sliding windows

Motif Regressor Approach Look at one expression experiment Expression log ratio Genes Look for candidate motifs Refine motifs Regress b/t upstream mtf match score and downstream expression MDscan

Motif Regressor Linear Regression

Multiple regression model: expression explained as the sum of motifs’ effects Expression of gene g Baseline expression Error term Regression coefficient Upstream motif- match score

Further motif selection by stepwise regression Stepwise regression to further select significant motifs. –Step 1: Include only intercept –Step 2: Sequentially add new motifs that give the largest reduction in error. –Step 3: Sequentially remove motifs that give the smallest increase in error. –Repeat Steps 2 and 3 until converge.

Application Yeast cells are grown under amino acid starvation. Gene expression (~6000 genes) was measured at 30 minutes after amino acid starvation. Motif Regressor was applied to identify sequence motifs.

Comparative genomics Evolutionary tree Darwin’s principle from evolution Cross-species sequence alignment Conservation of genes Conservation of regulatory sequence Quantifying sequence conservation Methods –MCS score (Kellis) –Phylocon Results –Yeast (Kellis) Advantage: no requirement for prior functional information Drawback: specie-specific motifs may not be learned (Fraenkel)

Non-uniform conservation rates Genes are typically conserved Intergenic regions are typically not conserved Why?

Motif finding by using multiple genomes Basic assumption: functional sequences evolve more slowly than non-functional sequences, as they are subject to selection pressure. Basic approach: –Identify conserved regions by sequence alignment algorithms –Restrict motif finding in conserved regions.

Motif: Gal4 – CGGNNNNNNNNNNNCCG Gal4 motif is highly conserved

Methods Wasserman et al MCS (Kellis et al. 2003; Xie et al. 2005) PhyloCon (Wang and Stormo 2003) EMnEM (Moses et al. 2004) OrthoMEME (Prakash et al. 2004) PhyME (Sinha et al. 2004) CompareProspector (Liu et al. 2004) PhyloGibbs (Siddharthan et al. 2005) Ortholog Sampler (Li and Wong 2005) MultiModule (Zhou and Wong 2005)

Methods Wasserman et al MCS (Kellis et al. 2003; Xie et al. 2005) PhyloCon (Wang and Stormo 2003) EMnEM (Moses et al. 2004) OrthoMEME (Prakash et al. 2004) PhyME (Sinha et al. 2004) CompareProspector (Liu et al. 2004) PhyloGibbs (Siddharthan et al. 2005) Ortholog Sampler (Li and Wong 2005) MultiModule (Zhou and Wong 2005)

MCS frequency Conservation rate p obs p0p0 Basic Idea Select those highly conserved motifs: p obs >> p 0 (Xie et al. 2005)

MCS frequency Conservation rate p obs p0p0 Definition of MCS: total #occurrence expected frequency observed frequency p 0 is estimated by random sampling. Choose cutoff at MCS = 6 (Xie et al. 2005)

Application to human regulatory motifs

Results

Tissue specificity of detected motifs

PhyloCon Basic Idea: (Wang and Stormo 2003) Both sequence conservation and gene co-regulation information are used for motif finding. Orthologous regions are viewed as sequence profiles. Align of sequence profiles instead of sequences. species 1 species 2 species 3 species 4 profile

PhyloCon

Compare two columns first. f b = {f A, f C, f G, f T } a column of profile p b = {p A, p C, p G, p T } background base frequency n b = {n A, n C, n G, n T } observed counts at the specified position likelihood ratio: Log-likelihood ratio: Profile Comparison

Compare two columns first ALLR measures the similarities between two columns. Sum over ALLR at all positions to get a score comparing two profiles. Profile Comparison background total counts frequencies

Profile merging Iteratively merge un-orthologous groups that have high ALLR scores.

Sampling motifs on Phylogenetic trees Motivation: The alignment-based method does not work well if the species are distant. Basic idea –Avoid aligning multiple species to gather othorlogous gene information. –Directly model the evolution of the genomic sequences. –Assuming that motifs evolve slower than background sequences.

An evolution model

Evolution model Probability of a nucleotide change

Main Algorithm Step 1: Building an evolution model. –Motif evolution is modeled by decreasing branch length by a fixed rate, say 50%. Step 2: Infer model parameters by using a Gibbs sampler.

Limitation of comparative genomics approach Species-specific motifs cannot be learned from this approach.

Divergence of TF binding Borneman et al. 2007

Divergence of TF binding Divergence binding can be caused by: divergence of TF motifs (e.g., Ste12) or some unknown mechanism (e.g. Tec1) Borneman et al. 2007

Other directions Combining multiple motif finding algorithms. (e.g. Harbison et al. 2004, Jensen and Liu 2005). Directly identify TF binding sites through experiments (CHIP-chip). Then apply motif finding algorithms to binding data. experimental data. (e.g. MDscan).

Challenge of Specificity A 7-mer is expected to occur every 16,384 base pairs by chance In human, this means 3 X 10 9 / 16,384 ~ 180,000 sites in total Total number of genes ~ 25,000 Most of predicted binding sites are false positives! Need other restrictive information to reduce false positives.

Some Biological Notes TF binding does not mean it is functional. –Some TFs always bind to DNA, but they are functional only if they are phosphorylated. Motif sites contain a large number of false positives. –Motifs are short DNA elements (~10 bp). Higher eukaryotes have large genome size, and these short elements may occur frequently by chance. Epigenetic factors also play an important role in regulation of TF binding. –Chromatin structure, histone modifications, DNA methylation, etc.

Reading list Conlon et al –Proposed Motif Regressor. Filter out motifs that are unassociated with gene expression changes. Xie et al – MCS. Use comparative approach to identify human regulatory motifs. Highly biological. Wang and Stormo 2003 –Phylocon. An elegant “multi-gene, multi species” approach for motif finding.

Acknowledgements X.S.Liu