MotifBooster – A Boosting Approach for Constructing TF-DNA Binding Classifiers Pengyu Hong 10/06/2005.

Slides:



Advertisements
Similar presentations
Gene Regulation and Microarrays. Finding Regulatory Motifs Given a collection of genes with common expression, Find the TF-binding motif in common......
Advertisements

Perceptron Learning Rule
Image classification Given the bag-of-features representations of images from different classes, how do we learn a model for distinguishing them?
© Tan,Steinbach, Kumar Introduction to Data Mining 4/18/ Other Classification Techniques 1.Nearest Neighbor Classifiers 2.Support Vector Machines.
Data Mining Classification: Alternative Techniques
Data Mining Classification: Alternative Techniques
Computational discovery of gene modules and regulatory networks Ziv Bar-Joseph et al (2003) Presented By: Dan Baluta.
CSCI 347 / CS 4206: Data Mining Module 07: Implementations Topic 03: Linear Models.
Thermodynamic Models of Gene Regulation Xin He CS598SS 04/30/2009.
Motif Finding. Regulation of Genes Gene Regulatory Element RNA polymerase (Protein) Transcription Factor (Protein) DNA.
Prénom Nom Document Analysis: Linear Discrimination Prof. Rolf Ingold, University of Fribourg Master course, spring semester 2008.
Transcription factor binding motifs (part I) 10/17/07.
Reconstructing Transcription Network in S.cerevisiae WANG Chao Oct. 4, 2004.
Reduced Support Vector Machine
The Model To model the complex distribution of the data we used the Gaussian Mixture Model (GMM) with a countable infinite number of Gaussian components.
Chapter 6: Multilayer Neural Networks
Bioinformatics Challenge  Learning in very high dimensions with very few samples  Acute leukemia dataset: 7129 # of gene vs. 72 samples  Colon cancer.
Biological Sequence Pattern Analysis Liangjiang (LJ) Wang March 8, 2005 PLPTH 890 Introduction to Genomic Bioinformatics Lecture 16.
In silico cis-analysis promoter analysis - Promoters and cis-elements - Searching for patterns - Searching redundant patterns.
A Quantitative Modeling of Protein- DNA interaction for Improved Energy Based Motif Finding Algorithm Junguk Hur School of Informatics April 25, 2005 L529.
(Regulatory-) Motif Finding. Clustering of Genes Find binding sites responsible for common expression patterns.
Ab initio motif finding
Introduction to molecular networks Sushmita Roy BMI/CS 576 Nov 6 th, 2014.
Boosting Main idea: train classifiers (e.g. decision trees) in a sequence. a new classifier should focus on those cases which were incorrectly classified.
Motif Discovery: Algorithm and Application Dan Scanfeld Hong Xue Sumeet Gupta Varun Aggarwal.
Cis-regultory module 10/24/07. TFs often work synergistically (Harbison 2004)
REGULATORY GENOMICS Saurabh Sinha, Dept. of Computer Science & Institute of Genomic Biology, University of Illinois.
Motif finding : Lecture 2 CS 498 CXZ. Recap Problem 1: Given a motif, finding its instances Problem 2: Finding motif ab initio. –Paradigm: look for over-represented.
For Better Accuracy Eick: Ensemble Learning
Motif finding: Lecture 1 CS 498 CXZ. From DNA to Protein: In words 1.DNA = nucleotide sequence Alphabet size = 4 (A,C,G,T) 2.DNA  mRNA (single stranded)
Special Topics in Genomics Lecture 1: Introduction Instructor: Hongkai Ji Department of Biostatistics
MotifML A Novel Ontology-based XML Model for Data- Exchange of Regulatory DNA Motif Profiles Eric Neumann, Beyond Genomics Tian Niu, Harvard University.
es/by-sa/2.0/. Large Scale Approaches to the Study of Gene Expression Prof:Rui Alves Dept.
Guiding Motif Discovery by Iterative Pattern Refinement Zhiping Wang, Mehmet Dalkilic, Sun Kim School of Informatics, Indiana University.
Detecting binding sites for transcription factors by correlating sequence data with expression. Erik Aurell Adam Ameur Jakub Orzechowski Westholm in collaboration.
Transcription factor binding sites and gene regulatory network Victor Jin Department of Biomedical Informatics The Ohio State University.
Expectation Maximization and Gibbs Sampling – Algorithms for Computational Biology Lecture 1- Introduction Lecture 2- Hashing and BLAST Lecture 3-
* only 17% of SNPs implicated in freshwater adaptation map to coding sequences Many, many mapping studies find prevalent noncoding QTLs.
ChIP-on-Chip and Differential Location Analysis Junguk Hur School of Informatics October 4, 2005.
Benk Erika Kelemen Zsolt
Motif finding with Gibbs sampling CS 466 Saurabh Sinha.
BOOSTING David Kauchak CS451 – Fall Admin Final project.
Unraveling condition specific gene transcriptional regulatory networks in Saccharomyces cerevisiae Speaker: Chunhui Cai.
Ensembles. Ensemble Methods l Construct a set of classifiers from training data l Predict class label of previously unseen records by aggregating predictions.
Ensemble Learning (1) Boosting Adaboost Boosting is an additive model
Computational Genomics and Proteomics Lecture 8 Motif Discovery C E N T R F O R I N T E G R A T I V E B I O I N F O R M A T I C S V U E.
Guiding motif discovery by iterative pattern refinement Zhiping Wang Advisor: Sun Kim, Mehmet Dalkilic School of Informatics, Indiana University.
ISQS 6347, Data & Text Mining1 Ensemble Methods. ISQS 6347, Data & Text Mining 2 Ensemble Methods Construct a set of classifiers from the training data.
Introduction to Bioinformatics Algorithms Finding Regulatory Motifs in DNA Sequences.
Non-Bayes classifiers. Linear discriminants, neural networks.
Lecture notes for Stat 231: Pattern Recognition and Machine Learning 1. Stat 231. A.L. Yuille. Fall 2004 AdaBoost.. Binary Classification. Read 9.5 Duda,
Journal report: High Resolution Model of Transcription Factor- DNA Affinities Improve In Vitro and In Vivo Binding Predictions Paper by: Phadera Gius,
Finding Transcription Factor Motifs Adapted from a lab created by Prof Terry Speed.
Cis-regulatory Modules and Module Discovery
Classification (slides adapted from Rob Schapire) Eran Segal Weizmann Institute.
Pattern Discovery and Recognition for Genetic Regulation Tim Bailey UQ Maths and IMB.
Bayesian Machine learning and its application Alan Qi Feb. 23, 2009.
Local Multiple Sequence Alignment Sequence Motifs
CS 6243 Machine Learning Advanced topic: pattern recognition (DNA motif finding)
Learning Sequence Motifs Using Expectation Maximization (EM) and Gibbs Sampling BMI/CS 776 Mark Craven
. Finding Motifs in Promoter Regions Libi Hertzberg Or Zuk.
Ensemble Methods Construct a set of classifiers from the training data Predict class label of previously unseen records by aggregating predictions made.
Case Study: Characterizing Diseased States from Expression/Regulation Data Tuck et al., BMC Bioinformatics, 2006.
Special Topics in Genomics Motif Analysis. Sequence motif – a pattern of nucleotide or amino acid sequences GTATGTACTTACTATGGGTGGTCAACAAATCTATGTATGA TAACATGTGACTCCTATAACCTCTTTGGGTGGTACATGAA.
Pattern Discovery and Recognition for Understanding Genetic Regulation Timothy L. Bailey Institute for Molecular Bioscience University of Queensland.
Transcription factor binding motifs (part II) 10/22/07.
Bab 5 Classification: Alternative Techniques Part 4 Artificial Neural Networks Based Classifer.
Boosting ---one of combining models Xin Li Machine Learning Course.
1 Discovery of Conserved Sequence Patterns Using a Stochastic Dictionary Model Authors Mayetri Gupta & Jun S. Liu Presented by Ellen Bishop 12/09/2003.
Introduction to Data Mining, 2nd Edition
Presentation transcript:

MotifBooster – A Boosting Approach for Constructing TF-DNA Binding Classifiers Pengyu Hong 10/06/2005

Motivation Understand transcriptional regulation Understand transcriptional regulation TF Gene X mRNA transcript Model transcriptional regulatory networks Model transcriptional regulatory networks Binding sites Regulators Genes

Motivation AlignACE (Hughes et al 2000) AlignACE (Hughes et al 2000) ANN-Spec (Workman et al 2000) ANN-Spec (Workman et al 2000) BioProspector (Liu et al 2001) BioProspector (Liu et al 2001) Consensus (Hertz et al 1999) Consensus (Hertz et al 1999) Gibbs Motif Sampler (Lawrence et al 1993) Gibbs Motif Sampler (Lawrence et al 1993) LogicMotif (Keles et al 2004) LogicMotif (Keles et al 2004) MDScan (Liu et al 2002) MDScan (Liu et al 2002) MEME (Bailey and Elkan 1995) MEME (Bailey and Elkan 1995) Motif Regressor (Colon et al 2003) Motif Regressor (Colon et al 2003) … … … … Previous works on motif finding

A C G T Motivation A widely used model – Motif Weight Matrix (Stormo et al 1982) A A C A T C C G Score of the site = + = A sequence is a target if it contains a binding site (score > threshold). vs. threshold Computational << Molecular

Motivation CACCCATACAT CACCCATACAT CATCCGTACAT CATCCGTACAT Non-linear binding effects, e.g., different binding modes. Preferred binding CACCCGTACAT CACCCGTACAT CATCCATACAT CATCCATACAT Non-preferred binding Mode 1 Mode 2 Mode 3 Mode 4 CA C/T CC A/G TACAT CA C/T CC A/G TACAT

Modeling Model a TF-DNA binding classifier as an ensemble model. base classifier weight ensemble model

Modeling Sequence scoring function: f m (s ik ) is a site scoring function (weight matrix + threshold). The scoring function considers (a) the number of matching sites (b) the degree of matching hm(Si)hm(Si)hm(Si)hm(Si) qm(Si)qm(Si)qm(Si)qm(Si) The mth base classifier

Training – Boosting Modify the confidence-rated boosting (CRB) algorithm (Schapire et al. 1999) to train ensemble models (b) Learn the parameters of each base classifier and its weight. (a) Decide the number of base classifiers.

Why Boosting? Booting is a Newton-like technique that iteratively adds base classifiers to minimize the upper bound on the training error. Training error Margin of training samples Generalization error (Schapire et al. 1998)

Challenges Positive sequences – targets of a TF Positive sequences – targets of a TF Negative sequences Negative sequences 1.Sequences are labeled, but not the sites in the sequences. 2.Cannot be well separated by the weight matrix model (linear). 3.Number of negative sequences >> number of positive sequences.

Boosting Initialization Positive Positive Negative Negative Total weight of the positive samples == Total weight of the negative samples. Total weight of the positive samples == Total weight of the negative samples. Since the motif must be an enriched pattern in the positive sequences, use Motif Regressor to find a seed motif matrix W 0. Since the motif must be an enriched pattern in the positive sequences, use Motif Regressor to find a seed motif matrix W 0.

Boosting Train a base classifier (BC) Refine  m and the parameters of q m (  ) to minimize Refine  m and the parameters of q m (  ) to minimize Negative information is explicitly used to train q m (  ) and  m. Negative information is explicitly used to train q m (  ) and  m. where y i is the label of S i and d i m is the weight of S i in the mth round. Use the seed matrix W 0 +  to initialize the mth base classifier q m (  ) and let  m =1. Use the seed matrix W 0 +  to initialize the mth base classifier q m (  ) and let  m =1. Positive Positive Negative Negative BC 1

Boosting Adjust sample weights and gives higher weights to previously misclassified samples. Positive Positive Negative Negative BC 1 y i is the label of S iy i is the label of S i d i m is the weight of S i in the mth round.d i m is the weight of S i in the mth round. d i m+1 is the new weight of S i.d i m+1 is the new weight of S i.

Boosting Add a new base classifier Positive Positive Negative Negative BC 1 BC 2

Boosting Add a new base classifier Positive Positive Negative Negative Decision boundary

Boosting Adjust sample weights again Positive Positive Negative Negative Decision boundary

Boosting Add one more base classifier Positive Positive Negative Negative BC 3

Boosting Add one more base classifier Positive Positive Negative Negative Decision boundary

Boosting Positive Positive Negative Negative Decision boundary Stop if the result is perfect or the performance on the internal validation sequences drops.

Results –Positive sequences –p-value < –Number of positive sequences  25. –Negative sequences –p-value  0.05 & ratio  1 Got 40 TFs. Data: ChIP-chip data of Saccharomyces cerevisiae (Lee et al )

Results Horizontal axis: TFs Vertical axis: Improvements on specificity Boosted models vs. Seed weight matrices Leave-one-out test results

Results RAP1 Weight Matrix Boosting Base classifier 1 Base classifier 2 Base classifier 3 Capture Position-Correlation +  0

Results REB1 Weight Matrix Boosting Base classifier 1 Base classifier 2 Capture Position-Correlation