Download presentation
Presentation is loading. Please wait.
1
MotifBooster – A Boosting Approach for Constructing TF-DNA Binding Classifiers Pengyu Hong 10/06/2005
2
Motivation Understand transcriptional regulation Understand transcriptional regulation TF Gene X mRNA transcript Model transcriptional regulatory networks Model transcriptional regulatory networks Binding sites Regulators Genes
3
Motivation AlignACE (Hughes et al 2000) AlignACE (Hughes et al 2000) ANN-Spec (Workman et al 2000) ANN-Spec (Workman et al 2000) BioProspector (Liu et al 2001) BioProspector (Liu et al 2001) Consensus (Hertz et al 1999) Consensus (Hertz et al 1999) Gibbs Motif Sampler (Lawrence et al 1993) Gibbs Motif Sampler (Lawrence et al 1993) LogicMotif (Keles et al 2004) LogicMotif (Keles et al 2004) MDScan (Liu et al 2002) MDScan (Liu et al 2002) MEME (Bailey and Elkan 1995) MEME (Bailey and Elkan 1995) Motif Regressor (Colon et al 2003) Motif Regressor (Colon et al 2003) … … … … Previous works on motif finding
4
1 2 3 4 5 6 7 8 A 0.19 1.11 -0.17 1.65 -2.65 -2.66 -1.98 0.92 C -0.14 -0.49 1.89 -1.81 1.70 2.32 2.14 -2.07 G -1.39 0.25 -1.22 -1.07 -2.07 -2.07 -2.07 1.13 T 0.86 -1.39 -2.65 -2.65 0.41 -2.65 -1.16 -1.80 Motivation A widely used model – Motif Weight Matrix (Stormo et al 1982) A A C A T C C G Score of the site = + = 10.84 A sequence is a target if it contains a binding site (score > threshold). vs. threshold Computational << Molecular
5
Motivation CACCCATACAT CACCCATACAT CATCCGTACAT CATCCGTACAT Non-linear binding effects, e.g., different binding modes. Preferred binding CACCCGTACAT CACCCGTACAT CATCCATACAT CATCCATACAT Non-preferred binding Mode 1 Mode 2 Mode 3 Mode 4 CA C/T CC A/G TACAT CA C/T CC A/G TACAT
6
Modeling Model a TF-DNA binding classifier as an ensemble model. base classifier weight ensemble model
7
Modeling Sequence scoring function: f m (s ik ) is a site scoring function (weight matrix + threshold). The scoring function considers (a) the number of matching sites (b) the degree of matching hm(Si)hm(Si)hm(Si)hm(Si) qm(Si)qm(Si)qm(Si)qm(Si) The mth base classifier
8
Training – Boosting Modify the confidence-rated boosting (CRB) algorithm (Schapire et al. 1999) to train ensemble models (b) Learn the parameters of each base classifier and its weight. (a) Decide the number of base classifiers.
9
Why Boosting? Booting is a Newton-like technique that iteratively adds base classifiers to minimize the upper bound on the training error. Training error Margin of training samples Generalization error (Schapire et al. 1998)
10
Challenges Positive sequences – targets of a TF Positive sequences – targets of a TF Negative sequences Negative sequences 1.Sequences are labeled, but not the sites in the sequences. 2.Cannot be well separated by the weight matrix model (linear). 3.Number of negative sequences >> number of positive sequences.
11
Boosting Initialization Positive Positive Negative Negative Total weight of the positive samples == Total weight of the negative samples. Total weight of the positive samples == Total weight of the negative samples. Since the motif must be an enriched pattern in the positive sequences, use Motif Regressor to find a seed motif matrix W 0. Since the motif must be an enriched pattern in the positive sequences, use Motif Regressor to find a seed motif matrix W 0.
12
Boosting Train a base classifier (BC) Refine m and the parameters of q m ( ) to minimize Refine m and the parameters of q m ( ) to minimize Negative information is explicitly used to train q m ( ) and m. Negative information is explicitly used to train q m ( ) and m. where y i is the label of S i and d i m is the weight of S i in the mth round. Use the seed matrix W 0 + to initialize the mth base classifier q m ( ) and let m =1. Use the seed matrix W 0 + to initialize the mth base classifier q m ( ) and let m =1. Positive Positive Negative Negative BC 1
13
Boosting Adjust sample weights and gives higher weights to previously misclassified samples. Positive Positive Negative Negative BC 1 y i is the label of S iy i is the label of S i d i m is the weight of S i in the mth round.d i m is the weight of S i in the mth round. d i m+1 is the new weight of S i.d i m+1 is the new weight of S i.
14
Boosting Add a new base classifier Positive Positive Negative Negative BC 1 BC 2
15
Boosting Add a new base classifier Positive Positive Negative Negative Decision boundary
16
Boosting Adjust sample weights again Positive Positive Negative Negative Decision boundary
17
Boosting Add one more base classifier Positive Positive Negative Negative BC 3
18
Boosting Add one more base classifier Positive Positive Negative Negative Decision boundary
19
Boosting Positive Positive Negative Negative Decision boundary Stop if the result is perfect or the performance on the internal validation sequences drops.
20
Results –Positive sequences –p-value < 0.001 –Number of positive sequences 25. –Negative sequences –p-value 0.05 & ratio 1 Got 40 TFs. Data: ChIP-chip data of Saccharomyces cerevisiae (Lee et al. 2002 )
21
Results Horizontal axis: TFs Vertical axis: Improvements on specificity Boosted models vs. Seed weight matrices Leave-one-out test results
22
Results RAP1 Weight Matrix Boosting Base classifier 1 Base classifier 2 Base classifier 3 Capture Position-Correlation + 0
23
Results REB1 Weight Matrix Boosting Base classifier 1 Base classifier 2 Capture Position-Correlation
Similar presentations
© 2024 SlidePlayer.com. Inc.
All rights reserved.