Download presentation
Presentation is loading. Please wait.
Published byBonnie Harmon Modified over 9 years ago
1
1 A combining approach to statistical methods for p >> n problems Shinto Eguchi Workshop on Statistical Genetics, Nov 9, 2004 at ISM
2
2 Microarray data cDNA microarry
3
3 Prediction from gene expressions Feature vector dimension = number of genes p components = quantities of gene expression Class label disease, adverse effect Classification machine based on training dataset
4
4 Leukemic diseases, Golub et al http://www.broad.mit.edu/cgi-bin/cancer/publications/
5
5 Web microarray data p n y = +1 y = - 1 ALLAML7129723735 Colon2000624022 Estrogen7129492524 p >> n http://microarray.princeton.edu/oncology/ http://mgm.duke.edu/genome/dna micro/work/
6
6 Genomic data SNPsProteomeMicroarray Data dimension p 1,000 ~ 100,000 function5,000~20,000 data size n 100 ~ 10005 ~ 2020 ~ 100 mRNA ProteinGenome
7
7 Problem: p >> n Fundamental issue on Bioinformatics p is the dimension of biomarker (SNPs, proteome, microarray, …) n is the number of individuals (informed consent, institutional protocol, … bioethics)
8
8 Current paradigm Biomarker space SNPsHaplotype block (Fujisawa) Microarray Model-based clustering Proteome Peak data reduction (Miyata) GroupBoost (Takenouchi) Network gene model Haplotype & adverse effects (Matsuura)
9
9 An approach by combining Let B be a biomarker space Rapid expansion of genomic data Let be K experimental facilities
10
10 Bridge Study? CAMDA (Critical Assessment of Microarray Data Analysis ) DDBJ (DNA Data Bank Japan, NIG) …. result
11
11 CAMDA 2003 4 datasets for Lung Cancer HarvardPNAS, 2001Affymetrix MichiganNature Med, 2002 Affymetrix StanfordPNAS, 2001cDNA OntarioCancer Res 2001 cDNA http://www.camda.duke.edu/camda03/datasets/
12
12 Some problems 1. Heterogeneity in feature space cDNA, Affymetrix Differences in covariates , medical diagnosis Uncertainty for microarray experiments 2. Heterogeneous class-labeling 3. Heterogeneous generalization powers A vast of unpublished studies 4. Publication bias
13
13 Machine learning Leanability: boosting weak learners? AdaBoost :Freund & Schapire (1997) weak classifiersA strong classifier stagewise
14
14 AdaBoost
15
15 One-gene classifier Error number 5556 4 66 5 5 6 5 one-gene classifier Let be expressions of the j-th gene
16
16 The second training Errror number 4 5.5 7 9 8 7.5 6 79 8.5 Update the weight: 4.5 Weight up to 2 Weight down to 0.5
17
17 Learning algorithm Final machine
18
18 Exponential loss Update :
19
19 Different datasets Normalization: ∋ expression vector of the same genes label of the same clinical item
20
20 Weighted Errors The k-th weighted error The combined weighted error
21
21 BridgeBoost
22
22 Learning Stage t :Stage t+1 :
23
23 Mean exponential loss Exponential loss Mean exponential loss Note: convexity of Expo-Loss
24
24 Meta-leaning Separate learning Meta-learning
25
25 Simulation Collapsed dataset Traning errorTest error 3 datasets Test error 0 ( ideal ) Test error 0.5 ( ideal ) data 1, data2 data3
26
26 Comparison Separate AdaBoost BridgeBoost Training error Test error
27
27 Min =15%Min =4% Min =43% Min = 3% Min = 4% Collapsed AdaBoost Separate AdaBoost BridgeBoost Test errors
28
28 Conclusion …. result Separate Leaning Meta-leaning
29
29 Unsolved problems 3. On the information on the unmatched genes in combining datasets 2. Prediction for class-label for a given new x ? 4. Heterogeneity is OK, but publication bias? 1. Which dataset should be joined or deleted in BridgeBoost ?
30
30 Mean and s.d. of 37 studies Passive smokers vs lung cancer Funnel plot heterogeneity publication bias Publication bias? (Copas & Shi, 2001)
31
31 References [1] A class of logistic-type discriminant functions. S. Eguchi and J. Copas, Biometrika 89, 1-22 (2002). [2] Information geometry of U-Boost and Bregman divergence. N. Murata, T. Takenouchi, T. Kanamori and S. Eguchi Neural Computation 16, 1437-1481 (2004). [3] Robustifying AdaBoost by adding the naive error rate. T. Takenouchi and S. Eguchi. Neural Computation 16, 767-787 (2004). [4] GroupAdaBoost for selecting important genes. In preparation. T. Takenouchi, M. Ushijima and S. Eguchi [5] Local model uncertainty and incomplete data bias. J. Copas and S. Eguchi. ISM Research Memo. 884 July. (2003). [6] Local sensitivity approximation for selectivity bias. J. Copas and S. Eguchi. J. Royal Statistical Society B 63 (2001) 871-895. [7] Reanalysis of epidemiological evidence on lung cancer and passive smoking. J. Copas and J.Q. Shi, British Medical Journal 7232 (2000) 417-418.
Similar presentations
© 2025 SlidePlayer.com. Inc.
All rights reserved.