1 A combining approach to statistical methods for p >> n problems Shinto Eguchi Workshop on Statistical Genetics, Nov 9, 2004 at ISM
2 Microarray data cDNA microarry
3 Prediction from gene expressions Feature vector dimension = number of genes p components = quantities of gene expression Class label disease, adverse effect Classification machine based on training dataset
4 Leukemic diseases, Golub et al
5 Web microarray data p n y = +1 y = - 1 ALLAML Colon Estrogen p >> n micro/work/
6 Genomic data SNPsProteomeMicroarray Data dimension p 1,000 ~ 100,000 function5,000~20,000 data size n 100 ~ ~ 2020 ~ 100 mRNA ProteinGenome
7 Problem: p >> n Fundamental issue on Bioinformatics p is the dimension of biomarker (SNPs, proteome, microarray, …) n is the number of individuals (informed consent, institutional protocol, … bioethics)
8 Current paradigm Biomarker space SNPsHaplotype block (Fujisawa) Microarray Model-based clustering Proteome Peak data reduction (Miyata) GroupBoost (Takenouchi) Network gene model Haplotype & adverse effects (Matsuura)
9 An approach by combining Let B be a biomarker space Rapid expansion of genomic data Let be K experimental facilities
10 Bridge Study? CAMDA (Critical Assessment of Microarray Data Analysis ) DDBJ (DNA Data Bank Japan, NIG) …. result
11 CAMDA datasets for Lung Cancer HarvardPNAS, 2001Affymetrix MichiganNature Med, 2002 Affymetrix StanfordPNAS, 2001cDNA OntarioCancer Res 2001 cDNA
12 Some problems 1. Heterogeneity in feature space cDNA, Affymetrix Differences in covariates , medical diagnosis Uncertainty for microarray experiments 2. Heterogeneous class-labeling 3. Heterogeneous generalization powers A vast of unpublished studies 4. Publication bias
13 Machine learning Leanability: boosting weak learners? AdaBoost :Freund & Schapire (1997) weak classifiersA strong classifier stagewise
14 AdaBoost
15 One-gene classifier Error number one-gene classifier Let be expressions of the j-th gene
16 The second training Errror number Update the weight: 4.5 Weight up to 2 Weight down to 0.5
17 Learning algorithm Final machine
18 Exponential loss Update :
19 Different datasets Normalization: ∋ expression vector of the same genes label of the same clinical item
20 Weighted Errors The k-th weighted error The combined weighted error
21 BridgeBoost
22 Learning Stage t :Stage t+1 :
23 Mean exponential loss Exponential loss Mean exponential loss Note: convexity of Expo-Loss
24 Meta-leaning Separate learning Meta-learning
25 Simulation Collapsed dataset Traning errorTest error 3 datasets Test error 0 ( ideal ) Test error 0.5 ( ideal ) data 1, data2 data3
26 Comparison Separate AdaBoost BridgeBoost Training error Test error
27 Min =15%Min =4% Min =43% Min = 3% Min = 4% Collapsed AdaBoost Separate AdaBoost BridgeBoost Test errors
28 Conclusion …. result Separate Leaning Meta-leaning
29 Unsolved problems 3. On the information on the unmatched genes in combining datasets 2. Prediction for class-label for a given new x ? 4. Heterogeneity is OK, but publication bias? 1. Which dataset should be joined or deleted in BridgeBoost ?
30 Mean and s.d. of 37 studies Passive smokers vs lung cancer Funnel plot heterogeneity publication bias Publication bias? (Copas & Shi, 2001)
31 References [1] A class of logistic-type discriminant functions. S. Eguchi and J. Copas, Biometrika 89, 1-22 (2002). [2] Information geometry of U-Boost and Bregman divergence. N. Murata, T. Takenouchi, T. Kanamori and S. Eguchi Neural Computation 16, (2004). [3] Robustifying AdaBoost by adding the naive error rate. T. Takenouchi and S. Eguchi. Neural Computation 16, (2004). [4] GroupAdaBoost for selecting important genes. In preparation. T. Takenouchi, M. Ushijima and S. Eguchi [5] Local model uncertainty and incomplete data bias. J. Copas and S. Eguchi. ISM Research Memo. 884 July. (2003). [6] Local sensitivity approximation for selectivity bias. J. Copas and S. Eguchi. J. Royal Statistical Society B 63 (2001) [7] Reanalysis of epidemiological evidence on lung cancer and passive smoking. J. Copas and J.Q. Shi, British Medical Journal 7232 (2000)