1 A combining approach to statistical methods for p >> n problems Shinto Eguchi Workshop on Statistical Genetics, Nov 9, 2004 at ISM.

1 A combining approach to statistical methods for p >> n problems Shinto Eguchi Workshop on Statistical Genetics, Nov 9, 2004 at ISM

2 Microarray data cDNA microarry

3 Prediction from gene expressions Feature vector dimension = number of genes p components = quantities of gene expression Class label disease, adverse effect Classification machine based on training dataset

4 Leukemic diseases, Golub et al http://www.broad.mit.edu/cgi-bin/cancer/publications/

5 Web microarray data p n y = +1 y = － 1 ALLAML7129723735 Colon2000624022 Estrogen7129492524 p >> n http://microarray.princeton.edu/oncology/ http://mgm.duke.edu/genome/dna micro/work/

6 Genomic data SNPsProteomeMicroarray Data dimension p 1,000 ～ 100,000 function5,000~20,000 data size n 100 ～ 10005 ～ 2020 ～ 100 mRNA ProteinGenome

7 Problem: p >> n Fundamental issue on Bioinformatics p is the dimension of biomarker (SNPs, proteome, microarray, …) n is the number of individuals (informed consent, institutional protocol, … bioethics)

8 Current paradigm Biomarker space SNPsHaplotype block (Fujisawa) Microarray Model-based clustering Proteome Peak data reduction (Miyata) GroupBoost (Takenouchi) Network gene model Haplotype & adverse effects (Matsuura)

9 An approach by combining Let B be a biomarker space Rapid expansion of genomic data Let be K experimental facilities

10 Bridge Study? CAMDA (Critical Assessment of Microarray Data Analysis ) DDBJ (DNA Data Bank Japan, NIG) …. result

11 CAMDA 2003 4 datasets for Lung Cancer HarvardPNAS, 2001Affymetrix MichiganNature Med, 2002 Affymetrix StanfordPNAS, 2001cDNA OntarioCancer Res 2001 cDNA http://www.camda.duke.edu/camda03/datasets/

12 Some problems 1. Heterogeneity in feature space cDNA, Affymetrix Differences in covariates ， medical diagnosis Uncertainty for microarray experiments 2. Heterogeneous class-labeling 3. Heterogeneous generalization powers A vast of unpublished studies 4. Publication bias

13 Machine learning Leanability: boosting weak learners? AdaBoost :Freund & Schapire (1997) weak classifiersA strong classifier stagewise

14 AdaBoost

15 One-gene classifier Error number 5556 4 66 5 5 6 5 one-gene classifier Let be expressions of the j-th gene

16 The second training Errror number 4 5.5 7 9 8 7.5 6 79 8.5 Update the weight: 4.5 Weight up to 2 Weight down to 0.5

17 Learning algorithm Final machine

18 Exponential loss Update :

19 Different datasets Normalization: ∋ expression vector of the same genes label of the same clinical item

20 Weighted Errors The k-th weighted error The combined weighted error

21 BridgeBoost

22 Learning Stage t :Stage t+1 :

23 Mean exponential loss Exponential loss Mean exponential loss Note: convexity of Expo-Loss

24 Meta-leaning Separate learning Meta-learning

25 Simulation Collapsed dataset Traning errorTest error 3 datasets Test error 0 （ ideal ） Test error 0.5 （ ideal ） data 1, data2 data3

26 Comparison Separate AdaBoost BridgeBoost Training error Test error

27 Min =15%Min =4% Min =43% Min = 3% Min = 4% Collapsed AdaBoost Separate AdaBoost BridgeBoost Test errors

28 Conclusion …. result Separate Leaning Meta-leaning

29 Unsolved problems 3. On the information on the unmatched genes in combining datasets 2. Prediction for class-label for a given new x ? 4. Heterogeneity is OK, but publication bias? 1. Which dataset should be joined or deleted in BridgeBoost ?

30 Mean and s.d. of 37 studies Passive smokers vs lung cancer Funnel plot heterogeneity publication bias Publication bias? (Copas & Shi, 2001)

31 References [1] A class of logistic-type discriminant functions. S. Eguchi and J. Copas, Biometrika 89, 1-22 (2002). [2] Information geometry of U-Boost and Bregman divergence. N. Murata, T. Takenouchi, T. Kanamori and S. Eguchi Neural Computation 16, 1437-1481 (2004). [3] Robustifying AdaBoost by adding the naive error rate. T. Takenouchi and S. Eguchi. Neural Computation 16, 767-787 (2004). [4] GroupAdaBoost for selecting important genes. In preparation. T. Takenouchi, M. Ushijima and S. Eguchi [5] Local model uncertainty and incomplete data bias. J. Copas and S. Eguchi. ISM Research Memo. 884 July. (2003). [6] Local sensitivity approximation for selectivity bias. J. Copas and S. Eguchi. J. Royal Statistical Society B 63 (2001) 871-895. [7] Reanalysis of epidemiological evidence on lung cancer and passive smoking. J. Copas and J.Q. Shi, British Medical Journal 7232 (2000) 417-418.

1 A combining approach to statistical methods for p >> n problems Shinto Eguchi Workshop on Statistical Genetics, Nov 9, 2004 at ISM.

Similar presentations

Presentation on theme: "1 A combining approach to statistical methods for p >> n problems Shinto Eguchi Workshop on Statistical Genetics, Nov 9, 2004 at ISM."— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

1 A combining approach to statistical methods for p >> n problems Shinto Eguchi Workshop on Statistical Genetics, Nov 9, 2004 at ISM.

Similar presentations

Presentation on theme: "1 A combining approach to statistical methods for p >> n problems Shinto Eguchi Workshop on Statistical Genetics, Nov 9, 2004 at ISM."— Presentation transcript:

Similar presentations

About project

Feedback