1 A combining approach to statistical methods for p >> n problems Shinto Eguchi Workshop on Statistical Genetics, Nov 9, 2004 at ISM.

Slides:

Advertisements

Similar presentations

A gene expression analysis system for medical diagnosis D. Maroulis, D. Iakovidis, S. Karkanis, I. Flaounas D. Maroulis, D. Iakovidis, S. Karkanis, I.

Advertisements

A Statistician’s Games * : Bootstrap, Bagging and Boosting * Please refer to “Game theory, on-line prediction and boosting” by Y. Freund and R. Schapire,

Boosting Ashok Veeraraghavan. Boosting Methods Combine many weak classifiers to produce a committee. Resembles Bagging and other committee based methods.

CMPUT 466/551 Principal Source: CMU

Cos 429: Face Detection (Part 2) Viola-Jones and AdaBoost Guest Instructor: Andras Ferencz (Your Regular Instructor: Fei-Fei Li) Thanks to Fei-Fei Li,

Introduction to Boosting Slides Adapted from Che Wanxiang( 车万翔 ) at HIT, and Robin Dhamankar of Many thanks!

Boosting CMPUT 615 Boosting Idea We have a weak classifier, i.e., it’s error rate is a little bit better than 0.5. Boosting combines a lot of such weak.

By Russell Armstrong Supervisor Mrs Wei Ji Diagnosis Analysis of Lung Cancer by Genome Expression Profiles.

Predictive Automatic Relevance Determination by Expectation Propagation Yuan (Alan) Qi Thomas P. Minka Rosalind W. Picard Zoubin Ghahramani.

4 th NETTAB Workshop Camerino, 5 th -7 th September 2004 Alberto Bertoni, Raffaella Folgieri, Giorgio Valentini

Classification: Support Vector Machine 10/10/07. What hyperplane (line) can separate the two classes of data?

Robust supervised image classifiers by spatial AdaBoost based on robust loss functions Ryuei Nishii and Shinto Eguchi Proc. Of SPIE Vol D-2.

Multidimensional Analysis If you are comparing more than two conditions (for example 10 types of cancer) or if you are looking at a time series (cell cycle.

Scientific Data Mining: Emerging Developments and Challenges F. Seillier-Moiseiwitsch Bioinformatics Research Center Department of Mathematics and Statistics.

Adaboost and its application

Bioinformatics Challenge  Learning in very high dimensions with very few samples  Acute leukemia dataset: 7129 # of gene vs. 72 samples  Colon cancer.

Introduction to Boosting Aristotelis Tsirigos SCLT seminar - NYU Computer Science.

CIBB-WIRN 2004 Perugia, 14 th -17 th September 2004 Alberto Bertoni, Raffaella Folgieri, Giorgio Valentini Feature.

1 Masterseminar „A statistical framework for the diagnostic of meningioma cancer“ Chair for Bioinformatics, Saarland University Andreas Keller Supervised.

Sparse vs. Ensemble Approaches to Supervised Learning

Statistical Learning: Pattern Classification, Prediction, and Control Peter Bartlett August 2002, UC Berkeley CIS.

Boosting Main idea: train classifiers (e.g. decision trees) in a sequence. a new classifier should focus on those cases which were incorrectly classified.

Guidelines on Statistical Analysis and Reporting of DNA Microarray Studies of Clinical Outcome Richard Simon, D.Sc. Chief, Biometric Research Branch National.

Machine Learning CS 165B Spring 2012

Whole Genome Expression Analysis

Chapter 10 Boosting May 6, Outline Adaboost Ensemble point-view of Boosting Boosting Trees Supervised Learning Methods.

Cost-Sensitive Bayesian Network algorithm Introduction: Machine learning algorithms are becoming an increasingly important area for research and application.

Exagen Diagnostics, Inc., all rights reserved Biomarker Discovery in Genomic Data with Partial Clinical Annotation Cole Harris, Noushin Ghaffari.

Boris Babenko Department of Computer Science and Engineering University of California, San Diego Semi-supervised and Unsupervised Feature Scaling.

1 Information Geometry on Classification Logistic, AdaBoost, Area under ROC curve Shinto Eguchi – – ISM seminor on 17/1/2001 This talk is based on one.

The Broad Institute of MIT and Harvard Classification / Prediction.

Benk Erika Kelemen Zsolt

Controlling FDR in Second Stage Analysis Catherine Tuglus Work with Mark van der Laan UC Berkeley Biostatistics.

Scenario 6 Distinguishing different types of leukemia to target treatment.

1/15 Strengthening I-ReGEC classifier G. Attratto, D. Feminiano, and M.R. Guarracino High Performance Computing and Networking Institute Italian National.

BOOSTING David Kauchak CS451 – Fall Admin Final project.

年 11 月 24 日（水）～ 26 日（金） 2004 Open Lecture at ISM Recent topics in machine learning: Boosting 公開講座統計数理要論「機械学習の最近の話題」ブースト学習江口真透（統計数理研究所, 総合研究大学院統計科学）

Classification of microarray samples Tim Beißbarth Mini-Group Meeting

Combining multiple learners Usman Roshan. Bagging Randomly sample training data Determine classifier C i on sampled data Goto step 1 and repeat m times.

Ensemble Learning (1) Boosting Adaboost Boosting is an additive model

Analysis and Management of Microarray Data Previous Workshops –Computer Aided Drug Design –Public Domain Resources in Biology –Application of Computer.

Evolutionary Algorithms for Finding Optimal Gene Sets in Micro array Prediction. J. M. Deutsch Presented by: Shruti Sharma.

Tony Jebara, Columbia University Advanced Machine Learning & Perception Instructor: Tony Jebara.

1 U-divergence and its application to statistical pattern recognition Shinto Eguchi Inst of Stat Math, Japan.

Online Multiple Kernel Classification Steven C.H. Hoi, Rong Jin, Peilin Zhao, Tianbao Yang Machine Learning (2013) Presented by Audrey Cheong Electrical.

Class 23, 2001 CBCl/AI MIT Bioinformatics Applications and Feature Selection for SVMs S. Mukherjee.

Classification (slides adapted from Rob Schapire) Eran Segal Weizmann Institute.

Introduction to Microarrays Kellie J. Archer, Ph.D. Assistant Professor Department of Biostatistics

Computational Approaches for Biomarker Discovery SubbaLakshmiswetha Patchamatla.

Guest lecture: Feature Selection Alan Qi Dec 2, 2004.

A Report on CAMDA’01 Biointelligence Lab School of Computer Science and Engineering Seoul National University Kyu-Baek Hwang and Jeong-Ho Chang.

Disease Diagnosis by DNAC MEC seminar 25 May 04. DNA chip Blood Biopsy Sample rRNA/mRNA/ tRNA RNA RNA with cDNA Hybridization Mixture of cell-lines Reference.

Combining multiple learners Usman Roshan. Decision tree From Alpaydin, 2010.

Boosting ---one of combining models Xin Li Machine Learning Course.

AdaBoost Algorithm and its Application on Object Detection Fayin Li.

Adaboost (Adaptive boosting) Jo Yeong-Jun Schapire, Robert E., and Yoram Singer. "Improved boosting algorithms using confidence- rated predictions."

Combining Models Foundations of Algorithms and Machine Learning (CS60020), IIT KGP, 2017: Indrajit Bhattacharya.

Classification with Gene Expression Data

Debesh Jha and Kwon Goo-Rak

Trees, bagging, boosting, and stacking

Alan Qi Thomas P. Minka Rosalind W. Picard Zoubin Ghahramani

Boosting and Additive Trees

Claudio Lottaz and Rainer Spang

Blind Signal Separation using Principal Components Analysis

Cos 429: Face Detection (Part 2) Viola-Jones and AdaBoost Guest Instructor: Andras Ferencz (Your Regular Instructor: Fei-Fei Li) Thanks to Fei-Fei.

Boosting For Tumor Classification With Gene Expression Data

Ensemble learning Reminder - Bagging of Trees Random Forest

CIS 519 Recitation 11/15/18.

Claudio Lottaz and Rainer Spang

Presentation transcript:

1 A combining approach to statistical methods for p >> n problems Shinto Eguchi Workshop on Statistical Genetics, Nov 9, 2004 at ISM

2 Microarray data cDNA microarry

3 Prediction from gene expressions Feature vector dimension = number of genes p components = quantities of gene expression Class label disease, adverse effect Classification machine based on training dataset

4 Leukemic diseases, Golub et al

5 Web microarray data p n y = +1 y = － 1 ALLAML Colon Estrogen p >> n micro/work/

6 Genomic data SNPsProteomeMicroarray Data dimension p 1,000 ～ 100,000 function5,000~20,000 data size n 100 ～～ 2020 ～ 100 mRNA ProteinGenome

7 Problem: p >> n Fundamental issue on Bioinformatics p is the dimension of biomarker (SNPs, proteome, microarray, …) n is the number of individuals (informed consent, institutional protocol, … bioethics)

8 Current paradigm Biomarker space SNPsHaplotype block (Fujisawa) Microarray Model-based clustering Proteome Peak data reduction (Miyata) GroupBoost (Takenouchi) Network gene model Haplotype & adverse effects (Matsuura)

9 An approach by combining Let B be a biomarker space Rapid expansion of genomic data Let be K experimental facilities

10 Bridge Study? CAMDA (Critical Assessment of Microarray Data Analysis ) DDBJ (DNA Data Bank Japan, NIG) …. result

11 CAMDA datasets for Lung Cancer HarvardPNAS, 2001Affymetrix MichiganNature Med, 2002 Affymetrix StanfordPNAS, 2001cDNA OntarioCancer Res 2001 cDNA

12 Some problems 1. Heterogeneity in feature space cDNA, Affymetrix Differences in covariates ， medical diagnosis Uncertainty for microarray experiments 2. Heterogeneous class-labeling 3. Heterogeneous generalization powers A vast of unpublished studies 4. Publication bias

13 Machine learning Leanability: boosting weak learners? AdaBoost :Freund & Schapire (1997) weak classifiersA strong classifier stagewise

14 AdaBoost

15 One-gene classifier Error number one-gene classifier Let be expressions of the j-th gene

16 The second training Errror number Update the weight: 4.5 Weight up to 2 Weight down to 0.5

17 Learning algorithm Final machine

18 Exponential loss Update :

19 Different datasets Normalization: ∋ expression vector of the same genes label of the same clinical item

20 Weighted Errors The k-th weighted error The combined weighted error

21 BridgeBoost

22 Learning Stage t :Stage t+1 :

23 Mean exponential loss Exponential loss Mean exponential loss Note: convexity of Expo-Loss

24 Meta-leaning Separate learning Meta-learning

25 Simulation Collapsed dataset Traning errorTest error 3 datasets Test error 0 （ ideal ） Test error 0.5 （ ideal ） data 1, data2 data3

26 Comparison Separate AdaBoost BridgeBoost Training error Test error

27 Min =15%Min =4% Min =43% Min = 3% Min = 4% Collapsed AdaBoost Separate AdaBoost BridgeBoost Test errors

28 Conclusion …. result Separate Leaning Meta-leaning

29 Unsolved problems 3. On the information on the unmatched genes in combining datasets 2. Prediction for class-label for a given new x ? 4. Heterogeneity is OK, but publication bias? 1. Which dataset should be joined or deleted in BridgeBoost ?

30 Mean and s.d. of 37 studies Passive smokers vs lung cancer Funnel plot heterogeneity publication bias Publication bias? (Copas & Shi, 2001)

31 References [1] A class of logistic-type discriminant functions. S. Eguchi and J. Copas, Biometrika 89, 1-22 (2002). [2] Information geometry of U-Boost and Bregman divergence. N. Murata, T. Takenouchi, T. Kanamori and S. Eguchi Neural Computation 16, (2004). [3] Robustifying AdaBoost by adding the naive error rate. T. Takenouchi and S. Eguchi. Neural Computation 16, (2004). [4] GroupAdaBoost for selecting important genes. In preparation. T. Takenouchi, M. Ushijima and S. Eguchi [5] Local model uncertainty and incomplete data bias. J. Copas and S. Eguchi. ISM Research Memo. 884 July. (2003). [6] Local sensitivity approximation for selectivity bias. J. Copas and S. Eguchi. J. Royal Statistical Society B 63 (2001) [7] Reanalysis of epidemiological evidence on lung cancer and passive smoking. J. Copas and J.Q. Shi, British Medical Journal 7232 (2000)