2015-6-30DIMACS Workshop on Machine Learning Techniques in Bioinformatics 1 Cancer Classification with Data-dependent Kernels Anne Ya Zhang (with Xue-wen.

Slides:

Advertisements

Similar presentations

Outlines Background & motivation Algorithms overview

Advertisements

Predictive Analysis of Gene Expression Data from Human SAGE Libraries Alexessander Alves* Nikolay Zagoruiko + Oleg Okun § Olga Kutnenko + Irina Borisova.

CSCE555 Bioinformatics Lecture 15 classification for microarray data Meeting: MW 4:00PM-5:15PM SWGN2A21 Instructor: Dr. Jianjun Hu Course page:

A gene expression analysis system for medical diagnosis D. Maroulis, D. Iakovidis, S. Karkanis, I. Flaounas D. Maroulis, D. Iakovidis, S. Karkanis, I.

Instance-based Classification Examine the training samples each time a new query instance is given. The relationship between the new query instance and.

Y.-J. Lee, O. L. Mangasarian & W.H. Wolberg

A Bayesian Approach to Joint Feature Selection and Classifier Design Balaji Krishnapuram, Alexander J. Hartemink, Lawrence Carin, Fellow, IEEE, and Mario.

Dimension reduction : PCA and Clustering Agnieszka S. Juncker Slides: Christopher Workman and Agnieszka S. Juncker Center for Biological Sequence Analysis.

L15:Microarray analysis (Classification) The Biological Problem Two conditions that need to be differentiated, (Have different treatments). EX: ALL (Acute.

Part II: Discriminative Margin Clustering Joint work with: Rob Tibshirani, Dept of Statistics Patrick O. Brown, School of Medicine Stanford University.

Genomic Signal Processing: Ensemble Dependence Model for Classification and Prediction of Cancer Based on Gene Expression Data Joseph DePasquale Engineering.

Predictive Automatic Relevance Determination by Expectation Propagation Yuan (Alan) Qi Thomas P. Minka Rosalind W. Picard Zoubin Ghahramani.

4 th NETTAB Workshop Camerino, 5 th -7 th September 2004 Alberto Bertoni, Raffaella Folgieri, Giorgio Valentini

Classification of Microarray Data. Sample Preparation Hybridization Array design Probe design Question Experimental Design Buy Chip/Array Statistical.

Reduced Support Vector Machine

Classification of Microarray Data. Sample Preparation Hybridization Array design Probe design Question Experimental Design Buy Chip/Array Statistical.

Feature Selection Lecture 5

CIBB-WIRN 2004 Perugia, 14 th -17 th September 2004 Alberto Bertoni, Raffaella Folgieri, Giorgio Valentini Feature.

Supervised gene expression data analysis using SVMs and MLPs Giorgio Valentini

JAVED KHAN ET AL. NATURE MEDICINE – Volume 7 – Number 6 – JUNE 2001

1 Comparison of Discrimination Methods for the Classification of Tumors Using Gene Expression Data Presented by: Tun-Hsiang Yang.

Gene expression profiling identifies molecular subtypes of gliomas

Classification of multiple cancer types by multicategory support vector machines using gene expression data.

CZ5225: Modeling and Simulation in Biology Lecture 6, Microarray Cancer Classification Prof. Chen Yu Zong Tel:

Whole Genome Expression Analysis

Classification (Supervised Clustering) Naomi Altman Nov '06.

Evaluation of Supervised Learning Algorithms on Gene Expression Data CSCI 6505 – Machine Learning Adan Cosgaya Winter 2006 Dalhousie University.

Prediction model building and feature selection with SVM in breast cancer diagnosis Cheng-Lung Huang, Hung-Chang Liao, Mu- Chen Chen Expert Systems with.

Molecular Diagnosis Florian Markowetz & Rainer Spang Courses in Practical DNA Microarray Analysis.

ArrayCluster: an analytic tool for clustering, data visualization and module ﬁnder on gene expression proﬁles 組員：李祥豪謝紹陽江建霖.

1 Classifying Lymphoma Dataset Using Multi-class Support Vector Machines INFS-795 Advanced Data Mining Prof. Domeniconi Presented by Hong Chai.

Sample classification using Microarray Data. AB We have two sample entities malignant vs. benign tumor patient responding to drug vs. patient resistant.

The Broad Institute of MIT and Harvard Classification / Prediction.

Selection of Patient Samples and Genes for Disease Prognosis Limsoon Wong Institute for Infocomm Research Joint work with Jinyan Li & Huiqing Liu.

PCA, Clustering and Classification by Agnieszka S. Juncker Part of the slides is adapted from Chris Workman.

Microarray Workshop 1 Introduction to Classification Issues in Microarray Data Analysis Jane Fridlyand Jean Yee Hwa Yang University of California, San.

Class Prediction and Discovery Using Gene Expression Data Donna K. Slonim, Pablo Tamayo, Jill P. Mesirov, Todd R. Golub, Eric S. Lander 발표자 : 이인희.

Classification and diagnostic prediction of cancers using gene expression profiling and artificial neural networks From Nature Medicine 7(6) 2001 By Javed.

Evolutionary Algorithms for Finding Optimal Gene Sets in Micro array Prediction. J. M. Deutsch Presented by: Shruti Sharma.

A presentation on the topic For CIS 595 Bioinformatics course

Application of Class Discovery and Class Prediction Methods to Microarray Data Kellie J. Archer, Ph.D. Assistant Professor Department of Biostatistics.

Class 23, 2001 CBCl/AI MIT Bioinformatics Applications and Feature Selection for SVMs S. Mukherjee.

Classification (slides adapted from Rob Schapire) Eran Segal Weizmann Institute.

Reduces time complexity: Less computation Reduces space complexity: Less parameters Simpler models are more robust on small datasets More interpretable;

Guest lecture: Feature Selection Alan Qi Dec 2, 2004.

Applications of Supervised Learning in Bioinformatics Yen-Jen Oyang Dept. of Computer Science and Information Engineering.

Molecular Classification of Cancer Class Discovery and Class Prediction by Gene Expression Monitoring.

CZ5225: Modeling and Simulation in Biology Lecture 7, Microarray Class Classification by Machine learning Methods Prof. Chen Yu Zong Tel:

Molecular Classification of Cancer: Class Discovery and Class Prediction by Gene Expression Monitoring T.R. Golub et al., Science 286, 531 (1999)

Validation methods.

Case Study: Characterizing Diseased States from Expression/Regulation Data Tuck et al., BMC Bioinformatics, 2006.

Survival-Time Classification of Breast Cancer Patients and Chemotherapy Yuh-Jye Lee, Olvi Mangasarian & W. H. Wolberg UW Madison & UCSD La Jolla Computational.

Intro. ANN & Fuzzy Systems Lecture 16. Classification (II): Practical Considerations.

Classification of tissues and samples 指導老師：藍清隆演講者：張許恩、王人禾.

Classifiers!!! BCH364C/391L Systems Biology / Bioinformatics – Spring 2015 Edward Marcotte, Univ of Texas at Austin.

Part 3: Estimation of Parameters. Estimation of Parameters Most of the time, we have random samples but not the densities given. If the parametric form.

Predictive Automatic Relevance Determination by Expectation Propagation Y. Qi T.P. Minka R.W. Picard Z. Ghahramani.

PREDICT 422: Practical Machine Learning

Heping Zhang, Chang-Yung Yu, Burton Singer, Momian Xiong

LECTURE 09: BAYESIAN ESTIMATION (Cont.)

Classifiers!!! BCH339N Systems Biology / Bioinformatics – Spring 2016

Alan Qi Thomas P. Minka Rosalind W. Picard Zoubin Ghahramani

Gene Expression Classification

Molecular Classification of Cancer

Somi Jacob and Christian Bach

Multivariate Methods Berlin Chen

Multivariate Methods Berlin Chen, 2005 References:

Lecture 16. Classification (II): Practical Considerations

Advisor: Dr.vahidipour Zahra salimian Shaghayegh jalali Dec 2017

Outlines Introduction & Objectives Methodology & Workflow

Presentation transcript:

DIMACS Workshop on Machine Learning Techniques in Bioinformatics 1 Cancer Classification with Data-dependent Kernels Anne Ya Zhang (with Xue-wen Chen & Huilin Xiong) EECS & ITTC University of Kansas

DIMACS Workshop on Machine Learning Techniques in Bioinformatics 2 Outline  Introduction Data-dependent Kernel Results Conclusion

DIMACS Workshop on Machine Learning Techniques in Bioinformatics 3 Cancer facts Cancer is a group of many related diseases Cells continue to grow and divide and do not die when they should. Changes in the genes that control normal cell growth and death. Cancer is the second leading cause of death in the United States Cancer causes 1 of every 4 deaths NIH estimate overall costs for cancer in 2004 at $189.8 billion ($64.9 billion for direct medical cost) Cancer types Breast cancer, Lung cancer, Colon cancer, … Death rates vary greatly by cancer type and stage at diagnosis

DIMACS Workshop on Machine Learning Techniques in Bioinformatics 4 Motivation Why do we need to classify cancers? The general way of treating cancer is to: Categorize the cancers in different classes Use specific treatment for each of the classes Traditional way to classify cancers Morphological appearance Not accurate! Enzyme-based histochemical analyses. Immunophenotyping. Cytogenetic analysis. Complicated & needs highly specialized laboratories

DIMACS Workshop on Machine Learning Techniques in Bioinformatics 5 Motivation Why traditional ways are not enough ? There exists some tumors in the same class with completely different clinical courses May be more accurate classification is needed Assigning new tumors to known cancer classes is not easy e.g. assigning an acute leukemia tumor to one of the  AML (acute myeloid leukemia)  ALL (acute lymphoblastic leukemia)

DIMACS Workshop on Machine Learning Techniques in Bioinformatics 6 DNA Microarray-based Cancer Diagnosis Cancer is caused by changes in the genes that control normal cell growth and death. Molecular diagnostics offer the promise of precise, objective, and systematic cancer classification These tests are not widely applied because characteristic molecular markers for most solid tumors have to be identified. Recently, microarray tumor gene expression profiles have been used for cancer diagnosis.

DIMACS Workshop on Machine Learning Techniques in Bioinformatics 7 Microarray A microarray experiment monitors the expression levels for thousands of genes simultaneously. Microarray techniques will lead to a more complete understanding of the molecular variations among tumors, hence to a more reliable classification. G1 G2 G3 G4 G5 G6 G7 G6 G7 C1 C2 C3 C4 C5 C6 C7Low Zero High

DIMACS Workshop on Machine Learning Techniques in Bioinformatics 8 Microarray Microarray analysis allows the monitoring of the activities of thousands of genes over many different conditions. From a machine learning point of view… Gene\Experimentex-1ex-2 …… ex-m g-1 g-2 ……. g-n The large volume of the data requires the computational aid in analyzing the expression data.

DIMACS Workshop on Machine Learning Techniques in Bioinformatics 9 Machine learning tasks in cancer classification There are three main types of machine learning problems associated with cancer classification: The identification of new cancer classes using gene expression profiles The classification of cancer into known classes The identifications of “marker” genes that characterize the different cancer classes In this presentation, we focus on the second type of problems.

DIMACS Workshop on Machine Learning Techniques in Bioinformatics 10 Project Goals To develop a more systematic machine learning approach to cancer classification using microarray gene expression profiles. Use an initial collection of samples belonging to the known classes of cancer to create a “class predictor” for new, unknown, samples.

DIMACS Workshop on Machine Learning Techniques in Bioinformatics 11 Challenges in cancer classification Gene expression data are typically characterized by high dimensionality (i.e. a large number of genes) small sample size Curse of dimensionality! Methods Kernel techniques Data resampling Gene selection AML

DIMACS Workshop on Machine Learning Techniques in Bioinformatics 12 Outline Introduction  Data-dependent Kernel Results Conclusion

DIMACS Workshop on Machine Learning Techniques in Bioinformatics 13 Data-dependent kernel model Optimizing the data-dependent kernel is to choose the coefficient vector Data dependent

DIMACS Workshop on Machine Learning Techniques in Bioinformatics 14 Optimizing the kernel Criterion for kernel optimization Maximum class separability of the training data in the kernel-induced feature space

DIMACS Workshop on Machine Learning Techniques in Bioinformatics 15 The Kernel Optimization In reality, the matrix N 0 is usually singular α: eigenvector corresponding to the largest eigenvalue

DIMACS Workshop on Machine Learning Techniques in Bioinformatics 16 Kernel optimization Before Kernel Optimization After Kernel Optimization Training data Test data

DIMACS Workshop on Machine Learning Techniques in Bioinformatics 17 Distributed resampling Original training data: Training data with resampling:

DIMACS Workshop on Machine Learning Techniques in Bioinformatics 18 Gene selection A filter method: class separability

DIMACS Workshop on Machine Learning Techniques in Bioinformatics 19 Outline Introduction Data-dependent Kernel  Results Conclusion

DIMACS Workshop on Machine Learning Techniques in Bioinformatics 20 Comparison with other methods k-Nearest Neighbor (kNN) Diagonal linear discriminant analysis (DLDA) Uncorrelated Linear Discriminant analysis (ULDA) Support vector machines (SVM)

DIMACS Workshop on Machine Learning Techniques in Bioinformatics 21 Data sets AML Subtypes: ALL vs. AML Status of Estrogen receptor Status of lymph nodal Outcome of treatment Tumor vs. healthy tissue Subtypes: MPM vs. ADCA Different lymphomas cells Cancer vs. non-cancer Tumor vs. healthy tissue

DIMACS Workshop on Machine Learning Techniques in Bioinformatics 22 Experimental setup Data normalization Zero mean and unity variance at the gene direction Random partition data into two disjoint subsets of equal size – training data + test data Repeat each experiment 100 times

DIMACS Workshop on Machine Learning Techniques in Bioinformatics 23 Parameters DLDA: no parameter KNN: Euclidean distance, K=3 ULDA: K=3 SVM: Gaussian kernel, use leave-one-out on the training data to tune parameters KerNN: Gaussian kernel for basic kernel k 0, γ 0 andσare empirically set. Use leave-one-out on the training data to tune the rest parameters. KNN for classification

DIMACS Workshop on Machine Learning Techniques in Bioinformatics 24 Effect of data resampling Prostate 102 samples Lung 181 samples

DIMACS Workshop on Machine Learning Techniques in Bioinformatics 25 Effect of gene selection ALL-AML

DIMACS Workshop on Machine Learning Techniques in Bioinformatics 26 Effect of gene selection Colon

DIMACS Workshop on Machine Learning Techniques in Bioinformatics 27 Effect of gene selection Prostate

DIMACS Workshop on Machine Learning Techniques in Bioinformatics 28 Comparison results ALL-AML BreastER BreastLNColon

DIMACS Workshop on Machine Learning Techniques in Bioinformatics 29 Comparison results CNS lung Ovarian Prostate

DIMACS Workshop on Machine Learning Techniques in Bioinformatics 30 Outline Introduction Data-dependent Kernel Results  Conclusion

DIMACS Workshop on Machine Learning Techniques in Bioinformatics 31 Conclusion By maximizing the class separability of training data, the data-dependent kernel is also able to increase the separability of test data. The kernel method is robust to high dimensional microarray data The distributed resampling strategy helps to alleviate the problem of overfitting

DIMACS Workshop on Machine Learning Techniques in Bioinformatics 32 Conclusion The classifier assign samples more accurately than other approaches so we can have better treatments respectively. The method can be used for clarifying unusual cases e.g. a patient which was diagnosed as AML but with atypical morphology. The method can be applied to distinctions relating to future clinical outcomes.

DIMACS Workshop on Machine Learning Techniques in Bioinformatics 33 Future work How to estimate the parameters Study the genes selected

DIMACS Workshop on Machine Learning Techniques in Bioinformatics 34 Reference H. Xiong, M.N.S. Swamy, and M.O. Ahmad. Optimizing the data-dependent kernel in the empirical feature space. IEEE Trans. on Neural Networks 2005, 16: H. Xiong, Y. Zhang, and X. Chen. Data-dependent Kernels for Cancer Classification. Under review. A. Ben-Dor, L. Bruhn, N. Friedman, I. Nachman, M. Schummer, and Z. Yakhini. Tissue classification with gene expression profiles. J. Computational Biology 2000, 7: S. Dudoit, J. Fridlyand, and T.P. Speed. Comparison of discrimination method for the classification of tumor using gene expression data. J. Am. Statistical Assoc. 2002, 97:77-87 T.S. Furey, N. Cristianini, N. Duffy, D.W. Bednarski, M. Schummer, and D. Haussler. Support vector machine classification and validation of cancer tissue samples using microarray expression data. Bioinformatics 2000, 16: J. Ye, T. Li, T. Xiong, and R. Janardan. Using uncorrelated discriminant analysis for tissue classification with gene expression data. IEEE/ACM Trans. on Computational Biology and Bioinformatics 2004, 1:

DIMACS Workshop on Machine Learning Techniques in Bioinformatics 35 Thanks! Questions?