Selection of Patient Samples and Genes for Disease Prognosis Limsoon Wong Institute for Infocomm Research Joint work with Jinyan Li & Huiqing Liu.

Slides:



Advertisements
Similar presentations
Original Figures for "Molecular Classification of Cancer: Class Discovery and Class Prediction by Gene Expression Monitoring"
Advertisements

Instance-based Classification Examine the training samples each time a new query instance is given. The relationship between the new query instance and.
Casulo C et al. Proc ASH 2013;Abstract 510.
4 th NETTAB Workshop Camerino, 5 th -7 th September 2004 Alberto Bertoni, Raffaella Folgieri, Giorgio Valentini
Reduced Support Vector Machine
Cancer classification using Machine Learning Techniques on Microarray Data Yongjin Park 1 and Ming-Chi Tsai 2 1 Department of Biology, Computational Biology.
. Differentially Expressed Genes, Class Discovery & Classification.
Bioinformatics Challenge  Learning in very high dimensions with very few samples  Acute leukemia dataset: 7129 # of gene vs. 72 samples  Colon cancer.
Microarray-based Disease Prognosis using Gene Annotation Signatures Michael Kovshilovsky Swapna Annavarapu SoCalBSI 2005.
CIBB-WIRN 2004 Perugia, 14 th -17 th September 2004 Alberto Bertoni, Raffaella Folgieri, Giorgio Valentini Feature.
Supervised gene expression data analysis using SVMs and MLPs Giorgio Valentini
Introduction to Survival Analysis PROC LIFETEST and Survival Curves.
Guidelines on Statistical Analysis and Reporting of DNA Microarray Studies of Clinical Outcome Richard Simon, D.Sc. Chief, Biometric Research Branch National.
Copyright © 2004 by Jinyan Li and Limsoon Wong Rule-Based Data Mining Methods for Classification Problems in Biomedical Domains Jinyan Li Limsoon Wong.
JAVED KHAN ET AL. NATURE MEDICINE – Volume 7 – Number 6 – JUNE 2001
References 1.Salazar R, Roepman P, Capella G et al. Gene expression signature to improve prognosis prediction of stage II and III colorectal cancer. J.
1 Comparison of Discrimination Methods for the Classification of Tumors Using Gene Expression Data Presented by: Tun-Hsiang Yang.
An Evaluation of Gene Selection Methods for Multi-class Microarray Data Classification by Carlotta Domeniconi and Hong Chai.
Exciting Bioinformatics Adventures Limsoon Wong Institute for Infocomm Research.
Gene expression profiling identifies molecular subtypes of gliomas
CZ5225: Modeling and Simulation in Biology Lecture 6, Microarray Cancer Classification Prof. Chen Yu Zong Tel:
A 14-gene prognosis signature predicts metastasis risk in node-negative, estrogen receptor-positive, Tamoxifen-treated breast cancer in different ethnogeographic.
Knowledge Discovery in Biomedicine Limsoon Wong Institute for Infocomm Research.
Chapter 7 Essential Concepts in Molecular Pathology Companion site for Molecular Pathology Author: William B. Coleman and Gregory J. Tsongalis.
Analysis of Molecular and Clinical Data at PolyomX Adrian Driga 1, Kathryn Graham 1, 2, Sambasivarao Damaraju 1, 2, Jennifer Listgarten 3, Russ Greiner.
Evaluation of Supervised Learning Algorithms on Gene Expression Data CSCI 6505 – Machine Learning Adan Cosgaya Winter 2006 Dalhousie University.
Supplemental Figures Loss of circadian clock gene expression is associated with tumor progression in breast cancer Cristina Cadenas 1*, Leonie van de Sandt.
University of Washington Institute of Technology Tacoma, WA, USA Ecole des Hautes Etudes en Santé Publique Département Infobiostat Rennes, France Isabelle.
Sgroi DC et al. Proc SABCS 2012;Abstract S1-9.
1 Classifying Lymphoma Dataset Using Multi-class Support Vector Machines INFS-795 Advanced Data Mining Prof. Domeniconi Presented by Hong Chai.
MANAGEMENT OF MANTLE CELL LYMPHOMA IN TUNISIA R BEN LAKHAL, L KAMMOUN, K ZAHRA, S KEFI Sousse 25 MAY 2012.
1.DATABASE construction  n=1,715  Median OS=40.0 months, age: 64+/-10 yrs  Histology (adeno/squamous/large): 50% / 45% / 5%  Stage 1/2/3/4: 63% / 27%
It is only the beginning: Putting microarrays into context Matthias E. Futschik Institute for Theoretical Biology Humboldt-University, Berlin, Germany.
The Broad Institute of MIT and Harvard Classification / Prediction.
INCREASED EXPRESSION OF PROTEIN KINASE CK2  SUBUNIT IN HUMAN GASTRIC CARCINOMA Kai-Yuan Lin 1 and Yih-Huei Uen 1,2,3 1 Department of Medical Research,
Bertinoro, Nov 2005 Some Data Mining Challenges Learned From Bioinformatics & Actions Taken Limsoon Wong National University of Singapore.
Combining multiple learners Usman Roshan. Bagging Randomly sample training data Determine classifier C i on sampled data Goto step 1 and repeat m times.
Evolutionary Algorithms for Finding Optimal Gene Sets in Micro array Prediction. J. M. Deutsch Presented by: Shruti Sharma.
Support Vector Machine Data Mining Olvi L. Mangasarian with Glenn M. Fung, Jude W. Shavlik & Collaborators at ExonHit – Paris Data Mining Institute University.
Tumor clock protein PER2 as a determinant of survival in patients (pts) receiving oxaliplatin-5-FU- leucovorin as 1st line chemotherapy for metastatic.
Application of Class Discovery and Class Prediction Methods to Microarray Data Kellie J. Archer, Ph.D. Assistant Professor Department of Biostatistics.
CHFR METHYLATION AS AN EPIGENETIC MARKER FOR RECURRENCE OF COLON CANCER M. D. Anderson Cancer Center, Houston, Texas Motofumi Tanaka, Salil Sethi, Donghui.
Computational Approaches for Biomarker Discovery SubbaLakshmiswetha Patchamatla.
Guest lecture: Feature Selection Alan Qi Dec 2, 2004.
Gray Zone Lymphoma (GZL) with Features Intermediate between Classical Hodgkin Lymphoma (cHL) and Diffuse Large B-Cell Lymphoma (DLBCL): A Large Retrospective.
A comparative study of survival models for breast cancer prognostication based on microarray data: a single gene beat them all? B. Haibe-Kains, C. Desmedt,
FREEDOM FROM PROGRESSION FOR PATIENTS RECEIVING I 125 VERSUS Pd 103 FOR PROSTATE BRACHYTHERAPY Jane Cho, Carol Morgenstern, Barbara Napolitano, Lee Richstone,
Satistics 2621 Statistics 262: Intermediate Biostatistics Jonathan Taylor and Kristin Cobb April 20, 2004: Introduction to Survival Analysis.
Ki-67 index cutoff value of 1% is a valuable prognostic biomarker for pulmonary carcinoids based on this large cohort. Our data also provide strong evidence.
Cetuximab plus FOLFIRI in the treatment of metastatic colorectal cancer: the influence of KRAS and BRAF biomarkers on outcome: updated data from the CRYSTAL.
Combining multiple learners Usman Roshan. Decision tree From Alpaydin, 2010.
Scott Kopetz, MD, PhD Department of Gastrointestinal Medical Oncology
Copyright © 2004 by Jinyan Li and Limsoon Wong Rule-Based Data Mining Methods for Classification Problems in Biomedical Domains Jinyan Li Limsoon Wong.
R-CHOP with Iodine-131 Tositumomab Consolidation for Advanced Stage Diffuse Large B-Cell Lymphoma (DLBCL): Southwest Oncology Group Protocol S0433 Friedberg.
What Factors Predict Outcome At Relapse After Previous Esophagectomy And Adjuvant Therapy in High-Risk Esophageal Cancer? Edward Yu 1, Patricia Tai 5,
Date of download: 6/18/2016 Copyright © 2016 American Medical Association. All rights reserved. From: Association of BRCA1 and BRCA2 Mutations With Survival,
BENEFIT OF CONSOLIDATIVE RADIATION THERAPY IN PATIENTS WITH DIFFUSE LARGE B-CELL LYMPHOMA TREATED WITH R-CHOP CHEMOTHERAPY JACK PHAN, ALI MAZLOOM, L. JEFFREY.
IDENTIFYING CANCER SUBTYPES BASED ON SOMATIC MUTATION PROFILE BIOINFORMATICS SEMINAR 2016 SPRING YOUJIN SHIN.
Copyright © 2004 by Jinyan Li and Limsoon Wong Rule-Based Data Mining Methods for Classification Problems in Biomedical Domains Jinyan Li Limsoon Wong.
Kelci J. Miclaus, PhD Advanced Analytics R&D Manager JMP Life Sciences
Trees, bagging, boosting, and stacking
Focus on lymphomas Cancer Cell
Volume 7, Issue 4, Pages (April 2005)
Shidan Wang, BS, Lin Yang, MD, Bo Ci, BS, Matthew Maclean, BS, David E
Arjun Pennathur, MD, Liqiang Xi, MD, Virginia R. Litle, MD, William E
Molecular prognostication of liver cancer: End of the beginning
Figure 1. Identification of three tumour molecular subtypes in CIT and TCGA cohorts. We used CIT multi-omics data ( Figure 1. Identification of.
by Wei-Yi Cheng, Tai-Hsien Ou Yang, and Dimitris Anastassiou
MYC–HOXB7–HER2 predicts clinical outcome in breast cancer patients treated with tamoxifen. MYC–HOXB7–HER2 predicts clinical outcome in breast cancer patients.
Advisor: Dr.vahidipour Zahra salimian Shaghayegh jalali Dec 2017
Presentation transcript:

Selection of Patient Samples and Genes for Disease Prognosis Limsoon Wong Institute for Infocomm Research Joint work with Jinyan Li & Huiqing Liu

Copyright © 2004 by Limsoon Wong Plan Introduction Techniques – Extreme sample selection – ERCOF gene identification – Scoring function construction Results –Diffuse large-B-cell lymphoma –Lung adenocarcinoma Discussions and conclusions

Introduction

Copyright © 2004 by Limsoon Wong Gene Expression Profile + Clinical Data  Outcome Prediction Univariate & multivariate Cox survival analysis (Beer et al 2002, Rosenwald et al 2002 ) Fuzzy neural network (Ando et al 2002) Partial least squares regression (Park et al 2002) Weighted voting algorithm (Shipp et al 2002) Gene index and “reference gene” (LeBlanc et al 2003) ……

Our Approach “extreme” sample selection ERCOF Copyright © 2004 by Limsoon Wong

Techniques Extreme Sample Selection ERCOF Gene Identification Scoring Function Construction

Copyright © 2004 by Limsoon Wong Short-term Survivors v.s. Long-term Survivors T: sample F(T): follow-up time E(T): status (1:unfavorable; 0: favorable) c 1 and c 2 : thresholds of survival time Short-term survivors who died within a short period F(T) < c 1 and E(T) = 1 Long-term survivors who were alive after a long follow-up time F(T) > c 2 Extreme Sample Selection

ERCOF Entropy- Based Rank Sum Test & Correlation Filtering Copyright © 2004 by Limsoon Wong

ERCOF Phase I: Entropy Measure Apply a discretization algorithm (Fayyad & Irani 1993) Remove genes with expression values w/o cut point found (cannot be discretized)

Copyright © 2004 by Limsoon Wong ERCOF Phase II: Wilcoxon Rank Sum Test Calculate Wilcoxon rank sum w(x) for gene x Remove gene x if w(x)  [c lower, c upper ]

ERCOF Phase III: Pearson Correlation Group features by Pearson Correlation For each group, retain the top 50% Copyright © 2004 by Limsoon Wong

Linear Kernel SVM regression function T: test sample, x(i): support vector, y i : class label (1: short-term survivors; -1: long-term survivors) Transformation function (posterior probability) S(T): risk score of sample T Risk Score Construction

Copyright © 2004 by Limsoon Wong Predicting Survival of Patients with DLBC Lymphoma Image credit: Rosenwald et al, 2002

Copyright © 2004 by Limsoon Wong Diffuse Large B-Cell Lymphoma DLBC lymphoma is the most common type of lymphoma in adults Can be cured by anthracycline-based chemotherapy in 35 to 40 percent of patients  DLBC lymphoma comprises several diseases that differ in responsiveness to chemotherapy Intl Prognostic Index (IPI) –age, “Eastern Cooperative Oncology Group” Performance status, tumor stage, lactate dehydrogenase level, sites of extranodal disease,... Not good for stratifying DLBC lymphoma patients for therapeutic trials  Use gene-expression profiles to predict outcome of chemotherapy?

Copyright © 2004 by Limsoon Wong Rosenwald et al., NEJM data samples –160 in preliminary group –80 in validation group –each sample described by 7399 microarray features Rosenwald et al.’s approach –identify gene: Cox proportional-hazards model –cluster identified genes into four gene signatures –calculate for each sample an outcome-predictor score –divide patients into quartiles according to score

Knowledge Discovery from Gene Expression of “Extreme” Samples “extreme” sample selection: 8 yrs knowledge discovery from gene expression 240 samples 80 samples 26 long- term survivors 47 short- term survivors 7399 genes 84 genes T is long-term if S(T) < 0.3 T is short-term if S(T) > 0.7 Copyright © 2004 by Limsoon Wong

p-value of log-rank test: < Risk score thresholds: 0.7, 0.3 Kaplan-Meier Plot for 80 Test Cases Copyright © 2004 by Limsoon Wong

(A) IPI low, p-value = (B) IPI intermediate, p-value = Improvement Over IPI Copyright © 2004 by Limsoon Wong

(A) W/o sample selection (p =0.38) (B) With sample selection (p=0.009) No clear difference on the overall survival of the 80 samples in the validation group of DLBCL study, if no training sample selection conducted Merit of “Extreme” Samples Copyright © 2004 by Limsoon Wong

Predicting Survival of Patients with Lung Adenocarcinoma Image credit: Beer et al, 2002

Copyright © 2004 by Limsoon Wong Beer et al., Nat Med data samples –67 are stage I –19 are stage III –each described by 7129 microarray features Beer et al.’s approach –identify genes: univariate Cox analysis –derive risk index using selected genes –validation methods equal sized training and testing sample leave-one-out cross validation

Knowledge Discovery from Gene Expression of “Extreme” Samples “extreme” sample selection: 5 yrs knowledge discovery from gene expression 67 stage I 19 stage III 55 samples 21 long- term survivors 10 short- term survivors 7129 genes 591 genes T is long-term if S(T) < 0.25 T is short-term if S(T) > 0.73 Copyright © 2004 by Limsoon Wong

p-value of log-rank test: (A) , (B) < (A) Test cases (55)(B) All cases (86) Kaplan-Meier Plots for Lung Adenocarcinoma Study Copyright © 2004 by Limsoon Wong

Discussions

Copyright © 2004 by Limsoon Wong (*)Informative OriginalLung (*)Informative OriginalDLBCL AliveDead TotalStatusData setApplication Number of samples in original data and selected informative training set. (*): Number of samples whose corresponding patient was dead at the end of follow-up time, but selected as a long-term survivor. Discussions: Sample Selection

Copyright © 2004 by Limsoon Wong 591(8.3%)84(1.7%)Phase II 884(12.4%)132(2.7%)Phase I (*)Original LungDLBCLGene selection Number of genes left after feature filtering for each phase. (*): number of genes after removing those genes who were absent in more than 10% of the experiments. Discussions: Gene Identification

Discussions: Kaplan-Meier plots on Test Cases Using Diff No. of Genes Copyright © 2004 by Limsoon Wong

Conclusion I Selecting extreme cases as training samples is an effective way to improve patient outcome prediction based on gene expression profiles and clinical information

Copyright © 2004 by Limsoon Wong Is ERCOF Useful?

Copyright © 2004 by Limsoon Wong Expts Feature selection methods considered –All use all features –All-entropy select features whose value range can be partitioned by Fayyad & Irani’s entropy method –Mean-entropy select features whose entropy is better than the mean entropy –Top-number-entropy select the top 20, 50, 100, 200 genes by their entropy –ERCOF at 5% significant level for Wilcoxon rank sum test and 0.99 Pearson correlation coeff threshold Data sets considered –Colon tumor –Prostate cancer –Lung cancer –Ovarian cancer –DLBC lymphoma –ALL-AML –Childhood ALL Learning methods considered –C4.5 –Bagging, Boosting, CS4 –SVM, 3-NN

ERCOF vs All-Entropy Copyright © 2004 by Limsoon Wong All-entropy wins 4 times ERCOF wins 60 times

ERCOF vs Mean-Entropy Copyright © 2004 by Limsoon Wong Mean-entropy wins 18 times ERCOF wins 42 times

Effectiveness of ERCOF Copyright © 2004 by Limsoon Wong 22 Total wins

Copyright © 2004 by Limsoon Wong Conclusion 2 ERCOF is very suitable for SVM, 3-NN, CS4, Random Forest, as it gives these learning algos highest no. of wins ERCOF is suitable for Bagging also, as it gives this classifier the lowest no. of errors  ERCOF is a systematic feature selection method that is very useful

Copyright © 2004 by Limsoon Wong Any Question?

Copyright © 2004 by Limsoon Wong References H. Liu et al, “Selection of patient samples and genes for outcome prediction”, Proc. CSB2004, pages Beer et al, “Gene-expression profiles predict survival of patients with lung adenocarcinoma.” Nat. Med, 8(8): , 2002 Rosenwald et al, “The use of molecular profiling to predict survival after chemotherapy for diffuse large- B-cell lymphoma”, NEJM, 346(25): , 2002