University of Washington Institute of Technology Tacoma, WA, USA Ecole des Hautes Etudes en Santé Publique Département Infobiostat Rennes, France Isabelle.

Slides:



Advertisements
Similar presentations
Analysis of High-Throughput Screening Data C371 Fall 2004.
Advertisements

A gene expression analysis system for medical diagnosis D. Maroulis, D. Iakovidis, S. Karkanis, I. Flaounas D. Maroulis, D. Iakovidis, S. Karkanis, I.
Application of Stacked Generalization to a Protein Localization Prediction Task Melissa K. Carroll, M.S. and Sung-Hyuk Cha, Ph.D. Pace University, School.
Learning on Probabilistic Labels Peng Peng, Raymond Chi-wing Wong, Philip S. Yu CSE, HKUST 1.
1 MicroArray -- Data Analysis Cecilia Hansen & Dirk Repsilber Bioinformatics - 10p, October 2001.
Microarray technology and analysis of gene expression data Hillevi Lindroos.
DNA methylation age of human tissues and cell types. Genome Biol (10):R115 PMID:
Model and Variable Selections for Personalized Medicine Lu Tian (Northwestern University) Hajime Uno (Kitasato University) Tianxi Cai, Els Goetghebeur,
Predictive Automatic Relevance Determination by Expectation Propagation Yuan (Alan) Qi Thomas P. Minka Rosalind W. Picard Zoubin Ghahramani.
4 th NETTAB Workshop Camerino, 5 th -7 th September 2004 Alberto Bertoni, Raffaella Folgieri, Giorgio Valentini
Reduced Support Vector Machine
Microarrays and Cancer Segal et al. CS 466 Saurabh Sinha.
. Differentially Expressed Genes, Class Discovery & Classification.
Alternative Splicing As an introduction to microarrays.
Feature Selection and Its Application in Genomic Data Analysis March 9, 2004 Lei Yu Arizona State University.
Modeling Gene Interactions in Disease CS 686 Bioinformatics.
1 April, 2005 Chapter C4.1 and C5.1 DNA Microarrays and Cancer.
How do we know whether a marker or model is any good? A discussion of some simple decision analytic methods Carrie Bennette on behalf of Andrew Vickers.
Statistical Learning: Pattern Classification, Prediction, and Control Peter Bartlett August 2002, UC Berkeley CIS.
Introduction The goal of translational bioinformatics is to enable the transformation of increasingly voluminous genomic and biological data into diagnostics.
A comparative study of survival models for breast cancer prognostication based on microarray data: a single gene beat them all? B. Haibe-Kains, C. Desmedt,
CS Machine Learning. What is Machine Learning? Adapt to / learn from data  To optimize a performance function Can be used to:  Extract knowledge.
1 Harvard Medical School Transcriptional Diagnosis by Bayesian Network Hsun-Hsien Chang and Marco F. Ramoni Children’s Hospital Informatics Program Harvard-MIT.
A hybrid method for gene selection in microarray datasets Yungho Leu, Chien-Pan Lee and Ai-Chen Chang National Taiwan University of Science and Technology.
Expression profiling of peripheral blood cells for early detection of breast cancer Introduction Early detection of breast cancer is a key to successful.
A Multivariate Biomarker for Parkinson’s Disease M. Coakley, G. Crocetti, P. Dressner, W. Kellum, T. Lamin The Michael L. Gargano 12 th Annual Research.
Whole Genome Expression Analysis
Analysis and Management of Microarray Data Dr G. P. S. Raghava.
Molecular Diagnosis Florian Markowetz & Rainer Spang Courses in Practical DNA Microarray Analysis.
Exagen Diagnostics, Inc., all rights reserved Biomarker Discovery in Genomic Data with Partial Clinical Annotation Cole Harris, Noushin Ghaffari.
CSCE555 Bioinformatics Lecture 16 Identifying Differentially Expressed Genes from microarray data Meeting: MW 4:00PM-5:15PM SWGN2A21 Instructor: Dr. Jianjun.
How do we know whether a marker or model is any good? A discussion of some simple decision analytic methods Carrie Bennette (on behalf of Andrew Vickers)
GA-Based Feature Selection and Parameter Optimization for Support Vector Machine Cheng-Lung Huang, Chieh-Jen Wang Expert Systems with Applications, Volume.
Finish up array applications Move on to proteomics Protein microarrays.
The Broad Institute of MIT and Harvard Classification / Prediction.
Selection of Patient Samples and Genes for Disease Prognosis Limsoon Wong Institute for Infocomm Research Joint work with Jinyan Li & Huiqing Liu.
Scenario 6 Distinguishing different types of leukemia to target treatment.
Center for Evolutionary Functional Genomics Large-Scale Sparse Logistic Regression Jieping Ye Arizona State University Joint work with Jun Liu and Jianhui.
Today Ensemble Methods. Recap of the course. Classifier Fusion
Evolutionary Algorithms for Finding Optimal Gene Sets in Micro array Prediction. J. M. Deutsch Presented by: Shruti Sharma.
Nuria Lopez-Bigas Methods and tools in functional genomics (microarrays) BCO17.
By: Amira Djebbari and John Quackenbush BMC Systems Biology 2008, 2: 57 Presented by: Garron Wright April 20, 2009 CSCE 582.
Making Time: Pseudo Time-Series for the Temporal Analysis of Cross-Section Data Emma Peeling, Allan Tucker Centre for Intelligent Data Analysis Brunel.
Class 23, 2001 CBCl/AI MIT Bioinformatics Applications and Feature Selection for SVMs S. Mukherjee.
Bioinformatics and Computational Biology
Classification (slides adapted from Rob Schapire) Eran Segal Weizmann Institute.
Introduction to Microarrays Kellie J. Archer, Ph.D. Assistant Professor Department of Biostatistics
Computational Approaches for Biomarker Discovery SubbaLakshmiswetha Patchamatla.
Guest lecture: Feature Selection Alan Qi Dec 2, 2004.
Introduction to Biostatistics and Bioinformatics Experimental Design.
A comparative study of survival models for breast cancer prognostication based on microarray data: a single gene beat them all? B. Haibe-Kains, C. Desmedt,
Molecular Classification of Cancer Class Discovery and Class Prediction by Gene Expression Monitoring.
Molecular Classification of Cancer: Class Discovery and Class Prediction by Gene Expression Monitoring T.R. Golub et al., Science 286, 531 (1999)
Case Study: Characterizing Diseased States from Expression/Regulation Data Tuck et al., BMC Bioinformatics, 2006.
A comparative approach for gene network inference using time-series gene expression data Guillaume Bourque* and David Sankoff *Centre de Recherches Mathématiques,
Eigengenes as biological signatures Dr. Habil Zare, PhD PI of Oncinfo Lab Assistant Professor, Department of Computer Science Texas State University 5.
Statistical Analysis for Expression Experiments Heather Adams BeeSpace Doctoral Forum Thursday May 21, 2009.
Other uses of DNA microarrays
Tree and Forest Classification and Regression Tree Bagging of trees Boosting trees Random Forest.
Eigengenes as biological signatures Dr. Habil Zare, PhD PI of Oncinfo Lab Assistant Professor, Department of Computer Science Texas State University 3.
Predictive Automatic Relevance Determination by Expectation Propagation Y. Qi T.P. Minka R.W. Picard Z. Ghahramani.
Combining Models Foundations of Algorithms and Machine Learning (CS60020), IIT KGP, 2017: Indrajit Bhattacharya.
David Amar, Tom Hait, and Ron Shamir
An Artificial Intelligence Approach to Precision Oncology
Molecular Classification of Cancer
Claudio Lottaz and Rainer Spang
Prepared by: Mahmoud Rafeek Al-Farra
Class Prediction Based on Gene Expression Data Issues in the Design and Analysis of Microarray Experiments Michael D. Radmacher, Ph.D. Biometric Research.
Claudio Lottaz and Rainer Spang
Advisor: Dr.vahidipour Zahra salimian Shaghayegh jalali Dec 2017
Presentation transcript:

University of Washington Institute of Technology Tacoma, WA, USA Ecole des Hautes Etudes en Santé Publique Département Infobiostat Rennes, France Isabelle Bichindaritz

Purpose of this Talk Once upon a time … There was biology (~1800), and There were computers (~1920) Of their common interests was born bioinformatics (~1979) … Question: How can CBR contribute to bioinformatics research ? An example to microarray data analysis ICCBR '10

NCBI, 2004

Bioinformatics Challenges Frequent tasks in bioinformatics Similarity search in genetic sequences Microarray data analysis Macromolecule shape prediction Evolutionary tree construction Gene regulatory network mining ICCBR '10

Bioinformatics Challenges Microarray data analysis Microarrays are made from a collection of purified DNA’s. A drop of each type of DNA in solution is placed onto a specially- prepared glass microscope slide by an arraying machine. Please note that … … the human genome contains about 30,000 genes. … a microarray can contain thousands or tens of thousands relatively short nucleotides of known sequences. ICCBR '10

The end product of a comparative hybridization experiment is a scanned array image. ICCBR '10 Bioinformatics Challenges

ICCBR '10 Bioinformatics Challenges

Microarray applications Determine relative DNA levels associated with huge number of known and predicted genes in a single experiment. The most attractive application of microarrays is in the study of differential gene expression in disease. The up– or down-regulation of gene activity can either be the cause of the pathophysiology or the result of the disease. Accurate measurement of every single gene is assessed. Sensitivity: very high – detect the presence of one transcript in one-tenth of a cell. ICCBR '10 Bioinformatics Challenges

ICCBR '10 Data mining challenges Volume of data (Giga bytes, number of features) Characteristics of data (specific constraints) Domain specific knowledge (expert interpretation) Bioinformatics Challenges

BMA-CBR System ICCBR '10 Gene Expression Level Dataset Application of Feature Selection Algorithm Discrete Sample Output: Supervised Machine Learning and Model Construction through Classification Diagnosis Continuous Sample Output: Supervised Machine Lerning and Model Construction through Prediction Survival analysis

ICCBR '10 BMA-CBR System BMA-CBR system performs feature selection through BMA before using CBR for microarray data classification and prediction (survival analysis) Introduction and motivation of variable selection What is Bayesian Model Averaging (BMA)? One approach: the iterative BMA algorithm Application 1: Chronic Myeloid Leukemia (CML) Application 2: Survival analysis Presentation of CBR

ICCBR '10 Feature selection Used to select a subset of relevant features for building robust learning models in machine learning. Often used in supervised learning. Select relevant features from the training set (for which class labels are known ). Apply the selected features in the test set. Bayesian Model Averaging

ICCBR '10 Feature selection A minimal set of relevant genes for future prediction or assay development Bayesian Model Averaging

ICCBR '10 Typical variable selection methods – one variable at a time Examples: T-test Between group to within group sum of squares (BSS/ WSS) [Dudoit et al. 2001] Bayesian Model Averaging

ICCBR '10 Multivariate gene selection Our goal: consider multiple genes Simultaneously to exploit the interdependence between genes to reduce # relevant genes Bayesian Model Averaging

ICCBR '10 Bayesian Model Averaging (BMA) [Raftery 1995], [Hoeting et. al. 1999] A multivariate variable selection technique. Typical model selection approaches select a model and then proceed as if the selected model has generated the data --> overconfident inferences Advantages of BMA: Fewer selected genes Can be generalized to any number of classes Posterior probabilities for selected genes and selected models Bayesian Model Averaging

ICCBR '10 BMA Average over predictions from several models What do we need? Prediction with a given model k --> logistic regression How to choose a set of “good” models? --> variable selection Bayesian Model Averaging

ICCBR '10 What models to average over? All possible models --> way too many!! Eg. 2^30~1 billion, 2^50~10^15 etc… The BMA solution: 1. “leaps and bounds ” [Furnival and Wilson 1974] : when #variables (genes) <= 30, we can efficiently produce a reduced set of good models (branch and bound). 2. Cut down the # models? Discard models that are much less likely than the best model. Bayesian Model Averaging

ICCBR '10 Iterative BMA algorithm [Yeung, Bumgarner, Raftery 2005] Pre-processing step: Rank genes using BSS/WSS ratio. Initial step: Repeat until all genes are processed: Output: selected genes and models with their posterior probabilities Bayesian Model Averaging

ICCBR '10 Application 1: Classification of progression of chronic myeloid leukemia (CML) Motivation: New Candidates for Prognostic studies in CML Bayesian Model Averaging

ICCBR '10 Progression of CML CML usually presents in chronic phase (CP), but in the absence of effective therapy, CP CML invariably transforms to accelerated phase (AP) disease, and then to an acute leukemia, blast crisis (BC). BC is highly resistant to treatment, and all treatments are more successful when administered during CP. Imatinib is most effective in early CP patients with excellent survival (86% at 7 years). Currently there are limited clinical markers and no molecular tests that can predict the “clock” of CML progression for individual patients at the time of CP diagnosis, making it difficult to adapt therapy to the risk level of each patient. Bayesian Model Averaging

ICCBR '10 Why predictors for CML progression? Bayesian Model Averaging

ICCBR '10 Identification of CML progression biomarkers Bayesian Model Averaging

ICCBR '10 Genes associated with CML progression Bayesian Model Averaging

ICCBR '10 BMA selected genes using microarray data Selected 6 genes over 21 models Repeat CV 100 times Average Brier Score = 0.21 Average prediction accuracy = 99.17% Bayesian Model Averaging

ICCBR '10 PCR data: CP-early vs CP-late Bayesian Model Averaging

ICCBR '10 Summary: CML data BMA applied to a microarray data consisting of patient samples in different phases of CML identified 6 signature genes (ART4, DDX47, IGSF2,LTB4R, SCARB1, SLC25A3). Results validated the gene signature using quantitative PCR: 6- gene signature is highly predictive of CP-early vs CP-late. What is next? To identify biologically meaningful biomarkers for CML progression and response to therapy. Biomarkers that are functionally related (connected in an underlying network) to known reference genes. Bayesian Model Averaging

ICCBR '10 Application 2: Survival analysis Bayesian Model Averaging

ICCBR '10 Results: Breast cancer data Bayesian Model Averaging

ICCBR '10 Results: Breast cancer data - Annest, Bumgarner, Raftery, Yeung. BMC Bioinformatics 2009 Bayesian Model Averaging

CBR Classification task Similarity measure Weights provided by BMA for selected features ICCBR '10

CBR Classification task Choose the class for which the average similar score is highest ICCBR '10

CBR Survival analysis task Similarity measure Weights provided by BMA for selected features ICCBR '10

CBR Survival analysis task Choose the class for which the average similar score is highest ICCBR '10

Evaluation / Classification ICCBR '10 DatasetTotal Number of Samples # Training Samples # Validation Samples Number of Genes Leukemia Leukemia Dataset# classesBMA-CBRiterativeBMAOther published results Leukemia 22#genes = 20 #errors = 1/34 #genes = 20 #errors = 2/34 #genes = 5 #errors = 1/34 Leukemia 33#genes = 15 #errors = 1/34 #genes = 15 #errors = 1/34 #genes ~ 40 #errors = 1/34

Evaluation / Prediction ICCBR '10 DatasetTotal Number# Training Samples # Validation Samples Number Of Genes DLBCL ,399 Breast Cancer ,919 DatasetBMA-CBRiterativeBMABest Other Published Results DLBCL#genes = 25 p-value = #genes = 25 p-value = #genes = 17 p-value = Breast cancer#genes = 15 p-value = 2.14e-10 #genes = 15 p-value = 3.38e-10 #genes = 5 p-value = 3.12e-05

Conclusion The combination of BMA and CBR provides excellent classification and prediction results. It provides promising results for the application of CBR to bioinformatics tasks and data. ICCBR '10

Conclusion Future developments Refine risk classes into more than two risk groups. Refine CBR algorithm. Test on additional datasets. Provide automatic interpretation of the classification / prediction both for gene selection and for case-based reasoning. ICCBR '10