Three lessons I learned

Slides:



Advertisements
Similar presentations
Endemic or Outbreak? Differentiating recent transmission of an historic tuberculosis strain in New York City IUATLD-NAR 16 th Annual Meeting February 23-25,
Advertisements

My name is Dustin Boswell and I will be presenting: Ensemble Methods in Machine Learning by Thomas G. Dietterich Oregon State University, Corvallis, Oregon.
1 Harvard Medical School Mapping Transcription Mechanisms from Multimodal Genomic Data Hsun-Hsien Chang, Michael McGeachie, and Marco F. Ramoni Children.
Data Mining Classification: Alternative Techniques
Evaluating the Use of Outbreak Detection Algorithms to Detect Tuberculosis Outbreaks in Scotland Ben Tait Dr Janet Stevenson.
Paper presentation for CSI5388 PENGCHENG XI Mar. 23, 2005
Enterprise systems infrastructure and architecture DT211 4
Analysis of Molecular and Clinical Data at PolyomX Adrian Driga 1, Kathryn Graham 1, 2, Sambasivarao Damaraju 1, 2, Jennifer Listgarten 3, Russ Greiner.
Cost-Sensitive Bayesian Network algorithm Introduction: Machine learning algorithms are becoming an increasingly important area for research and application.
Machine Learning1 Machine Learning: Summary Greg Grudic CSCI-4830.
LOGO Ensemble Learning Lecturer: Dr. Bo Yuan
The Broad Institute of MIT and Harvard Classification / Prediction.
Benk Erika Kelemen Zsolt
1 Critical Review of Published Microarray Studies for Cancer Outcome and Guidelines on Statistical Analysis and Reporting Authors: A. Dupuy and R.M. Simon.
Wang Y 1,2, Damaraju S 1,3,4, Cass CE 1,3,4, Murray D 3,4, Fallone G 3,4, Parliament M 3,4 and Greiner R 1,2 PolyomX Program 1, Department.
Exploiting Context Analysis for Combining Multiple Entity Resolution Systems -Ramu Bandaru Zhaoqi Chen Dmitri V.kalashnikov Sharad Mehrotra.
Ensemble Methods: Bagging and Boosting
Introduction Use machine learning and various classifying techniques to be able to create an algorithm that can decipher between spam and ham s. .
CISC Machine Learning for Solving Systems Problems Presented by: Ashwani Rao Dept of Computer & Information Sciences University of Delaware Learning.
Evolutionary Algorithms for Finding Optimal Gene Sets in Micro array Prediction. J. M. Deutsch Presented by: Shruti Sharma.
Feature (Gene) Selection MethodsSample Classification Methods Gene filtering: Variance (SD/Mean) Principal Component Analysis Regression using variable.
Classification (slides adapted from Rob Schapire) Eran Segal Weizmann Institute.
Team Dogecoin: An Experience in Predicting Hospital Readmissions Acknowledgements The Problem Hospitals in the UK must keep track of which patients, once.
Random Forests Ujjwol Subedi. Introduction What is Random Tree? ◦ Is a tree constructed randomly from a set of possible trees having K random features.
Classification Ensemble Methods 1
Competition II: Springleaf Sha Li (Team leader) Xiaoyan Chong, Minglu Ma, Yue Wang CAMCOS Fall 2015 San Jose State University.
Identifying “Best Bet” Web Search Results by Mining Past User Behavior Author: Eugene Agichtein, Zijian Zheng (Microsoft Research) Source: KDD2006 Reporter:
1 TB Data Visualization and correlations in TB Patient Networks.
Combining multiple learners Usman Roshan. Decision tree From Alpaydin, 2010.
Machine Learning Usman Roshan Dept. of Computer Science NJIT.
Usman Roshan Dept. of Computer Science NJIT
Simple-Sequence Length Polymorphisms
Combining Models Foundations of Algorithms and Machine Learning (CS60020), IIT KGP, 2017: Indrajit Bhattacharya.
David Amar, Tom Hait, and Ron Shamir
External multicentric validation of a COPD detection questionnaire.
Classification with Gene Expression Data
Predicting Azure Consumption using Ensemble Learning
Zaman Faisal Kyushu Institute of Technology Fukuoka, JAPAN
Machine Learning Logistic Regression
An Artificial Intelligence Approach to Precision Oncology
Table 1. Advantages and Disadvantages of Traditional DM/ML Methods
Boosting and Additive Trees
Reading: Pedro Domingos: A Few Useful Things to Know about Machine Learning source: /cacm12.pdf reading.
Source: Procedia Computer Science(2015)70:
COMP61011 : Machine Learning Ensemble Models
Molecular study of two types of mutations in promoters of IL-2 and IL-10 genes in Iraqi patients with Tuberculosis Mazin S.Salman Awatif.
Machine Learning Logistic Regression
National Center for HIV/AIDS, Viral Hepatitis, STD, and TB Prevention
Review – TB transmission
Whole-genome sequencing to delineate Mycobacterium tuberculosis outbreaks: a retrospective observational study  Dr Timothy M Walker, MRCP, Camilla LC.
Genotypic analysis of genes associated with transmission and drug resistance in the Beijing lineage of Mycobacterium tuberculosis  J-R. Chang, C-H. Lin,
Machine Learning 101 Intro to AI, ML, Deep Learning
W. Sougakoff  Clinical Microbiology and Infection 
Machine learning Empirical Performance Analysis
MIS2502: Data Analytics Clustering and Segmentation
Somi Jacob and Christian Bach
S. Godreuil, F. Renaud, M. Choisy, J. J. Depina, E. Garnotel, M
Model generalization Brief summary of methods
Parametric Methods Berlin Chen, 2005 References:
Single Cell Regulatory Variation
Contact investigations for outbreaks of Mycobacterium tuberculosis: advances through whole genome sequencing  T.M. Walker, P. Monk, E. Grace Smith, T.E.A.
Genetic diversity of Mycobacterium tuberculosis isolates from foreign-born and Japan- born residents in Tokyo  M. Kato-Miyazawa, T. Miyoshi-Akiyama, Y.
Machine Learning with Clinical Data
Communicable Disease Surveillance Centre, London
Sofia Pediaditaki and Mahesh Marina University of Edinburgh
BCG Vaccination Dr Lika Nehaul CCDC / NPHS TB Programme Lead
Using Bayesian Network in the Construction of a Bi-level Multi-classifier. A Case Study Using Intensive Care Unit Patients Data B. Sierra, N. Serrano,
Laura Lane, Epidemiologist
Cytokine profiles can predict severe CRS
Advisor: Dr.vahidipour Zahra salimian Shaghayegh jalali Dec 2017
Presentation transcript:

Three lessons I learned from machine learning Leonid Chindelevitch School of Computing Science Simon Fraser University

Tuberculosis Tuberculosis is a bacterial infection caused by Mycobacterium tuberculosis. Usually attacks lungs but can attack any kidney, …, etc.

Tuberculosis Tuberculosis is the most common infectious disease killer today. Its annual toll is 9 million new cases and 1.5 million deaths. Nearly every third person has latent TB! Maps the prevalence of TB per 100,000 people. Note that 1/3 of world’s population has been infected by M. Tuberculosis but most do not cause the disease.

MIRU-VNTR Mycobacterium interspersed repetitive units – variable number tandem repeats is a technique for fingerprinting strains of TB at up to 24 genetic loci.

Infection ≈ branching process

Evaluating a TB program: UK Every reported case is genotyped: MIRU-VNTR After seeing multiple cases having the same type, a “cluster investigation” can be initiated This leads to detecting further clustered cases Question: does this alter cluster growth rates?

The data 1064 cases in a total of just over 100 clusters For each one, we know: the strain lineage (more on that soon), the cluster level (national/regional/local), the patient’s origin (within or outside the UK), sex and age group, the date of diagnosis, location of the TB (in the lungs or outside), and whether it’s been discovered pre or post cluster investigation

Growth of selected clusters

A linear regression model Use all the available variables to model delay Basic regression: Tj – Tj-1 ~ strain + cluster level + + origin[j] + sex[j] + age[j] + lungs[j] + after[j]

A linear regression model Use all the available variables to model delay Basic regression: Tj – Tj-1 ~ strain + cluster level + + origin[j] + sex[j] + age[j] + lungs[j] + after[j] New regression: Tj – Tj-1 ~ strain + cluster level + + origin[j] + sex[j] + age[j] + lungs[j] + after[j] + j Conclusion: the investigation had no clear effect Mears et al. Evaluation of the National TB Strain Typing Service. Internal report.

Lesson n. 1 Even when you think your model includes all the information you have available, make sure you are not leaving anything out by mistake!

TB lineages vary by geography The location of initial infection. Also linked to virulence. Read et al. Major Mycobacterium tuberculosis Lineages Associate with Patient Country of Origin. Journal of Clinical Microbiology (2009), vol. 47 n. 4, 1119-1128

The data Two sources: MIRU-VNTRplus: 186 strains (chosen for their diversity) NYC DOHMH: 2060 strains (collected over 10+ years) Each strain is labelled with one of six lineages: East African-Indian East Asian Euro-American Indo-Oceanic M. Africanum M. Bovis

Current classifiers MIRU-VNTRplus Up to 89% accuracy using k-nearest neighbours TB-Insight Up to 94% accuracy with a Bayesian network

Our approach: StackTB Objective: To achieve a higher classification accuracy through the application of more sophisticated Machine Learning methods Strategy: Ensemble together a collection of machine learning algorithms through stacking

Decision trees

Decision trees

Random forests

Gradient boosted machines Figure from http://homes.cs.washington.edu/~tqchen/pdf/BoostedTree.pdf Source: http://homes.cs.washington.edu/~tqchen/pdf/BoostedTree.pdf

k – Nearest neighbours

Multinomial logistic regression

Stacking for the final model Inputs Random Forest Gradient Boosted Machine k Nearest Neighbours Multinomial Regression Outputs

Results Ensemble methods led to an improvement in classification accuracy over existing methods

But simple rules also do well! combine threshold-based rules in a specific order we achieve nearly 95% accuracy (cf. StackTB, 96%) trade-offs between accuracy and interpretability

Boolean compressed sensing Source: Malioutov, D and Varshney, K. Exact Rule Learning via Boolean Compressed Sensing Proceedings of ICML-13 (2013), pp. 765-773.

Lesson n. 2 Do not neglect the opportunity to create a more interpretable predictor if the resulting accuracy loss is within an acceptable range.

Pediatric vasculitis A rare childhood disease; vessel inflammation Patient presentation frequently looks like this: Source: http://ilearnpeds.com/Assets/MCH/Images/Media/4413-media.jpg

The data 148 observations, 39 of them have the disease Over 63,000 variables measured for each one (gene expression count data – using RNA-Seq) Some additional clinical information for each patient (such as the severity of the disease)

The approach Use univariate association to link probes with the phenotype (presence or absence of PV) Using penalized ridge/lasso regression identify the best combinations of these predictors Correlate the resulting gene sets with known gene sets or pathways (enrichment analysis)

Lesson n. 3 Do not promise to find anything significant in a dataset that contains too few positive cases!

Conclusion: the three lessons Even when you think your model includes all the information you have available, make sure you are not leaving anything out by mistake! Do not neglect the opportunity to create a more interpretable predictor if the resulting accuracy loss is within an acceptable range. Do not promise to find anything significant in a dataset that contains too few positive cases!

Acknowledgments