Analysis of the Full Ewing EWS/FLI Screen Ken Ross 10/22/10 The Broad Institute of MIT and Harvard.

Slides:

Advertisements

Similar presentations

Applications of one-class classification

Advertisements

Modeling of Data. Basic Bayes theorem Bayes theorem relates the conditional probabilities of two events A, and B: A might be a hypothesis and B might.

Original Figures for "Molecular Classification of Cancer: Class Discovery and Class Prediction by Gene Expression Monitoring"

ECG Signal processing (2)

Image classification Given the bag-of-features representations of images from different classes, how do we learn a model for distinguishing them?

SVM - Support Vector Machines A new classification method for both linear and nonlinear data It uses a nonlinear mapping to transform the original training.

Linear Classifiers (perceptrons)

Copyright © Allyn & Bacon (2010) Statistical Analysis of Data Graziano and Raulin Research Methods: Chapter 5 This multimedia product and its contents.

Data Mining Classification: Alternative Techniques

Data Mining Classification: Alternative Techniques

Copyright © Allyn & Bacon (2007) Statistical Analysis of Data Graziano and Raulin Research Methods: Chapter 5 This multimedia product and its contents.

Principal Component Analysis (PCA) for Clustering Gene Expression Data K. Y. Yeung and W. L. Ruzzo.

Data preprocessing before classification In Kennedy et al.: “Solving data mining problems”

Transductive Reliability Estimation for Kernel Based Classifiers 1 Department of Computer Science, University of Ioannina, Greece 2 Faculty of Computer.

Content Based Image Clustering and Image Retrieval Using Multiple Instance Learning Using Multiple Instance Learning Xin Chen Advisor: Chengcui Zhang Department.

Predictive Automatic Relevance Determination by Expectation Propagation Yuan (Alan) Qi Thomas P. Minka Rosalind W. Picard Zoubin Ghahramani.

Reduced Support Vector Machine

Clustered or Multilevel Data

Bioinformatics Challenge  Learning in very high dimensions with very few samples  Acute leukemia dataset: 7129 # of gene vs. 72 samples  Colon cancer.

CHAPTER 4: Parametric Methods. Lecture Notes for E Alpaydın 2004 Introduction to Machine Learning © The MIT Press (V1.1) 2 Parametric Estimation X = {

CHAPTER 4: Parametric Methods. Lecture Notes for E Alpaydın 2004 Introduction to Machine Learning © The MIT Press (V1.1) 2 Parametric Estimation Given.

1 BA 555 Practical Business Analysis Review of Statistics Confidence Interval Estimation Hypothesis Testing Linear Regression Analysis Introduction Case.

Review for Exam 2 Some important themes from Chapters 6-9 Chap. 6. Significance Tests Chap. 7: Comparing Two Groups Chap. 8: Contingency Tables (Categorical.

Binary Variables (1) Coin flipping: heads=1, tails=0 Bernoulli Distribution.

Linear hyperplanes as classifiers Usman Roshan. Hyperplane separators.

This week: overview on pattern recognition (related to machine learning)

Artificial Intelligence Lecture No. 28 Dr. Asad Ali Safi Assistant Professor, Department of Computer Science, COMSATS Institute of Information Technology.

Simple Linear Regression Models

CS 8751 ML & KDDSupport Vector Machines1 Support Vector Machines (SVMs) Learning mechanism based on linear programming Chooses a separating plane based.

ADVANCED CLASSIFICATION TECHNIQUES David Kauchak CS 159 – Fall 2014.

PARAMETRIC STATISTICAL INFERENCE

DATA MINING LECTURE 10 Classification k-nearest neighbor classifier Naïve Bayes Logistic Regression Support Vector Machines.

The Broad Institute of MIT and Harvard Classification / Prediction.

Data Mining Practical Machine Learning Tools and Techniques Chapter 4: Algorithms: The Basic Methods Section 4.6: Linear Models Rodney Nielsen Many of.

1 CS 391L: Machine Learning: Experimental Evaluation Raymond J. Mooney University of Texas at Austin.

Prediction of Molecular Bioactivity for Drug Design Experiences from the KDD Cup 2001 competition Sunita Sarawagi, IITB

Last lecture summary Bias, Bessel's correction MAD Normal distribution Empirical rule.

Chapter 16 Data Analysis: Testing for Associations.

CS 478 – Tools for Machine Learning and Data Mining SVM.

Linear hyperplanes as classifiers Usman Roshan. Hyperplane separators.

Comparison of groups The purpose of analysis is to compare two or more population means by analyzing sample means and variances. One-way analysis is used.

Classification (slides adapted from Rob Schapire) Eran Segal Weizmann Institute.

Support Vector Machines and Gene Function Prediction Brown et al PNAS. CS 466 Saurabh Sinha.

Guest lecture: Feature Selection Alan Qi Dec 2, 2004.

Analyzing Expression Data: Clustering and Stats Chapter 16.

CZ5225: Modeling and Simulation in Biology Lecture 7, Microarray Class Classification by Machine learning Methods Prof. Chen Yu Zong Tel:

Evaluation of gene-expression clustering via mutual information distance measure Ido Priness, Oded Maimon and Irad Ben-Gal BMC Bioinformatics, 2007.

Fast Query-Optimized Kernel Machine Classification Via Incremental Approximate Nearest Support Vectors by Dennis DeCoste and Dominic Mazzoni International.

Machine Learning 5. Parametric Methods.

DATA MINING LECTURE 10b Classification k-nearest neighbor classifier

Feature Selction for SVMs J. Weston et al., NIPS 2000 오장민 (2000/01/04) Second reference : Mark A. Holl, Correlation-based Feature Selection for Machine.

Machine Learning in Practice Lecture 6 Carolyn Penstein Rosé Language Technologies Institute/ Human-Computer Interaction Institute.

Statistical Analysis for Expression Experiments Heather Adams BeeSpace Doctoral Forum Thursday May 21, 2009.

D/RS 1013 Discriminant Analysis. Discriminant Analysis Overview n multivariate extension of the one-way ANOVA n looks at differences between 2 or more.

Intro. ANN & Fuzzy Systems Lecture 16. Classification (II): Practical Considerations.

Slide Slide 1 Copyright © 2007 Pearson Education, Inc Publishing as Pearson Addison-Wesley. Lecture Slides Elementary Statistics Tenth Edition and the.

High resolution product by SVM. L’Aquila experience and prospects for the validation site R. Anniballe DIET- Sapienza University of Rome.

Amos Tanay Nir Yosef 1st HCA Jamboree, 8/2017

Aaker, Kumar, Day Ninth Edition Instructor’s Presentation Slides

MIRA, SVM, k-NN Lirong Xia. MIRA, SVM, k-NN Lirong Xia.

K Nearest Neighbor Classification

Artificial Intelligence Lecture No. 28

15.1 The Role of Statistics in the Research Process

Parametric Methods Berlin Chen, 2005 References:

Linear Discrimination

Physics-guided machine learning for milling stability:

SVMs for Document Ranking

MIRA, SVM, k-NN Lirong Xia. MIRA, SVM, k-NN Lirong Xia.

Lecture 16. Classification (II): Practical Considerations

Machine Learning for Cyber

Presentation transcript:

Analysis of the Full Ewing EWS/FLI Screen Ken Ross 10/22/10 The Broad Institute of MIT and Harvard

Outline Review of analysis pipeline Analysis of Ewing EWS/FLI screen –Screen overview –June 2010 screen issues and bad plates Bad plates selected to be repeated –Plates with known technical problems –Plates with low fraction of genes changing in correct direction –Plates with poor summed score and weighted summed score z-factors –Plates with high hit rates –Combined screen Good plates from June 2010 screen Plates repeated from June screen Pilot screen data –Viability prediction –Hits Summary and conclusions

Screen Analysis Methodology Data processing includes: –Filtering –Well scaling by forming ratio to reference genes –Scaling and normalization –Compounds scored based upon a combination of methods: Summed score Weighted summed score Naïve Bayes KNN classifier SVM classifier Trading-off options for analysis parameters was partially based upon maximizing the Z-factor Z-Factor: Zhang et al –Z-factor=1 would be ideal –Z-factor > 0 are good – more than 3 standard deviations separates control means –Typically we see z-factor > 0.5 for EWS/FLI plates

Naïve Bayes Classifier Probabilities for continuous variables (like gene expression ratios) are modeled as independent Gaussian or kernel distributions Naïve Bayes works even when the independence assumption does not hold Based upon the Bayes rule and “naively” assumes feature independence Pr[h|E] = (Pr[E 1 |h] x Pr[E 2 |h] x … x Pr[E N |h] x Pr[h])/Pr[E] –where E i is the evidence for the hypothesis (in this case, gene ratios as evidence for the cells transforming)

K-nn classifier example: K=5, 2 genes, 2 classes project samples in gene space gene 1 gene 2 class orange class black

gene 1 gene 2 class orange class black project unknown sample ? K-nn classifier example: K=5, 2 genes, 2 classes

gene 1 gene 2 class orange class black "consult" 5 closest neighbors: - 3 black - 2 orange Distance measures: Euclidean distance 1-Pearson correlation KL divergence … ? K-nn classifier example: K=5, 2 genes, 2 classes

gene 1 gene 2 class orange class black "consult" 5 closest neighbors: - 3 black - 2 orange Distance measures: Euclidean distance 1-Pearson correlation KL divergence … K-nn classifier example: K=5, 2 genes, 2 classes

Support Vector Machine (SVM) Prediction A SVM maps input vectors to a higher dimensional space where a maximal separating hyperplane is constructed Parallel hyperplanes are constructed on each side of the hyperplane that separates the data The separating hyperplane is the hyperplane that maximizes the distance between the parallel hyperplanes Assumes that a larger margin or distance between the parallel hyperplanes results in a classifier with a better generalization error

Ewing EWS/FLI Screen Overview 36 chemical library plates (31 in June 2010 screen and 5 in pilot screen) –Screened in duplicate –DOS libraries (25 plates – 8000 compounds) –Natural products (4 plates – 1280 compounds) –Commercial compounds (2 plates – 640 compounds) –Bioactive compounds (6 plates – 1920 compounds) Positive Control: EWS/FLI knockdown (32 per plate) Negative Control: DMSO (32 per plate) LMA plates generated and detected by the GAP 138 gene signature for readout (134 pilot) –6 reference genes –89 genes up-regulated by EWS/FLI knockdown –49 genes down-regulated by EWS/FLI knockdown (45 pilot)

June 2010 Screen Quality Overview 8 plates were obviously bad and needed to be redone –6 had bad PCR –1 flipped plate (poor performance when un-flipped, possibly flipped back and forth during wash) –1 plate with incorrect detector program Remaining plates were processed in many batches with obvious batch effects Problems were evident in some plates with: –Low fraction of genes changing in expected direction Plates with fraction of good genes changing in the expected direction < 0.7 were considered bad – 6 plates –Poor z-factors for summed score Summed score z-factors < -0.5 were considered bad (good genes) – 5 plates –Poor z-factors for weighted summed score Weighted summed score z-factors < 0.2 were considered bad – 6 plates Calculated platewise weights with good genes because of plate-to-plate variation –Large group of wells without beads / overall low bead count –Excessive number of hits Plate considered to have high number of hits if SS hits > 50, WSS hits > 50, Naïve Bayes hits > 60, or KNN hits >10

Summary of Problematic Plates from June plates total –2 high hit count plates are replicates of same chemical plate – might be real –2 other chemical plates have problems with both replicates 20 repeated in September 2010 –One low bead count plate (BR ) was ok after redetection –One with marginal WSS z-factor (0.1) and marginally low bead counts kept (BR ) –Two with moderately high hits rates in only SS and Naïve Bayes in both replicates kept (BR and BR )

EWS/FLI Screen Analysis Approach Screen consists of 42 plates from June 2010, 20 plates repeated in September 2010, and 10 plates from pilot screen Each plate analyzed separately –Positive Control: EWS/FLI knockdown (32 per plate except pilot has 16) –Negative Control: DMSO (32 per plate except pilot has 16) Filtering: –Reference gene: GAPDH median EF3-2 minus 4 median absolute deviations (with a minimum level of 6000) –Bead count: more than 10 probes with 6 or fewer beads Well normalization –Marker genes ratioed to mean of 3 reference genes (ACTB, HINT1, and TUBB) Each plate analyzed twice: –All genes –Good Genes – Genes are considered ‘good’ if they change in the expected direction and have z-factors > -30 Five methods are used to evaluate hits on each plate and then hit lists are combined together –Summed Score –Weighted Summed Score –Naïve Bayes –KNN –SVM

Summed Scores (All Genes) DMSOEF3-2 KnockdownCompoundsLuciferase Batch effects are obvious here Recent batch looks much better June 2010 Plates Sept Plates Pilot Plates

Summed Scores (Good Genes) DMSOEF3-2 KnockdownCompoundsLuciferase Batch effects are obvious here Different good genes on each plate exaggerates plate-to- plate differences June 2010 Plates Sept Plates Pilot Plates

Z-Score of Summed Scores (All Genes) DMSOEF3-2 KnockdownCompoundsLuciferase June 2010 Plates Sept Plates Pilot Plates Z-score helps make score comparable among plates Recent batch still looks better

Z-Score of Summed Scores (Good Genes) DMSOEF3-2 KnockdownCompoundsLuciferase June 2010 Plates Sept Plates Pilot Plates Z-score helps make score comparable among plates Even with the different size signatures scores seem comparable

Heatmap for EWS/FLI Screen Plate Means DMSOEF3-2 KnockdownCompoundsLuciferase Down Genes Up Genes

EWS/FLI Screen Plate Z-Factors (All Genes) 72 plates (42 from June 2010, 20 from September 2010, and 10 from pilot screen) Each plate analyzed separately –No scaling –Ratios with the 3 best reference genes –All genes Mean Z-factors: –Summed score mean z-factor = 0.36 –Weighted summed score mean z-factor = 0.45 All z-factors are better than thresholds set after June 2010 screen

EWS/FLI Screen Plate Z-Factors (Good Genes) 72 plates (42 from June 2010, 20 from September 2010, and 10 from pilot screen) Each plate analyzed separately –No scaling –Ratios with the 3 best reference genes –Good genes Mean Z-factors: –Summed score mean z-factor = 0.35 –Weighted summed score mean z-factor = 0.51 All z-factors are better than thresholds set after June 2010 screen

Fraction of Genes Changing in Expected Direction 72 plates (42 from June 2010, 20 from September 2010, and 10 from pilot screen) Each plate analyzed separately –No scaling –Ratios with the 3 best reference genes –Good genes Mean fraction of genes changing in expected direction = 0.74 All plates have 65% or more of all genes changing in expected direction

Fraction of Genes Changing in Expected Direction 72 plates (42 from June 2010, 20 from September 2010, and 10 from pilot screen) Each plate analyzed separately –No scaling –Ratios with the 3 best reference genes –Good genes Mean fraction of genes changing in expected direction: –Up genes = 0.84 –Down genes = 0.56 –All genes = 0.74 The balance of up and down genes changing varies considerably

Histogram of Z-Score of Summed Scores (Good Genes) 72 plates (42 from June 2010, 20 from September 2010, and 10 from pilot screen) Each plate analyzed separately –No scaling –Ratios with the 3 best reference genes –Good genes Summed score z- score normalized after calculation so plate data can be compared

Histogram of Z-Score of Weighted Summed Scores (Good Genes) 72 plates (42 from June 2010, 20 from September 2010, and 10 from pilot screen) Each plate analyzed separately –No scaling –Ratios with the 3 best reference genes –Good genes –Plate specific weights Weighted summed score z-score normalized after calculation so plate data can be compared

Hit Distribution (Good Genes) 72 plates (42 from June 2010, 20 from September 2010, and 10 from pilot screen) Each plate analyzed separately –Ratios with the 3 best reference genes –Good genes –Plate specific weights 5 methods for calling hits –Summed score probability > 0.5 –Weighted summed score probability > 0.5 –Naïve Bayes > 0.5 –KNN –SVM

Summed Scores Hit Distribution (Good Genes) 72 plates (42 from June 2010, 20 from September 2010, and 10 from pilot screen) Each plate analyzed separately –No scaling –Ratios with the 3 best reference genes –Good genes Not a linear relationship between score and probability but the best probabilities have the highest scores

Weighted Summed Scores Hit Distribution (Good Genes) 72 plates (42 from June 2010, 20 from September 2010, and 10 from pilot screen) Each plate analyzed separately –No scaling –Ratios with the 3 best reference genes –Good genes –Plate specific weights Not a linear relationship between score and probability but the best probabilities have the highest scores

Hit Distribution with Scores (Good Genes) 72 plates (42 from June 2010, 20 from September 2010, and 10 from pilot screen) Each plate analyzed separately –Ratios with the 3 best reference genes –Good genes –Plate specific weights 5 methods for calling hits –Summed score probability > 0.5 –Summed score z-score > 3 –Weighted summed score probability > 0.5 –Weighted summed score z-score > 3 –Naïve Bayes > 0.5 –KNN –SVM

Hit Distribution Versus Plate (Good Genes) 72 plates (42 from June 2010, 20 from September 2010, and 10 from pilot screen) Each plate analyzed separately –Ratios with the 3 best reference genes –Good genes –Plate specific weights 7 methods for calling hits –Summed score probability > 0.5 –Summed score z-score > 3 –Weighted summed score probability > 0.5 –Weighted summed score z-score > 3 –Naïve Bayes > 0.5 –KNN –SVM

Hit Distribution Versus Plate Pairs (Good Genes) 72 plates (42 from June 2010, 20 from September 2010, and 10 from pilot screen) Each plate analyzed separately –Ratios with the 3 best reference genes –Good genes –Plate specific weights 7 methods for calling hits –Summed score probability > 0.5 –Summed score z-score > 3 –Weighted summed score probability > 0.5 –Weighted summed score z-score > 3 –Naïve Bayes > 0.5 –KNN –SVM

Paired Plate Hit Distributions (Good Genes) Summed Score Probability > 0.5 Hits Z-Score of Summed Score > 3 Hits Weighted Summed Score Probability > 0.5 Hits Z-Score of Weighted Summed Score > 3 Hits Naïve Bayes Probability > 0.5 Hits KNN Hits SVM Hits

Viability Prediction in Ewing Data Viability prediction model developed with data from 3 development plates and pilot screen: –Cytotoxic compound plate –Compound plate from Kung lab with compounds predicted active in Ewing sarcoma –Phenothiazine compound plate –EWS/FLI pilot screen data (5 chemical plates in duplicate from ChemBiology) Viability prediction with KNN –Sample relative viability defines cells in well as either alive (>=0.2) or dead (<0.2) –Development data split into train and test with 25% of combined data from 3 development plates and pilot screen for training and remaining for testing –Data Median Range Scaled together and each sample normalized (subtract median and divide by median absolute deviation)

Classifiers for Viability Prediction 25% of all Data for Training – KNN – Normalized Ewing development plate with 25% of data from 3 development plates and pilot screen used to train a viability model –Used day 2 viability relative to DMSO mean of day 2 viability relative to day 0 ratio –Sample relative viability defines cells in well as either alive (>=0.2) or dead (<0.2) –Median range scaled data –Samples normalized by subtracting median and dividing by median absolute deviation –KNN classifier K=3 Cosine distance Distance weighting 10 features selected by signal-to- noise Performance with normalization slightly better 25% Combined Training – LOOCV Results Held-Out 75% Combined Testing

Applying Viability Prediction to EWS/FLI Screen Ewing development plate with 25% of data from 3 development plates and pilot screen used to train a viability model –Used day 2 viability relative to DMSO mean of day 2 viability relative to day 0 ratio –Sample relative viability defines cells in well as either alive (>=0.2) or dead (<0.2) –Median range scaled data –Samples normalized by subtracting median and dividing by median absolute deviation –KNN classifier K=3 Cosine distance Distance weighting 10 features selected by signal-to-noise Seems to have some success –Some known toxic compounds are predicted dead –Probably many of the 389 of compound wells filtered would also be predicted dead

Hit Distribution with Viability (Good Genes) 72 plates (42 from June 2010, 20 from September 2010, and 10 from pilot screen) Each plate analyzed separately –Ratios with the 3 best reference genes –Good genes KNN model used to predict viability –Trained on pilot screen and development plates 5 methods for calling hits –Summed Score Probability –Weighted Summed Score Probability –Naïve Bayes –KNN –SVM

Viability and Subsignature Prediction Conclusions / Future Work Viability prediction seems to be working reasonably well –A reasonable number of wells that made it through reference gene filtering are being classified as “dead” –Many of the hits appear to be in “dead” wells –Evaluate with secondary screen where there will be viability measurements More work with viability prediction –Need to further explore methods for working across different batches of data –How much training data is needed? Would more training data improve the models? –What kind of training samples are needed? Can we use several standard test plates? –Further evaluation of data scaling, use of log expression ratio, and other model types –Try other types of classifiers, e.g., SVM and Naïve Bayes Other work with subsignatures in GE-HTS data

Conclusions / Future Work Quality of screen data is not ideal but workable –Repeat plates and pilot screen plates have best quality –Key to salvaging poor data is ability to follow-up a sufficiently large number of hits Analysis methods needed to accommodate data obtained from many batches and to be robust to batch variations –Analyzed plates separately and then collapsed results Used plate specific models for hit selection Each plate used its own set of ‘good’ genes (plate dependant) Fortunately 32 positive and 32 negative controls on each plate –SVM seems to suffer the most with plate-by-plate analysis because wide separation between controls allows divergent models –Overlap of hits between replicates suggests that the batch effect problems have been at least partially dealt with –Can the predicted viability be used to avoid ‘hits’ that just kill the cells?