JMP Discovery Summit 2016 Janet Alvarado

Slides:



Advertisements
Similar presentations
Random Forest Predrag Radenković 3237/10
Advertisements

Notes Sample vs distribution “m” vs “µ” and “s” vs “σ” Bias/Variance Bias: Measures how much the learnt model is wrong disregarding noise Variance: Measures.
Dimension reduction (1)
Prediction, Correlation, and Lack of Fit in Regression (§11. 4, 11
Assessment. Schedule graph may be of help for selecting the best solution Best solution corresponds to a plateau before a high jump Solutions with very.
Chapter 17 Overview of Multivariate Analysis Methods
Multivariate Methods Pattern Recognition and Hypothesis Testing.
© 2005 The McGraw-Hill Companies, Inc., All Rights Reserved. Chapter 14 Using Multivariate Design and Analysis.
1 Canonical Analysis Introduction Assumptions Model representation An output example Conditions Procedural steps Mechanical steps - with the use of attached.
Multivariate Data Analysis Chapter 4 – Multiple Regression.
19-1 Chapter Nineteen MULTIVARIATE ANALYSIS: An Overview.
Data Basics. Data Matrix Many datasets can be represented as a data matrix. Rows corresponding to entities Columns represents attributes. N: size of the.
What Is Multivariate Analysis of Variance (MANOVA)?
Quantitative Business Analysis for Decision Making Simple Linear Regression.
ETHEM ALPAYDIN © The MIT Press, Lecture Slides for.
Discriminant Analysis Testing latent variables as predictors of groups.
Ensemble Learning (2), Tree and Forest
T-tests and ANOVA Statistical analysis of group differences.
The Tutorial of Principal Component Analysis, Hierarchical Clustering, and Multidimensional Scaling Wenshan Wang.
Multivariate Statistical Data Analysis with Its Applications
© 2013 Cengage Learning. All Rights Reserved. May not be scanned, copied or duplicated, or posted to a publicly accessible website, in whole or in part.
Multivariate Data Analysis Chapter 8 - Canonical Correlation Analysis.
Discriminant Function Analysis Basics Psy524 Andrew Ainsworth.
بسم الله الرحمن الرحیم.. Multivariate Analysis of Variance.
Educational Research: Competencies for Analysis and Application, 9 th edition. Gay, Mills, & Airasian © 2009 Pearson Education, Inc. All rights reserved.
Patterns of Event Causality Suggest More Effective Corrective Actions Abstract: The Occurrence Reporting and Processing System (ORPS) has used a consistent.
BOF Trees Visualization  Zagreb, June 12, 2004 BOF Trees Visualization  Zagreb, June 12, 2004 “BOF” Trees Diagram as a Visual Way to Improve Interpretability.
ANOVA and Linear Regression ScWk 242 – Week 13 Slides.
A B S T R A C T The study presents the application of selected chemometric techniques to the pollution monitoring dataset, namely, cluster analysis,
ITEC6310 Research Methods in Information Technology Instructor: Prof. Z. Yang Course Website: c6310.htm Office:
Adjusted from slides attributed to Andrew Ainsworth
Multivariate Data Analysis Chapter 1 - Introduction.
Multivariate Analysis and Data Reduction. Multivariate Analysis Multivariate analysis tries to find patterns and relationships among multiple dependent.
Module III Multivariate Analysis Techniques- Framework, Factor Analysis, Cluster Analysis and Conjoint Analysis Research Report.
Principle Component Analysis and its use in MA clustering Lecture 12.
Random Forests Ujjwol Subedi. Introduction What is Random Tree? ◦ Is a tree constructed randomly from a set of possible trees having K random features.
Introduction to Machine Learning Multivariate Methods 姓名 : 李政軒.
Principal Component Analysis Zelin Jia Shengbin Lin 10/20/2015.
Competition II: Springleaf Sha Li (Team leader) Xiaoyan Chong, Minglu Ma, Yue Wang CAMCOS Fall 2015 San Jose State University.
Regression Tree Ensembles Sergey Bakin. Problem Formulation §Training data set of N data points (x i,y i ), 1,…,N. §x are predictor variables (P-dimensional.
Classification and Regression Trees
 Seeks to determine group membership from predictor variables ◦ Given group membership, how many people can we correctly classify?
Supervised learning in high-throughput data  General considerations  Dimension reduction with outcome variables  Classification models.
D/RS 1013 Discriminant Analysis. Discriminant Analysis Overview n multivariate extension of the one-way ANOVA n looks at differences between 2 or more.
Tree and Forest Classification and Regression Tree Bagging of trees Boosting trees Random Forest.
Multivariate statistical methods. Multivariate methods multivariate dataset – group of n objects, m variables (as a rule n>m, if possible). confirmation.
Chapter 14 EXPLORATORY FACTOR ANALYSIS. Exploratory Factor Analysis  Statistical technique for dealing with multiple variables  Many variables are reduced.
Strategies for Metabolomic Data Analysis Dmitry Grapov, PhD.
McGraw-Hill/Irwin © 2003 The McGraw-Hill Companies, Inc.,All Rights Reserved. Part Four ANALYSIS AND PRESENTATION OF DATA.
Methods of multivariate analysis Ing. Jozef Palkovič, PhD.
Chapter 12 REGRESSION DIAGNOSTICS AND CANONICAL CORRELATION.
Data Science Practical Machine Learning Tools and Techniques 6.8: Clustering Rodney Nielsen Many / most of these slides were adapted from: I. H. Witten,
Stats Methods at IC Lecture 3: Regression.
Kelci J. Miclaus, PhD Advanced Analytics R&D Manager JMP Life Sciences
Predicting Energy Consumption in Buildings using Multiple Linear Regression Introduction Linear regression is used to model energy consumption in buildings.
Machine Learning with Spark MLlib
MANOVA Dig it!.
PREDICT 422: Practical Machine Learning
Semi-Supervised Clustering
Multiple Regression Prof. Andy Field.
Principal Component Analysis (PCA)
Understanding Standards Event Higher Statistics Award
Teaching Analytics with Case Studies: Finding Love in a Classification Tree Ruth Hummel, PhD JMP Academic Ambassador.
Introduction to Statistical Methods for Measuring “Omics” and Field Data PCA, PcoA, distance measure, AMOVA.
Ch11 Curve Fitting II.
Somi Jacob and Christian Bach
An Introduction to Correlational Research
Classification with CART
Analysis for Predicting the Selling Price of Apartments Pratik Nikte
Advisor: Dr.vahidipour Zahra salimian Shaghayegh jalali Dec 2017
Presentation transcript:

Novel Multivariate Approach for the Assessment of Product Comparability JMP Discovery Summit 2016 Janet Alvarado janet_alvarado@merck.com Center for Mathematical Sciences

Outline Introduction Current methods used to establish comparability What is comparability? Regulations regarding establishing comparability, requirements Current methods used to establish comparability Overview of suggested approach Random Forest (RF) and the proximity matrix Case studies Demonstration Future work

Definition of Comparability Comparable —used to say that two or more things are very similar and can be compared to each other Merrian-Webster dictionary online ICH Q5E (Biotechnological/Biological Products) : “The goal of the comparability exercise is to ascertain that pre- and post-change drug product is comparable in terms of quality, safety, and efficacy.” “The demonstration of comparability does not necessarily mean that the quality attributes of the pre-change and post-change products are identical; but that they are highly similar and that the existing knowledge is sufficiently predictive to ensure that any differences in quality attributes have no adverse impact upon safety or efficacy of the drug product.”

Current Pharmaceutical Industry practices to determine comparability Univariate methods: Equivalence Differences Intervals Multivariate methods: PLS PCA Cluster analysis

Suggested Approach Overview Multivariate approach, suggested as a multivariate exploratory tool for the assessment of product comparability Combination of well known multivariate methods: Random Forest (RF) -> Principal Coordinate Analysis (PCoA) Accommodates continuous and categorical predictors, responses with two or more levels, and a wide range of number of cases versus number of predictors Observations from different groups that lie close together in a plot of the principal coordinates are for the most part indistinguishable from one another Variable importance from RF analysis can be used to identify which single variables contribute the most to the separation between groups, which can then be used to determine if a more focused approach is needed Impact of these variables on product safety? On product efficacy?

Algorithm basis Model the data using Random Forest (RF) algorithm (Breiman & Cutler, 2001, as implemented by Liaw & Wiener in R) Obtain dissimilarity matrix from the RF proximity matrix dissimilarity matrix, d = 1 – proximity matrix Perform classical multidimensional scaling, also known as Principal Coordinate Analysis (Gower, 1966), of the dissimilarity matrix Distances between pair of points are Euclidean Calculate density ellipses for each group on a plot of the first two Principal Coordinates Density ellipse based on bivariate normal distribution Make statement about similarity based on amount of overlap between pairs of ellipses

Comparison to other Multivariate Approaches Current Discriminant Analysis PLS Predictor Variables Both Continuous and Categorical Continuous only (JMP) Response Variable Categorical Single response Continuous only Single or multiple Data structure Accommodates any kind of relationship among variables Requires assumptions be made with respect to within-group covariance Modeled relationships are linear Sample size N vs. No. of predictors k Handles k >> N Requires N > k Group size Increase in type I error with small group sizes Efficacy decreases with larger differences in group sizes Irrelevant Determination of Variable Importance Model is fit to maximize node purity Model is fit to maximize ‘separation’ of groups Model is fit to find direction of maximum correlation among predictors that explains the maximum variance in the response space Outliers Robust Greatly affected Variants exist that are robust

Advantages of Random Forest models Handling of k >>N They do not expect linear features or even features that interact linearly. Handling of continuous and categorical variables Handling of missing values Robust to outliers Robust against overfitting Built-in cross-validation Disadvantages: Ability to extract linear combinations of features Interpretability

Random Forest model Bootstrap sample Bootstrap sample X2 < 0.92 ≥ 0.92 < 0.52 ≥ 0.52 X1 < 0.405 ≥ 0.405 ≥ 0.195 < 0.195 B C A < 0.745 ≥ 0.745 X1 X2 < 0.62 ≥ 0.62 < 1.495 ≥ 1.495 < 0.34 ≥ 0.34 < 0.745 ≥ 0.745 ≥ 0.195 < 0.195 C A B Each tree is built on a bootstrap sample of the data (training set), no pruning

Random Forest model Classification error is based on the OOB samples 4 14 10 12 16 6 7 12 9 4 X2 < 0.92 ≥ 0.92 < 0.52 ≥ 0.52 X1 < 0.405 ≥ 0.405 ≥ 0.195 < 0.195 B C A < 0.745 ≥ 0.745 X1 X2 < 0.62 ≥ 0.62 < 1.495 ≥ 1.495 < 0.34 ≥ 0.34 < 0.745 ≥ 0.745 ≥ 0.195 < 0.195 C A B Classification error is based on the OOB samples

RF model and Proximity Matrix 4 14 10 12 16 6 7 12 4 9 Proximity matrix, nxn n = number of cases Not in same node In same node X Not in OOB sample Proximity i, j : total number of times that cases i and j ended up in the same terminal node of a tree, normalized by the total number of trees in the forest

RF model and Dissimilarity Matrix Dissimilarity matrix, d Freq in separate terminal node RF d i,j = 1 - proximity i,j Density ellipses highly overlapped, groups can be declared to be “similar” Principal Coordinates plot

Case Study 1: Iris dataset

Case Study 1: Iris dataset – Analysis Results No overlap, Groups can be said to be different Total Variance explained by first 2 PCoA’s = 99.0%

Case Study 1: Iris dataset – Analysis Results

Case Study 1: Iris dataset - Comparison to DA and PLS

Case Study 2: Site Product comparison

Case Study 2: Site Product comparison – Analysis Results L2 and L3 can be said to be similar, L1 is different from L2 and L3 Total Variance explained by first 2 PCoA’s = 70.1%

Case Study 2: Site Product comparison – Analysis Results

Case Study 2: Site Product comparison - Comparison to DA and PLS

Case Study 3: Raw Material Lot comparison

Case Study 3: Raw Material Lot comparison – Analysis Results No overlap, Groups can be said to be different Total Variance explained by first 2 PCoA’s = 84.7%

Case Study 3: Raw Material Lot comparison – Analysis Results

Case Study 3: Raw Material Lot comparison - Comparison to DA and PLS

JMP Script link (placeholder) Live Demonstration JMP Script link (placeholder)

Summary Suggested as a multivariate exploratory tool for the assessment of comparability Versatile tool Can be used to identify which variables contribute the most to separation between groups Prioritize resource allocation Good alternative to Orthogonal-PLS-DA

Future Work Sensitivity analyses Calculate amount of ellipses overlap Develop ‘similarity’ criteria based on amount of overlap of the ellipses (highly similar - different) Mitigation of risk by defining weights for important variables depending on their potential to impact safety and/or efficacy

Questions? Acknowledgments Nelson L. Afanador, PhD Andy Liaw, PhD and Matt Wiener, PhD (2002). Classification and Regression by randomForest. R News 2(3), 18-22. CMS colleagues Questions?