The Impact of Functional Redundancy on Molecular Signatures Alexandru G. Floares SAIA - Solutions of Artificial Intelligence Applications, Cluj, Romania Adaug footnote cu denumirea conferintei, etc.
Signatures Uniqueness. Wishful Thinking? Discovering signatures from Big Omics Data, using Machine Learning, leads to Precision Medicine Prevailing conception: For a given biomedical condition (e.g., cancer) and class of biomolecules (e.g., miRNA), there should be an unique signature. However, many proposed classifiers, for similar data & biomedical problems, have almost non-overlapping lists of molecules. Alexandru Floares - Rome, October, 2016
The Fundamental Non-Uniqueness of Molecular Signatures If we want the signature to be the minimal list of most relevant biomarkers, it cannot be unique. This is not a technical problem but a fundamental one: Evolution seems to favor functional redundancy which is at the foundation of complex robustness. The most obvious functional redundancy for miRNA: many miRNAs regulate the same mRNA and mRNAs are functionally redundant, too. The fundamental functional redundancy will lead to equivalent but different minimal signatures. Alexandru Floares - Rome, October, 2016
Functional Redundancy Implications Many intriguing omics facts might be simple reflections of the functional redundancy. E.g., cancer classifiers of similar data may have almost non- overlapping lists of molecules. The unexpected heterogeneity of cancer mutations - similar phenotypes but different mutations patterns. Redundancy manifests itself in all normal & pathological functions, in diagnosis, prognosis, treatment, and so on. Alexandru Floares - Rome, October, 2016
Accuracy, Robustness and Transparency What are realistic goals if uniqueness is impossible? Predictive models should be: Accurate: e.g., > 95% Robust: generalize well to new unseen cases Transparent: e.g., Trees/Rules We will illustrate some aspects of this vision, using Alexandru Floares - Rome, October, 2016
The TCGA miRNA Datasets The biggest NGS miRNA data: 9 Cancers (C) and Normal (N). Focus on two binary classifications: C vs N and BRCA vs N. Details in these papers (ResearchGate, Alexandru Floares): Bigger Data is Better for Molecular Diagnosis Tests Based on Decision Trees (Best Paper Award DMBD 2016) Exploring the Functional Redundancy of miRNA in Cancer with Computational Intelligence Alexandru Floares - Rome, October, 2016
Maximum Relevance minimum Redundance Using ML we can build powerful predictive biomedical models. Strong bias toward the highest accuracy & the minimum # variables. Justified: we want accurate and cheap omics tests. Follows the general principle of the Maximum Relevance and minimum Redundance (MRmR). This strategy alone is inadequate for exploring, understanding, and pragmatically exploiting redundancy. For redundant systems we should develop redundant models! Alexandru Floares - Rome, October, 2016
Unifying the MRmR and MRMR Principles We developed ML methodologies capable of producing both: The best omics tests - most accurate & min # variables, based on the Maximum Relevance minimum Redundance (MRmR) Principle, and The best redundant models - most accurate & max # relevant variables, based on the Maximum Relevance Maximum Redundance (MRMR) Principle. Alexandru Floares - Rome, October, 2016
Ideal & The Best Classifier Highly performant with minimal data preprocessing: resistance to outliers, resistance to reasonable missing values, resistance to correlated variables, and resistance to unbalanced classes. Capable of generating large sets of highly accurate models, both of the MRmR and of the MRMR class, Resistant to overfitting. We tested: RF, Boosted C5 DT, SVM, GP, DL NN, and SGB – the best. Alexandru Floares - Rome, October, 2016
Hyperparameter Optimization (Grid Search) I Not especially useful for miRNAs (AUC 0.99), but for other omics data. Hyperparameter optimization & ensemble methods increase the models pool size used to select the MRmR and MRMR models. Even if the accuracy is not much increased by parameter tuning, other aspects of the models could be improved. E.g., a larger team of shorter trees could generalize better than a smaller ensemble of larger trees. Alexandru Floares - Rome, October, 2016
Hyperparameter Optimization (Grid Search) II Optimal number of bins for CV. Values: 5, 10, 20, 50. Best: 10-fold CV. Learn Rate. Values: 0.001, 0.01, 0.1. Best: 0.01 (test AUC = 1) Max Nodes. Values: 2, 4, 6, 8, 9. Best: 2 (AUC 1.000). Interesting: it indicates that miRNAs interactions are not essential, the ensemble being composed of stump trees. Min Child. Values: 1, 2, 5, 10, 25, 50, 100, 200. Best: 5. Subsample. Values: 0.1, 0.2, 0.25, 0.3, 0.5, 0.75, and 0.9. Best: 0.5, gives the best accuracy and prevents overfitting. Alexandru Floares - Rome, October, 2016
miRNAs Importance: Redundancy Signature?
Alexandru Floares - Rome, October, 2016 Long Tail & Redundancy Few variables with high importance and a very long tail of variables with slowly decreasing importance. For non-redundant systems either there is no such long tail, or it represents mainly noise. Functional Analysis shows that the long tail contains cancer related miRNAs, not noise. Thus, for redundant systems long tail could be a redundancy mark. Alexandru Floares - Rome, October, 2016
Long Tail & Redundancy: Univariate Models Build first a model with all 644 features. Select the features with importance > 3.5 (136 features; 21%). Build a univariate model for each of the 136 selected features. Misclassification error is surprisingly good for all univariate models. It ranges from 0.1068 to 0.2705 with a mean of 0.2206. This is another evidence of the underlying redundancy. Alexandru Floares - Rome, October, 2016
Long Tail & Redundancy: In Silico Knockdown miRNAs Eliminate one miR at a time from the bottom/top of the imp list. The results are typical too for a redundant system: More than 100 miRNAs can be removed, from the 136 with importance > 3.5 (73%), and the AUC remains unchanged at 1.00. MRmR model: max AUC 1.00 & min # miRNAs 7. Candidate for omics Dx test: AUC > 0.95: AUC 0.96, 4 miRNAs, and AUC 0.96, 3 miRNAs. MRMR model: max AUC 1.00 & max # miRNAs 136 Alexandru Floares - Rome, October, 2016
Learning Curves BRCA vs NORM I Training size increases from 22 to 865 in steps of 9 Class proportions of the original dataset preserved. Each dataset was partitioned into: 75% training set, 25% for fresh test set. Only the training set was used for 3-fold CV CV was repeated 100 times.
Learning Curves BRCA vs NORM II Repeating CV 100 times, we mimic 100 studies, with various # of different patients. All remaining data were used again for testing the generalization capability, especially for small sample size simulated studies. For example: data set: 22 patients, ~16 (75%) for CV, 6 (25%) for test, 843 (865 total – 22 data set) for generalization testing.
Algorithms for Dx Tests I We deliberately chose simple & transparent algorithms, to be useful for the biomedical community. We used C5 and CART decision trees (DT) algorithms, their advantages making them one of the best choices for omics tests: Implicitly perform feature selection. Discover nonlinear relationships and interactions.
Algorithms for Dx Tests II Require relatively little effort from users for data preparation: DT do not need variable scaling, DT can deal with a reasonable amount of missing values, DT are not affected by outliers. Easy to interpret and explain. Can generate rules helping experts to formalize their knowledge. Usually, we use ensemble methods, & hyperparameter optimization, with boosted C5 , Random Forests, XGBoost, Deep Learning.
C5 AUC vs Sample Size & Fitted Power Law AUC ↑with the sample size Faster for small data sizes Slower for bigger data sizes Min 0.8523, Mean 0.9646, Max 0.9873. Best Fit Power Law: AUC = −0.5636X−0.5461 + 0.9931 Goodness of fit: SSE = 0.0018 R−square = 0.968
C5 AUC vs No Predictors & Fitted Power Law AUC ↑ with the # of predictors. Min 1, Mean 6, Max 11, Best Fit Power Law: AUC = −0.07035X−1.053+ 0.9845 Goodness of fit: SSE = 0.002932, R − square = 0.9502,
CART AUC vs Sample Size: Results & Fitted CART AUC ↑with the sample size. # miRs = cst = 6! ~ constant for ≥ 100 patients Best Fit Power Law: AUC = -1078X-2.514 + 0.954 Goodness of fit: SSE = 0.00189, R-square = 0.9933
Signatures Depends on the Sample Size and on the Classification Algorithms Accuracy: increases in various way with the sample size. The Number of Predictors: either increases (usually) with the sample size or it remains constant. The list of relevant biomarkers is changing with the sample size, partially due to functional redundancy. Robustness: All C5 and CART classifiers generalize well, on remaining data, despite using different signatures, which looks equivalent.
Alexandru Floares - Rome, October, 2016 Conclusion Functional redundancy is a fundamental, but scarcely investigated, property of living systems, related to their amazing robust complexity. We proposed the first ML methodology capable to develop models, from the best Dx tests to the best redundancy explorers. It unifies two general principle: Maximum Relevance & Minimum Redundance, and Maximum Relevance & Maximum Redundance. The signatures should not be unique but the models can be highly accurate, robust and transparent. Alexandru Floares - Rome, October, 2016