Download presentation
Presentation is loading. Please wait.
Published byFerdinand Holmes Modified over 6 years ago
1
The Impact of Functional Redundancy on Molecular Signatures
Alexandru G. Floares SAIA - Solutions of Artificial Intelligence Applications, Cluj, Romania Adaug footnote cu denumirea conferintei, etc.
2
Signatures Uniqueness. Wishful Thinking?
Discovering signatures from Big Omics Data, using Machine Learning, leads to Precision Medicine Prevailing conception: For a given biomedical condition (e.g., cancer) and class of biomolecules (e.g., miRNA), there should be an unique signature. However, many proposed classifiers, for similar data & biomedical problems, have almost non-overlapping lists of molecules. Alexandru Floares - Rome, October, 2016
3
The Fundamental Non-Uniqueness of Molecular Signatures
If we want the signature to be the minimal list of most relevant biomarkers, it cannot be unique. This is not a technical problem but a fundamental one: Evolution seems to favor functional redundancy which is at the foundation of complex robustness. The most obvious functional redundancy for miRNA: many miRNAs regulate the same mRNA and mRNAs are functionally redundant, too. The fundamental functional redundancy will lead to equivalent but different minimal signatures. Alexandru Floares - Rome, October, 2016
4
Functional Redundancy Implications
Many intriguing omics facts might be simple reflections of the functional redundancy. E.g., cancer classifiers of similar data may have almost non- overlapping lists of molecules. The unexpected heterogeneity of cancer mutations - similar phenotypes but different mutations patterns. Redundancy manifests itself in all normal & pathological functions, in diagnosis, prognosis, treatment, and so on. Alexandru Floares - Rome, October, 2016
5
Accuracy, Robustness and Transparency
What are realistic goals if uniqueness is impossible? Predictive models should be: Accurate: e.g., > 95% Robust: generalize well to new unseen cases Transparent: e.g., Trees/Rules We will illustrate some aspects of this vision, using Alexandru Floares - Rome, October, 2016
6
The TCGA miRNA Datasets
The biggest NGS miRNA data: 9 Cancers (C) and Normal (N). Focus on two binary classifications: C vs N and BRCA vs N. Details in these papers (ResearchGate, Alexandru Floares): Bigger Data is Better for Molecular Diagnosis Tests Based on Decision Trees (Best Paper Award DMBD 2016) Exploring the Functional Redundancy of miRNA in Cancer with Computational Intelligence Alexandru Floares - Rome, October, 2016
7
Maximum Relevance minimum Redundance
Using ML we can build powerful predictive biomedical models. Strong bias toward the highest accuracy & the minimum # variables. Justified: we want accurate and cheap omics tests. Follows the general principle of the Maximum Relevance and minimum Redundance (MRmR). This strategy alone is inadequate for exploring, understanding, and pragmatically exploiting redundancy. For redundant systems we should develop redundant models! Alexandru Floares - Rome, October, 2016
8
Unifying the MRmR and MRMR Principles
We developed ML methodologies capable of producing both: The best omics tests - most accurate & min # variables, based on the Maximum Relevance minimum Redundance (MRmR) Principle, and The best redundant models - most accurate & max # relevant variables, based on the Maximum Relevance Maximum Redundance (MRMR) Principle. Alexandru Floares - Rome, October, 2016
9
Ideal & The Best Classifier
Highly performant with minimal data preprocessing: resistance to outliers, resistance to reasonable missing values, resistance to correlated variables, and resistance to unbalanced classes. Capable of generating large sets of highly accurate models, both of the MRmR and of the MRMR class, Resistant to overfitting. We tested: RF, Boosted C5 DT, SVM, GP, DL NN, and SGB – the best. Alexandru Floares - Rome, October, 2016
10
Hyperparameter Optimization (Grid Search) I
Not especially useful for miRNAs (AUC 0.99), but for other omics data. Hyperparameter optimization & ensemble methods increase the models pool size used to select the MRmR and MRMR models. Even if the accuracy is not much increased by parameter tuning, other aspects of the models could be improved. E.g., a larger team of shorter trees could generalize better than a smaller ensemble of larger trees. Alexandru Floares - Rome, October, 2016
11
Hyperparameter Optimization (Grid Search) II
Optimal number of bins for CV. Values: 5, 10, 20, 50. Best: 10-fold CV. Learn Rate. Values: 0.001, 0.01, 0.1. Best: 0.01 (test AUC = 1) Max Nodes. Values: 2, 4, 6, 8, 9. Best: 2 (AUC 1.000). Interesting: it indicates that miRNAs interactions are not essential, the ensemble being composed of stump trees. Min Child. Values: 1, 2, 5, 10, 25, 50, 100, 200. Best: 5. Subsample. Values: 0.1, 0.2, 0.25, 0.3, 0.5, 0.75, and 0.9. Best: 0.5, gives the best accuracy and prevents overfitting. Alexandru Floares - Rome, October, 2016
12
miRNAs Importance: Redundancy Signature?
13
Alexandru Floares - Rome, October, 2016
Long Tail & Redundancy Few variables with high importance and a very long tail of variables with slowly decreasing importance. For non-redundant systems either there is no such long tail, or it represents mainly noise. Functional Analysis shows that the long tail contains cancer related miRNAs, not noise. Thus, for redundant systems long tail could be a redundancy mark. Alexandru Floares - Rome, October, 2016
14
Long Tail & Redundancy: Univariate Models
Build first a model with all 644 features. Select the features with importance > 3.5 (136 features; 21%). Build a univariate model for each of the 136 selected features. Misclassification error is surprisingly good for all univariate models. It ranges from to with a mean of This is another evidence of the underlying redundancy. Alexandru Floares - Rome, October, 2016
15
Long Tail & Redundancy: In Silico Knockdown miRNAs
Eliminate one miR at a time from the bottom/top of the imp list. The results are typical too for a redundant system: More than 100 miRNAs can be removed, from the 136 with importance > 3.5 (73%), and the AUC remains unchanged at 1.00. MRmR model: max AUC 1.00 & min # miRNAs 7. Candidate for omics Dx test: AUC > 0.95: AUC 0.96, 4 miRNAs, and AUC 0.96, 3 miRNAs. MRMR model: max AUC 1.00 & max # miRNAs 136 Alexandru Floares - Rome, October, 2016
17
Learning Curves BRCA vs NORM I
Training size increases from 22 to 865 in steps of 9 Class proportions of the original dataset preserved. Each dataset was partitioned into: 75% training set, 25% for fresh test set. Only the training set was used for 3-fold CV CV was repeated 100 times.
18
Learning Curves BRCA vs NORM II
Repeating CV 100 times, we mimic 100 studies, with various # of different patients. All remaining data were used again for testing the generalization capability, especially for small sample size simulated studies. For example: data set: 22 patients, ~16 (75%) for CV, 6 (25%) for test, 843 (865 total – 22 data set) for generalization testing.
19
Algorithms for Dx Tests I
We deliberately chose simple & transparent algorithms, to be useful for the biomedical community. We used C5 and CART decision trees (DT) algorithms, their advantages making them one of the best choices for omics tests: Implicitly perform feature selection. Discover nonlinear relationships and interactions.
20
Algorithms for Dx Tests II
Require relatively little effort from users for data preparation: DT do not need variable scaling, DT can deal with a reasonable amount of missing values, DT are not affected by outliers. Easy to interpret and explain. Can generate rules helping experts to formalize their knowledge. Usually, we use ensemble methods, & hyperparameter optimization, with boosted C5 , Random Forests, XGBoost, Deep Learning.
21
C5 AUC vs Sample Size & Fitted Power Law
AUC ↑with the sample size Faster for small data sizes Slower for bigger data sizes Min , Mean , Max Best Fit Power Law: AUC = −0.5636X− Goodness of fit: SSE = R−square = 0.968
22
C5 AUC vs No Predictors & Fitted Power Law
AUC ↑ with the # of predictors. Min 1, Mean 6, Max 11, Best Fit Power Law: AUC = − X− Goodness of fit: SSE = , R − square = ,
23
CART AUC vs Sample Size: Results & Fitted
CART AUC ↑with the sample size. # miRs = cst = 6! ~ constant for ≥ 100 patients Best Fit Power Law: AUC = -1078X Goodness of fit: SSE = , R-square =
24
Signatures Depends on the Sample Size and on the Classification Algorithms
Accuracy: increases in various way with the sample size. The Number of Predictors: either increases (usually) with the sample size or it remains constant. The list of relevant biomarkers is changing with the sample size, partially due to functional redundancy. Robustness: All C5 and CART classifiers generalize well, on remaining data, despite using different signatures, which looks equivalent.
25
Alexandru Floares - Rome, October, 2016
Conclusion Functional redundancy is a fundamental, but scarcely investigated, property of living systems, related to their amazing robust complexity. We proposed the first ML methodology capable to develop models, from the best Dx tests to the best redundancy explorers. It unifies two general principle: Maximum Relevance & Minimum Redundance, and Maximum Relevance & Maximum Redundance. The signatures should not be unique but the models can be highly accurate, robust and transparent. Alexandru Floares - Rome, October, 2016
Similar presentations
© 2024 SlidePlayer.com. Inc.
All rights reserved.