Outlines Introduction & Objectives Methodology & Workflow

Outlines Introduction & Objectives Methodology & Workflow
Simulation Work Conclusion & Future Work

Introduction & Objectives
Differentiating patients is challenge when the data are heavily overlapped among treatment groups or disease subgroups. In addition, class imbalance and small sample size commonly observed in the trials imposes additional challenges. Objectives Develop a machine learning framework that has improvements in classification performance under different scenarios. More specifically, comparison will be conducted on various machine learning approaches and different resampling methods.

Methodology Consideration Resampling
Description Random Under Sampling (RUS) Remove samples from majority class randomly. Random Over Sampling (ROS) Replicate samples from minority class randomly. Synthetic Minority Over-Sampling Technique (SMOTE) Simultaneously create synthetic samples for minority and under-sample majority class Synthetic sample are generated by using the feature vector of a sample from minority class and its nearest neighbors Cluster Based Under Sampling (ClusBUS) Divide samples into clusters (based on distance and density) or noise points using DBSCAN; Remove all majority points in clusters with majority percentage below a threshold.

Methodology Consideration Classifier & Evaluation Criterion
Logistic Regression Linear Discriminant Analysis (LDA) Quadratic Discriminant Analysis (QDA) Decision Tree (DT) Random Forest (RF) Support Vector Machine (SVM) Fuzzy C-means (FCM): Soft clustering that each data point belong to multiple cluster with membership grades. Adjustment made to conduct classification Evaluation Criterion G-mean: Geometric mean of specificity TN/(FP+TN) and sensitivity TP/(TP+FN)

Hyper-parameter Tuning
Workflow Modeling Data Nested CV Repeated CV Method Selection Classification + Resampling Methods Testing Data Selected Method Hyper-parameter Tuning Training Final Model Make Prediction Step 1 Step 2

Simulation-Null × 4 scenarios (1000 data sets for each scenario)
Modeling data: n=120 Testing data: n=30 Outcome: responder/non-responder Scenarios Equal Mean Equal Variance Responding Rate Scenario 1 × 50% Scenario 2 25% Scenario 3 Scenario 4

Results-Null, scenario 2 (G-Mean)
Validated Methods Resampling Classifier No FCM ClusBUS ROS RUS DT GLM LDA SVM SMOTE QDA RF RUS is preferred (smallest variance, G-mean closest to 0.5 for all classifiers) ClusBUS result is similar to the result with no resampling

Summary-Null Resampling: Classifiers:
RUS is preferred and needed for small size imbalanced data, it reduces variance of G-mean, provides valid G-mean estimation; ClusBUS result is similar to the result with no resampling. Classifiers: FCM provides valid result with small size imbalanced data even without using resampling; Some resampling and classifier’s combination provide biased estimation of G-mean; they need to be removed for alternative simulation and real data analysis; SVM+SMOTE is not recommended due to high variation of G-mean Performance Measurement: G-mean balances the accuracy of prediction between both classes and it is a robust performance measurement.

Simulation-Alternative
3 scenarios (1000 data sets for each scenario) Modeling data: n=120 (1:3) Testing data: n=30 (1:3) Outcome: responder/non-responder Scenario Target Overlapping Rate Overlapping Rate in Simulation Mean Median Range 1 80% 78.85% 79.05% (57%, 94%) 2 65% 64.27% 65.32% (22%, 95%) 3 50% 49.37% 50.22% (15%, 85%)

Results-Alternative: Selected Methods
By G-Mean: Green Dashed line represents SVM+RUS SVM and FCM are most selected classifiers The probabilities of being chosen as the best methods among top 5 selected methods and SVM+RUS are quite similar

Results-Alternative: G-Mean
Selected Method vs. Top 5 selected method (on testing data, N=30) Overlapping Rate G-Mean Mean(SD) 80% 61% (11.08%) 65% 63% (11.0%) 50% 66% (9.74%) Performance of different methods did not differ very much.

Summary-Alternative When overlapping rate is high (≥50%), true signal is weak: Performance of G-mean (mean and variance) decrease slightly with increasing overlapping rate. SVM and FCM have slightly higher chances to be selected as best model. Performance of different methods did not differ very much.

Conclusion & Future Work
For time consuming concern, nested CV (STEP 1) may be skipped SVM+RUS could be applied directly instead under current scenarios. Future Work Simulation: extremer scenarios Extremely imbalanced data (responder < 25%) Smaller testing data sample size (N < 30) Simulation: different data structure Testing data set imbalance ratio (3:1) is different from that of modeling data set (1:3)

Thank You!

Outlines Introduction & Objectives Methodology & Workflow

Similar presentations

Presentation on theme: "Outlines Introduction & Objectives Methodology & Workflow"— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

Outlines Introduction & Objectives Methodology & Workflow

Similar presentations

Presentation on theme: "Outlines Introduction & Objectives Methodology & Workflow"— Presentation transcript:

Similar presentations

About project

Feedback