No definitive “gold standard” causal networks Use a novel held-out validation approach, emphasizing causal aspect of challenge Training Data (4 treatments) Test Data (N-4 treatments) FGFR1/3i AKTi AKTi+MEKi DMSO All Data (N treatments) Test1 Test2 …. Test(N-4) Participants infer 32 networks using training data Inferred networks assessed using test data
Assessment: How well do inferred causal networks agree with effects observed under inhibition in test data? Step 1: Identify “gold standard” with a paired t-test to compare DMSO and test inhibitors for each phosphoprotein and cell line/stimulus regime phospho1 (a.u.) p-value = 3.2x10 -5 time DMSO Test1 Phospho2 (a.u.) time DMSO Test p-value = 0.45 e.g. UACC812/Serum, Test1 phosphoproteins “gold standard”
FP TP Compare descendants of test inhibitor target to “gold standard” list of observed effects in held-out data #TP(τ), #FP(τ) Step 2: Score submissions threshold, τ Vary threshold τ ROC curve and AUROC score # TP # FP Test1 phosphoproteins AUROC Test1 Obtain protein descendants downstream of test inhibitor target Matrix of predicted edge scores for a single cell line/stimulus regime
74 final submissions Each submission has 32 AUROC scores (one for each cell line/stimulus regime) 3.58 x x x x non-significant AUROC significant AUROC best performer
Scoring procedure: 1.For each submission and each cell line/stimulus pair, compute AUROC score 2.Submissions ranked for each cell line/stimulus pair 3.Mean rank across cell line/stimulus pairs calculated for each submission 4.Rank submissions according to mean rank 32 cell line/stimulus pairs Submissions AUROC scores 32 cell line/stimulus pairs Submissions AUROC ranks Submissions mean rank Submissions final rank
Verify that final ranking is robust Procedure: 1.Mask 50% of phosphoproteins in each AUROC calculation 2.Re-calculate final ranking 3.Repeat (1) and (2) 100 times phosphoproteins rank Top 10 teams 5.40 x
Gold-standard available: Data-generating causal network Participants submitted a single set of edge scores Edge scores compared against gold standard -> AUROC score Participants ranked based on AUROC score 3.11 x non-significant AUROC (51) significant AUROC (14) best performer Robustness Analysis: 1.Mask 50% of edges in calculation of AUROC 2.Re-calculate final ranking 3.Repeat (1) and (2) 100 times rank Top 10 teams 3.90 x
59 teams participated in both SC1A and SC1B Reward for consistently good performance across both parts of SC1 Average of SC1A rank and SC1B rank Top team ranked robustly first
FGFR1/3i AKTi AKTi+MEKi DMSO Test1 Test2 …. Test(N-4) Training Data (4 treatments) Test Data (N-4 treatments) All Data (N treatments) Participants build dynamical models using training data and make predictions for phosphoprotein trajectories under inhibitions not in training data Predictions assessed using test data
Participants made predictions for all phosphoproteins for each cell line/stimulus pair, under inhibition of each of 5 test inhibitors Assessment: How well do predicted trajectories agree with the corresponding trajectories in the test data? Scoring metric: Root-mean-squared error (RMSE), calculated for each cell line/phosphoprotein/test inhibitor combination e.g. UACC812, Phospho1, Test1
14 final submissions 1.35 x x x x non-significant AUROC significant AUROC best performer Final ranking: Analogously to SC1A, submissions ranked for each regime and mean rank calculated
Verify that final ranking is robust Procedure: 1.Mask 50% of data points in each RMSE calculation 2.Re-calculate final ranking 3.Repeat (1) and (2) 100 times Top 10 teams x rank 6.97 x Incomplete submission 2 best performers
Participants made predictions for all phosphoproteins for each stimulus regime, under inhibition of each phosphoprotein in turn Scoring metric is RMSE and procedure follows that of SC2A x x Robustness Analysis: 1.Mask 50% of data points in each RMSE calculation 2.Re-calculate final ranking 3.Repeat (1) and (2) 100 times non-significant AUROC significant AUROC best performer 7.71 x Top 10 teams rank 0.99 Incomplete submission
10 teams participated in both SC2A and SC2B Reward for consistently good performance across both parts of SC2 Average of SC2A rank and SC2B rank Top team ranked robustly first
14 submissions 36 HPN-DREAM participants voted – assigned ranks 1 to 3 Final score = mean rank (unranked submissions assigned rank 4)
Submissions rigorously assessed using held-out test data: SC1A: Novel procedure used to assess network inference performance in setting with no true “gold standard” Many statistically significant predictions submitted For further investigation: Explore why some regimes (e.g. cell line/stimulus pairs) are easier to predict than others Determine why different teams performed well in experimental and in silico challenges Identify the methods/approaches that yield the best predictions Wisdom of crowds – does aggregating submissions improve performance and lead to discovery of biological insights?