DOMINO: Using Machine Learning to Predict Genes Associated with Dominant Disorders Mathieu Quinodoz, Beryl Royer-Bertrand, Katarina Cisarova, Silvio Alessandro Di Gioia, Andrea Superti-Furga, Carlo Rivolta The American Journal of Human Genetics Volume 101, Issue 4, Pages 623-629 (October 2017) DOI: 10.1016/j.ajhg.2017.09.001 Copyright © 2017 American Society of Human Genetics Terms and Conditions
Figure 1 Rationale and General Design of DOMINO (A) A typical exome analysis identifies 20,000 variants, when compared to the human reference genome. After filtering by rarity in the general population (minor allele frequency, or MAF, < 1%) and by functional impact of each variant, approximately 400 DNA changes remain. These impact 300–400 genes, heterozygously (red dots), and 5–10 genes when they are present as homozygous or compound heterozygous variants (blue dots). (B) Workflow of DOMINO methodology, showing the different steps of gene selection, annotation, and scoring. (C) Details of the LDA algorithm. Relevant features are first preselected and then removed, replaced or added iteratively to the model, with specific acceptance criteria. 10 × 10-fold cross-validation is performed at each iteration. (D) Performance of the model as a function of the iterations performed. AUCs of the training, testing and validation sets, as well as the number of features at each iteration are shown. The cut-off value retained corresponded to the 14th iteration and a set of 8 features. The model converges starting from the 36th iteration. (E) ROC curves for the complete training, testing and validation sets, displaying AUC values of 0.912, 0.908, and 0.920, respectively. (F) Features composing the selected model. Average values for AD and AR genes of the training set are shown, along with their relative weight. Units are as follows: for STRING entries, number of interactions;17 for ExAC-pRec, probability of being intolerant to homozygous but not heterozygous loss-of-function variants;18 for ExAC-missense Z score, value with respect to a distribution of expected number of missenses;18 PhyloP, average PhyloP score with respect to a 1,000-bp window centered on the TSS;19 ExAC-don./syn., number of variants at the donor splicing site, normalized to the number of synonymous variants in the coding sequence;20 mRNA half-life, 0 if ≤ 10 hr or 1 if > 10 hr.21 The American Journal of Human Genetics 2017 101, 623-629DOI: (10.1016/j.ajhg.2017.09.001) Copyright © 2017 American Society of Human Genetics Terms and Conditions
Figure 2 Distributions of LDA Scores and Probabilities of Being Dominant, P(AD), for Genes in the Training and Validation Sets (A) Density plots of LDA score for AD (red) and AR (blue) genes of the training set. Continuous lines refer to raw values, whereas dashed lines to their normal approximations. (B–F) Histograms of P(AD) for: (B) AD genes of the training set, (C) AR genes of the training set, (D) AD genes of the validation set, (E) AR genes of the validation set, (F) Genes known to behave as false positives in NGS experiments, containing rare, non-pathogenic variants. The American Journal of Human Genetics 2017 101, 623-629DOI: (10.1016/j.ajhg.2017.09.001) Copyright © 2017 American Society of Human Genetics Terms and Conditions
Figure 3 Distributions of P(AD) for Genes with at Least Two De Novo Mutations in Different Individuals with Intellectual Disability or Epilepsy Histograms of P(AD) for (A) 82 genes carrying de novo mutations in 1,010 individuals with intellectual disability or (B) 19 genes carrying de novo mutations in 532 individuals with epilepsy, as extracted from denovo-db. The American Journal of Human Genetics 2017 101, 623-629DOI: (10.1016/j.ajhg.2017.09.001) Copyright © 2017 American Society of Human Genetics Terms and Conditions