Supervised Machine Learning for Population Genetics: A New Paradigm

Slides:



Advertisements
Similar presentations
Bayesian Networks. Male brain wiring Female brain wiring.
Advertisements

Overview of Supervised Learning Overview of Supervised Learning2 Outline Linear Regression and Nearest Neighbors method Statistical Decision.
1 E. Fatemizadeh Statistical Pattern Recognition.
BAYESIAN LEARNING. 2 Bayesian Classifiers Bayesian classifiers are statistical classifiers, and are based on Bayes theorem They can calculate the probability.
Results for all features Results for the reduced set of features
Signatures of Selection
The Elements of Statistical Learning
Basic machine learning background with Python scikit-learn
Soyoun Kim, Jaewon Hwang, Daeyeol Lee  Neuron 
Evaluating classifiers for disease gene discovery
Overview of Supervised Learning
Attention Narrows Position Tuning of Population Responses in V1
Computational Tools for Stem Cell Biology
Joshua J. Foster, Emma M. Bsales, Russell J. Jaffe, Edward Awh 
Varodom Charoensawan, Derek Wilson, Sarah A. Teichmann 
Araceli Ramirez-Cardenas, Maria Moskaleva, Andreas Nieder 
Volume 28, Issue 7, Pages e5 (April 2018)
Caspar M. Schwiedrzik, Winrich A. Freiwald  Neuron 
Alicia R. Martin, Christopher R. Gignoux, Raymond K
Optimal Degrees of Synaptic Connectivity
Mismatch Receptive Fields in Mouse Visual Cortex
Henry R. Johnston, David J. Cutler 
A twin approach to unraveling epigenetics
Summary of the population genetics measures used in this study.
Brian K. Maples, Simon Gravel, Eimear E. Kenny, Carlos D. Bustamante 
Volume 25, Issue 20, Pages (October 2015)
Differential Impact of Behavioral Relevance on Quantity Coding in Primate Frontal and Parietal Neurons  Pooja Viswanathan, Andreas Nieder  Current Biology 
Yukiyasu Kamitani, Frank Tong  Current Biology 
Signatures of Purifying and Local Positive Selection in Human miRNAs
Biased Gene Conversion Skews Allele Frequencies in Human Populations, Increasing the Disease Burden of Recessive Alleles  Joseph Lachance, Sarah A. Tishkoff 
CA3 Retrieves Coherent Representations from Degraded Input: Direct Evidence for CA3 Pattern Completion and Dentate Gyrus Pattern Separation  Joshua P.
A Switching Observer for Human Perceptual Estimation
Caspar M. Schwiedrzik, Winrich A. Freiwald  Neuron 
Deciphering Cortical Number Coding from Human Brain Activity Patterns
Volume 24, Issue 13, Pages (July 2014)
Dynamic Coding for Cognitive Control in Prefrontal Cortex
Variant Association Tools for Quality Control and Analysis of Large-Scale Sequence and Genotyping Array Data  Gao T. Wang, Bo Peng, Suzanne M. Leal  The.
Human Evolution: Thrifty Genes and the Dairy Queen
Volume 10, Issue 4, Pages (October 2011)
Parametric Methods Berlin Chen, 2005 References:
Guifen Chen, Daniel Manson, Francesca Cacucci, Thomas Joseph Wills 
A Switching Observer for Human Perceptual Estimation
Intratumoral Heterogeneity of the Epigenome
Redundancy in the Population Code of the Retina
Origin and Function of Tuning Diversity in Macaque Visual Cortex
Volume 84, Issue 2, Pages (October 2014)
Sriram Sankararaman, Swapan Mallick, Nick Patterson, David Reich 
Mosquitoes Use Vision to Associate Odor Plumes with Thermal Targets
Fast Sequences of Non-spatial State Representations in Humans
Volume 89, Issue 6, Pages (March 2016)
Amy E. Skerry, Rebecca Saxe  Current Biology 
Jeffrey A. Fawcett, Hideki Innan  Trends in Genetics 
Complex Signatures of Natural Selection at the Duffy Blood Group Locus
Molecular Analysis of the β-Globin Gene Cluster in the Niokholo Mandenka Population Reveals a Recent Origin of the βS Senegal Mutation  Mathias Currat,
Timescales of Inference in Visual Adaptation
Volume 23, Issue 3, Pages (April 2018)
Jonathan K. Pritchard, Joseph K. Pickrell, Graham Coop  Current Biology 
Pier Francesco Palamara, Todd Lencz, Ariel Darvasi, Itsik Pe’er 
Sébastien Marti, Jean-Rémi King, Stanislas Dehaene  Neuron 
Encoding of Stimulus Probability in Macaque Inferior Temporal Cortex
Genotype distance distribution summaries for soft and hard sweeps.
Volume 23, Issue 21, Pages (November 2013)
Sriram Sankararaman, Swapan Mallick, Nick Patterson, David Reich 
Volume 74, Issue 1, Pages (April 2012)
Computational Tools for Stem Cell Biology
Volume 28, Issue 7, Pages e5 (April 2018)
Yafei Mao, Evan P. Economo, Noriyuki Satoh  Current Biology 
Evaluating the Effects of Imputation on the Power, Coverage, and Cost Efficiency of Genome-wide SNP Platforms  Carl A. Anderson, Fredrik H. Pettersson,
Volume 27, Issue 20, Pages e4 (October 2017)
Volume 28, Issue 10, Pages e7 (September 2019)
Presentation transcript:

Supervised Machine Learning for Population Genetics: A New Paradigm Daniel R. Schrider, Andrew D. Kern  Trends in Genetics  Volume 34, Issue 4, Pages 301-312 (April 2018) DOI: 10.1016/j.tig.2017.12.005 Copyright © 2017 The Author(s) Terms and Conditions

Figure I An Imaginary Training Set of Two Types of Fruit, Oranges (Orange Filled Points) and Apples (Green Filled Points), Where Two Measurements Were Made for Each Fruit. With a training set in hand we can use supervised ML to learn a function that can differentiate between classes (broken line) such that the unknown class of new datapoints (unlabeled points above) can be predicted. Trends in Genetics 2018 34, 301-312DOI: (10.1016/j.tig.2017.12.005) Copyright © 2017 The Author(s) Terms and Conditions

Figure II An Example Application of Supervised ML to Demographic Model Selection. In this example population samples experiencing constant population size (equilibrium), a recent instantaneous population decline (contraction), or recent instantaneous expansion (growth) were simulated. A variant of a random forest classifier [51] was trained, which is an ensemble of semi-randomly generated decision trees, to discriminate between these three models on the basis of a feature vector consisting of two population genetic summary statistics [34,74]. (A) The decision surface: red points represent the growth scenario, dark-blue points represent equilibrium, and light-blue points represent contraction. The shaded areas in the background show how additional datapoints would be classified – note the non-linear decision surface separating these three classes. (B) The confusion matrix obtained from measuring classification accuracy on an independent test set. Data were simulated using ms [75], and classification was performed via scikit-learn [76]. All code used to create these figures can be found in a collection of Jupyter notebooks that demonstrate some simple examples of using supervised ML for population genetic inference provided here: https://github.com/kern-lab/popGenMachineLearningExamples. Trends in Genetics 2018 34, 301-312DOI: (10.1016/j.tig.2017.12.005) Copyright © 2017 The Author(s) Terms and Conditions

Figure III A Visualization of S/HIC Feature Vector and Classes. The S/HIC feature vector consists of π [77], θˆw [74], θˆH [34], the number (#) of distinct haplotypes, average haplotype homozygosity, H12 and H2/H1 [78,79], ZnS [37], and the maximum value of ω [48]. The expected values of these statistics are shown for genomic regions containing hard and soft sweeps (as estimated from simulated data). Fay and Wu’s H [34] and Tajima’s D [39] are also shown, though these may be omitted from the vector because they are redundant with π, θˆw, and θˆH. To classify a given region the spatial patterns of these statistics are examined across a genomic window to infer whether the center of the window contains a hard selective sweep (blue shaded area on the left, using statistics calculated within the larger blue window), is linked to a hard sweep (purple shaded area and larger window, left), contains a soft sweep (red, on the right), is linked to soft sweep (orange, right), or is evolving neutrally (not shown). Trends in Genetics 2018 34, 301-312DOI: (10.1016/j.tig.2017.12.005) Copyright © 2017 The Author(s) Terms and Conditions