Supervised Machine Learning for Population Genetics: A New Paradigm

Slides:

Advertisements

Similar presentations

Bayesian Networks. Male brain wiring Female brain wiring.

Advertisements

Overview of Supervised Learning Overview of Supervised Learning2 Outline Linear Regression and Nearest Neighbors method Statistical Decision.

1 E. Fatemizadeh Statistical Pattern Recognition.

BAYESIAN LEARNING. 2 Bayesian Classifiers Bayesian classifiers are statistical classifiers, and are based on Bayes theorem They can calculate the probability.

Results for all features Results for the reduced set of features

Signatures of Selection

The Elements of Statistical Learning

Basic machine learning background with Python scikit-learn

Soyoun Kim, Jaewon Hwang, Daeyeol Lee Neuron

Evaluating classifiers for disease gene discovery

Overview of Supervised Learning

Attention Narrows Position Tuning of Population Responses in V1

Computational Tools for Stem Cell Biology

Joshua J. Foster, Emma M. Bsales, Russell J. Jaffe, Edward Awh

Varodom Charoensawan, Derek Wilson, Sarah A. Teichmann

Araceli Ramirez-Cardenas, Maria Moskaleva, Andreas Nieder

Volume 28, Issue 7, Pages e5 (April 2018)

Caspar M. Schwiedrzik, Winrich A. Freiwald Neuron

Alicia R. Martin, Christopher R. Gignoux, Raymond K

Optimal Degrees of Synaptic Connectivity

Mismatch Receptive Fields in Mouse Visual Cortex

Henry R. Johnston, David J. Cutler

A twin approach to unraveling epigenetics

Summary of the population genetics measures used in this study.

Brian K. Maples, Simon Gravel, Eimear E. Kenny, Carlos D. Bustamante

Volume 25, Issue 20, Pages (October 2015)

Differential Impact of Behavioral Relevance on Quantity Coding in Primate Frontal and Parietal Neurons Pooja Viswanathan, Andreas Nieder Current Biology

Yukiyasu Kamitani, Frank Tong Current Biology

Signatures of Purifying and Local Positive Selection in Human miRNAs

Biased Gene Conversion Skews Allele Frequencies in Human Populations, Increasing the Disease Burden of Recessive Alleles Joseph Lachance, Sarah A. Tishkoff

CA3 Retrieves Coherent Representations from Degraded Input: Direct Evidence for CA3 Pattern Completion and Dentate Gyrus Pattern Separation Joshua P.

A Switching Observer for Human Perceptual Estimation

Caspar M. Schwiedrzik, Winrich A. Freiwald Neuron

Deciphering Cortical Number Coding from Human Brain Activity Patterns

Volume 24, Issue 13, Pages (July 2014)

Dynamic Coding for Cognitive Control in Prefrontal Cortex

Variant Association Tools for Quality Control and Analysis of Large-Scale Sequence and Genotyping Array Data Gao T. Wang, Bo Peng, Suzanne M. Leal The.

Human Evolution: Thrifty Genes and the Dairy Queen

Volume 10, Issue 4, Pages (October 2011)

Parametric Methods Berlin Chen, 2005 References:

Guifen Chen, Daniel Manson, Francesca Cacucci, Thomas Joseph Wills

A Switching Observer for Human Perceptual Estimation

Intratumoral Heterogeneity of the Epigenome

Redundancy in the Population Code of the Retina

Origin and Function of Tuning Diversity in Macaque Visual Cortex

Volume 84, Issue 2, Pages (October 2014)

Sriram Sankararaman, Swapan Mallick, Nick Patterson, David Reich

Mosquitoes Use Vision to Associate Odor Plumes with Thermal Targets

Fast Sequences of Non-spatial State Representations in Humans

Volume 89, Issue 6, Pages (March 2016)

Amy E. Skerry, Rebecca Saxe Current Biology

Jeffrey A. Fawcett, Hideki Innan Trends in Genetics

Complex Signatures of Natural Selection at the Duffy Blood Group Locus

Molecular Analysis of the β-Globin Gene Cluster in the Niokholo Mandenka Population Reveals a Recent Origin of the βS Senegal Mutation Mathias Currat,

Timescales of Inference in Visual Adaptation

Volume 23, Issue 3, Pages (April 2018)

Jonathan K. Pritchard, Joseph K. Pickrell, Graham Coop Current Biology

Pier Francesco Palamara, Todd Lencz, Ariel Darvasi, Itsik Pe’er

Sébastien Marti, Jean-Rémi King, Stanislas Dehaene Neuron

Encoding of Stimulus Probability in Macaque Inferior Temporal Cortex

Genotype distance distribution summaries for soft and hard sweeps.

Volume 23, Issue 21, Pages (November 2013)

Sriram Sankararaman, Swapan Mallick, Nick Patterson, David Reich

Volume 74, Issue 1, Pages (April 2012)

Computational Tools for Stem Cell Biology

Volume 28, Issue 7, Pages e5 (April 2018)

Yafei Mao, Evan P. Economo, Noriyuki Satoh Current Biology

Evaluating the Effects of Imputation on the Power, Coverage, and Cost Efficiency of Genome-wide SNP Platforms Carl A. Anderson, Fredrik H. Pettersson,

Volume 27, Issue 20, Pages e4 (October 2017)

Volume 28, Issue 10, Pages e7 (September 2019)

Presentation transcript:

Supervised Machine Learning for Population Genetics: A New Paradigm Daniel R. Schrider, Andrew D. Kern Trends in Genetics Volume 34, Issue 4, Pages 301-312 (April 2018) DOI: 10.1016/j.tig.2017.12.005 Copyright © 2017 The Author(s) Terms and Conditions

Figure I An Imaginary Training Set of Two Types of Fruit, Oranges (Orange Filled Points) and Apples (Green Filled Points), Where Two Measurements Were Made for Each Fruit. With a training set in hand we can use supervised ML to learn a function that can differentiate between classes (broken line) such that the unknown class of new datapoints (unlabeled points above) can be predicted. Trends in Genetics 2018 34, 301-312DOI: (10.1016/j.tig.2017.12.005) Copyright © 2017 The Author(s) Terms and Conditions

Figure II An Example Application of Supervised ML to Demographic Model Selection. In this example population samples experiencing constant population size (equilibrium), a recent instantaneous population decline (contraction), or recent instantaneous expansion (growth) were simulated. A variant of a random forest classifier [51] was trained, which is an ensemble of semi-randomly generated decision trees, to discriminate between these three models on the basis of a feature vector consisting of two population genetic summary statistics [34,74]. (A) The decision surface: red points represent the growth scenario, dark-blue points represent equilibrium, and light-blue points represent contraction. The shaded areas in the background show how additional datapoints would be classified – note the non-linear decision surface separating these three classes. (B) The confusion matrix obtained from measuring classification accuracy on an independent test set. Data were simulated using ms [75], and classification was performed via scikit-learn [76]. All code used to create these figures can be found in a collection of Jupyter notebooks that demonstrate some simple examples of using supervised ML for population genetic inference provided here: https://github.com/kern-lab/popGenMachineLearningExamples. Trends in Genetics 2018 34, 301-312DOI: (10.1016/j.tig.2017.12.005) Copyright © 2017 The Author(s) Terms and Conditions

Figure III A Visualization of S/HIC Feature Vector and Classes. The S/HIC feature vector consists of π [77], θˆw [74], θˆH [34], the number (#) of distinct haplotypes, average haplotype homozygosity, H12 and H2/H1 [78,79], ZnS [37], and the maximum value of ω [48]. The expected values of these statistics are shown for genomic regions containing hard and soft sweeps (as estimated from simulated data). Fay and Wu’s H [34] and Tajima’s D [39] are also shown, though these may be omitted from the vector because they are redundant with π, θˆw, and θˆH. To classify a given region the spatial patterns of these statistics are examined across a genomic window to infer whether the center of the window contains a hard selective sweep (blue shaded area on the left, using statistics calculated within the larger blue window), is linked to a hard sweep (purple shaded area and larger window, left), contains a soft sweep (red, on the right), is linked to soft sweep (orange, right), or is evolving neutrally (not shown). Trends in Genetics 2018 34, 301-312DOI: (10.1016/j.tig.2017.12.005) Copyright © 2017 The Author(s) Terms and Conditions