A tutorial for Tractor Simon Gravel.

Slides:



Advertisements
Similar presentations
Applications of one-class classification
Advertisements

Model checking in mixture models via mixed predictive p-values Alex Lewin and Sylvia Richardson, Centre for Biostatistics, Imperial College, London Mixed.
Hidden Markov Model in Biological Sequence Analysis – Part 2
Constraints on Blazar Jet Conditions During Gamma- Ray Flaring from Radiative Transfer Modeling M.F. Aller, P.A. Hughes, H.D. Aller, & T. Hovatta The γ-ray.
Dynamic Bayesian Networks (DBNs)
1 QTL mapping in mice Lecture 10, Statistics 246 February 24, 2004.
 CpG is a pair of nucleotides C and G, appearing successively, in this order, along one DNA strand.  CpG islands are particular short subsequences in.
Patterns, Profiles, and Multiple Alignment.
Hidden Markov Models Theory By Johan Walters (SR 2003)
Hidden Markov Models M. Vijay Venkatesh. Outline Introduction Graphical Model Parameterization Inference Summary.
Methods and challenges in the analysis of admixed human genomes Simon Gravel Stanford University.
Signatures of Selection
Genome-wide Regulatory Complexity in Yeast Promoters Zhu YANG 15 th Mar, 2006.
Clustering short time series gene expression data Jason Ernst, Gerard J. Nau and Ziv Bar-Joseph BIOINFORMATICS, vol
Constraining Astronomical Populations with Truncated Data Sets Brandon C. Kelly (CfA, Hubble Fellow, 6/11/2015Brandon C. Kelly,
Islands in Africa: a study of structure in the source population for modern humans Rosalind Harding Depts of Statistics, Zoology & Anthropology, Oxford.
Association Mapping of Complex Diseases with Ancestral Recombination Graphs: Models and Efficient Algorithms Yufeng Wu UC Davis RECOMB 2007.
Continuous Coalescent Model
Maximum Likelihood Flips usage of probability function A typical calculation: P(h|n,p) = C(h, n) * p h * (1-p) (n-h) The implied question: Given p of success.
Inferring human demographic history from DNA sequence data Apr. 28, 2009 J. Wall Institute for Human Genetics, UCSF.
Estimating recombination rates using three-site likelihoods Jeff Wall Program in Molecular and Computational Biology, USC.
On Comparing Classifiers: Pitfalls to Avoid and Recommended Approach Published by Steven L. Salzberg Presented by Prakash Tilwani MACS 598 April 25 th.
Genetic Regulatory Network Inference Russell Schwartz Department of Biological Sciences Carnegie Mellon University.
Gil McVean Department of Statistics, Oxford Approximate genealogical inference.
Motif finding with Gibbs sampling CS 466 Saurabh Sinha.
ABC The method: practical overview. 1. Applications of ABC in population genetics 2. Motivation for the application of ABC 3. ABC approach 1. Characteristics.
IRCS/CCN Summer Workshop June 2003 Speech Recognition.
CS774. Markov Random Field : Theory and Application Lecture 19 Kyomin Jung KAIST Nov
Population and Sample The entire group of individuals that we want information about is called population. A sample is a part of the population that we.
Random Sampling Approximations of E(X), p.m.f, and p.d.f.
Lecture 12: Linkage Analysis V Date: 10/03/02  Least squares  An EM algorithm  Simulated distribution  Marker coverage and density.
Hidden Markovian Model. Some Definitions Finite automation is defined by a set of states, and a set of transitions between states that are taken based.
California Pacific Medical Center
Comp. Genomics Recitation 10 4/7/09 Differential expression detection.
Parallel Genetic Algorithms By Larry Hale and Trevor McCasland.
Imputation-based local ancestry inference in admixed populations
Practical With Merlin Gonçalo Abecasis. MERLIN Website Reference FAQ Source.
Introduction to Inference Sampling Distributions.
By Mireya Diaz Department of Epidemiology and Biostatistics for EECS 458.
Statistical Tests We propose a novel test that takes into account both the genes conserved in all three regions ( x 123 ) and in only pairs of regions.
Chapter 12 Inference for Proportions AP Statistics 12.2 – Comparing Two Population Proportions.
Modelling evolution Gil McVean Department of Statistics TC A G.
Chapter 9 Sampling Distributions 9.1 Sampling Distributions.
Inferences on human demographic history using computational Population Genetic models Gabor T. Marth Department of Biology Boston College Chestnut Hill,
STA248 week 121 Bootstrap Test for Pairs of Means of a Non-Normal Population – small samples Suppose X 1, …, X n are iid from some distribution independent.
Understanding human admixture, and association mapping in admixed populations. Simon Myers.
Hirophysics.com The Genetic Algorithm vs. Simulated Annealing Charles Barnes PHY 327.
Robert Page Doctoral Student in Dr. Voss’ Lab Population Genetics.
How many iterations in the Gibbs sampler? Adrian E. Raftery and Steven Lewis (September, 1991) Duke University Machine Learning Group Presented by Iulian.
Hidden Markov Models BMI/CS 576
POPULATION GENOMICS, ADMIXTURE AND EPIDEMIOLOGY AT HIGH RESOLUTION
IMa2(Isolation with Migration)
Inference for Proportions
Signatures of Selection
Daniel Falush, Dan Lawson, Lucy van Dorp
Imputation-based local ancestry inference in admixed populations
Tests for Gene Clustering
Hidden Markov Models Part 2: Algorithms
Chapter 18 – Sampling Distribution Models
Alicia R. Martin, Christopher R. Gignoux, Raymond K
Finding regulatory modules
David H. Spencer, Kerry L. Bubb, Maynard V. Olson 
Brian K. Maples, Simon Gravel, Eimear E. Kenny, Carlos D. Bustamante 
Population Genetic Inference from Personal Genome Data: Impact of Ancestry and Admixture on Human Genomic Variation  Jeffrey M. Kidd, Simon Gravel, Jake.
Goals: To identify subpopulations (subsets of the sample with distinct allele frequencies) To assign individuals (probabilistically) to subpopulations.
Chapter 12 Inference for Proportions
by Benjamin Vernot, Serena Tucci, Janet Kelso, Joshua G
Exploring Population Admixture Dynamics via Empirical and Simulated Genome-wide Distribution of Ancestral Chromosomal Segments  Wenfei Jin, Sijia Wang,
Complex History of Admixture between Modern Humans and Neandertals
Bruce Rannala, Jeff P. Reeve  The American Journal of Human Genetics 
Presentation transcript:

A tutorial for Tractor Simon Gravel

Tractor goal Find best-fitting gene flow models to observed patterns of local ancestry More specifically, model the distribution of ancestry tract lengths

Background Most individuals derive a substantial proportion of their recent ancestry to two or more statistically distinct populations. When the populations are distinct enough, it is possible to infer the local ancestry along the genome. Available methods: HapMix, Lamp, PCAdmix Saber, SupportMix, …

Typical setup for local ancestry inference Panel individuals are proxies for source population The panel individuals are likely to be admixed themselves, and there is no clear cutoff. In the following, “Admixed” simply means the samples for which we are attempting the local ancestry inference. Panel individuals “Admixed” individuals

PCAdmix: local ancestry assignment using PCA by window+HMM Best case scenario: panels well-separated, sample clusters with one Panel 1 Sample Panel 2 Panel 3 More typical case (if we’re lucky) Panel 3 Sample Panel 1 Panel 2 Kidd*, Gravel* et al (in Review)

Modeling the admixture process Kidd*, Gravel* et al (in Review)

Tractor assumptions Local ancestry assignments are accurate hard calls. In PCAdmix, this means using a Viterbi decoding algorithm. The “admixed” population is a panmictic population, without population structure. Recombination is uniform across populations. Little drift since admixture began.

Recombination model in Tractor Tractor uses a simplified Markovian model of recombination. This is the approximation of least concern.

Modeling ancestry tracts using a Markov model: migration pulse A simulated chromosome with local assignments T1 Each recombination occurs independently, giving rise to a Markov Model Gravel (in Review)

More complex demographic histories can be modeled via multiple-state Markov model The entire demographic history contained in the transition matrix. Tractor calculates it for you

Markov model vs simulation Gravel (in Review)

The goal is now to use real data, generate these histograms, fit some demographic models

Assuming you have already run a local ancestry inference method The day starts with bed files containing the local ancestry calls: chrom begin end assignment cmBegin cmEnd chrX 0 2717733 UNKNOWN 0.0 20.95 chrX 2717733 152359442 YRI 20.95 200.66 chrX 152359442 154913754 UNKNOWN 200.66 202.23 chr13 0 18110261 UNKNOWN 0.0 0.19 chr13 18110261 28539742 YRI 0.19 22.193 chr13 28539742 28540421 UNKNOWN 22.193 22.193 chr13 28540421 91255067 CEU 22.193 84.7013

Organizing files in a directory We suppose that genomes are phased. One way to organize this is to have two bed files per individual (_A and _B), and have individuals in a directory:

Tractor is object-oriented. definitions in tractor.py tract<chrom<chropair<indiv<population import complete population and calculate statistics: pop=tractor.population(names=names, fname=(directory,"",".viterbi.bed.cm"), selectchrom=chroms) (bins, data)=pop.get_global_tractlengths(npts=50)

Defining a model Tractor can take arbitrary time-dependent migration rates m from K populations. Migrations rates are organized as an array: populations k/K generations t/T mtk Way too many parameters to optimize!!

Defining a model We need to choose a model with a short vector of parameters a, and define a function def f(a): Return KxT migration array def control(a): Return < 0 if parameters outside range Tons of 2- and 3-pop models are pre-defined, I’m happy to help with model-building.

Optimization steps decide of the starting conditions for the parameters startparams=numpy.array([ 0.897887 , 0.172344 , 0.922907 , 0.120098 , 0.111489 , 0.05883 ]) decide how many bins of short tracts to ignore (cutoff typically 1 or 2) You’re all set: xopt=tractor.optimize_cob(startparams,bins,Ls,data,nind,func,outofbounds_fun=bound,cutoff=1,epsilon=1e-2) Hopefully, you get something like:

If optimization fails to reliably converge Use improved optimizer: optimize_cob_fracs Restart with different starting parameters…

Comparing different models Use a nested models and perform a likelihood ratio test