Targets of recent positive selection in Indian populations Irene Gallego Romero Leverhulme Centre for Human Evolutionary Studies Department of Biological.

Slides:



Advertisements
Similar presentations
Lecture 2 Strachan and Read Chapter 13
Advertisements

Mapping analysis software Dr Ian Carr PhD. MCSD. Leeds Institute of Molecular Medicine St Jamess University Hospital.
Admixture in Horse Breeds Illustrated from Single Nucleotide Polymorphism Data César Torres, Yaniv Brandvain University of Minnesota, Department of Plant.
Note that the genetic map is different for men and women Recombination frequency is higher in meiosis in women.
METHODS FOR HAPLOTYPE RECONSTRUCTION
Presented by Qing Duan Dr. Yun Li group UNC at Chapel Hill
Perspectives from Human Studies and Low Density Chip Jeffrey R. O’Connell University of Maryland School of Medicine October 28, 2008.
Gene Expression Levels Are a Target of Recent Natural Selection in the Human Genome Mol. Biol. Evol. 26(3):649– Journal Club
Objectives Cover some of the essential concepts for GWAS that have not yet been covered Hardy-Weinberg equilibrium Meta-analysis SNP Imputation Review.
High resolution detection of IBD Sharon R Browning and Brian L Browning Supported by the Marsden Fund.
Understanding GWAS Chip Design – Linkage Disequilibrium and HapMap Peter Castaldi January 29, 2013.
Patterns of population structure and admixture among human populations Katarzyna Bryc OEB 275br February 19, 2013.
MALD Mapping by Admixture Linkage Disequilibrium.
Ingredients for a successful genome-wide association studies: A statistical view Scott Weiss and Christoph Lange Channing Laboratory Pulmonary and Critical.
Plant of the day! Pebble plants, Lithops, dwarf xerophytes Aizoaceae
Sequencing Neanderthal DNA
Signatures of Selection
Genomics An introduction. Aims of genomics I Establishing integrated databases – being far from merely a storage Linking genomic and expressed gene sequences.
Variant discovery Different approaches: With or without a reference? With a reference – Limiting factors are CPU time and memory required – Crossbow –
Biology and Bioinformatics Gabor T. Marth Department of Biology, Boston College BI820 – Seminar in Quantitative and Computational Problems.
Tracing the dispersal of human populations By analysis of polymorphisms in the Non-recombining region of the Human Y Chromosome Underhill et al 2000 Nature.
Evolutionary Genome Biology Gabor T. Marth, D.Sc. Department of Biology, Boston College Medical Genomics Course – Debrecen, Hungary, May 2006.
Inferring human demographic history from DNA sequence data Apr. 28, 2009 J. Wall Institute for Human Genetics, UCSF.
Human Migrations Saeed Hassanpour Spring Introduction Population Genetics Co-evolution of genes with language and cultural. Human evolution: genetics,
Optimal Tag SNP Selection for Haplotype Reconstruction Jin Jun and Ion Mandoiu Computer Science & Engineering Department University of Connecticut.
Study Design Discussion The Ghost of Candidate Gene Past and the Ghost of Genome-wide Association Yet to Come Stephen S. Rich, Ph.D. Wake Forest University.
Review Session Monday, November 8 Shantz 242 E (the usual place) 5:00-7:00 PM I’ll answer questions on my material, then Chad will answer questions on.
Whole Exome Sequencing for Variant Discovery and Prioritisation
Modes of selection on quantitative traits. Directional selection The population responds to selection when the mean value changes in one direction Here,
Characterizing the role of miRNAs within gene regulatory networks using integrative genomics techniques Min Wenwen
What is the value of indigenous populations to medical genetics research? Rosalind M. Harding Departments of Zoology and Statistics University of Oxford.
Computational research for medical discovery at Boston College Biology Gabor T. Marth Boston College Department of Biology
Fine mapping QTLs using Recombinant-Inbred HS and In-Vitro HS William Valdar Jonathan Flint, Richard Mott Wellcome Trust Centre for Human Genetics.
Case(Control)-Free Multi-SNP Combinations in Case-Control Studies Dumitru Brinza and Alexander Zelikovsky Combinatorial Search (CS) for Disease-Association:
National Taiwan University Department of Computer Science and Information Engineering Haplotype Inference Yao-Ting Huang Kun-Mao Chao.
A single-nucleotide polymorphism tagging set for human drug metabolism and transport Kourosh R Ahmadi, Mike E Weale, Zhengyu Y Xue, Nicole Soranzo, David.
Biology 101 DNA: elegant simplicity A molecule consisting of two strands that wrap around each other to form a “twisted ladder” shape, with the.
Gene Hunting: Linkage and Association
Large-scale recombination rate patterns are conserved among human populations David Serre McGill University and Genome Quebec Innovation Center UQAM January.
Jianfeng Xu, M.D., Dr.PH Professor of Public Health and Cancer Biology Director, Program for Genetic and Molecular Epidemiology of Cancer Associate Director,
QTL Mapping in Heterogeneous Stocks Talbot et al, Nature Genetics (1999) 21: Mott et at, PNAS (2000) 97:
Copy Number Variation Eleanor Feingold University of Pittsburgh March 2012.
INTRODUCTION TO ASSOCIATION MAPPING
E XOME SEQUENCING AND COMPLEX DISEASE : practical aspects of rare variant association studies Alice Bouchoms Amaury Vanvinckenroye Maxime Legrand 1.
Recombination based population genomics Jaume Bertranpetit Marta Melé Francesc Calafell Asif Javed Laxmi Parida.
MEME homework: probability of finding GAGTCA at a given position in the yeast genome, based on a background model of A = 0.3, T = 0.3, G = 0.2, C = 0.2.
FINE SCALE MAPPING ANDREW MORRIS Wellcome Trust Centre for Human Genetics March 7, 2003.
California Pacific Medical Center
February 20, 2002 UD, Newark, DE SNPs, Haplotypes, Alleles.
Errors in Genetic Data Gonçalo Abecasis. Errors in Genetic Data Pedigree Errors Genotyping Errors Phenotyping Errors.
The International Consortium. The International HapMap Project.
Variant calling: number of individuals vs. depth of read coverage Gabor T. Marth Boston College Biology Department 1000 Genomes Meeting Cold Spring Harbor.
Lectures 7 – Oct 19, 2011 CSE 527 Computational Biology, Fall 2011 Instructor: Su-In Lee TA: Christopher Miles Monday & Wednesday 12:00-1:20 Johnson Hall.
Computational Biology and Genomics at Boston College Biology Gabor T. Marth Department of Biology, Boston College
Evolutionary Genome Biology Gabor T. Marth, D.Sc. Department of Biology, Boston College
Admixture Mapping Controlled Crosses Are Often Used to Determine the Genetic Basis of Differences Between Populations. When controlled crosses are not.
Signals of natural selection in the HapMap project data The International HapMap Consortium Gil McVean Department of Statistics, Oxford University.
Inferences on human demographic history using computational Population Genetic models Gabor T. Marth Department of Biology Boston College Chestnut Hill,
Canadian Bioinformatics Workshops
Gil McVean Department of Statistics
Population Genetics As we all have an interest in genomic epidemiology we are likely all either in the process of sampling and ananlysising genetic data.
Signatures of Selection
Genome Wide Association Studies using SNP
Shared and Unique Components of Human Population Structure and Genome-Wide Signals of Positive Selection in South Asia  Mait Metspalu, Irene Gallego Romero,
Detection of the footprint of natural selection in the genome
Genome-wide Associations
Proportioning Whole-Genome Single-Nucleotide–Polymorphism Diversity for the Identification of Geographic Population Structure and Genetic Ancestry  Oscar.
Revisiting the Thrifty Gene Hypothesis via 65 Loci Associated with Susceptibility to Type 2 Diabetes  Qasim Ayub, Loukas Moutsianas, Yuan Chen, Kalliope.
Genomic Signatures of Selective Pressures and Introgression from Archaic Hominins at Human Innate Immunity Genes  Matthieu Deschamps, Guillaume Laval,
Simultaneous Genotype Calling and Haplotype Phasing Improves Genotype Accuracy and Reduces False-Positive Associations for Genome-wide Association Studies 
Presentation transcript:

Targets of recent positive selection in Indian populations Irene Gallego Romero Leverhulme Centre for Human Evolutionary Studies Department of Biological Anthropology

The Indian subcontinent Probably inhabited by H sapiens ~50,000 YBP (coastal route out of Africa, mtDNA and Y data) Drastic population expansion ~35,000 YBP Decidedly not a single panmictic population, highly stratified and fragmented –linguistics, geography, sociocultural practices. Very high incidence of T2D and obesity (predicted highest worldwide by 2030) Underrepresented in genomic diversity panels

All of which means… There has been ample time for ‘recent’ evolutionary adaptations to arise These adaptations have generally gone unexamined –Most Indian work to date has examined Indian population history, and been carried out on mtDNA and Y-chromosome

Allelic trajectories under selection Bamshad & Wooding, Nat Rev Gen, 2003

Selective sweeps and haplotypes Nielsen et al, Nat Rev Gen, 2007

Selective sweeps and haplotypes Bamshad & Wooding, Nat Rev Gen, 2003 All we are looking for is haplotypes that are uncommonly long for their frequency in the sample.

Quantifying selective sweeps EHH: probability of two chromosomes in a sample being identical as a function of distance from a chosen ‘core’ SNP Other related metrics: –iHS: integral under the EHH curve, sensitive to allelic ancestry –XP-EHH: cross population EHH, compares population pairs, detects the action of selection in one population but not the other

Sample composition 156 Indian samples –31 populations 836 further samples HGDP-CEPH, our data –Old World, Oceania –Split into 8 geographic groups/40 populations Illumina 650K, 610K chips (~550,000 autosomal SNPs)

India in a global context: F ST

Computational challenges Phasing: –Inferring haplotype from genotype Calculating test statistics: –iHS and XP-EHH Data post-processing: –~550,000 data points per population per statistic –SNPs to genes/genomic regions

Phasing Likelihood-based methods 550,000 SNPs per individual, ~1,000 individuals Phasing chromosome 2 (densest, ~50,000 SNPs) can take over a week Computationally intensive, and requires a lot of disk space for storing iterations, so cannot use CamGrid –use elephant.bio.cam.ac.uk, simultaneously run multiple chromosomes –< 2 weeks to phase all autosomal chromosomes

Computing XP-EHH and iHS Compute a value for each statistic for each SNP for each population or population pair (~10 per test) –>5,000,000 data points for each statistic Not computationally intensive, small files –easily run on CamGrid (each chromosome separately) –4-5 hours to analyse a single population C++ code

Data processing Data sets this big suffer from high false discovery rates Multiple testing corrections can be too stringent Need to reduce the number of data points –windowing approach: Break the genome into non-overlapping, contiguous 200kb windows, test significance at that level

Windowing Done using R –Hand-written code, no extra packages –Requires large amounts of RAM (> 10GB), so not suitable for CamGrid –Again, use elephant –Roughly 2 hours per population From 550,000 SNPs to 13,274 windows –Spanning ~20,000 genes –How to tease out biological meaningfulness?

Separate signals in North and South India

From SNPs to genes and beyond Selection acts on phenotypes, not genes Mining of ontologies and other databases –Gene Ontology terms, Mammalian Phenotype terms, other annotations –(not actually done by high throughput methods, but I know better by now) –Although it still requires a lot of manual curation Map biological function to windows, test for overrepresentation of categories relative to expectations

A lot of hours later…

Acknowledgements Toomas Kivisild, Katie Siddle (LCHES) Jenny Barna Mait Metspalu, Georgi Hudjashov, Gyaneshwer Chaubey (University of Tartu) Joe Pickrell (University of Chicago) Richard Lempicki (NIH)

Other genome-wide statistics Genome-wide F ST and H S are both computed with simple R scripts –Hand-written code –~5 minutes per population –The slowest bit is reading the data in –Use elephant.bio.cam.ac.uk AAF spectrum slopes are a bit more involved –To correct for sample size effects, resample every locus 1,000 times from its own allelic distribution –~ 1 hour per population, requires high RAM, use R

Ancestral allele frequency slopes