The 1000 Genomes Project Gil McVean Department of Statistics, Oxford.

Slides:



Advertisements
Similar presentations
Imputation for GWAS 6 December 2012.
Advertisements

Analysis of imputed rare variants
1000G Phase 1 Release chr20 call sets Ryan Poplin Genome Sequencing and Analysis Medical and Population Genetics January 25, 2011.
Considerations for Analyzing Targeted NGS Data HLA
Julia Krushkal 4/11/2017 The International HapMap Project: A Rich Resource of Genetic Information Julia Krushkal Lecture in Bioinformatics 04/15/2010.
Corso di Genomica lezione laurea magistrale Biotecnologia Industriale Giovedì 9 dicembre 2010 aula 6 orario : Martedì ore
Shifting the Paradigm for Genetic Sample Resources J. Claiborne Stephens, CEO, Genomics GPS Ron Bromley, Ena Bromley, BSSI Ted Kalbfleisch, Intrepid BioInformatics.
Genome-wide Association Study Focus on association between SNPs and traits Tendency – Larger and larger sample size – Use of more narrowly defined phenotypes(blood.
Bioinformatics at Molecular Epidemiology - new tools for identifying indels in sequencing data Kai Ye
Understanding GWAS Chip Design – Linkage Disequilibrium and HapMap Peter Castaldi January 29, 2013.
Lessons learnt from the 1000 Genomes Project about sequencing in populations Gil McVean Wellcome Trust Centre for Human Genetics and Department of Statistics,
Toward a unified view of human genetic variation Gabor Marth Boston College Biology Department on behalf of the International 1000 Genomes Project.
Mark de Pristo But 1-2% of 3 billion is still a lot! What fraction of human genetic variation has now been described?
1000G Pilot 3 Progress in silico analysis and comparison to experimental validation Gabor Marth (Boston College) + A + L Kiran Garimella (Broad Institute)
Variant discovery Different approaches: With or without a reference? With a reference – Limiting factors are CPU time and memory required – Crossbow –
Biology and Bioinformatics Gabor T. Marth Department of Biology, Boston College BI820 – Seminar in Quantitative and Computational Problems.
Computational Tools for Finding and Interpreting Genetic Variations Gabor T. Marth Department of Biology, Boston College
The Extraction of Single Nucleotide Polymorphisms and the Use of Current Sequencing Tools Stephen Tetreault Department of Mathematics and Computer Science.
SNP Resources: Finding SNPs, Databases and Data Extraction Debbie Nickerson NIEHS SNPs Workshop.
Genomewide Association Studies.  1. History –Linkage vs. Association –Power/Sample Size  2. Human Genetic Variation: SNPs  3. Direct vs. Indirect Association.
Polymorphisms – SNP, InDel, Transposon BMI/IBGP 730 Victor Jin, Ph.D. (Slides from Dr. Kun Huang) Department of Biomedical Informatics Ohio State University.
Course Overview Personalized Medicine: Understanding Your Own Genome Fall 2014.
Towards Personal Genomics Tools for Navigating the Genome of an Individual Saul A. Kravitz J. Craig Venter Institute Rockville, MD Bio-IT World 2008.
The Phase 1 Variant Set and Future Developments
Whole Exome Sequencing for Variant Discovery and Prioritisation
GeVab: Genome Variation Analysis Browsing Server Korean BioInformation Center, KRIBB InCoB2009 KRIBB
HapMap: application in the design and interpretation of association studies Mark J. Daly, PhD on behalf of The International HapMap Consortium.
Computational research for medical discovery at Boston College Biology Gabor T. Marth Boston College Department of Biology
Medical variations Gabor T. Marth Boston College Biology Department BI543 Fall 2013 February 5, 2013.
Loss-of-co-Homozygosity mapping and exome sequencing of a Syrian pedigree identified the candidate causal mutation associated with rheumatoid arthritis.
Biology 101 DNA: elegant simplicity A molecule consisting of two strands that wrap around each other to form a “twisted ladder” shape, with the.
Molecular & Genetic Epi 217 Association Studies
A Genome-wide association study of Copy number variation in schizophrenia Andrés Ingason CNS Division, deCODE Genetics. Research Institute of Biological.
The 1000 Genomes Project Gil McVean Department of Statistics, Oxford.
Association mapping: finding genetic variants for common traits & diseases Manuel Ferreira Queensland Institute of Medical Research Brisbane Genetic Epidemiology.
Jeff O’ConnellInterbull annual meeting, Orlando, FL, July 2015 (1) J. R. O’Connell 1 and P. M. VanRaden 2 1 University of Maryland School of Medicine,
E XOME SEQUENCING AND COMPLEX DISEASE : practical aspects of rare variant association studies Alice Bouchoms Amaury Vanvinckenroye Maxime Legrand 1.
Recombination based population genomics Jaume Bertranpetit Marta Melé Francesc Calafell Asif Javed Laxmi Parida.
California Pacific Medical Center
HW2: exome sequencing and complex disease Jacquemin Jonathan de Bournonville Sébastien.
The International Consortium. The International HapMap Project.
Motivations to study human genetic variation
PanMap Mapping Genomic Variation in Western Chimpanzees
Copyright OpenHelix. No use or reproduction without express written consent1.
Resources at HapMap.Org HapMap3 Tutorial Marcela K. Tello-Ruiz Cold Spring Harbor Laboratory.
Variant calling: number of individuals vs. depth of read coverage Gabor T. Marth Boston College Biology Department 1000 Genomes Meeting Cold Spring Harbor.
Computational Biology and Genomics at Boston College Biology Gabor T. Marth Department of Biology, Boston College
Current Data And Future Analysis Thomas Wieland, Thomas Schwarzmayr and Tim M Strom Helmholtz Zentrum München Institute of Human Genetics Geneva, 16/04/12.
Signals of natural selection in the HapMap project data The International HapMap Consortium Gil McVean Department of Statistics, Oxford University.
Analysis of Next Generation Sequence Data BIOST /06/2015.
Canadian Bioinformatics Workshops
A brief guide to sequencing Dr Gavin Band Wellcome Trust Advanced Courses; Genomic Epidemiology in Africa, 21 st – 26 th June 2015 Africa Centre for Health.
Recent Advances in Genomic Science Julian Sampson Institute of Medical Genetics, Cardiff.
Reliable Identification of Genomic Variants from RNA-seq Data Robert Piskol, Gokul Ramaswami, Jin Billy Li PRESENTED BY GAYATHRI RAJAN VINEELA GANGALAPUDI.
1 Finding disease genes: A challenge for Medicine, Mathematics and Computer Science Andrew Collins, Professor of Genetic Epidemiology and Bioinformatics.
WHI Imputation. Target GWAS data WHIMS +, ~5,000-6,000 samples, Illumina Omni express GRANET, ~5,000 samples, Illumina Omni Hipfx, ~4,000-5,000 samples,
Inferences on human demographic history using computational Population Genetic models Gabor T. Marth Department of Biology Boston College Chestnut Hill,
From Reads to Results Exome-seq analysis at CCBR
Interpreting exomes and genomes: a beginner’s guide
Gil McVean Department of Statistics
The Genome Diversity in Africa Project
Week 5 Theory and application for setting up an RNA-Seq pipeline
Deep Whole-Genome Sequencing of 100 Southeast Asian Malays
Population Genetic Inference from Personal Genome Data: Impact of Ancestry and Admixture on Human Genomic Variation  Jeffrey M. Kidd, Simon Gravel, Jake.
Catarina D. Campbell, Nick Sampas, Anya Tsalenko, Peter H
Haplotypes When the presence of two or more polymorphisms on a single chromosome is statistically correlated in a population, this is a haplotype Example.
Selecting a Maximally Informative Set of Single-Nucleotide Polymorphisms for Association Analyses Using Linkage Disequilibrium  Christopher S. Carlson,
Trevor J. Pemberton, Chaolong Wang, Jun Z. Li, Noah A. Rosenberg 
Rare Variant Association Tests for Multiple Ancestries Using Common Controls Megan Sorenson July 29, 2019.
Presentation transcript:

The 1000 Genomes Project Gil McVean Department of Statistics, Oxford

What is the 1000 Genomes Project? A catalogue of all types of genetic variation, including rare variants (c. 1% frequency) obtained by sequencing at least 1000 individuals from geographic centres of major medical genetics interest A large international collaboration –UK, USA, China, Germany An exploration of the use of next-generation technologies for population-scale genome sequencing A resource for accelerating the rate of identifying disease mechanisms in the follow-up to disease-association studies

Samples for the main project UK FIN TSI ESP CEU JPT CHB CHS DAI KVT GMB GHN YRI MLW LWK Major population groups comprised of subpopulations of c. 100 each MXL ASW new CMB PRO

Population-scale genome sequencing Haplotypes 2x 10x

Pilot experiments Pilot 1 –Low-coverage (2x-4x) on 60 unrelated individuals from each of CEU, YRI and CHB+JPT Pilot 2 –High-coverage (20x diploid) on 2 trios (one from CEU, one from YRI) Pilot 3 –Exons from 1000 genes to 20x in c samples (largely European) Complete!

The 1000G Low Coverage Pilot 185 individuals from 4 populations – CEU (63), CHB (30), JPT (30), YRI (62) PopulationTechnologyN IndividualsMapped Bases (billions) Mean Coverage / Individual CEUSLX SOLiD CHBSLX JPTSLX YRISLX SOLiD Combined1851,

Even still, at lot of data isn’t much In the Pilot 1 sample 1 tera-basepairs leaves the CEU with… –6% of genotypes with 0 reads –16% of genotypes with < 2 reads –29% of genotypes with < 3 reads

ftp.1000genomes.ebi.ac.uk Pilot release expected Nov/Dec 2009 ftp-trace.ncbi.nih.gov/1000genomes/ftp

What has the project already generated?

Over 9 millions novel SNPs Total 17.2 M SNPs called Previously ~12M SNPs “known” (dbSNP 129) –7.9M confirmed –9.2M novel CEUYRI CHB+JPT CEU YRI CHB+JPT Total SNPsNovel SNPs Le Quang

A near complete record of common SNPs Durbin, Le Quang

A set of accurate genotypes This is about where simulations suggest we should be with 2-4x on 60 samples Note this quality is much much better than if calls were made marginally Durbin, Le Quang

Many novel indels and larger structural variants

Zam Iqbal Up to 50kb Novel sequence from de novo assembly

Some interesting biology - variation in SNP density

Some more interesting biology – high Fst SNPs Ryan Hernandez, Adam Auton

Even more interesting biology – loss of function mutations Daniel MacArthur

A robust and modular pipeline for analysis of population-scale sequence data

An efficient format for storing aligned reads and a set of tools to manipulate and view the files SAM/BAM format for storing (aligned) reads Bioinformatics (2009)

An information-rich format for storing generic haplotype/genotype data and tools for manipulating the files

Using the 1000G data now

IMPUTE Genotypes in additional samples from standard product Reference panel (1000G) Imputation … … … … … … … … … … … … Imputed genotypes

Imputation performance across SNP types from P1 (CEU) from Affy 500k Annotation# SNPsInfo measure All414, MAF < 5%102, MAF > 5%312, UCSC Genes6, Depth < 1003,153 (0.7%)0.611 SimpRpts25, SimpRpts + Depth < 1001,652 (6.5%)0.671 SegDups24, SegDups + Depth < (2.7%)0.388 Jonathan Marchini

Looking forward... Already have data generated for c. 200 more Europeans –Data generation largely complete by mid 2010 Much work still to be done on accurate inference of all types of variation from NGS data Data already proven useful for a number of projects – please use it

Thanks to the many...