The 1000 Genomes Project Gil McVean Department of Statistics, Oxford.

Slides:



Advertisements
Similar presentations
Imputation for GWAS 6 December 2012.
Advertisements

Analysis of imputed rare variants
1000G Phase 1 Release chr20 call sets Ryan Poplin Genome Sequencing and Analysis Medical and Population Genetics January 25, 2011.
Why this paper Causal genetic variants at loci contributing to complex phenotypes unknown Rat/mice model organisms in physiology and diseases Relevant.
Considerations for Analyzing Targeted NGS Data HLA
Genome-wide Association Study Focus on association between SNPs and traits Tendency – Larger and larger sample size – Use of more narrowly defined phenotypes(blood.
Bioinformatics at Molecular Epidemiology - new tools for identifying indels in sequencing data Kai Ye
Understanding GWAS Chip Design – Linkage Disequilibrium and HapMap Peter Castaldi January 29, 2013.
Introduction  Human leukocyte antigen (HLA) is the major histocompatibility complex (MHC) in humans  Group of genes ('superregion') on chromosome 6.
Targeted Data Introduction  Many mapping, alignment and variant calling algorithms  Most of these have been developed for whole genome sequencing and.
Lessons learnt from the 1000 Genomes Project about sequencing in populations Gil McVean Wellcome Trust Centre for Human Genetics and Department of Statistics,
Toward a unified view of human genetic variation Gabor Marth Boston College Biology Department on behalf of the International 1000 Genomes Project.
1000G Pilot 3 Progress in silico analysis and comparison to experimental validation Gabor Marth (Boston College) + A + L Kiran Garimella (Broad Institute)
Variant discovery Different approaches: With or without a reference? With a reference – Limiting factors are CPU time and memory required – Crossbow –
The 1000 Genomes Project Gil McVean Department of Statistics, Oxford.
Biology and Bioinformatics Gabor T. Marth Department of Biology, Boston College BI820 – Seminar in Quantitative and Computational Problems.
Computational Tools for Finding and Interpreting Genetic Variations Gabor T. Marth Department of Biology, Boston College
SNP Resources: Variation Discovery, HapMap and the EGP Mark J. Rieder Department of Genome Sciences NIEHS SNPs Workshop Jan 10-11,
The Phase 1 Variant Set and Future Developments
Genome Variations & GWAS
Design Considerations in Large- Scale Genetic Association Studies Michael Boehnke, Andrew Skol, Laura Scott, Cristen Willer, Gonçalo Abecasis, Anne Jackson,
Whole Exome Sequencing for Variant Discovery and Prioritisation
Considerations for Analyzing Targeted NGS Data Introduction Tim Hague, CTO.
Comments on Rare Variants Analyses Ryo Yamada Kyoto University 2012/08/27 Japan.
Computational research for medical discovery at Boston College Biology Gabor T. Marth Boston College Department of Biology
A Primer on Genetic Variation Variety Lawrence Brody - NHGRI.
Genetics-multistep tumorigenesis genomic integrity & cancer Sections from Weinberg’s ‘the biology of Cancer’ Cancer genetics and genomics Selected.
Medical variations Gabor T. Marth Boston College Biology Department BI543 Fall 2013 February 5, 2013.
Loss-of-co-Homozygosity mapping and exome sequencing of a Syrian pedigree identified the candidate causal mutation associated with rheumatoid arthritis.
Next-Generation Sequencing
Next Generation Sequencing and its data analysis challenges Background Alignment and Assembly Applications Genome Epigenome Transcriptome.
Molecular & Genetic Epi 217 Association Studies
A Genome-wide association study of Copy number variation in schizophrenia Andrés Ingason CNS Division, deCODE Genetics. Research Institute of Biological.
Association mapping: finding genetic variants for common traits & diseases Manuel Ferreira Queensland Institute of Medical Research Brisbane Genetic Epidemiology.
Methods in genome wide association studies. Norú Moreno
Copy Number Variation Eleanor Feingold University of Pittsburgh March 2012.
E XOME SEQUENCING AND COMPLEX DISEASE : practical aspects of rare variant association studies Alice Bouchoms Amaury Vanvinckenroye Maxime Legrand 1.
Recombination based population genomics Jaume Bertranpetit Marta Melé Francesc Calafell Asif Javed Laxmi Parida.
California Pacific Medical Center
Geuvadis Analysis Meeting 16/02/2012 Micha Sammeth CNAG – Barcelona.
The International Consortium. The International HapMap Project.
Motivations to study human genetic variation
PanMap Mapping Genomic Variation in Western Chimpanzees
Variant calling: number of individuals vs. depth of read coverage Gabor T. Marth Boston College Biology Department 1000 Genomes Meeting Cold Spring Harbor.
Lectures 7 – Oct 19, 2011 CSE 527 Computational Biology, Fall 2011 Instructor: Su-In Lee TA: Christopher Miles Monday & Wednesday 12:00-1:20 Johnson Hall.
Current Data And Future Analysis Thomas Wieland, Thomas Schwarzmayr and Tim M Strom Helmholtz Zentrum München Institute of Human Genetics Geneva, 16/04/12.
Analyzing DNA using Microarray and Next Generation Sequencing (1) Background SNP Array Basic design Applications: CNV, LOH, GWAS Deep sequencing Alignment.
Analysis of Next Generation Sequence Data BIOST /06/2015.
Canadian Bioinformatics Workshops
Reliable Identification of Genomic Variants from RNA-seq Data Robert Piskol, Gokul Ramaswami, Jin Billy Li PRESENTED BY GAYATHRI RAJAN VINEELA GANGALAPUDI.
Genome-Wides Association Studies (GWAS) Veryan Codd.
1 Finding disease genes: A challenge for Medicine, Mathematics and Computer Science Andrew Collins, Professor of Genetic Epidemiology and Bioinformatics.
WHI Imputation. Target GWAS data WHIMS +, ~5,000-6,000 samples, Illumina Omni express GRANET, ~5,000 samples, Illumina Omni Hipfx, ~4,000-5,000 samples,
Date of download: 7/2/2016 Copyright © 2016 American Medical Association. All rights reserved. From: Clinical Interpretation and Implications of Whole-Genome.
From Reads to Results Exome-seq analysis at CCBR
Armenian Genome Project
Interpreting exomes and genomes: a beginner’s guide
SNP Detection Congtam Pham 2/24/04 Dr. Marth’s Class.
K. Lakiotaki1, E. Kartsaki1, A. Kanterakis1, T. Katsila2, G. P
Gil McVean Department of Statistics
Week 5 Theory and application for setting up an RNA-Seq pipeline
Alicia R. Martin, Christopher R. Gignoux, Raymond K
Deep Whole-Genome Sequencing of 100 Southeast Asian Malays
Jingjing Li, Xiumei Hong, Sam Mesiano, Louis J
Linkage Disequilibrium and Heritability of Copy-Number Polymorphisms within Duplicated Regions of the Human Genome  Devin P. Locke, Andrew J. Sharp, Steven.
Single-Molecule Sequencing: Towards Clinical Applications
BF528 - Genomic Variation and SNP Analysis
BF528 - Whole Genome Sequencing and Genomic Variation
Trevor J. Pemberton, Chaolong Wang, Jun Z. Li, Noah A. Rosenberg 
Presentation transcript:

The 1000 Genomes Project Gil McVean Department of Statistics, Oxford

Questions Why do we need a comprehensive map of human genetic variation? How will data from the 1000G project be used in medical genetics? How can we start accessing the hard part of the genome? What has the 1000G project told us about the genetic structure of populations?

The role of the 1000G Project in medical genetics A representation of ‘normal’ variation –95% of variants at 1% frequency in populations of interest A resource for increasing the power of existing genome-wide association studies A development platform for sequencing / statistical / computational technologies A forum for establishing best practice and standards in genome sequencing

Samples for the 1000 Genomes Project Major population groups comprised of subpopulations of c. 100 each GBR FIN TSI IBS CEU JPT CHB CHS CDX KHV GWB GHN YRI MAB LWK MXL CLM ASW AJM ACB PEL PUR Samples from S. Asia

Three key components to the 1000G Project design

1. Population-scale genome sequencing Haplotypes 2x 10x

2. Capturing diversity by sequencing related populations

2. Integrating data types Low coverage GW data Exome SNP genotype Array CGH Genome sequence

Open and unrestricted access

What did we learn from the pilot?

TSI* CEU JPT CHB CHS* YRI LWK* *Exon pilot only 270 genomes in the pilot project

The 1000G pilots 40x 2-4x 50x

Lesson 1. The low-coverage model works for variant discovery

A near complete record of common variants CEU

The low-coverage model works almost as well for discovery as exome sequencing Number of sites found in exome data Number of sites in low coverage data count of alternate allele in low coverage (in 688 shared samples) number of sites

Low coverage sequencing can detect structural variants

Lesson 2. The low coverage model works for SNP genotyping

A set of accurate genotypes/haplotypes CEU

Marginal callingJoint calling

Lesson 3. The genome has a large grey area where variant calling is hard

Lesson 4. Joint calling of different variant types substantially improves the quality of calls

Where is the project now?

Phase1 GBR FIN TSI CEU JPT CHB CHS YRI LWK MXL CLM ASW PUR IBS

New data types Data typePilotPhase 1 (now) Deep genomes6- Low coverage genomes1791,094 Deep exonic697 (1,000 genes)977 (full exomes) Chip genotypes-1,542 (OMNI2.5)

New variants VariantPilotPhase 1 (now) Total SNP15.2M38.9M Known SNP6.8M8.5M Novel SNP8.4M30.4M Short INDELs1.3M4.7M** ftp://ftp.1000genomes.ebi.ac.uk **Estimated from chromosome 20

Genotype accuracy is improving HomRefHetHomAltOverall Error rate0.16%0.76%0.39%0.37%

Deletion SNPs (from LC, EX, OMNI) Indels We are beginning to tackle variant integration

How will people use the 1000G project data?

1. Screening for functional variants

Fraction of variant sites present in an individual that are NOT already represented in dbSNP DateFraction not in dbSNP February, % February, % April, % February, 20112% May 20111%

Rates of individual genome variant ‘rediscovery’ c. 250 LOF / person c. 75 HGMD DM

USH2A Mutations cause with Usher syndrome 66 missense variants in dbSNP 2/3 detected in 1000 Genomes Pilot One HGMD ‘disease-causing’ variant homozygous in 3 YRI –Other reports indicate this is not a real disease-causing variant

2. Imputation into existing GWAS studies

IMPUTE Genotypes in additional samples from standard product Reference panel (1000G) Imputation … … … … … … … … … … … … Imputed genotypes

3. Designing new genome-wide genotyping platforms

Illumina

How can we access the hard part of the genome? The paradigm of mapping reads to a reference fails when the genome contains sequence highly diverged from, or absent in the reference We have been developing de novo assembly algorithms using coloured de Bruijn graphs to identify complex variants and genotype them For example, we can type classical HLA alleles from WGS data without read alignment

ACGCGTC ACGTGTC

3501/5703 from lab-typing Zam Iqbal

Lessons learnt about related populations

GBR FIN TSI CEU IBS

Closely related populations can have substantially different rare variants

Spatial heterogeneity in non-genetic risk can differentially confound association studies for rare and common variants Iain Mathieson

Thanks to the many... Steering committee –Co-chairs: Richard Durbin and David Altshuler Samples and ELSI Committee –Co-chairs: Aravinda Chakravarti and Leena Peltonen Data Production Group –Co-chairs: Elaine Mardis and Stacey Gabriel Analysis Group –Co-Chairs: Gil McVean and Goncalo Abecasis –Subgroups in gene-targeted sequencing (Richard Gibbs) and population genetics (Molly Przeworski) Structural Variation Group –Co-chairs: Matt Hurles, Charles Lee and Evan Eichler DCC –Co-Chairs: Paul Flicek and Steve Sherry