Presentation is loading. Please wait.

Presentation is loading. Please wait.

The Phase 1 Variant Set and Future Developments

Similar presentations


Presentation on theme: "The Phase 1 Variant Set and Future Developments"— Presentation transcript:

1 The Phase 1 Variant Set and Future Developments
The 1000 Genomes Project The Phase 1 Variant Set and Future Developments Laura Clarke 16th October 2012

2 Glossary Pilot: The 1000 Genomes project ran a pilot study between 2008 and 2010 Phase 1: The initial round of exome and low coverage sequencing of 1000 individuals Phase 2: Expanded sequencing of 1700 individuals and method improvement Phase 3: Sequencing of 2500 individuals and a new variation catalogue SAM/BAM: Sequence Alignment/Map Format, an alignment format VCF: Variant Call Format, a variant format This is a glossary which may explain some terms you see in this presentation.

3 The 1000 Genomes Project: Overview
International project to construct a foundational data set for human genetics Discover virtually all common human variations by investigating many genomes at the base pair level Consortium with multiple centers, platforms, funders Aims Discover population level human genetic variations of all types (95% of variation > 1% frequency) Define haplotype structure in the human genome Develop sequence analysis methods, tools, and other reagents that can be transferred to other sequencing projects

4 Phase 1 populations

5 Phase 2/3 populations Barbados Pakistan Ghana Bangladesh Peru Nigeria
India Vietnam Sierra Leone Sri Lanka USA

6 Hapmap, The Pilot Project and The Main Project
Starting in 2002 Last release contained ~3m snps 1400 individuals 11 populations High Throughput genotyping chips 1000 Genomes Pilot project Started in 2008 Paper release contained ~14 million snps 179 individuals 4 populations Low coverage next generation sequencing 1000 Genomes Phase 1 Started in 2009 Phase 1 release has 36.6millon snps, 3.8millon indels and 14K deletions 1094 individuals 14 populations Low coverage and exome next generation sequencing This slide is just to give an idea of how things have changes since the 1000 genomes projects predecessor HapMap. So HapMap started in 2002 in the wake of the Human Genome Project. It used high throughput genotyping chips to rapidly assess variation in large numbers of invidiuals and by its finish in 2007 it has released containing more around 3 million variants across 1400 individuals from 11 different populations. As second generation sequencing technologies started to bear out their promises it because clear it would be plausible to sequence large number of individuals and the 1000 genomes project was born. We have moved from pilot phase where nearlu 200 individuals were sequenced to now where we are aiming for a total of 2500 and hope to capture almost all shared variation that exists currently in the populations we study. 1000 Genomes Phase 2 Started in 2011 1722 individuals 19 Populations Low coverage and exome next generation sequencing

7 Timeline September 2007: 1000 Genomes project formally proposed Cambridge, UK April 2008: First Submission of Data to the Short Read Archive. May 2008: First public data release. October 2008: SAM/BAM Format Defined. December 2008: First High Coverage Variants Released. December 2008: First 1000 genomes browser released May 2009: First Indel Calls released. July 2009: VCF Format defined August 2009: First Large Scale Deletions released. December 2009: First Main Project Sequence Data Released. March 2010: Low Coverage Pilot Variant Release made July 2010: Phased genotypes for 159 Individuals released. October 2010: A Map of Human Variation from population scale sequencing is published in Nature. This gives you and idea of how the project has progressed and some of the major milestones we have seen. January 2011: Final Phase 1 Low coverage alignments are released May appears on Twitter May 2011: First Variant Release made on more than 1000 individuals October 2011: Phase 1 integrated variant release made March 2012: Phase 2 Alignment release November 2012: An integrated map of genetic variation from 1,092 human genomes in Nature

8 Fraction of variant sites present in an individual that are NOT already represented in dbSNP
Date Fraction not in dbSNP February, 2000 98% February, 2001 80% April, 2008 10% February, 2011 2% Now <1% Over the last decade the quantity of know variation has increased massively. At the end of the Human Genome Project only 20% of human variation had been discovered. With the help of Hapmap and other similar projects we had almost 90% of human variation. Now we are aiming to find less than 1% of the variation left in the genome. Ryan Poplin, David Altshuler

9 Sequencing Data Evolution
The Project contains data from 3 different providers and multiple platforms Platform Min Read Length (bp) Max Read Length (bp) 454 Roche GS FLX Titanium 70 400 Illumina GA 30 81 Illumina GA II 26 160 Illumina HiSeq 50 102 ABI Solid System 2.0 25 35 ABI Solid System 2.5 ABI Solid System 3.0 As you can see the project as it has been running since 2008 contains data which shows how the machines have changed over time and from most of the platforms which exist from the three main providers

10 Pipelines for data processing and variant calling
Tens of analysis groups have contributed Individual pipelines and component tools vary Typical main steps: Read mapping Duplicate filtering Base quality value recalibration INDEL realignment Variant Site Discovery Individual Genotype Assignment (sometimes part of site discovery) Variant filtering / call set refinement Variant reporting Our analysis process has been built up over time into a system of alignment and variant calling. Tens of groups contribute to this process along the way. At the current point in time our basic analysis pipelie contains about 8 steps which are read mapping, duplicate filtering, base quality recalibration, indel realignment, site discovery (variant calling), genotype assignment, variant filtering and the final reporting stage.

11 3 pilot coverage strategies
In the pilot phase 3 different strategies were used, high coverage sequencing of family trios to get true phasing of the variants detected. Low coverage sequencing of many individuals to allow broader detection of variants but requiring statistical phasing and sequencing of specific exon targets in a larger number of individuals to allow detection of rare variants but would remain unphased.

12 Phase 1 analysis goal: an integrated view of human variations
Integrated haplotype map of 1,092 human genomes 14 populations from Europe, Africa, East Asia, and Americas 38 million SNPs, 2 million INDELs, 14 thousands larger deletions ~99% of per-individual variants, >95% of 1% frequency SNPs

13 Alignment Data The project has made more than 10 releases of Alignment Data Pilot Project Aligned to NCBI36 Maq and Corona Base Quality Recalibration done Phase 1 Aligned to GRCh37 BWA and Bfast Indel Realignment Phase 2 Aligned to extended GRCh37 Improvements to Base Quality Recalibration As the project has grown and moved forward our analysis methods have also changed. We have made more than 10 releases of alignment data and both our reference sequence and alignment processes have changed over time. For the pilot project we used NCBI36 as the reference and MAQ and Corona were the main algorithms. We developed base quality recalibration to improve the base quality scores coming from 9 different sequencing centres and 10s of different sequencing machines. In phase 1 we moved to GRCh37 and use BWA and Bfast for alignment. We introduced indel realignment around know indels to improve false positve mapping around indels which introduced false positive variant calls. As we move into Phase 2 we are now aligning to an extended GRCh37 reference which also includes all the alternative haplotypes and novel sequence both from Venter and NA We also hope to improve base quality recalibration so it handles error models around indels aswell as around snps

14 Short Variant Calling Early call sets used a single variant caller
Intersect approach developed during pilot Variant Quality Score Recalibration (VQSR) developed for Phase 1 Genotype Likelihoods assigned to help with genotype calling Integrated genotype calling based on individual variant call sets Phase 2 looks to improve site discovery and improve integration During the pilot study it was discovered that by using intersections of different callers rather than a single caller we were able to improve the accuracy of the calls by creating intersection call sets where only variants discovered by two or more callers were included. This process was improved for phase 1 by using methods like the Broad’s Variant Quality Score Recalibration to score variant sites compared to known truths like dbSNP and high throughput chip data. We also produced improved genotype likelihoods to help with assigning genotypes and produced a integrated call set which presents snps, indels and large scale deletions together. Phase look looks to improve integration and enable us to better detect and genotype multi allelic variants and more complex structural variants

15 Summary of Variants Autosomes Chromosome X GENCODE regions Sequence Data Mean mapped depth (x) 5.1 3.9 80.3 Total Raw Bases (GB) 19049 804 327 SNPS No. sites overall 36.7 M 1.3 M 498 K Novelty rate 58% 77% 50% No. Syn/NonSyn/Nonsense NA 4.7 / 6.5 / k 199 / 293 / 6.3 k Avg. No. SNPs per sample 3.60 M 105 K 24.0 K Indels 1.38 M 59 K 1867 62% 73% 54% No. in-frame/frameshift 19 / 14 719 / 1066 Avg. no. Indels per sample 344 K 13 K 440 Large Deletions 13.8 K 432 847 Avg. variants per sample 717 26 39

16 Power to Detect Site Detection Genotype Concordance

17 Quality of the Phase 1 Integrated Genotypes
Region TYPE EVAL # EVAL #Markers HET Overall Samples (Overlaps) concord. Genome SNP Omni2.5M 1,092 2.1M 99.09% 99.65% CGI 34 13M 98.63% 99.60% Exome 765 50K 99.73% 99.95% 99.52% 99.83% INDEL 820K 95.64% 98.01% SV Conrad 248 1.1K 99.01% 99.82%

18 Imputation Accuracy

19 Ancestry Deconvolution in Ad-Mixed Populations
Blue=European Grey=African Red=Native American Black=Unassigned

20 New for Phase 2 and 3 Empirical Indel Error Modeling De Novo Assembly Variant Calling Haplotype based variant calling Multi Allelic Sites MNP and complex substitution calling Integrated Genotyping for Structural variation

21 Data Availability FTP site: ftp://ftp.1000genomes.ebi.ac.uk/vol1/ftp/
Raw Data Files AWS Amazon Cloud: FTP mirror Web site: Release Announcements Documentation Ensembl Style Browser: Browse 1000 Genomes variants in Genomic Context Variant Effect Predictor Data Slicer Other Tools Our data is available from our ftp site and on the browser, there is documentation and release announcements on the web site.

22 FTP Site Two mirrored ftp sites NCBI site is direct mirror of EBI site
ftp://ftp.1000genomes.ebi.ac.uk/vol1/ftp ftp://ftp-trace.ncbi.nih.gov/1000genomes/ftp NCBI site is direct mirror of EBI site Can be up to 24 hours out of date Both also accessible using aspera EBI site has http mirror

23 ftp://ftp. 1000genomes. ebi. ac. uk ftp://ftp-trace. ncbi. nih
ftp://ftp.1000genomes.ebi.ac.uk ftp://ftp-trace.ncbi.nih.gov/1000genomes/ftp Documentation Raw Data Phase 1 Data Pilot Data Release Data We have two mirrored FTP at the EBI and the NCBI. The EBI is the master copy but the NCBI site is rarely more than 24 hours behind. The basic structure of the ftp site follows this. We have several readmes which describe different aspects of the data. The raw alignment and sequence files are stored in the data directory, The phase 1 bam set is in the phase 1 directory, the pilot data set is in the pilot data directory and release directory contains our official releases and the technical directory contains project data and reference data sets. Technical Data

24 The FTP Site: Data Sample Level Files sequence_read alignment
The data directory contains one directory per individual and each individual further has directories which contain fastq files or alignment files

25 Alternative Alignments
FTP Site: Technical Alternative Alignments Reference Data Sets The technical directory contains reference data sets and internal project material. We have alternative alignments, the reference directory and experimental data Experimental Data

26 FTP Site: Release Older Release Dirs Date Format YYYYMMDD
Sequence Index Dates The release directories are generally named for data. Originally these where the date of release. Now they follow the date of the original sequence release the data set is based on and you can trace this back to the sequence index files which describe a particular sequence release. Please notice our date formats should always be YYYYMMDD, this is true in both file and directory names.

27 FTP Site: Phase1 Data Analysis Results

28 Finding Data FTP search http://www.1000genomes.org/ftpsearch
Search on the current.tree file Provides full ftp paths and md5 checksums Every page also has a website search box There is also a web front end to this index which allows you to search the filename and will report urls for either the EBI or NCBI ftp site. The box can take * as a wildcard character but no other search characters work. The website itself also has a search box which allows you to search the contents of the website and our documentation.

29 This is the front page of the 1000 genomes website. The majority of the page is announcements but you can also see the headline links which give you access to the browser and other info and also the faq linked on the right hand side.

30 The browser front page is more straight forward that the standard ensembl front page. We describe which release the browser is based on and give links to the tutorial and standard entry points into the browser as well as a search box where you can enter gene names, coordinates or snp identifiers.

31 Genes and SNPs Intron Coding UTR Line indicates number of SNPS
Here is a description of what the gene and variation tracks look like on the browser. For genes open boxes represent utrs, closed boxes represent coding exons and the downward pointed lines represent introns. For snps there are two options. The Density Track displays a histogram with the number of variants in a particular location. The compact view draws a line per variant and the colours indicate the most severe functional consequence of that variant. Line indicates number of SNPS Each Line is One SNP

32 Region in Detail This is Location View, the page that region in detail brings you to. The top box is an overview of the chromosome. Below the region in detail title is a smaller scale summary. The red box in this is around the more detailed view you can see below it. The last box is the most detailed view where you can see the structure of the genes and other genomic features. The left hand menu gives you various configuration options and the ability to switch to the main ensembl site which also gives you functionality like mart and blast. The location is also specified tab in the bar at the top of the page. If you navigate to a feature from this page this tab should remain in the top menu bar to allow you to get back to this page.

33 Turning on Tracks If you wish to turn on or off different annotation tracks in the detailed box you need to use the configure this page button in the left hand menu. The configure this page menu contains options for lots of different types of annotation. As mentioned earlier for snps there are two options the density view which give you a histogram indicating how many snps there are and the compact view which has a line for each snp which are then coloured by variation consequence.

34 File upload to view with 1000 Genomes data
Manage your data We also support upload of various file types including bam, bed and VCF. The VCF file mustly be remotely accessible and tabix indexed to work. Supports popular file types: BAM, BED, bedGraph, BigWig, GBrowse, Generic, GFF, GTF, PSL, VCF*, WIG * VCF must be indexed

35 Uploaded VCF Example: ftp://ftp.1000genomes.ebi.ac.uk/vol1/ftp/release/ /ALL.wgs.phase1_release_v snps_indels_sv.sites.vcf.gz The VCF file is displayed in the same way as a normal variant track with the option of either a density plot or compact view. If you hover over a snp from a vcf file you should be presented with information about that snp from the file.

36 Gene View Gene Tab Download as csv Get in vcf format
Click a Gene then ‘Variation Table’ or ‘Variation Image’ Gene Tab Download as csv When on a Gene View page you should always see a get vcf button which takes you to our data slicer with the 1000 genomes vcf file these release is based on and the location of that gene. All tables should be downloadable as csv files and filterable on their contents. Get in vcf format

37 Variation Image Gene variation zoom
If you click on the gene variation image in the left hand menu you get an image showing how different transcripts look with respect to the variation in the region. You can zoom in an out of this image as some genes have a large amount of variation which can be difficult to see when looking at the whole gene.

38 Transcript Tab: Variations
Effect on Protein: SIFT PolyPhen If you look at a particular transcript (By clicking on an ENST identifier) you can then see the consequences for all the variation in that transcript. For non synonymous variants we also provide sift and polyphen scores which state if the change is deleterious or tolerated.

39 Variation Pages

40 The web front end for the tools can be accessed from the tools link which is in the top right hand corner of every browser page.

41 Tools page Here you can see the Variant Effect Predictor, Data Slicer, Variation Pattern finder and VCF to PED converter all available. All have an online version and all but the Data Slicer have an api script. To do what the data slicer does on the command line you need to use tabix and samtools as you were shown earlier.

42 Data Slicing The web front end accepts a url for any indexed bam or tabix indexed vcf, bed or other file based on genomic position, You must supply a remote file location and a genomic postion. This tool works best with less than .5MB regions. If you have a vcf file you can also supply a panel file which maps individuals to population and then filter by either individual or population. You can clear the boxes which start with example urls in them. There is also more extensive documentation available on our website

43

44 Variant Effect Predictor
Predicts Functional Consequences of Variants Both Web Front end and API script Can provide sift/polyphen/condel consequences Refseq gene names HGVS output Can run from a cache as well as Database Convert from one input format to another Script available for download from: ftp://ftp.ensembl.org/pub/misc-scripts/Variant_effect_predictor/ The Variation Effect Predictor predicts the functional consequences of variants on the basis of the current ensembl annotation. It can provide sift, polyphen and condel scores and predictions and also both refseq and HGVS names if desired. There is a cache which can allow it to run without database access and it is also able to convert from say VCF to ensembl input format or pileup format. The script is available for download and the web interface can be got to using this link.

45 The web front end allows you to query up to 750 variants at a time
The web front end allows you to query up to 750 variants at a time. You can use 3 different input formats, ensembl, vcf or pileup and request information like HGVS names and sift and polyphen predictions and scores. You can also filter by minor allele frequency of 1000 genomes pilot populations.

46

47 Variation Pattern Finder
Remote or local tabix indexed VCF input Discovers patterns of Shared Inheritance Variants with functional consequences considered by default Web output with CSV and Excel downloads The Variation Pattern Finder is a tool for discovering patterns of shared inheritance between individuals in the same vcf file.

48 Variation Pattern Finder
The input for the variation pattern finder takes a remote indexed vcf file and a remote sample population mapping file (tab delimited sample, population) and a region to generate its output.

49

50 VCF to PED LD Visualization tools like Haploview require PED files
VCF to PED converts VCF to PED Will a file divide by individual or population Ped files are useful in visualization and calculating haplotype information. This tool will convert from VCF to ped format and allows you to select not only a particular region but particular individuals or populations.

51 VCF to PED Like the variation pattern finder the VCF to PED converter needs the remote vcf file, a remote sample population mapping file and a region.

52

53 Haploview haploview Once the ped and info files exist you can use haploview (which is aliased on these machines to be haploview) to view the LD pattern shared for these variants in these individuals.

54 Access to backend Ensembl databases
Public MySQL database at mysql-db.1000genomes.org port 4272 Full programmatic access with Ensembl API The 1000 Genomes Pilot uses Ensembl v60 databases and the NCBI36 assembly (this is frozen) The 1000 Genomes main project currently uses Ensembl v68 databases We also have a public mysql instance like the Ensembl public instance to give complete programatic access to these databases

55 More Information Please with any questions

56 Announcements http://1000genomes.org 1000announce@1000genomes.org
There are various means to discover new data which is part of the 1000 genomes project. All announcements are made on our website. There is also an announcements mailing list, rss feed and twitter feed.

57 Thanks The 1000 Genomes Project Consortium Paul Flicek Richard Smith
Holly Zheng Bradley Ian Streeter


Download ppt "The Phase 1 Variant Set and Future Developments"

Similar presentations


Ads by Google