Capture / Resequencing Data Handling and Analysis

Slides:



Advertisements
Similar presentations
Sequence Capture and Targeted Re-sequencing
Advertisements

Considerations for Analyzing Targeted NGS Data HLA
RNAseq Library Preparation and ANAlysis basics
V Improvements to 3kb Long Insert Size Paired-End Library Preparation Naomi Park, Lesley Shirley, Michael Quail, Harold Swerdlow Wellcome Trust Sanger.
Reference mapping and variant detection Peter Tsai Bioinformatics Institute, University of Auckland.
Current methods for high-throughput resequencing of custom targets Adam Gordon Nickerson Lab, UW Genome Sciences WHI Genetics SIG call 3/26/14.
Next–generation DNA sequencing technologies – theory & practice
DNAseq analysis Bioinformatics Analysis Team
Variant Calling Workshop Chris Fields Variant Calling Workshop v2 | Chris Fields1 Powerpoint by Casey Hanson.
SOLiD Sequencing & Data
DETECTING CNV BY EXOME SEQUENCING Fah Sathirapongsasuti Biostatistics, HSPH.
Targeted Data Introduction  Many mapping, alignment and variant calling algorithms  Most of these have been developed for whole genome sequencing and.
Physical Mapping I CIS 667 February 26, Physical Mapping A physical map of a piece of DNA tells us the location of certain markers  A marker is.
High Throughput Sequencing
11 © 2009 PerkinElmer © 2010 PerkinElmer November 20, 2012 DNA Services Overview.
NGS Analysis Using Galaxy
Whole Exome Sequencing for Variant Discovery and Prioritisation
Genome & Exome Sequencing Read Mapping Xiaole Shirley Liu STAT115, STAT215, BIO298, BIST520.
Considerations for Analyzing Targeted NGS Data Introduction Tim Hague, CTO.
Variant Calling Workshop Chris Fields Variant Calling Workshop | Chris Fields | PowerPoint by Casey Hanson.
Todd J. Treangen, Steven L. Salzberg
MES Genome Informatics I - Lecture VIII. Interpreting variants Sangwoo Kim, Ph.D. Assistant Professor, Severance Biomedical Research Institute,
GBS Bioinformatics Pipeline(s) Overview
Next Generation DNA Sequencing
Chromatin Immunoprecipitation DNA Sequencing (ChIP-seq)
RNA-Seq Assembly 转录组拼接 唐海宝 基因组与生物技术研究中心 2013 年 11 月 23 日.
How will new sequencing technologies enable the HMP? Elaine Mardis, Ph.D. Associate Professor of Genetics Co-Director, Genome Sequencing Center Washington.
Considerations for Analyzing Targeted NGS Data Exome Tim Hague, CTO.
Alexis DereeperCIBA courses – Brasil 2011 Detection and analysis of SNP polymorphisms.
HaloPlexHS Get to Know Your DNA. Every Single Fragment.
Current Challenges in Metagenomics: an Overview Chandan Pal 17 th December, GoBiG Meeting.
BNFO 615 Usman Roshan. Short read alignment Input: – Reads: short DNA sequences (upto a few hundred base pairs (bp)) produced by a sequencing machine.
EDACC Quality Characterization for Various Epigenetic Assays
Introduction to RNAseq
HW2: exome sequencing and complex disease Jacquemin Jonathan de Bournonville Sébastien.
 CHANGE!! MGL Users Group meetings will now be on the 1 st Monday of each month 3:00-4:00 Room Note the change of time and room.
Genome STRiP ASHG Workshop demo materials
P.M. VanRaden and D.M. Bickhart Animal Genomics and Improvement Laboratory, Agricultural Research Service, USDA, Beltsville, MD, USA
Current Data And Future Analysis Thomas Wieland, Thomas Schwarzmayr and Tim M Strom Helmholtz Zentrum München Institute of Human Genetics Geneva, 16/04/12.
Moderní metody analýzy genomu - analýza Mgr. Nikola Tom Brno,
Accessing and visualizing genomics data
Canadian Bioinformatics Workshops
A brief guide to sequencing Dr Gavin Band Wellcome Trust Advanced Courses; Genomic Epidemiology in Africa, 21 st – 26 th June 2015 Africa Centre for Health.
Canadian Bioinformatics Workshops
Reliable Identification of Genomic Variants from RNA-seq Data Robert Piskol, Gokul Ramaswami, Jin Billy Li PRESENTED BY GAYATHRI RAJAN VINEELA GANGALAPUDI.
1 Finding disease genes: A challenge for Medicine, Mathematics and Computer Science Andrew Collins, Professor of Genetic Epidemiology and Bioinformatics.
DEPARTMENT OF HEALTH AND HUMAN SERVICES National Institutes of Health National Cancer Institute Frederick National Laboratory is a federally funded research.
From Reads to Results Exome-seq analysis at CCBR
Canadian Bioinformatics Workshops
Will 10x technology make us rethink genome assemblies?
Interpreting exomes and genomes: a beginner’s guide
Short Read Sequencing Analysis Workshop
Lesson: Sequence processing
Detection of FLT3 Internal Tandem Duplication in Targeted, Short-Read-Length, Next- Generation Sequencing Data  David H. Spencer, Haley J. Abel, Christina.
Variant Calling Chris Fields
Cancer Genomics Core Lab
Next Generation Sequencing Analysis
Quality Control & Preprocessing of Metagenomic Data
Variant Calling Workshop
Introduction to RAD Acropora millepora.
 The human genome contains approximately genes.  At any given moment, each of our cells has some combination of these genes turned on & others.
Comparison of Clinical Targeted Next-Generation Sequence Data from Formalin-Fixed and Fresh-Frozen Tissue Specimens  David H. Spencer, Jennifer K. Sehn,
Detection of FLT3 Internal Tandem Duplication in Targeted, Short-Read-Length, Next- Generation Sequencing Data  David H. Spencer, Haley J. Abel, Christina.
2nd (Next) Generation Sequencing
Eric Samorodnitsky, Jharna Datta, Benjamin M
Maximize read usage through mapping strategies
BF528 - Genomic Variation and SNP Analysis
Canadian Bioinformatics Workshops
Variant Calling Chris Fields
The Variant Call Format
Presentation transcript:

Capture / Resequencing Data Handling and Analysis MGL Users Group Capture / Resequencing Data Handling and Analysis

Designing and ordering a targeted exome probe set We use Agilent SureSelect hybridization probes 120 base biotinylated RNA probes Process of design chose your genes of interest submit them to the SureDesign website some considerations price breaks at 0.5, 3, 6, 12, 24 Mb (see next slide) for Karel size of target set <500K this is ~1% of whole exome this is 0.015% of whole genome

Example of scaling of costs for SureSelect probes These are costs per sample. For example, for 96 samples for ~130 genes: 96 x $260 = $24, 960.

Designing and ordering a targeted exome probe set We use Agilent SureSelect hybridization probes 120 base biotinylated RNA probes Process of design chose your genes of interest submit them to the SureDesign website some considerations price breaks at 0.5, 3, 6, 12, 24 Mb for Karel size of target set <500K this is ~1% of whole exome this is 0.015% of whole genome

Example of SureDesign report

Targeted vs whole exome sequencing (TES vs WES) Cost of WES is ~$120 for pulldown probes Can run many more samples per lane for TES WES uses off-the-shelf probe kit, so shorter ordering time Less “extraneous” data with TES = more “free” data with WES

Process of hybridization and library preparation We use the Agilent SureSelectXT Target Enrichment kit need 5 µg of high quality genomic DNA to start probes are RNA, be sure DNA is Rnase-free Shear the DNA, size select, ligate adaptors, amplify library Hybridize to custom probes and pull down Add barcodes, pools samples for sequencing

Sequencing ABI SOLiD 5500xl Optimum density is 160 million beads per lane (one DNA fragment per bead). Nominally 110 bases read per fragment = 16.2 billion bases per lane. Significant losses due to filtering and off-target reads.

Understanding Data from the Sequencer Each fragment can produce one or two reads from the forward and or reverse ends. Commonly for re-sequencing projects we want to maximize both coverage and call reliability, therefore paired ends are desirable of the longest length the sequencer can produce. Data is in the form of individual calls and qualities are present for each. In order to reduce possible artifacts multiple filtering steps are desirable.

Colorspace Compared to FASTQ Colorspace is similar to FASTQ, but there is a layer of encoding making it not immediately interpretable. Both have calls and qualities Due to the encoding sampling two bases, call error actually goes down in colorspace data, making it a bit more reliable for re-sequencing. A tradeoff is that reads are a bit shorter, meaning more independent fragments must be read to achieve similar coverage. 2nd Base Encoding 1st Base csqual file with associated call qualities. XSQ is a compressed binary format combining both.

You WILL have variants The human reference genome (hg19) is assembled from 13 people, various portions represent only a fraction of those individuals. The human genome prior to the most recent build (not yet generally adopted by the vast majority of tools) contains many rare alleles. dbSNP (build 141) reports 62 million common variants (from 260 million submissions), 29.9 million of which occur within genes. Includes mainly synonymous and ‘non-impactful’ mutations. The goal of many re-sequencing projects is to try to distill meaningful mutations from all of this common genetic variation.

Considerations with Capture data Exome or targeted capture is an excellent tool for reducing the amount of ‘irrelevant’ data for a study, but does introduce some caveats. Capture is never 100% enrichment. In both our hands and in data evaluated from NISC exome capture tends to be ~50% or so on vs off target bases, as explicitly defined by the capture (exons +/- 10bp). Product literature usually extends the capture regions a further 100 bp to pad that. By the complex hybridization nature of capture, there is a LOT of variability in how well some sequences are captured vs others. Some regions may have low/no coverage while others may be heavily covered.

Distribution of Coverage in Capture “Average” Coverage is overall 228x Reads for capture bases, but note the range, and the presence of a terribly captured fraction!

Falloff of coverage in targeted regions 80% of bases 50% of bases 20% of bases We can track what fraction of bases are covered at a certain level. This can be adjusted by how much sequencing is done.

Capture coverage scales fairly linearly with input, but low coverage bases do not scale well! High coverage bases vs low coverage bases scale differently. A factor of how well they can be hybridized.

Pre-filtering of data Reads are evaluated and trimmed based on contents BEFORE any form of mapping. Important as “bad” reads may map and result in variant calls! Generally important for any form of project, not just re- sequencing, but especially critical here. A variety of tools exist to perform this. I prefer Trimmomatic for this task. Two main tasks for Trimmomatic: Remove adapter or problematic sequences (poly-A, etc) Clip or trim read sequences at low quality positions Discard below a minimum threshold length

Alignment of Data This is actually a critical choice. Which aligner you use will determine the reliability of your downstream results! Alignment algorithms may change depending on task/project. Generally three types of aligners: Seed & Extend Reference Indexing Prefix/Suffix matching (Burrows Wheeler Transforms) Computational time and accuracy vary.

Benchmarking of Common Aligners For Illumina and some colorspace mapping I prefer to use Novocraft. It’s less commonly used as it’s not free. (Simulated data on actual aligners) Oliver GR. F1000Research 2012

Benchmarking Indel Detection Indels are a bit trickier to detect, particularly for some alignment strategies Oliver GR. F1000Research 2012

Post alignment Workflow GATK best practices (Van der Auwera GA, Carneiro M, Hartl C, Poplin R, del Angel G, Levy- Moonshine A, Jordan T, Shakir K, Roazen D, Thibault J, Banks E, Garimella K, Altshuler D, Gabriel S, DePristo M (2013). From FastQ Data to High-Confidence Variant Calls: The Genome Analysis Toolkit Best Practices Pipeline. Current Protocols in Bioinformatics. 43:11.10.1-11.10.33.) Continually updated tools and recommendations for handling of sequencing data from Broad Institute.

Final portable data format VCF (Variant call format) – Tab-delimited text Each line represents a position of a variant, then describes the genotype and underlying data & reliability for each sample. Extendable with annotations and additional information. Common and readable by many current third party tools. ##fileformat=VCFv4.1 ##fileDate=20090805 ##source=myImputationProgramV3.1 ##reference=file:///seq/references/1000GenomesPilot-NCBI36.fasta ##contig=<ID=20,length=62435964,assembly=B36,md5=f126cdf8a6e0c7f379d618ff66beb2da,species="Homo sapiens",taxonomy=x> ##phasing=partial ##INFO=<ID=NS,Number=1,Type=Integer,Description="Number of Samples With Data"> ##INFO=<ID=DP,Number=1,Type=Integer,Description="Total Depth"> ##INFO=<ID=AF,Number=A,Type=Float,Description="Allele Frequency"> ##INFO=<ID=AA,Number=1,Type=String,Description="Ancestral Allele"> ##INFO=<ID=DB,Number=0,Type=Flag,Description="dbSNP membership, build 129"> ##INFO=<ID=H2,Number=0,Type=Flag,Description="HapMap2 membership"> ##FILTER=<ID=q10,Description="Quality below 10"> ##FILTER=<ID=s50,Description="Less than 50% of samples have data"> ##FORMAT=<ID=GT,Number=1,Type=String,Description="Genotype"> ##FORMAT=<ID=GQ,Number=1,Type=Integer,Description="Genotype Quality"> ##FORMAT=<ID=DP,Number=1,Type=Integer,Description="Read Depth"> ##FORMAT=<ID=HQ,Number=2,Type=Integer,Description="Haplotype Quality"> #CHROM POS ID REF ALT QUAL FILTER INFO FORMAT NA00001 NA00002 NA00003 20 14370 rs6054257 G A 29 PASS NS=3;DP=14;AF=0.5;DB;H2 GT:GQ:DP:HQ 0|0:48:1:51,51 1|0:48:8:51,51 1/1:43:5:.,. 20 17330 . T A 3 q10 NS=3;DP=11;AF=0.017 GT:GQ:DP:HQ 0|0:49:3:58,50 0|1:3:5:65,3 0/0:41:3 20 1110696 rs6040355 A G,T 67 PASS NS=2;DP=10;AF=0.333,0.667;AA=T;DB GT:GQ:DP:HQ 1|2:21:6:23,27 2|1:2:0:18,2 2/2:35:4 20 1230237 . T . 47 PASS NS=3;DP=13;AA=T GT:GQ:DP:HQ 0|0:54:7:56,60 0|0:48:4:51,51 0/0:61:2 20 1234567 microsat1 GTC G,GTCT 50 PASS NS=3;DP=9;AA=G GT:GQ:DP 0/1:35:4 0/2:17:2 1/1:40:3

Additional handling Varies significantly by project & goals. Association testing with disease phenotypes Modifiers Identification of mutations segregating with disease among families Causative mutation(s) Copy Number Variation (CNV) The amount of data needed to perform these sorts of tests and analysis will vary depending on characterization and type of study. Filtering, visualization, and manipulation can be done by many third party tools. Varsifter, Golden Helix, IGV, GALAXY, and MANY more. http://nihlibrary.nih.gov/Services/Bioinformatics/Pages/bioanalysis.aspx