NGS Cancer Systems Biology Workshop Variant Calling and Structural Variants from Exomes/WGS Ramesh Nair May 30, 2014.

Slides:



Advertisements
Similar presentations
G ENOTYPE AND SNP C ALLING FROM N EXT - GENERATION S EQUENCING D ATA Authors: Rasmus Nielsen, et al. Published in Nature Reviews, Genetics, Presented.
Advertisements

Reference mapping and variant detection Peter Tsai Bioinformatics Institute, University of Auckland.
Genome-wide Association Study Focus on association between SNPs and traits Tendency – Larger and larger sample size – Use of more narrowly defined phenotypes(blood.
DNAseq analysis Bioinformatics Analysis Team
Ruibin Xi Peking University School of Mathematical Sciences
Bioinformatics at Molecular Epidemiology - new tools for identifying indels in sequencing data Kai Ye
Outline to SNP bioinformatics lecture
CS177 Lecture 9 SNPs and Human Genetic Variation Tom Madej
Variant discovery Different approaches: With or without a reference? With a reference – Limiting factors are CPU time and memory required – Crossbow –
Computational Tools for Finding and Interpreting Genetic Variations Gabor T. Marth Department of Biology, Boston College
Something related to genetics? Dr. Lars Eijssen. Bioinformatics to understand studies in genomics – São Paulo – June Image:
Vanderbilt Center for Quantitative Sciences Summer Institute Sequencing Analysis (DNA) Yan Guo.
Polymorphisms – SNP, InDel, Transposon BMI/IBGP 730 Victor Jin, Ph.D. (Slides from Dr. Kun Huang) Department of Biomedical Informatics Ohio State University.
Sequence Variation Informatics Gabor T. Marth Department of Biology, Boston College BI420 – Introduction to Bioinformatics.
Informatics challenges and computer tools for sequencing 1000s of human genomes Gabor T. Marth Boston College Biology Department Cold Spring Harbor Laboratory.
Genome Variations & GWAS
NGS Workshop Variant Calling
Dr Katie Snape Specialist Registrar in Genetics St Georges Hospital
Whole Exome Sequencing for Variant Discovery and Prioritisation
NGS Workshop Variant Calling and Structural Variants from Exomes/WGS
Epigenome 1. 2 Background: GWAS Genome-Wide Association Studies 3.
- any detectable change in DNA sequence eg. errors in DNA replication/repair - inherited ones of interest in evolutionary studies Deleterious - will be.
MES Genome Informatics I - Lecture VIII. Interpreting variants Sangwoo Kim, Ph.D. Assistant Professor, Severance Biomedical Research Institute,
Computational research for medical discovery at Boston College Biology Gabor T. Marth Boston College Department of Biology
Genetics-multistep tumorigenesis genomic integrity & cancer Sections from Weinberg’s ‘the biology of Cancer’ Cancer genetics and genomics Selected.
Next-Generation Sequencing
GenomeVIP: A Genomics Analysis Pipeline for Cloud Computing with Germline and Somatic Calling on Amazon’s Cloud R. Jay Mashl October 20, 2014.
CS177 Lecture 10 SNPs and Human Genetic Variation
Next-Generation Sequencing Eric Jorgenson Epidemiology 217 2/28/12.
Alexis DereeperCIBA courses – Brasil 2011 Detection and analysis of SNP polymorphisms.
Genomics Method Seminar - BreakDancer January 21, 2015 Sora Kim Researcher Yonsei Biomedical Science Institute Yonsei University College.
HaloPlexHS Get to Know Your DNA. Every Single Fragment.
SCRIPPS GENOME ADVISER Galina Erikson Senior Bioinformatics Programmer The Scripps Translational Science Institute Scripps Translational Science Institute.
E XOME SEQUENCING AND COMPLEX DISEASE : practical aspects of rare variant association studies Alice Bouchoms Amaury Vanvinckenroye Maxime Legrand 1.
Identification of Copy Number Variants using Genome Graphs
Cancer genomics Yao Fu March 4, Cancer is a genetic disease In the early 1970’s, Janet Rowley’s microscopy studies of leukemia cell chromosomes.
Lecture 6. Functional Genomics: DNA microarrays and re-sequencing individual genomes by hybridization.
MEME homework: probability of finding GAGTCA at a given position in the yeast genome, based on a background model of A = 0.3, T = 0.3, G = 0.2, C = 0.2.
Lecture 11. Topics in Omic Studies (Cancer Genomics, Transcriptomics and Epignomics) The Chinese University of Hong Kong CSCI5050 Bioinformatics and Computational.
February 20, 2002 UD, Newark, DE SNPs, Haplotypes, Alleles.
Single nucleotide polymorphisms and Large scale variation
Ke Lin 23 rd Feb, 2012 Structural Variation Detection Using NGS technology.
Computational Biology and Genomics at Boston College Biology Gabor T. Marth Department of Biology, Boston College
Personalized genomics
Calling Somatic Mutations using VarScan
GSVCaller – R-based computational framework for detection and annotation of short sequence variations in the human genome Vasily V. Grinev Associate Professor.
Analysis of Next Generation Sequence Data BIOST /06/2015.
Canadian Bioinformatics Workshops
A brief guide to sequencing Dr Gavin Band Wellcome Trust Advanced Courses; Genomic Epidemiology in Africa, 21 st – 26 th June 2015 Africa Centre for Health.
Introduction to Variant Analysis of Exome- and Amplicon sequencing data Lecture by: Date: Training: Extended version see: Dr. Christian Rausch 29 May 2015.
Recent Advances in Genomic Science Julian Sampson Institute of Medical Genetics, Cardiff.
Canadian Bioinformatics Workshops
Reliable Identification of Genomic Variants from RNA-seq Data Robert Piskol, Gokul Ramaswami, Jin Billy Li PRESENTED BY GAYATHRI RAJAN VINEELA GANGALAPUDI.
1 Finding disease genes: A challenge for Medicine, Mathematics and Computer Science Andrew Collins, Professor of Genetic Epidemiology and Bioinformatics.
Canadian Bioinformatics Workshops
From Reads to Results Exome-seq analysis at CCBR
A comparison of somatic mutation callers in breast cancer samples and matched blood samples THOMAS BRETONNET BIOINFORMATICS AND COMPUTATIONAL BIOLOGY UNIT.
Canadian Bioinformatics Workshops
Canadian Bioinformatics Workshops
071126_EAS56_0057_FC – lanes 1-8 read 2 b a _EAS56_0057_FC – lanes 1-8 read 1 Table S1. Summary tables for a read 1 and b read 2 of a.
Interpreting exomes and genomes: a beginner’s guide
Common variation, GWAS & PLINK
Nucleotide variation in the human genome
Content and Labeling of Tests Marketed as Clinical “Whole-Exome Sequencing” Perspectives from a cancer genetics clinician and clinical lab director Allen.
Discovery tools for human genetic variations
Annotation of Sequence Variants in Cancer Samples
Annotation of Sequence Variants in Cancer Samples
BF528 - Genomic Variation and SNP Analysis
BF528 - Whole Genome Sequencing and Genomic Variation
Canadian Bioinformatics Workshops
Presentation transcript:

NGS Cancer Systems Biology Workshop Variant Calling and Structural Variants from Exomes/WGS Ramesh Nair May 30, 2014

Outline Types of genetic variation Framework for variant discovery Variant calling methods and variant callers Filtering of variants Structural variants 5/30/2014 Variant Calling

Why call variants? TCGA Program Overview “There are at least 200 forms of cancer, and many more subtypes. Each of these is caused by errors in DNA that cause cells to grow uncontrolled. Identifying the changes in each cancer’s complete set of DNA – its genome – and understanding how such changes interact to drive the disease will lay the foundation for improving cancer prevention, early detection and treatment.” 5/30/2014 Variant Calling

Types of Genetic Variation Cancer is driven by genomic alterations like: Single Nucleotide Aberrations Single Nucleotide Polymorphisms (SNPs) - mutations shared amongst a population Single Nucleotide Variations (SNVs) - private mutations Short Insertions or Deletions (indels) Copy Number Variations (CNVs) Larger Structural Variations (SVs) 5/30/2014 Variant Calling

SNPs vs. SNVs Both are aberrations at a single nucleotide SNP SNV Aberration expected at the position for any member in the species (well-characterized) Occur in population at some frequency so expected at a given locus Validated in population Catalogued in dbSNP (http://www.ncbi.nlm.nih.gov/snp) SNV Aberration seen in only one individual (not well characterized) Occur at low frequency so not common Not validated in population Really a matter of frequency of occurrence 5/30/2014 Variant Calling

SNVs of interest Non-synonymous mutations Somatic mutations in cancer Result in amino acid change Impact protein sequence Missense, nonsense, stop gained/lost mutations Somatic mutations in cancer Tumor-specific mutations 5/30/2014 Variant Calling

Catalogs of human genetic variation The 1000 Genomes Project http://www.1000genomes.org/ SNPs and structural variants from 2500 individuals from about 25 populations HapMap http://hapmap.ncbi.nlm.nih.gov/ identify and catalog genetic similarities and differences dbSNP http://www.ncbi.nlm.nih.gov/snp/ Database of SNPs and multiple small-scale variations COSMIC http://www.sanger.ac.uk/genetics/CGP/cosmic/ Catalog of Somatic Mutations in Cancer TCGA http://cancergenome.nih.gov/ The Cancer Genome Atlas researchers are mapping the genetic changes in 20 selected cancers ClinVar http://www.ncbi.nlm.nih.gov/clinvar/ aggregates information about sequence variation and its relationship to human health 5/30/2014 Variant Calling

Challenges of accurate somatic variant calling Not as simple as identifying sites with a variant allele in the tumor not present in the normal Artifacts from PCR amplification or targeted (exome) capture Machine sequencing errors Incorrect local alignment of reads Tumor heterogeneity Tumor-normal cross-contamination 5/30/2014 Variant Calling

A framework for variation discovery Phase 1: Mapping Place reads with an initial alignment on the reference genome using mapping algorithms Refine initial alignments local realignment around indels molecular duplicates are eliminated Generate the technology-independent SAM/BAM alignment map format Accurate mapping crucial for variation discovery DePristo, M.A. et al. A framework for variation discovery and genotyping using next-generation DNA sequencing data. Nat Genet. 43(5):491-8. PMID: 21478889 (2011). 5/30/2014 Variant Calling

A framework for variation discovery Phase 2: Discovery of raw variants Analysis-ready SAM/BAM files are analyzed to discover all sites with statistical evidence for an alternate allele present among the samples SNPs, SNVs, short indels, and SVs SNVs DePristo, M.A. et al. A framework for variation discovery and genotyping using next-generation DNA sequencing data. Nat Genet. 43(5):491-8. PMID: 21478889 (2011). 5/30/2014 Variant Calling

A framework for variation discovery Phase 3: Discovery of analysis-ready variants technical covariates, known sites of variation, genotypes for individuals, linkage disequilibrium, and family and population structure are integrated with the raw variant calls from Phase 2 to separate true polymorphic sites from machine artifacts at these sites high-quality genotypes are determined for all samples SNVs DePristo, M.A. et al. A framework for variation discovery and genotyping using next-generation DNA sequencing data. Nat Genet. 43(5):491-8. PMID: 21478889 (2011). 5/30/2014 Variant Calling

A framework for variation discovery DePristo, M.A. et al. A framework for variation discovery and genotyping using next-generation DNA sequencing data. Nat Genet. 43(5):491-8. PMID: 21478889 (2011). 5/30/2014 Variant Calling

Variant calling methods > 15 different algorithms Three categories Allele counting Probabilistic methods, e.g. Bayesian model to quantify statistical uncertainty Assign priors based on observed allele frequency of multiple samples Heuristic approach Based on thresholds for read depth, base quality, variant allele frequency, statistical significance SNP variant Ref A Ind1 G/G Ind2 A/G Nielsen R, Paul JS, Albrechtsen A, Song YS. Genotype and SNP calling from next-generation sequencing data. Nat Rev Genet. 2011 Jun;12(6):443-51. PMID: 21587300. http://seqanswers.com/wiki/Software/list

Heuristic with allele counting Some variant callers Name Category Tumor/Normal Pairs Metric Reference JointSNVMix (Fisher) Allele Counting Yes Somatic probability Roth, A. et al. (2012) SomaticSniper Heuristic Somatic Score Larson, D.E. et al. (2012) VarScan2 Heuristic with allele counting Somatic p-value Koboldt, D. et al. (2012) GATK UnifiedGenotyper Bayesian No Phred QUAL DePristo, M.A. et al. (2011) Strelka Saunders, C.T. et al. (2012) MuTect Log odds score (LOD) Cibulskis, K. et al. (2013) VCF (Variant Call Format) is a standard file format for representing variant calls Roth, A. et al. JointSNVMix : A Probabilistic Model For Accurate Detection Of Somatic Mutations In Normal/Tumour Paired Next Generation Sequencing Data. Bioinformatics (2012). Larson, D.E. et al. SomaticSniper: identification of somatic point mutations in whole genome sequencing data. Bioinformatics. 28(3):311-7 (2012). Koboldt, D. et al. VarScan 2: Somatic mutation and copy number alteration discovery in cancer by exome sequencing. Genome Res. 22(3):568-76. doi: 10.1101/gr.129684.111 (2012). DePristo, M.A. et al. A framework for variation discovery and genotyping using next-generation DNA sequencing data. Nat Genet. 43(5):491-8. PMID: 21478889 (2011). Saunders, C.T. et al. Strelka: accurate somatic small-variant calling from sequenced tumor-normal sample pairs. Bioinformatics 28(14):1811-7. doi : 10.1093/bioinformatics/bts271 (2012). Cibulskis, K. et al. Sensitive detection of somatic point mutations in impure and heterogeneous cancer samples. Nat Biotechnol. 31(3):213-9. doi : 10.1038/nbt.2514 (2013). 5/30/2014 Variant Calling

Allele Counting Example JointSNVMix & VarScan2 (Fisher’s Exact Test) Allele count data from the normal and tumor compared using a two tailed Fisher’s exact test If the counts are significantly different the position is labeled as a variant position (e.g., p-value < 0.001) 2x2 Contingency Table REF allele ALT allele Total Tumor 15 16 31 Normal 25 Totals 40 56 G6PC2 hg19 chr2:169764377 A>G Asn286Asp The two-tailed for the Fisher’s Exact Test p-value is < 0.0001 The association between rows (groups) and columns (outcomes) is considered to be extremely statistically significant. 5/30/2014 Variant Calling

G6PC2 hg19 chr2:169764377 A>G Asn286Asp REF allele G6PC2 hg19 chr2:169764377 A>G Asn286Asp Normal Depth=25 REF=25 ALT=0 Tumor Depth=31 REF=15 ALT=16 5/30/2014 Variant Calling

VarScan2 Variant Calling Algorithm VarScan2 calls somatic variants (SNPs and indels) using a heuristic method and a statistical test based on the number of aligned reads supporting each allele.     If tumor matches normal:         If tumor and normal match the reference             → Call Reference         Else tumor and normal do not match the reference             → Call Germline     Else tumor does not match normal:     Calculate significance of allele frequency difference by Fisher's Exact Test         If difference is significant (p-value < threshold):             If normal matches reference                 → Call Somatic             Else If normal is heterozygous                 → Call LOH             Else normal and tumor are variant, but different                 → Call Unknown         Else difference is not significant:         Combined tumor and normal read counts for each allele. Recalculate p-value.             → Call Germline http://varscan.sourceforge.net/index.html 5/30/2014 Variant Calling

Strand Bias SNV Filtering Pre-processing in the mapping phase and SNV filtering help minimize false positives Absent in dbSNP Exclude LOH events Retain non-synonymous coding SNVs Tumor total reads (≥ 3) and variant reads Variant allele frequency in tumor and normal Mapping quality (≥ 40) and SNV quality (≥ 20) Max SNV calls (< 3) within a given window (10 bp) around the site SNV farther than a given distance (10 bp) from a predicted indel of a certain quality (≥ 50) Strand balance/bias Concordance across various SNV callers Bentley, D.R. et al. Accurate whole human genome sequencing using reversible terminator chemistry. Nature 456, 53–59 (2008). Wheeler, D.A. et al. The complete genome of an individual by massively parallel DNA sequencing. Nature 452, 872–876 (2008). Larson, D.E. et al. SomaticSniper: Identification of Somatic Point Mutations in Whole Genome Sequencing Data. Bioinformatics Advance Access (2011).

Which variant caller to use? Substantial discrepancies exist among the calls from different callers. Callers appear to be less concordant for calling somatic SNVs than germline SNPs. Sensitivity and Specificity not only vary across callers but also along the genome within any caller. Depend on factors like depth of sequence coverage in the tumor and matched normal, the local sequencing error rate, the allelic fraction of the mutation and the evidence thresholds used to declare a mutation MuTect claims to be more sensitive than other methods for low-allelic-fraction and low read support events while remaining highly specific. Multiple variant callers needed in pipeline (e.g., reduce false negatives). ) Pabinger, S. et al. A survey of tools for variant analysis of next-generation genome sequencing data. Brief. Bioinform. doi: 10.1093/bib/bbs086 (2013). O'Rawe, J. et al. Low concordance of multiple variant-calling pipelines: practical implications for exome and genome sequencing. Genome Medicine, 5:28 doi:10.1186/gm432 (2013). 5/30/2014 Variant Calling

Variant Annotation SeattleSeq Oncotator Annovar annotates SNVs and small indels, both known and novel includes dbSNP, gene names and accession numbers, variation functions (e.g. missense), protein positions and amino-acid changes, conservation scores, HapMap frequencies, PolyPhen predictions, and clinical association Oncotator annotates human genomic point mutations and indels with data relevant to cancer researchers aggregates genomic, protein, and cancer annotations Annovar annotates genetic variants detected from diverse genomes including human genome provides gene, region, and filter based annotations http://snp.gs.washington.edu/SeattleSeqAnnotation/ http://www.broadinstitute.org/cancer/cga/oncotator http://www.openbioinformatics.org/annovar/ 5/30/2014 Variant Calling

Why study Structural Variation (SV) Common in “normal” human genomes - major cause of phenotypic variation Common in certain diseases, particularly cancer Now showing up in rare disease; autism, schizophrenia Zang, Z.J. et al. Genetic and Structural Variation in the Gastric Cancer Kinome Revealed through Targeted Deep Sequencing. Cancer Res January 1, 71; 29 (2011). Shibayama, A. et al. MECP2 Structural and 30-UTR Variants in Schizophrenia, Autism and Other Psychiatric Diseases: A Possible Association With Autism. American Journal of Medical Genetics Part B (Neuropsychiatric Genetics) 128B:50–53 (2004). 5/30/2014 Variant Calling

Classes of structural variation Alkan, C. et al. Genome structural variation discovery and genotyping. Nature Reviews Genetics 12, 363-376 (2011). 5/30/2014 Variant Calling

Software Tools Name Detects Strategy Reference BreakDancer indels, inversions, translocations read-pair mapping Chen, K. et al (2009) Pindel indels split-read analysis Ye, K. et al. (2009) CNVnator CNVs read-depth analysis Abyzov, A. et al. (2011) BreakSeq junction mapping Lam, H.Y.K. et al (2010) Chen, K. et al. BreakDancer: an algorithm for high-resolution mapping of genomic structural variation. Nature Methods 6, 677 - 681 (2009). Ye, K. et al. Pindel: a pattern growth approach to detect break points of large deletions and medium sized insertions from paired-end short reads. Bioinformatics 25 (21): 2865-2871 (2009). Abyzov, A. et al. CNVnator: An approach to discover, genotype, and characterize typical and atypical CNVs from family and population genome sequencing. Genome Res. 21: 974-984 (2011). Lam, H.Y.K. et al. Nucleotide-resolution analysis of structural variants using BreakSeq and a breakpoint library. Nature Biotechnology 28, 47–55 (2010). 5/30/2014 Variant Calling

BreakDancer BreakDancerMax BreakDancerMini Detects anomalous read pairs indicative of deletions, insertions, inversions, intrachromosomal and interchromosomal translocations A pair of arrows represents the location and the orientation of a read pair A dotted line represents a chromosome in the analyzed genome A solid line represents a chromosome in the reference genome. BreakDancerMini focuses on detecting small indels (typically 10–100 bp) that are not routinely detected by BreakDancerMax Chen, K. et al. BreakDancer: an algorithm for high-resolution mapping of genomic structural variation. Nature Methods 6, 677 - 681 (2009). 5/30/2014 Variant Calling

BreakDancerMax Workflow Chen, K. et al. BreakDancer: an algorithm for high-resolution mapping of genomic structural variation. Nature Methods 6, 677 - 681 (2009). 5/30/2014 Variant Calling

Summary Accurate mapping and processing of NGS data are critical for analysis-ready reads and for downstream variant calling. Variant filtering is needed to reduce false positives. Multiple variant callers are needed in pipeline to reduce false negatives. Variant annotation helps determine biologically relevant variants. Variant calling pipeline should include the right set of tools and filters for the job. 5/30/2014 Variant Calling