Ruibin Xi Peking University School of Mathematical Sciences

Slides:



Advertisements
Similar presentations
G ENOTYPE AND SNP C ALLING FROM N EXT - GENERATION S EQUENCING D ATA Authors: Rasmus Nielsen, et al. Published in Nature Reviews, Genetics, Presented.
Advertisements

Reference mapping and variant detection Peter Tsai Bioinformatics Institute, University of Auckland.
Genome-wide Association Study Focus on association between SNPs and traits Tendency – Larger and larger sample size – Use of more narrowly defined phenotypes(blood.
DNAseq analysis Bioinformatics Analysis Team
Outline to SNP bioinformatics lecture
Variant discovery Different approaches: With or without a reference? With a reference – Limiting factors are CPU time and memory required – Crossbow –
Computational Tools for Finding and Interpreting Genetic Variations Gabor T. Marth Department of Biology, Boston College
Mining SNPs from EST Databases Picoult-Newberg et al. (1999)
The Extraction of Single Nucleotide Polymorphisms and the Use of Current Sequencing Tools Stephen Tetreault Department of Mathematics and Computer Science.
Introduction to Linkage Analysis March Stages of Genetic Mapping Are there genes influencing this trait? Epidemiological studies Where are those.
Something related to genetics? Dr. Lars Eijssen. Bioinformatics to understand studies in genomics – São Paulo – June Image:
PolyPhen and SIFT: Tools for predicting functional effects of SNPs Epi 244 Spring 2009 Sam S. Oh.
Vanderbilt Center for Quantitative Sciences Summer Institute Sequencing Analysis (DNA) Yan Guo.
Polymorphisms – SNP, InDel, Transposon BMI/IBGP 730 Victor Jin, Ph.D. (Slides from Dr. Kun Huang) Department of Biomedical Informatics Ohio State University.
Sequence Variation Informatics Gabor T. Marth Department of Biology, Boston College BI420 – Introduction to Bioinformatics.
High Throughput Sequencing
Genome Variations & GWAS
NGS Workshop Variant Calling
Dr Katie Snape Specialist Registrar in Genetics St Georges Hospital
Whole Exome Sequencing for Variant Discovery and Prioritisation
Genome & Exome Sequencing Read Mapping Xiaole Shirley Liu STAT115, STAT215, BIO298, BIST520.
Considerations for Analyzing Targeted NGS Data Introduction Tim Hague, CTO.
NGS Workshop Variant Calling and Structural Variants from Exomes/WGS
NGS Cancer Systems Biology Workshop Variant Calling and Structural Variants from Exomes/WGS Ramesh Nair May 30, 2014.
MES Genome Informatics I - Lecture VIII. Interpreting variants Sangwoo Kim, Ph.D. Assistant Professor, Severance Biomedical Research Institute,
Computational research for medical discovery at Boston College Biology Gabor T. Marth Boston College Department of Biology
Next-Generation Sequencing
CS177 Lecture 10 SNPs and Human Genetic Variation
SNP Haplotypes as Diagnostic Markers Shrish Tiwari CCMB, Hyderabad.
Next-Generation Sequencing Eric Jorgenson Epidemiology 217 2/28/12.
Alexis DereeperCIBA courses – Brasil 2011 Detection and analysis of SNP polymorphisms.
E XOME SEQUENCING AND COMPLEX DISEASE : practical aspects of rare variant association studies Alice Bouchoms Amaury Vanvinckenroye Maxime Legrand 1.
Cancer genomics Yao Fu March 4, Cancer is a genetic disease In the early 1970’s, Janet Rowley’s microscopy studies of leukemia cell chromosomes.
RNA-Seq Primer Understanding the RNA-Seq evidence tracks on the GEP UCSC Genome Browser Wilson Leung08/2014.
Lecture 6. Functional Genomics: DNA microarrays and re-sequencing individual genomes by hybridization.
MEME homework: probability of finding GAGTCA at a given position in the yeast genome, based on a background model of A = 0.3, T = 0.3, G = 0.2, C = 0.2.
Linear Reduction Method for Tag SNPs Selection Jingwu He Alex Zelikovsky.
February 20, 2002 UD, Newark, DE SNPs, Haplotypes, Alleles.
Computational methods for genomics-guided immunotherapy Sahar Al Seesi Computer Science & Engineering Department, UCONN Immunology Department, UCONN Health.
Association mapping for mendelian, and complex disorders January 16Bafna, BfB.
The International Consortium. The International HapMap Project.
Single nucleotide polymorphisms and Large scale variation
Lectures 7 – Oct 19, 2011 CSE 527 Computational Biology, Fall 2011 Instructor: Su-In Lee TA: Christopher Miles Monday & Wednesday 12:00-1:20 Johnson Hall.
Ke Lin 23 rd Feb, 2012 Structural Variation Detection Using NGS technology.
Computational Biology and Genomics at Boston College Biology Gabor T. Marth Department of Biology, Boston College
Current Data And Future Analysis Thomas Wieland, Thomas Schwarzmayr and Tim M Strom Helmholtz Zentrum München Institute of Human Genetics Geneva, 16/04/12.
Personalized genomics
Calling Somatic Mutations using VarScan
Introduction of the ChIP-seq pipeline Shigeki Nakagome November 16 th, 2015 Di Rienzo lab meeting.
GSVCaller – R-based computational framework for detection and annotation of short sequence variations in the human genome Vasily V. Grinev Associate Professor.
Analysis of Next Generation Sequence Data BIOST /06/2015.
Canadian Bioinformatics Workshops
A brief guide to sequencing Dr Gavin Band Wellcome Trust Advanced Courses; Genomic Epidemiology in Africa, 21 st – 26 th June 2015 Africa Centre for Health.
Introduction to Variant Analysis of Exome- and Amplicon sequencing data Lecture by: Date: Training: Extended version see: Dr. Christian Rausch 29 May 2015.
Canadian Bioinformatics Workshops
Reliable Identification of Genomic Variants from RNA-seq Data Robert Piskol, Gokul Ramaswami, Jin Billy Li PRESENTED BY GAYATHRI RAJAN VINEELA GANGALAPUDI.
The Haplotype Blocks Problems Wu Ling-Yun
1 Finding disease genes: A challenge for Medicine, Mathematics and Computer Science Andrew Collins, Professor of Genetic Epidemiology and Bioinformatics.
From Reads to Results Exome-seq analysis at CCBR
A comparison of somatic mutation callers in breast cancer samples and matched blood samples THOMAS BRETONNET BIOINFORMATICS AND COMPUTATIONAL BIOLOGY UNIT.
Canadian Bioinformatics Workshops
Nucleotide variation in the human genome
EMC Galaxy Course November 24-25, 2014
Introduction to bioinformatics lecture 11 SNP by Ms.Shumaila Azam
Discovery tools for human genetic variations
Annotation of Sequence Variants in Cancer Samples
Annotation of Sequence Variants in Cancer Samples
BF528 - Whole Genome Sequencing and Genomic Variation
Canadian Bioinformatics Workshops
SNPs and CNPs By: David Wendel.
Presentation transcript:

Ruibin Xi Peking University School of Mathematical Sciences Biostatistics-Lecture 17 Single nucleotide polymorphism detection—an introduction Ruibin Xi Peking University School of Mathematical Sciences

SNPs vs. SNVs Really a matter of frequency of occurrence Both are concerned with aberrations at a single nucleotide SNP (Single Nucleotide Polymorphism) Aberration expected at the position for any member in the species (well-characterized) Occur in population at some frequency so expected at a given locus Catalogued in dbSNP (http://www.ncbi.nlm.nih.gov/snp) SNV (Single Nucleotide Variants) Aberration seen in only a few individual (not well characterized) Occur at low frequency so not common May be related with certain diseases

SNV types of interest Non-synonymous mutations Impact on protein sequence Results in amino acid change Missense and nonsense mutations Somatic mutations in cancer Tumor-specific mutations in tumor-normal pairs

Catalogs of human genetic variation The 1000 Genomes Project http://www.1000genomes.org/ SNPs and structural variants genomes of about 2500 unidentified people from about 25 populations around the world will be sequenced using NGS technologies HapMap http://hapmap.ncbi.nlm.nih.gov/ identify and catalog genetic similarities and differences dbSNP http://www.ncbi.nlm.nih.gov/snp/ Database of SNPs and multiple small-scale variations that include indels, microsatellites, and non-polymorphic variants COSMIC http://www.sanger.ac.uk/genetics/CGP/cosmic/ Catalog of Somatic Mutations in Cancer

A framework for variation discovery DePristo, M.A. et al. A framework for variation discovery and genotyping using next-generation DNA sequencing data. Nat Genet. 43(5):491-8. PMID: 21478889 (2011).

A framework for variation discovery Phase 1: Mapping Place reads with an initial alignment on the reference genome using mapping algorithms Refine initial alignments local realignment around indels molecular duplicates are eliminated Generate the technology-independent SAM/BAM alignment map format Accurate mapping crucial for variation discovery DePristo, M.A. et al. A framework for variation discovery and genotyping using next-generation DNA sequencing data. Nat Genet. 43(5):491-8. PMID: 21478889 (2011).

Remove duplicates remove potential PCR duplicates - from PCR amplification step in library prep if multiple read pairs have identical external coordinates, only retain the pair with highest mapping quality Duplicates manifest themselves with high read depth support - impacts variant calling Software: SAMtools (rmdup) or Picard tools (MarkDuplicates) Human HapMap individual NA12005 - chr20:8660-8790 False SNP

Local Realignment

Local Realignment

Local Alignment Create local haplotypes For each haplotype Hi, align reads to Hi and score according to Find the best haplotype Hi, realign all reads just again Hi and H0(reference haplotype). reads all realigned if the log LR is > 5

A framework for variation discovery Phase 2: Discovery of raw variants Analysis-ready SAM/BAM files are analyzed to discover all sites with statistical evidence for an alternate allele present among the samples SNPs, SNVs, short indels, and SVs SNVs DePristo, M.A. et al. A framework for variation discovery and genotyping using next-generation DNA sequencing data. Nat Genet. 43(5):491-8. PMID: 21478889 (2011).

A framework for variation discovery Phase 3: Discovery of analysis-ready variants technical covariates, known sites of variation, genotypes for individuals, linkage disequilibrium, and family and population structure are integrated with the raw variant calls from Phase 2 to separate true polymorphic sites from machine artifacts at these sites high-quality genotypes are determined for all samples SNVs DePristo, M.A. et al. A framework for variation discovery and genotyping using next-generation DNA sequencing data. Nat Genet. 43(5):491-8. PMID: 21478889 (2011).

SNV Filtering Sufficient depth of read coverage Strand Bias SNV Filtering Sufficient depth of read coverage SNV present in given number of reads High mapping and SNV quality SNV density in a given bp window SNV greater than a given bp from a predicted indel Strand balance/bias Bentley, D.R. et al. Accurate whole human genome sequencing using reversible terminator chemistry. Nature 456, 53–59 (2008). Wheeler, D.A. et al. The complete genome of an individual by massively parallel DNA sequencing. Nature 452, 872–876 (2008). Larson, D.E. et al. SomaticSniper: Identification of Somatic Point Mutations in Whole Genome Sequencing Data. Bioinformatics Advance Access (2011).

SNV filtering

SomaticSniper: somatic detection filter Filter using SAMtools (Li, et al., 2009) calls from the tumor. Sites are retained if they meet all of the following rules: Site is greater than 10bp from a predicted indel of quality ≥ 50 Maximum mapping quality at the site is ≥ 40 < 3 SNV calls in a 10 bp window around the site Site is covered by ≥ 3 reads Consensus quality ≥ 20 SNP quality ≥ 20 SomaticSniper predictions passing the filters are then intersected with calls from dbSNP and sites matching both the position and allele of known dbSNPs are removed. Li, H. et al. The Sequence alignment/map (SAM) format and SAMtools. Bioinformatics, 25, 2078-9 (2009). Larson, D.E. et al. SomaticSniper: Identification of Somatic Point Mutations in Whole Genome Sequencing Data. Bioinformatics Advance Access (2011).

Variant calling methods > 15 different algorithms Two categories Heuristic approach Based on thresholds for read depth, base quality, variant allele frequency, statistical significance Probabilistic methods, e.g. Bayesian model to quantify statistical uncertainty Assign priors based on observed allele frequency of multiple samples SNP variant Ref A Ind1 G/G Ind2 A/G Nielsen R, Paul JS, Albrechtsen A, Song YS. Genotype and SNP calling from next-generation sequencing data. Nat Rev Genet. 2011 Jun;12(6):443-51. PMID: 21587300. http://seqanswers.com/wiki/Software/list

Variant callers Name Category Tumor/Normal Pairs Metric Reference SOAPsnp Bayesian No Phred QUAL Li et al. (2009) JointSNVMix (Fisher) Probability model Yes Somatic probability Roth, A. et al. (2012) Somatic Sniper Heuristic Somatic Score Larson, D.E. et al. (2012) VarScan 2 p-value Koboldt, D. et al. (2012) GATK DePristo, M.A. et al. (2011) Li, R. et al. SNP detection for massively parallel whole-genome resequencing. Genome Res. 19, 1124–1132 (2009). Roth, A. et al. JointSNVMix : A Probabilistic Model For Accurate Detection Of Somatic Mutations In Normal/Tumour Paired Next Generation Sequencing Data. Bioinformatics (2012). Larson, D.E. et al. SomaticSniper: identification of somatic point mutations in whole genome sequencing data. Bioinformatics. 28(3):311-7 (2012). Koboldt, D. et al. VarScan 2: Somatic mutation and copy number alteration discovery in cancer by exome sequencing. Genome Research (2012). DePristo, M.A. et al. A framework for variation discovery and genotyping using next-generation DNA sequencing data. Nat Genet. 43(5):491-8. PMID: 21478889 (2011).

Algorithm-SOAPsnp Given a genotype Ti, by the Bayes rule For haploid genome For diploid genome For diploid genome, given a set of observed alleles at a locus

Algorithm-JointSNVMix JointSNVMix (Fisher’s Exact Test) Allele count data from the normal and tumor compared using a two tailed Fisher’s exact test If the counts are significantly different the position is labeled as a variant position (e.g., p-value < 0.001) 2x2 Contingency Table REF allele ALT allele Total Tumor 15 16 31 Normal 25 Totals 40 56 G6PC2 hg19 chr2:169764377 A>G Asn286Asp The two-tailed for the Fisher’s Exact Test P value is < 0.0001 The association between rows (groups) and columns (outcomes) is considered to be extremely statistically significant.

G6PC2 hg19 chr2:169764377 A>G Asn286Asp Normal Depth=25 REF=25 ALT=0 Tumor Depth=31 REF=15 ALT=16 Variant Calling

How many variants will I find ? Samples compared to reference genome Hiseq: whole genome; mean coverage 60; HapMap individual NA12878 Exome: agilent capture; mean coverage 20; HapMap individual NA12878 DePristo MA, et al. A framework for variation discovery and genotyping using next-generation DNA sequencing data. Nat Genet. 2011 May;43(5):491-8. PMID: 21478889

Variant Annotation SeattleSeq Annovar annotation of known and novel SNPs includes dbSNP rs ID, gene names and accession numbers, SNP functions (e.g. missense), protein positions and amino-acid changes, conservation scores, HapMap frequencies, PolyPhen predictions, and clinical association Annovar Gene-based annotation Region-based annotations Filter-based annotation http://snp.gs.washington.edu/SeattleSeqAnnotation/ http://www.openbioinformatics.org/annovar/