> cd ~ > cp –R /media/sf_shared/BioNGS/GenomicVar/* .

Slides:



Advertisements
Similar presentations
Vanderbilt Center for Quantitative Sciences Summer Institute Sequencing Analysis Yan Guo.
Advertisements

Reference mapping and variant detection Peter Tsai Bioinformatics Institute, University of Auckland.
Genome-wide Association Study Focus on association between SNPs and traits Tendency – Larger and larger sample size – Use of more narrowly defined phenotypes(blood.
DNAseq analysis Bioinformatics Analysis Team
Introduction to Short Read Sequencing Analysis
The Extraction of Single Nucleotide Polymorphisms and the Use of Current Sequencing Tools Stephen Tetreault Department of Mathematics and Computer Science.
Introduction to Linkage Analysis March Stages of Genetic Mapping Are there genes influencing this trait? Epidemiological studies Where are those.
High Throughput Sequencing
NGS Analysis Using Galaxy
Whole Exome Sequencing for Variant Discovery and Prioritisation
Biostatistics-Lecture 15 High-throughput sequencing and sequence alignment Ruibin Xi Peking University School of Mathematical Sciences.
Considerations for Analyzing Targeted NGS Data Introduction Tim Hague, CTO.
MES Genome Informatics I - Lecture VIII. Interpreting variants Sangwoo Kim, Ph.D. Assistant Professor, Severance Biomedical Research Institute,
Introduction to Short Read Sequencing Analysis
RNAseq analyses -- methods
NGS data analysis CCM Seminar series Michael Liang:
Next Generation Sequencing and its data analysis challenges Background Alignment and Assembly Applications Genome Epigenome Transcriptome.
Next Generation DNA Sequencing
CS177 Lecture 10 SNPs and Human Genetic Variation
RNA-Seq Assembly 转录组拼接 唐海宝 基因组与生物技术研究中心 2013 年 11 月 23 日.
ParSNP Hash Pipeline to parse SNP data and output summary statistics across sliding windows.
Alexis DereeperCIBA courses – Brasil 2011 Detection and analysis of SNP polymorphisms.
E XOME SEQUENCING AND COMPLEX DISEASE : practical aspects of rare variant association studies Alice Bouchoms Amaury Vanvinckenroye Maxime Legrand 1.
MEME homework: probability of finding GAGTCA at a given position in the yeast genome, based on a background model of A = 0.3, T = 0.3, G = 0.2, C = 0.2.
Introduction to RNAseq
Tutorial 6 High Throughput Sequencing. HTS tools and analysis Review of resequencing pipeline Visualization - IGV Analysis platform – Galaxy Tuning up.
Ke Lin 23 rd Feb, 2012 Structural Variation Detection Using NGS technology.
Current Data And Future Analysis Thomas Wieland, Thomas Schwarzmayr and Tim M Strom Helmholtz Zentrum München Institute of Human Genetics Geneva, 16/04/12.
Moderní metody analýzy genomu - analýza Mgr. Nikola Tom Brno,
Personalized genomics
Canadian Bioinformatics Workshops
A brief guide to sequencing Dr Gavin Band Wellcome Trust Advanced Courses; Genomic Epidemiology in Africa, 21 st – 26 th June 2015 Africa Centre for Health.
Introduction to Variant Analysis of Exome- and Amplicon sequencing data Lecture by: Date: Training: Extended version see: Dr. Christian Rausch 29 May 2015.
Canadian Bioinformatics Workshops
Canadian Bioinformatics Workshops
Identifying disease causal variants Mendelian disorders A. Mesut Erzurumluoglu 1.
Canadian Bioinformatics Workshops
Reliable Identification of Genomic Variants from RNA-seq Data Robert Piskol, Gokul Ramaswami, Jin Billy Li PRESENTED BY GAYATHRI RAJAN VINEELA GANGALAPUDI.
Inheritance Model testing Andrew Stubbs Dept. Bioinformatics.
From Reads to Results Exome-seq analysis at CCBR
Integrated sequence analysis pipeline provides one-stop solution for identifying disease-causing mutations Cougar Hao Hu, MPIMG.
Canadian Bioinformatics Workshops
Konstantin Okonechnikov Qualimap v2: advanced quality control of
Interpreting exomes and genomes: a beginner’s guide
Introductory RNA-seq Transcriptome Profiling
Canadian Bioinformatics Workshops
Lesson: Sequence processing
Week-6: Genomics Browsers
Cancer Genomics Core Lab
Next Generation Sequencing Analysis
Variant Calling Workshop
VCF format: variants c.f. S. Brown NYU
RNA-Seq analysis in R (Bioconductor)
Interpretation Next Generation Sequencing (Bench Clinic)
S1 Supporting information Bioinformatic workflow and quality of the metrics Number of slides: 10.
EMC Galaxy Course November 24-25, 2014
Very important to know the difference between the trees!
Validation of a Next-Generation Sequencing Pipeline for the Molecular Diagnosis of Multiple Inherited Cancer Predisposing Syndromes  Paula Paulo, Pedro.
Molecular Diagnosis of Autosomal Dominant Polycystic Kidney Disease Using Next- Generation Sequencing  Adrian Y. Tan, Alber Michaeel, Genyan Liu, Olivier.
Maximize read usage through mapping strategies
ChIP-seq Robert J. Trumbly
BF528 - Genomic Variation and SNP Analysis
BF528 - Whole Genome Sequencing and Genomic Variation
Canadian Bioinformatics Workshops
Basic Local Alignment Search Tool
Sequence Analysis - RNA-Seq 2
BF528 - Sequence Analysis Fundamentals
SNPs and CNPs By: David Wendel.
Analysis of protein-coding genetic variation in 60,706 humans
The Variant Call Format
Presentation transcript:

> cd ~ > cp –R /media/sf_shared/BioNGS/GenomicVar/* .

Identifying and interpreting genomic variation using NGS Katherine Fawcett CGAT, University of Oxford

Organisation of sessions Lecture on calling, annotating and filtering genomic variation (~45 minutes) Practical session 1: Variant calling, annotation and filtering from exome data Practical session 2: Integromics – allele-specific expression

What is genomic variation?

Why study genomic variation? Important source of biological variation Changes in gene expression Changes in protein function Contributes to phenotypic variation (disease) Understanding and describing human evolution

Types of genomic variation SNVs (single nucleotide variants) Most common Single base substitutions 1% allele frequency = SNP Indels Short insertions or deletions (up to 100bp) CNVs (copy number variants) Duplication or loss of large genomic regions SVs (structural variants) Balanced translocations or inversions No copy number change

Methods for studying genomic variation Karyotyping ArrayCGH SNP microarrays DNA sequencing

Advantages of DNA Sequencing Pros Cons Identify all types of variants High resolution Unbiased Rare variants Causal variants Expensive Large datasets Complex analysis Difficult to interpret

DNA sequencing in human disease Rare causal variants in Mendelian disorders Small families Stuck positional cloning projects De novo dominants Rare variants in common, complex disease Somatic mutations in cancer Diagnostics

Whole Exome Sequencing Pros Cons 10-fold cheaper Greater sample numbers Smaller datasets Easier interpretation Protein coding exons only SNVs and indels only Not all genes/exons captured Reference bias Less uniform coverage

Sequence capture - whole exome

Sequence capture – whole exome IGV screenshot showing coverage of exons vs introns

Analysis workflow File Format Analysis Step Analysis Tools FASTQ QC of FASTQ FASTQC SAM/BAM Mapping Reads BWA IGV BED Processing & QC Picard VCF Variant Calling GATK Variant Annotation SnpEff/SnpSift Variant Filtering GATK

Step 1 - QC of Raw Sequencing Data File Format Analysis Step Analysis Tools FASTQ QC of FASTQ FASTQC SAM/BAM Mapping Reads BWA IGV BED Processing & QC Picard VCF Variant Calling GATK Variant Annotation SnpEff/SnpSift Variant Filtering GATK

FASTQ Format 2. Nucleotide sequence 1. Unique sequence ID 4. Per-base quality score (Phred-scaled) Different platforms use different symbols (ascii characters) Remove the bit on ascii characters? Phred-scaled...

Quality Control using FASTQC FASTQ files Traffic light overview Graphical summaries HTML report Galaxy integration http://www.bioinformatics.babraham.ac.uk/projects/fastqc/

Per-base Sequence Content Distribution of quality scores for each base position across all reads Quality score > 20 (Q20) = 1% error rate (yellow) Quality score > 30 (Q30) = 0.1% error rate (green) Quality score > 40 (Q40) = 0.01% error rate (max) Base 1 Base 100 Quality score (0-40) Possible to ‘trim’ reads by quality

Sequence Quality Distribution Average quality across all bases in a read Frequency Average sequence quality = 37 Average sequence quality = 31 with additional sequences low quality sequences at 17 Low quality reads can be removed by filtering

Exonic sequence has higher GC content than genomic background GC count per read Theoretical distribution % GC Exonic sequence has higher GC content than genomic background

Step 2 - Mapping Reads to a Reference Genome File Format Analysis Step Analysis Tools FASTQ QC of FASTQ FASTQC SAM/BAM Mapping Reads BWA IGV BED Processing & QC Picard VCF Variant Calling GATK Variant Annotation Variant Annotation SnpEff/SnpSift VEP Gene Annotation Variant Filtering GATK DAVID

Mapping Reads to a Reference Genome Find the position(s) in the reference genome where each short read sequence aligns with the fewest mismatches Must be fast (millions of short reads) Must allow small differences (sequencing errors or polymorphisms) String matching problem GCTGATGTGCCGCCTCACTCCGGTGG Reference Sequence CACTCCTGTGG CTCACTCCTGTGG GCTGATGTGCCACCTCA GATGTGCCACCTCACTC GTGCCGGCTCACTCCTG CTCCTGTGG Short reads TGATGTGCCGCCTCACT Sequencing error Heterozygous SNP Homozygous SNP Use g1k reference provided by 1000 genomes as it includes revised Cambridge Reference sequence for mitochondria, supercontigs, decoy sequences, Human herpesvirus 4 type 1)

Burrows-Wheeler Aligner (BWA) Part of the Broad Institute’s best practice pipeline Gapped alignment (enables indel calling) Aligns short sequences against long reference genome Fast (if not too many errors) http://www.ncbi.nlm.nih.gov/pubmed/19451168 Takes FASTQ as input, produces SAM as output Key parameters: Number of mismatches allowed (default read length dependent) Number of gaps allowed (default 1) Number of alternative hits to report (default 3) http://bio-bwa.sourceforge.net/bwa.shtml

SAM Format Sequence Alignment Map Standardised text file format Contains alignment information Header SAM Format version Sort order Reference sequence dictionary Reference sequence length Remove? Reference seq dictionary? Reference sequence name http://samtools.sourceforge.net/SAM1.pdf

SAM Format 1 2 3 4 5 6 7 8 9 10 11 QNAME: Query template NAME FLAG: bitwise FLAG RNAME: Reference sequence NAME POS: 1-based leftmost mapping POSition MAPQ: MAPping Quality CIGAR: CIGAR string RNEXT: Ref. name of the mate/next segment PNEXT: Position of the mate/next segment TLEN: observed Template LENgth SEQ: segment SEQuence QUAL: ASCII of Phred-scaled base QUALity+33 Remove? http://samtools.sourceforge.net/SAM1.pdf

Step 3 - QC of Mapped Reads File Format Analysis Step Analysis Tools FASTQ QC of FASTQ FASTQC SAM/BAM Mapping Reads BWA IGV BED Processing & QC Picard VCF Variant Calling GATK Variant Annotation Variant Annotation SnpEff/SnpSift VEP Gene Annotation Variant Filtering GATK DAVID

Processing steps (Picard) SAM Convert SAM to BAM Sort BAM by position Reorder contigs Add read groups Remove duplicates BAM

Post-alignment QC (Picard) Alignment summary metrics BAM QC statistics Insert size distribution BAM QC statistics Hybrid summary metrics BAM QC statistics Can also run FASTQC on BAM files

Alignment Summary Metrics PF_READS_ALIGNED: # reads aligned to reference sequence. PF_ALIGNED_BASES: # bases aligned to reference sequence. PF_MISMATCH_RATE: The rate of bases mismatching the reference. PF_INDEL_RATE: # indels per 100 aligned bases. READS_ALIGNED_IN_PAIRS: # aligned reads whose pair was aligned to reference. STRAND_BALANCE: # reads aligned to the positive strand of the genome divided by the number of reads aligned to the genome. PCT_CHIMERAS: % reads that map outside of a maximum insert size (usually 100kb) or have two ends mapping to different chromosomes. PCT_ADAPTER: The percentage of reads that are unaligned and match to a known adapter sequence right from the start of the read. Expect >90% alignment to reference genome

Insert Size Distribution Look for a tight distribution that peaks at around theoretical insert size from the sequencing library production size selection step

Hybrid Summary Metrics ON_BAIT_BASES: aligned bases that mapped to a baited region of the genome. NEAR_BAIT_BASES: aligned bases that mapped to within a fixed interval of a baited region, but not on a baited region. OFF_BAIT_BASES: aligned bases that mapped to neither on or near a bait. MEAN_BAIT_COVERAGE: The mean coverage of all baits in the experiment. FOLD_ENRICHMENT: The fold by which the baited region has been amplified above genomic background. ZERO_CVG_TARGETS_PCT: The number of targets that did not reach coverage=2 over any base. PCT_TARGET_BASES_20X: The percentage of ALL target bases achieving 20X or greater coverage. Expect > 80% bases covered at 20x picard.sourceforge.net/picard-metric-definitions.shtml

Step 4 - Variant Calling File Format Analysis Step Analysis Tools FASTQ QC of FASTQ FASTQC SAM/BAM Mapping Reads BWA IGV IGV BED Processing & QC Picard VCF Variant Calling GATK Variant Annotation Variant Annotation SnpEff/SnpSift VEP Gene Annotation Variant Filtering DAVID GATK

Variant calling steps (GATK) Local realignment around indels BAM Base quality score recalibration BAM Variant calling (HaplotypeCaller) BAM BAM Call families together to get matrix of all sites by all samples otherwise filtering dominated by missing data – need to know how confident you are that a site is reference. Case-control – call together, better than intersecting with publicly available data Variant quality score recalibration VCF

Variant calling steps (GATK) Local realignment around indels BAM Base quality score recalibration BAM Variant calling (HaplotypeCaller) BAM BAM Variant quality score recalibration VCF

Local Realignment around Indels During mapping each read is aligned against the reference genome separately This can lead to incorrect mapping around indels and false positive SNVs Two steps Identify target regions for realignment Realign all reads in this region to produce the most parsimonious alignment along all reads Alters original CIGAR string and adds tag. Depth dependent – can’t realign with 4X. Up to 20bp indels (HC up to 200bp) Pindel also does this

Indel Realigned BAM Before After

Variant calling steps (GATK) Local realignment around indels BAM Base quality score recalibration BAM Variant calling (HaplotypeCaller) BAM BAM Variant quality score recalibration VCF

Base quality score recalibration Important because downstream analysis is based on quality scores – weighing the evidence in a Bayesian framework DePristo et al., 2011. Nat Genet. 43(5), 491-8

Base Quality Score Recalibration Take all mismatches from the reference in the BAM file (excluding sites of known genetic variation) and assume to be errors Examine empirical error rate in each bin (dinucleotide context/machine run/reported quality) and use this to adjust reported error rate

Base Quality Score Recalibration DePristo et al., 2011. Nat Genet. 43(5), 491-8

Variant calling steps (GATK) Local realignment around indels BAM Base quality score recalibration BAM Variant calling (HaplotypeCaller) BAM BAM Variant quality score recalibration VCF

Variant Calling Many tools including Samtools, Platypus, Cortex etc. GATK HaplotypeCaller Calls SNVs and indels simultaneously Performs local de novo assembly of variable regions to produce candidate haplotypes (de Bruijn graph) Calculates likelihood for each read-haplotype pairing (using pairHMM) Assigns genotype by calculating the likelihood of each possible genotype given the scores for each read-haplotype pair (using Bayes theorum) Output VCF file Still need to indel realign as BQSR relies on this realignment Constructs de Bruijn assembly, each edge weighted by number of supporting reads

VCF File Format Headers Entries Location Alleles Info Format http://www.1000genomes.org/node/101

Variant calling steps (GATK) Local realignment around indels BAM Base quality score recalibration BAM Variant calling (HaplotypeCaller) BAM BAM Variant quality score recalibration VCF

Variant Quality Score Recalibration HaplotypeCaller is designed to be sensitive High false positive rate Learn how to filter from the data itself Models the distribution of known (true) variation relative to specified variant annotations Use model to evaluate novel variants

Variant Quality Score Recalibration

Fitting and Applying the Model Also good to check Ts/Tv ratio and compare calls to chips and novel variant rate. Really hard to assign quality to indels (number of indels still in dispute/multiple alleles/reptitive regions). Out-of-frame vs –in-frame DePristo et al., 2011. Nat Genet. 43(5), 491-8

Step 5: Variant Annotation File Format Analysis Step Analysis Tools FASTQ QC of FASTQ FASTQC SAM/BAM Mapping Reads BWA Galaxy IGV BED Processing & QC Picard VCF Variant Calling GATK Variant Annotation Variant Annotation SnpEff/SnpSift VEP Variant Filtering Gene Annotation GATK DAVID

Variant annotation Variant context Who else has it? Is it in a gene, coding region? Protein consequence? Is it conserved? Is it a known variant? Does it have known clinical significance? Pathway Who else has it? Presence/frequency in population or disease cohorts (eg. dbSNP, ClinSeq, 1000 genomes, NHLBI GO exome sequencing project)

Variant Annotation Tools Examples of Variant Annotation Tools include: Ensembl Variant Effect Predictor (VEP) Annovar SnpEff ...and others

SnpEff & SnpSift 20,000 reference genomes supported Wide variety of data sources integrated Pre-computed variant effect predictions from multiple sources (SIFT, Polyphen2, LRT, GERP…) Regulatory & non-coding annotations Add your own annotations Integrates ENCODE data Update

Variant Effect Prioritisation Variants can have different effects of different transcripts on the same protein We want to identify the isoform with the most deleterious effect GATK VariantAnnotator and newer versions of snpEff

Step 6: Variant Filtering File Format Analysis Step Analysis Tools FASTQ QC of FASTQ FASTQC SAM/BAM Mapping Reads BWA Galaxy IGV BED Processing & QC Picard VCF Variant Calling GATK Variant Annotation Variant Annotation SnpEff/SnpSift VEP Variant Filtering Gene Annotation GATK DAVID

Filtering strategies De novo Recessive Dominant X-linked

De novo filters Total variants = 131760 (example trio) RR RA Total variants = 131760 (example trio) Inheritance model: use phred-scaled likelihoods of the genotypes Variants = 104 Read depth in parents >=10, alternate allele depth in proband >=3 Variants = 79 Alternate allele read depth/total read depth > 0.25 in proband and <0.05 in parents Variants = 9 Functional impact of variant predicted high or moderate ie. Variant is nonsense, missense, splice site, or coding indel Variants = 3 MAF in population-based cohorts < 0.001 Variants = 2 (Between 1 and 4 candidate mutations in other trios!) Epi4K Consortium & Epilepsy Phenome/Genome Project, Nature 501, 217–221 (12 Sep 2013)

Filtering tools Pass jexl expressions to GATK SelectVariants Other tools include VAAST, VarMD, VarSifter, SnpSift filter

Filtering using JEXL Expressions Java Expression Language Syntax for constructing logic queries Select sample Select annotation operator vc.getGenotype("NA12891").getDP()>=10 && vc.getGenotype("NA12892").getDP()>=10 value Logical operator

Visualisation Samtools tview UCSC IGV Viewing options eg. zooming Highlight reads for more information Designed to be integrative

Further analysis Lower the stringency of filters Check coverage over candidate genes CNVs/SVs Consider whole-genome sequencing?

Variant calling, annotation and filtering from exome data Practical session 1 Variant calling, annotation and filtering from exome data

HapMap CEU trio NA12891 NA12892 NA12878

Practical hand-out 1. Introduction 2. Calling variants from NGS data – you can read this section but do not download the data or perform the steps outlined as they are too computationally expensive (VCF section) 3 and 4. Variant annotation and filtering – read through these sections and carry out the exercises. Have a go at the advanced exercises if you have time

Integromics – Allele Specific Expression Practical session 2 Integromics – Allele Specific Expression

Allele specific expression http://www.botanik.uni-koeln.de/1274.html

Allele specific expression Capture biological phenomena: Effects of cis-regulatory variants Nonsense-mediated decay Imprinting Single individual – no need to normalise

Why have DNA and RNAseq? Accuracy of variant calling (especially indels and structural variants) not as good from functional genomics data Allele-specific expression may lead to false reference homozygotes RNA editing

Reference bias (A) Construction of a personal genome by vcf2diploid tool is made by incorporating personal variants into the reference genome. Personal variants may require additional pre‐processing, that is, filtering, genotyping, and/or phasing. The output is the two (paternal and maternal) haplotypes of personal genome. During the construction step, the reference genome is represented as an array of nucleotides with each cell representing a single base. Iteratively, the nucleotides in the array are being modified to reflect personal variations. Once all the variations have been applied, a personal haplotype is constructed by reading through the array. Simultaneously, equivalence map (MAP‐file format—see Supplementary Figure 1) between personal haplotypes and reference genome is being constructed. This can similarly be done for a personal transcriptome. (B) AlleleSeq pipeline for determining allele‐specific binding (ASB) and allele‐specific expression (ASE) aligning reads against the personal diploid genome sequence as well as a diploid‐aware gene annotation file (including splice‐junction library). ©2011 by European Molecular Biology Organization Joel Rozowsky et al. Mol Syst Biol 2011;7:522

AlleleSeq pipeline (A) Construction of a personal genome by vcf2diploid tool is made by incorporating personal variants into the reference genome. Personal variants may require additional pre‐processing, that is, filtering, genotyping, and/or phasing. The output is the two (paternal and maternal) haplotypes of personal genome. During the construction step, the reference genome is represented as an array of nucleotides with each cell representing a single base. Iteratively, the nucleotides in the array are being modified to reflect personal variations. Once all the variations have been applied, a personal haplotype is constructed by reading through the array. Simultaneously, equivalence map (MAP‐file format—see Supplementary Figure 1) between personal haplotypes and reference genome is being constructed. This can similarly be done for a personal transcriptome. (B) AlleleSeq pipeline for determining allele‐specific binding (ASB) and allele‐specific expression (ASE) aligning reads against the personal diploid genome sequence as well as a diploid‐aware gene annotation file (including splice‐junction library). ©2011 by European Molecular Biology Organization Joel Rozowsky et al. Mol Syst Biol 2011;7:522

False discovery rate Correct for multiple hypothesis testing using FDR Simulates number of false positives given no allele-specific events by permuting allele labels of each mapped read at hetSNV loci For a given P value threshold (binomial test), number of false positives/total observed positives = FDR Default: FDR = 10%

HapMap CEU trio NA12891 NA12892 RNA-seq NA12878

Other tools... MBASED ASEQ GATK ASEReadCounter