> cd ~ > cp –R /media/sf_shared/BioNGS/GenomicVar/* .
Identifying and interpreting genomic variation using NGS Katherine Fawcett CGAT, University of Oxford
Organisation of sessions Lecture on calling, annotating and filtering genomic variation (~45 minutes) Practical session 1: Variant calling, annotation and filtering from exome data Practical session 2: Integromics – allele-specific expression
What is genomic variation?
Why study genomic variation? Important source of biological variation Changes in gene expression Changes in protein function Contributes to phenotypic variation (disease) Understanding and describing human evolution
Types of genomic variation SNVs (single nucleotide variants) Most common Single base substitutions 1% allele frequency = SNP Indels Short insertions or deletions (up to 100bp) CNVs (copy number variants) Duplication or loss of large genomic regions SVs (structural variants) Balanced translocations or inversions No copy number change
Methods for studying genomic variation Karyotyping ArrayCGH SNP microarrays DNA sequencing
Advantages of DNA Sequencing Pros Cons Identify all types of variants High resolution Unbiased Rare variants Causal variants Expensive Large datasets Complex analysis Difficult to interpret
DNA sequencing in human disease Rare causal variants in Mendelian disorders Small families Stuck positional cloning projects De novo dominants Rare variants in common, complex disease Somatic mutations in cancer Diagnostics
Whole Exome Sequencing Pros Cons 10-fold cheaper Greater sample numbers Smaller datasets Easier interpretation Protein coding exons only SNVs and indels only Not all genes/exons captured Reference bias Less uniform coverage
Sequence capture - whole exome
Sequence capture – whole exome IGV screenshot showing coverage of exons vs introns
Analysis workflow File Format Analysis Step Analysis Tools FASTQ QC of FASTQ FASTQC SAM/BAM Mapping Reads BWA IGV BED Processing & QC Picard VCF Variant Calling GATK Variant Annotation SnpEff/SnpSift Variant Filtering GATK
Step 1 - QC of Raw Sequencing Data File Format Analysis Step Analysis Tools FASTQ QC of FASTQ FASTQC SAM/BAM Mapping Reads BWA IGV BED Processing & QC Picard VCF Variant Calling GATK Variant Annotation SnpEff/SnpSift Variant Filtering GATK
FASTQ Format 2. Nucleotide sequence 1. Unique sequence ID 4. Per-base quality score (Phred-scaled) Different platforms use different symbols (ascii characters) Remove the bit on ascii characters? Phred-scaled...
Quality Control using FASTQC FASTQ files Traffic light overview Graphical summaries HTML report Galaxy integration http://www.bioinformatics.babraham.ac.uk/projects/fastqc/
Per-base Sequence Content Distribution of quality scores for each base position across all reads Quality score > 20 (Q20) = 1% error rate (yellow) Quality score > 30 (Q30) = 0.1% error rate (green) Quality score > 40 (Q40) = 0.01% error rate (max) Base 1 Base 100 Quality score (0-40) Possible to ‘trim’ reads by quality
Sequence Quality Distribution Average quality across all bases in a read Frequency Average sequence quality = 37 Average sequence quality = 31 with additional sequences low quality sequences at 17 Low quality reads can be removed by filtering
Exonic sequence has higher GC content than genomic background GC count per read Theoretical distribution % GC Exonic sequence has higher GC content than genomic background
Step 2 - Mapping Reads to a Reference Genome File Format Analysis Step Analysis Tools FASTQ QC of FASTQ FASTQC SAM/BAM Mapping Reads BWA IGV BED Processing & QC Picard VCF Variant Calling GATK Variant Annotation Variant Annotation SnpEff/SnpSift VEP Gene Annotation Variant Filtering GATK DAVID
Mapping Reads to a Reference Genome Find the position(s) in the reference genome where each short read sequence aligns with the fewest mismatches Must be fast (millions of short reads) Must allow small differences (sequencing errors or polymorphisms) String matching problem GCTGATGTGCCGCCTCACTCCGGTGG Reference Sequence CACTCCTGTGG CTCACTCCTGTGG GCTGATGTGCCACCTCA GATGTGCCACCTCACTC GTGCCGGCTCACTCCTG CTCCTGTGG Short reads TGATGTGCCGCCTCACT Sequencing error Heterozygous SNP Homozygous SNP Use g1k reference provided by 1000 genomes as it includes revised Cambridge Reference sequence for mitochondria, supercontigs, decoy sequences, Human herpesvirus 4 type 1)
Burrows-Wheeler Aligner (BWA) Part of the Broad Institute’s best practice pipeline Gapped alignment (enables indel calling) Aligns short sequences against long reference genome Fast (if not too many errors) http://www.ncbi.nlm.nih.gov/pubmed/19451168 Takes FASTQ as input, produces SAM as output Key parameters: Number of mismatches allowed (default read length dependent) Number of gaps allowed (default 1) Number of alternative hits to report (default 3) http://bio-bwa.sourceforge.net/bwa.shtml
SAM Format Sequence Alignment Map Standardised text file format Contains alignment information Header SAM Format version Sort order Reference sequence dictionary Reference sequence length Remove? Reference seq dictionary? Reference sequence name http://samtools.sourceforge.net/SAM1.pdf
SAM Format 1 2 3 4 5 6 7 8 9 10 11 QNAME: Query template NAME FLAG: bitwise FLAG RNAME: Reference sequence NAME POS: 1-based leftmost mapping POSition MAPQ: MAPping Quality CIGAR: CIGAR string RNEXT: Ref. name of the mate/next segment PNEXT: Position of the mate/next segment TLEN: observed Template LENgth SEQ: segment SEQuence QUAL: ASCII of Phred-scaled base QUALity+33 Remove? http://samtools.sourceforge.net/SAM1.pdf
Step 3 - QC of Mapped Reads File Format Analysis Step Analysis Tools FASTQ QC of FASTQ FASTQC SAM/BAM Mapping Reads BWA IGV BED Processing & QC Picard VCF Variant Calling GATK Variant Annotation Variant Annotation SnpEff/SnpSift VEP Gene Annotation Variant Filtering GATK DAVID
Processing steps (Picard) SAM Convert SAM to BAM Sort BAM by position Reorder contigs Add read groups Remove duplicates BAM
Post-alignment QC (Picard) Alignment summary metrics BAM QC statistics Insert size distribution BAM QC statistics Hybrid summary metrics BAM QC statistics Can also run FASTQC on BAM files
Alignment Summary Metrics PF_READS_ALIGNED: # reads aligned to reference sequence. PF_ALIGNED_BASES: # bases aligned to reference sequence. PF_MISMATCH_RATE: The rate of bases mismatching the reference. PF_INDEL_RATE: # indels per 100 aligned bases. READS_ALIGNED_IN_PAIRS: # aligned reads whose pair was aligned to reference. STRAND_BALANCE: # reads aligned to the positive strand of the genome divided by the number of reads aligned to the genome. PCT_CHIMERAS: % reads that map outside of a maximum insert size (usually 100kb) or have two ends mapping to different chromosomes. PCT_ADAPTER: The percentage of reads that are unaligned and match to a known adapter sequence right from the start of the read. Expect >90% alignment to reference genome
Insert Size Distribution Look for a tight distribution that peaks at around theoretical insert size from the sequencing library production size selection step
Hybrid Summary Metrics ON_BAIT_BASES: aligned bases that mapped to a baited region of the genome. NEAR_BAIT_BASES: aligned bases that mapped to within a fixed interval of a baited region, but not on a baited region. OFF_BAIT_BASES: aligned bases that mapped to neither on or near a bait. MEAN_BAIT_COVERAGE: The mean coverage of all baits in the experiment. FOLD_ENRICHMENT: The fold by which the baited region has been amplified above genomic background. ZERO_CVG_TARGETS_PCT: The number of targets that did not reach coverage=2 over any base. PCT_TARGET_BASES_20X: The percentage of ALL target bases achieving 20X or greater coverage. Expect > 80% bases covered at 20x picard.sourceforge.net/picard-metric-definitions.shtml
Step 4 - Variant Calling File Format Analysis Step Analysis Tools FASTQ QC of FASTQ FASTQC SAM/BAM Mapping Reads BWA IGV IGV BED Processing & QC Picard VCF Variant Calling GATK Variant Annotation Variant Annotation SnpEff/SnpSift VEP Gene Annotation Variant Filtering DAVID GATK
Variant calling steps (GATK) Local realignment around indels BAM Base quality score recalibration BAM Variant calling (HaplotypeCaller) BAM BAM Call families together to get matrix of all sites by all samples otherwise filtering dominated by missing data – need to know how confident you are that a site is reference. Case-control – call together, better than intersecting with publicly available data Variant quality score recalibration VCF
Variant calling steps (GATK) Local realignment around indels BAM Base quality score recalibration BAM Variant calling (HaplotypeCaller) BAM BAM Variant quality score recalibration VCF
Local Realignment around Indels During mapping each read is aligned against the reference genome separately This can lead to incorrect mapping around indels and false positive SNVs Two steps Identify target regions for realignment Realign all reads in this region to produce the most parsimonious alignment along all reads Alters original CIGAR string and adds tag. Depth dependent – can’t realign with 4X. Up to 20bp indels (HC up to 200bp) Pindel also does this
Indel Realigned BAM Before After
Variant calling steps (GATK) Local realignment around indels BAM Base quality score recalibration BAM Variant calling (HaplotypeCaller) BAM BAM Variant quality score recalibration VCF
Base quality score recalibration Important because downstream analysis is based on quality scores – weighing the evidence in a Bayesian framework DePristo et al., 2011. Nat Genet. 43(5), 491-8
Base Quality Score Recalibration Take all mismatches from the reference in the BAM file (excluding sites of known genetic variation) and assume to be errors Examine empirical error rate in each bin (dinucleotide context/machine run/reported quality) and use this to adjust reported error rate
Base Quality Score Recalibration DePristo et al., 2011. Nat Genet. 43(5), 491-8
Variant calling steps (GATK) Local realignment around indels BAM Base quality score recalibration BAM Variant calling (HaplotypeCaller) BAM BAM Variant quality score recalibration VCF
Variant Calling Many tools including Samtools, Platypus, Cortex etc. GATK HaplotypeCaller Calls SNVs and indels simultaneously Performs local de novo assembly of variable regions to produce candidate haplotypes (de Bruijn graph) Calculates likelihood for each read-haplotype pairing (using pairHMM) Assigns genotype by calculating the likelihood of each possible genotype given the scores for each read-haplotype pair (using Bayes theorum) Output VCF file Still need to indel realign as BQSR relies on this realignment Constructs de Bruijn assembly, each edge weighted by number of supporting reads
VCF File Format Headers Entries Location Alleles Info Format http://www.1000genomes.org/node/101
Variant calling steps (GATK) Local realignment around indels BAM Base quality score recalibration BAM Variant calling (HaplotypeCaller) BAM BAM Variant quality score recalibration VCF
Variant Quality Score Recalibration HaplotypeCaller is designed to be sensitive High false positive rate Learn how to filter from the data itself Models the distribution of known (true) variation relative to specified variant annotations Use model to evaluate novel variants
Variant Quality Score Recalibration
Fitting and Applying the Model Also good to check Ts/Tv ratio and compare calls to chips and novel variant rate. Really hard to assign quality to indels (number of indels still in dispute/multiple alleles/reptitive regions). Out-of-frame vs –in-frame DePristo et al., 2011. Nat Genet. 43(5), 491-8
Step 5: Variant Annotation File Format Analysis Step Analysis Tools FASTQ QC of FASTQ FASTQC SAM/BAM Mapping Reads BWA Galaxy IGV BED Processing & QC Picard VCF Variant Calling GATK Variant Annotation Variant Annotation SnpEff/SnpSift VEP Variant Filtering Gene Annotation GATK DAVID
Variant annotation Variant context Who else has it? Is it in a gene, coding region? Protein consequence? Is it conserved? Is it a known variant? Does it have known clinical significance? Pathway Who else has it? Presence/frequency in population or disease cohorts (eg. dbSNP, ClinSeq, 1000 genomes, NHLBI GO exome sequencing project)
Variant Annotation Tools Examples of Variant Annotation Tools include: Ensembl Variant Effect Predictor (VEP) Annovar SnpEff ...and others
SnpEff & SnpSift 20,000 reference genomes supported Wide variety of data sources integrated Pre-computed variant effect predictions from multiple sources (SIFT, Polyphen2, LRT, GERP…) Regulatory & non-coding annotations Add your own annotations Integrates ENCODE data Update
Variant Effect Prioritisation Variants can have different effects of different transcripts on the same protein We want to identify the isoform with the most deleterious effect GATK VariantAnnotator and newer versions of snpEff
Step 6: Variant Filtering File Format Analysis Step Analysis Tools FASTQ QC of FASTQ FASTQC SAM/BAM Mapping Reads BWA Galaxy IGV BED Processing & QC Picard VCF Variant Calling GATK Variant Annotation Variant Annotation SnpEff/SnpSift VEP Variant Filtering Gene Annotation GATK DAVID
Filtering strategies De novo Recessive Dominant X-linked
De novo filters Total variants = 131760 (example trio) RR RA Total variants = 131760 (example trio) Inheritance model: use phred-scaled likelihoods of the genotypes Variants = 104 Read depth in parents >=10, alternate allele depth in proband >=3 Variants = 79 Alternate allele read depth/total read depth > 0.25 in proband and <0.05 in parents Variants = 9 Functional impact of variant predicted high or moderate ie. Variant is nonsense, missense, splice site, or coding indel Variants = 3 MAF in population-based cohorts < 0.001 Variants = 2 (Between 1 and 4 candidate mutations in other trios!) Epi4K Consortium & Epilepsy Phenome/Genome Project, Nature 501, 217–221 (12 Sep 2013)
Filtering tools Pass jexl expressions to GATK SelectVariants Other tools include VAAST, VarMD, VarSifter, SnpSift filter
Filtering using JEXL Expressions Java Expression Language Syntax for constructing logic queries Select sample Select annotation operator vc.getGenotype("NA12891").getDP()>=10 && vc.getGenotype("NA12892").getDP()>=10 value Logical operator
Visualisation Samtools tview UCSC IGV Viewing options eg. zooming Highlight reads for more information Designed to be integrative
Further analysis Lower the stringency of filters Check coverage over candidate genes CNVs/SVs Consider whole-genome sequencing?
Variant calling, annotation and filtering from exome data Practical session 1 Variant calling, annotation and filtering from exome data
HapMap CEU trio NA12891 NA12892 NA12878
Practical hand-out 1. Introduction 2. Calling variants from NGS data – you can read this section but do not download the data or perform the steps outlined as they are too computationally expensive (VCF section) 3 and 4. Variant annotation and filtering – read through these sections and carry out the exercises. Have a go at the advanced exercises if you have time
Integromics – Allele Specific Expression Practical session 2 Integromics – Allele Specific Expression
Allele specific expression http://www.botanik.uni-koeln.de/1274.html
Allele specific expression Capture biological phenomena: Effects of cis-regulatory variants Nonsense-mediated decay Imprinting Single individual – no need to normalise
Why have DNA and RNAseq? Accuracy of variant calling (especially indels and structural variants) not as good from functional genomics data Allele-specific expression may lead to false reference homozygotes RNA editing
Reference bias (A) Construction of a personal genome by vcf2diploid tool is made by incorporating personal variants into the reference genome. Personal variants may require additional pre‐processing, that is, filtering, genotyping, and/or phasing. The output is the two (paternal and maternal) haplotypes of personal genome. During the construction step, the reference genome is represented as an array of nucleotides with each cell representing a single base. Iteratively, the nucleotides in the array are being modified to reflect personal variations. Once all the variations have been applied, a personal haplotype is constructed by reading through the array. Simultaneously, equivalence map (MAP‐file format—see Supplementary Figure 1) between personal haplotypes and reference genome is being constructed. This can similarly be done for a personal transcriptome. (B) AlleleSeq pipeline for determining allele‐specific binding (ASB) and allele‐specific expression (ASE) aligning reads against the personal diploid genome sequence as well as a diploid‐aware gene annotation file (including splice‐junction library). ©2011 by European Molecular Biology Organization Joel Rozowsky et al. Mol Syst Biol 2011;7:522
AlleleSeq pipeline (A) Construction of a personal genome by vcf2diploid tool is made by incorporating personal variants into the reference genome. Personal variants may require additional pre‐processing, that is, filtering, genotyping, and/or phasing. The output is the two (paternal and maternal) haplotypes of personal genome. During the construction step, the reference genome is represented as an array of nucleotides with each cell representing a single base. Iteratively, the nucleotides in the array are being modified to reflect personal variations. Once all the variations have been applied, a personal haplotype is constructed by reading through the array. Simultaneously, equivalence map (MAP‐file format—see Supplementary Figure 1) between personal haplotypes and reference genome is being constructed. This can similarly be done for a personal transcriptome. (B) AlleleSeq pipeline for determining allele‐specific binding (ASB) and allele‐specific expression (ASE) aligning reads against the personal diploid genome sequence as well as a diploid‐aware gene annotation file (including splice‐junction library). ©2011 by European Molecular Biology Organization Joel Rozowsky et al. Mol Syst Biol 2011;7:522
False discovery rate Correct for multiple hypothesis testing using FDR Simulates number of false positives given no allele-specific events by permuting allele labels of each mapped read at hetSNV loci For a given P value threshold (binomial test), number of false positives/total observed positives = FDR Default: FDR = 10%
HapMap CEU trio NA12891 NA12892 RNA-seq NA12878
Other tools... MBASED ASEQ GATK ASEReadCounter