> cd ~ > cp –R /media/sf_shared/BioNGS/GenomicVar/* .

> cd ~ > cp –R /media/sf_shared/BioNGS/GenomicVar/* .

Identifying and interpreting genomic variation using NGS
Katherine Fawcett CGAT, University of Oxford

Organisation of sessions
Lecture on calling, annotating and filtering genomic variation (~45 minutes) Practical session 1: Variant calling, annotation and filtering from exome data Practical session 2: Integromics – allele-specific expression

What is genomic variation?

Why study genomic variation?
Important source of biological variation Changes in gene expression Changes in protein function Contributes to phenotypic variation (disease) Understanding and describing human evolution

Types of genomic variation
SNVs (single nucleotide variants) Most common Single base substitutions 1% allele frequency = SNP Indels Short insertions or deletions (up to 100bp) CNVs (copy number variants) Duplication or loss of large genomic regions SVs (structural variants) Balanced translocations or inversions No copy number change

Methods for studying genomic variation
Karyotyping ArrayCGH SNP microarrays DNA sequencing

Advantages of DNA Sequencing
Pros Cons Identify all types of variants High resolution Unbiased Rare variants Causal variants Expensive Large datasets Complex analysis Difficult to interpret

DNA sequencing in human disease
Rare causal variants in Mendelian disorders Small families Stuck positional cloning projects De novo dominants Rare variants in common, complex disease Somatic mutations in cancer Diagnostics

Whole Exome Sequencing
Pros Cons 10-fold cheaper Greater sample numbers Smaller datasets Easier interpretation Protein coding exons only SNVs and indels only Not all genes/exons captured Reference bias Less uniform coverage

Sequence capture - whole exome

Sequence capture – whole exome
IGV screenshot showing coverage of exons vs introns

Analysis workflow File Format Analysis Step Analysis Tools FASTQ
QC of FASTQ FASTQC SAM/BAM Mapping Reads BWA IGV BED Processing & QC Picard VCF Variant Calling GATK Variant Annotation SnpEff/SnpSift Variant Filtering GATK

Step 1 - QC of Raw Sequencing Data
File Format Analysis Step Analysis Tools FASTQ QC of FASTQ FASTQC SAM/BAM Mapping Reads BWA IGV BED Processing & QC Picard VCF Variant Calling GATK Variant Annotation SnpEff/SnpSift Variant Filtering GATK

FASTQ Format 2. Nucleotide sequence 1. Unique sequence ID
4. Per-base quality score (Phred-scaled) Different platforms use different symbols (ascii characters) Remove the bit on ascii characters? Phred-scaled...

Quality Control using FASTQC
FASTQ files Traffic light overview Graphical summaries HTML report Galaxy integration

Per-base Sequence Content
Distribution of quality scores for each base position across all reads Quality score > 20 (Q20) = 1% error rate (yellow) Quality score > 30 (Q30) = 0.1% error rate (green) Quality score > 40 (Q40) = 0.01% error rate (max) Base Base 100 Quality score (0-40) Possible to ‘trim’ reads by quality

Sequence Quality Distribution
Average quality across all bases in a read Frequency Average sequence quality = 37 Average sequence quality = 31 with additional sequences low quality sequences at 17 Low quality reads can be removed by filtering

Exonic sequence has higher GC content than genomic background
GC count per read Theoretical distribution % GC Exonic sequence has higher GC content than genomic background

Step 2 - Mapping Reads to a Reference Genome
File Format Analysis Step Analysis Tools FASTQ QC of FASTQ FASTQC SAM/BAM Mapping Reads BWA IGV BED Processing & QC Picard VCF Variant Calling GATK Variant Annotation Variant Annotation SnpEff/SnpSift VEP Gene Annotation Variant Filtering GATK DAVID

Mapping Reads to a Reference Genome
Find the position(s) in the reference genome where each short read sequence aligns with the fewest mismatches Must be fast (millions of short reads) Must allow small differences (sequencing errors or polymorphisms) String matching problem GCTGATGTGCCGCCTCACTCCGGTGG Reference Sequence CACTCCTGTGG CTCACTCCTGTGG GCTGATGTGCCACCTCA GATGTGCCACCTCACTC GTGCCGGCTCACTCCTG CTCCTGTGG Short reads TGATGTGCCGCCTCACT Sequencing error Heterozygous SNP Homozygous SNP Use g1k reference provided by 1000 genomes as it includes revised Cambridge Reference sequence for mitochondria, supercontigs, decoy sequences, Human herpesvirus 4 type 1)

Burrows-Wheeler Aligner (BWA)
Part of the Broad Institute’s best practice pipeline Gapped alignment (enables indel calling) Aligns short sequences against long reference genome Fast (if not too many errors) Takes FASTQ as input, produces SAM as output Key parameters: Number of mismatches allowed (default read length dependent) Number of gaps allowed (default 1) Number of alternative hits to report (default 3)

SAM Format Sequence Alignment Map Standardised text file format
Contains alignment information Header SAM Format version Sort order Reference sequence dictionary Reference sequence length Remove? Reference seq dictionary? Reference sequence name

SAM Format 1 2 3 4 5 6 7 8 9 10 11 QNAME: Query template NAME
FLAG: bitwise FLAG RNAME: Reference sequence NAME POS: 1-based leftmost mapping POSition MAPQ: MAPping Quality CIGAR: CIGAR string RNEXT: Ref. name of the mate/next segment PNEXT: Position of the mate/next segment TLEN: observed Template LENgth SEQ: segment SEQuence QUAL: ASCII of Phred-scaled base QUALity+33 Remove?

Step 3 - QC of Mapped Reads
File Format Analysis Step Analysis Tools FASTQ QC of FASTQ FASTQC SAM/BAM Mapping Reads BWA IGV BED Processing & QC Picard VCF Variant Calling GATK Variant Annotation Variant Annotation SnpEff/SnpSift VEP Gene Annotation Variant Filtering GATK DAVID

Processing steps (Picard)
SAM Convert SAM to BAM Sort BAM by position Reorder contigs Add read groups Remove duplicates BAM

Post-alignment QC (Picard)
Alignment summary metrics BAM QC statistics Insert size distribution BAM QC statistics Hybrid summary metrics BAM QC statistics Can also run FASTQC on BAM files

Alignment Summary Metrics
PF_READS_ALIGNED: # reads aligned to reference sequence. PF_ALIGNED_BASES: # bases aligned to reference sequence. PF_MISMATCH_RATE: The rate of bases mismatching the reference. PF_INDEL_RATE: # indels per 100 aligned bases. READS_ALIGNED_IN_PAIRS: # aligned reads whose pair was aligned to reference. STRAND_BALANCE: # reads aligned to the positive strand of the genome divided by the number of reads aligned to the genome. PCT_CHIMERAS: % reads that map outside of a maximum insert size (usually 100kb) or have two ends mapping to different chromosomes. PCT_ADAPTER: The percentage of reads that are unaligned and match to a known adapter sequence right from the start of the read. Expect >90% alignment to reference genome

Insert Size Distribution
Look for a tight distribution that peaks at around theoretical insert size from the sequencing library production size selection step

Hybrid Summary Metrics
ON_BAIT_BASES: aligned bases that mapped to a baited region of the genome. NEAR_BAIT_BASES: aligned bases that mapped to within a fixed interval of a baited region, but not on a baited region. OFF_BAIT_BASES: aligned bases that mapped to neither on or near a bait. MEAN_BAIT_COVERAGE: The mean coverage of all baits in the experiment. FOLD_ENRICHMENT: The fold by which the baited region has been amplified above genomic background. ZERO_CVG_TARGETS_PCT: The number of targets that did not reach coverage=2 over any base. PCT_TARGET_BASES_20X: The percentage of ALL target bases achieving 20X or greater coverage. Expect > 80% bases covered at 20x picard.sourceforge.net/picard-metric-definitions.shtml

Step 4 - Variant Calling File Format Analysis Step Analysis Tools
FASTQ QC of FASTQ FASTQC SAM/BAM Mapping Reads BWA IGV IGV BED Processing & QC Picard VCF Variant Calling GATK Variant Annotation Variant Annotation SnpEff/SnpSift VEP Gene Annotation Variant Filtering DAVID GATK

Variant calling steps (GATK)
Local realignment around indels BAM Base quality score recalibration BAM Variant calling (HaplotypeCaller) BAM BAM Call families together to get matrix of all sites by all samples otherwise filtering dominated by missing data – need to know how confident you are that a site is reference. Case-control – call together, better than intersecting with publicly available data Variant quality score recalibration VCF

Local realignment around indels BAM Base quality score recalibration BAM Variant calling (HaplotypeCaller) BAM BAM Variant quality score recalibration VCF

Local Realignment around Indels
During mapping each read is aligned against the reference genome separately This can lead to incorrect mapping around indels and false positive SNVs Two steps Identify target regions for realignment Realign all reads in this region to produce the most parsimonious alignment along all reads Alters original CIGAR string and adds tag. Depth dependent – can’t realign with 4X. Up to 20bp indels (HC up to 200bp) Pindel also does this

Indel Realigned BAM Before After

Base quality score recalibration
Important because downstream analysis is based on quality scores – weighing the evidence in a Bayesian framework DePristo et al., Nat Genet. 43(5), 491-8

Base Quality Score Recalibration
Take all mismatches from the reference in the BAM file (excluding sites of known genetic variation) and assume to be errors Examine empirical error rate in each bin (dinucleotide context/machine run/reported quality) and use this to adjust reported error rate

Base Quality Score Recalibration
DePristo et al., Nat Genet. 43(5), 491-8

Variant Calling Many tools including Samtools, Platypus, Cortex etc.
GATK HaplotypeCaller Calls SNVs and indels simultaneously Performs local de novo assembly of variable regions to produce candidate haplotypes (de Bruijn graph) Calculates likelihood for each read-haplotype pairing (using pairHMM) Assigns genotype by calculating the likelihood of each possible genotype given the scores for each read-haplotype pair (using Bayes theorum) Output VCF file Still need to indel realign as BQSR relies on this realignment Constructs de Bruijn assembly, each edge weighted by number of supporting reads

VCF File Format Headers Entries Location Alleles Info Format

Variant Quality Score Recalibration
HaplotypeCaller is designed to be sensitive High false positive rate Learn how to filter from the data itself Models the distribution of known (true) variation relative to specified variant annotations Use model to evaluate novel variants

Variant Quality Score Recalibration

Fitting and Applying the Model
Also good to check Ts/Tv ratio and compare calls to chips and novel variant rate. Really hard to assign quality to indels (number of indels still in dispute/multiple alleles/reptitive regions). Out-of-frame vs –in-frame DePristo et al., Nat Genet. 43(5), 491-8

Step 5: Variant Annotation
File Format Analysis Step Analysis Tools FASTQ QC of FASTQ FASTQC SAM/BAM Mapping Reads BWA Galaxy IGV BED Processing & QC Picard VCF Variant Calling GATK Variant Annotation Variant Annotation SnpEff/SnpSift VEP Variant Filtering Gene Annotation GATK DAVID

Variant annotation Variant context Who else has it?
Is it in a gene, coding region? Protein consequence? Is it conserved? Is it a known variant? Does it have known clinical significance? Pathway Who else has it? Presence/frequency in population or disease cohorts (eg. dbSNP, ClinSeq, 1000 genomes, NHLBI GO exome sequencing project)

Variant Annotation Tools
Examples of Variant Annotation Tools include: Ensembl Variant Effect Predictor (VEP) Annovar SnpEff ...and others

SnpEff & SnpSift 20,000 reference genomes supported
Wide variety of data sources integrated Pre-computed variant effect predictions from multiple sources (SIFT, Polyphen2, LRT, GERP…) Regulatory & non-coding annotations Add your own annotations Integrates ENCODE data Update

Variant Effect Prioritisation
Variants can have different effects of different transcripts on the same protein We want to identify the isoform with the most deleterious effect GATK VariantAnnotator and newer versions of snpEff

Step 6: Variant Filtering
File Format Analysis Step Analysis Tools FASTQ QC of FASTQ FASTQC SAM/BAM Mapping Reads BWA Galaxy IGV BED Processing & QC Picard VCF Variant Calling GATK Variant Annotation Variant Annotation SnpEff/SnpSift VEP Variant Filtering Gene Annotation GATK DAVID

Filtering strategies De novo Recessive Dominant X-linked

De novo filters Total variants = 131760 (example trio) RR
RA Total variants = (example trio) Inheritance model: use phred-scaled likelihoods of the genotypes Variants = 104 Read depth in parents >=10, alternate allele depth in proband >=3 Variants = 79 Alternate allele read depth/total read depth > 0.25 in proband and <0.05 in parents Variants = 9 Functional impact of variant predicted high or moderate ie. Variant is nonsense, missense, splice site, or coding indel Variants = 3 MAF in population-based cohorts < 0.001 Variants = 2 (Between 1 and 4 candidate mutations in other trios!) Epi4K Consortium & Epilepsy Phenome/Genome Project, Nature 501, 217–221 (12 Sep 2013)

Filtering tools Pass jexl expressions to GATK SelectVariants
Other tools include VAAST, VarMD, VarSifter, SnpSift filter

Filtering using JEXL Expressions
Java Expression Language Syntax for constructing logic queries Select sample Select annotation operator vc.getGenotype("NA12891").getDP()>=10 && vc.getGenotype("NA12892").getDP()>=10 value Logical operator

Visualisation Samtools tview UCSC IGV Viewing options eg. zooming
Highlight reads for more information Designed to be integrative

Further analysis Lower the stringency of filters
Check coverage over candidate genes CNVs/SVs Consider whole-genome sequencing?

Variant calling, annotation and filtering from exome data
Practical session 1 Variant calling, annotation and filtering from exome data

HapMap CEU trio NA12891 NA12892 NA12878

Practical hand-out 1. Introduction
2. Calling variants from NGS data – you can read this section but do not download the data or perform the steps outlined as they are too computationally expensive (VCF section) 3 and 4. Variant annotation and filtering – read through these sections and carry out the exercises. Have a go at the advanced exercises if you have time

Integromics – Allele Specific Expression
Practical session 2 Integromics – Allele Specific Expression

Allele specific expression

Allele specific expression
Capture biological phenomena: Effects of cis-regulatory variants Nonsense-mediated decay Imprinting Single individual – no need to normalise

Why have DNA and RNAseq? Accuracy of variant calling (especially indels and structural variants) not as good from functional genomics data Allele-specific expression may lead to false reference homozygotes RNA editing

Reference bias (A) Construction of a personal genome by vcf2diploid tool is made by incorporating personal variants into the reference genome. Personal variants may require additional pre‐processing, that is, filtering, genotyping, and/or phasing. The output is the two (paternal and maternal) haplotypes of personal genome. During the construction step, the reference genome is represented as an array of nucleotides with each cell representing a single base. Iteratively, the nucleotides in the array are being modified to reflect personal variations. Once all the variations have been applied, a personal haplotype is constructed by reading through the array. Simultaneously, equivalence map (MAP‐file format—see Supplementary Figure 1) between personal haplotypes and reference genome is being constructed. This can similarly be done for a personal transcriptome. (B) AlleleSeq pipeline for determining allele‐specific binding (ASB) and allele‐specific expression (ASE) aligning reads against the personal diploid genome sequence as well as a diploid‐aware gene annotation file (including splice‐junction library). ©2011 by European Molecular Biology Organization Joel Rozowsky et al. Mol Syst Biol 2011;7:522

AlleleSeq pipeline (A) Construction of a personal genome by vcf2diploid tool is made by incorporating personal variants into the reference genome. Personal variants may require additional pre‐processing, that is, filtering, genotyping, and/or phasing. The output is the two (paternal and maternal) haplotypes of personal genome. During the construction step, the reference genome is represented as an array of nucleotides with each cell representing a single base. Iteratively, the nucleotides in the array are being modified to reflect personal variations. Once all the variations have been applied, a personal haplotype is constructed by reading through the array. Simultaneously, equivalence map (MAP‐file format—see Supplementary Figure 1) between personal haplotypes and reference genome is being constructed. This can similarly be done for a personal transcriptome. (B) AlleleSeq pipeline for determining allele‐specific binding (ASB) and allele‐specific expression (ASE) aligning reads against the personal diploid genome sequence as well as a diploid‐aware gene annotation file (including splice‐junction library). ©2011 by European Molecular Biology Organization Joel Rozowsky et al. Mol Syst Biol 2011;7:522

False discovery rate Correct for multiple hypothesis testing using FDR
Simulates number of false positives given no allele-specific events by permuting allele labels of each mapped read at hetSNV loci For a given P value threshold (binomial test), number of false positives/total observed positives = FDR Default: FDR = 10%

HapMap CEU trio NA12891 NA12892 RNA-seq NA12878

Other tools... MBASED ASEQ GATK ASEReadCounter

> cd ~ > cp –R /media/sf_shared/BioNGS/GenomicVar/* .

Similar presentations

Presentation on theme: "> cd ~ > cp –R /media/sf_shared/BioNGS/GenomicVar/* ."— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

> cd ~ > cp –R /media/sf_shared/BioNGS/GenomicVar/* .

Similar presentations

Presentation on theme: "> cd ~ > cp –R /media/sf_shared/BioNGS/GenomicVar/* ."— Presentation transcript:

Similar presentations

About project

Feedback