Presentation is loading. Please wait.

Presentation is loading. Please wait.

> cd ~ > cp –R /media/sf_shared/BioNGS/GenomicVar/* .

Similar presentations


Presentation on theme: "> cd ~ > cp –R /media/sf_shared/BioNGS/GenomicVar/* ."— Presentation transcript:

1 > cd ~ > cp –R /media/sf_shared/BioNGS/GenomicVar/* .

2 Identifying and interpreting genomic variation using NGS
Katherine Fawcett CGAT, University of Oxford

3 Organisation of sessions
Lecture on calling, annotating and filtering genomic variation (~45 minutes) Practical session 1: Variant calling, annotation and filtering from exome data Practical session 2: Integromics – allele-specific expression

4 What is genomic variation?

5 Why study genomic variation?
Important source of biological variation Changes in gene expression Changes in protein function Contributes to phenotypic variation (disease) Understanding and describing human evolution

6 Types of genomic variation
SNVs (single nucleotide variants) Most common Single base substitutions 1% allele frequency = SNP Indels Short insertions or deletions (up to 100bp) CNVs (copy number variants) Duplication or loss of large genomic regions SVs (structural variants) Balanced translocations or inversions No copy number change

7 Methods for studying genomic variation
Karyotyping ArrayCGH SNP microarrays DNA sequencing

8 Advantages of DNA Sequencing
Pros Cons Identify all types of variants High resolution Unbiased Rare variants Causal variants Expensive Large datasets Complex analysis Difficult to interpret

9 DNA sequencing in human disease
Rare causal variants in Mendelian disorders Small families Stuck positional cloning projects De novo dominants Rare variants in common, complex disease Somatic mutations in cancer Diagnostics

10 Whole Exome Sequencing
Pros Cons 10-fold cheaper Greater sample numbers Smaller datasets Easier interpretation Protein coding exons only SNVs and indels only Not all genes/exons captured Reference bias Less uniform coverage

11 Sequence capture - whole exome

12 Sequence capture – whole exome
IGV screenshot showing coverage of exons vs introns

13 Analysis workflow File Format Analysis Step Analysis Tools FASTQ
QC of FASTQ FASTQC SAM/BAM Mapping Reads BWA IGV BED Processing & QC Picard VCF Variant Calling GATK Variant Annotation SnpEff/SnpSift Variant Filtering GATK

14 Step 1 - QC of Raw Sequencing Data
File Format Analysis Step Analysis Tools FASTQ QC of FASTQ FASTQC SAM/BAM Mapping Reads BWA IGV BED Processing & QC Picard VCF Variant Calling GATK Variant Annotation SnpEff/SnpSift Variant Filtering GATK

15 FASTQ Format 2. Nucleotide sequence 1. Unique sequence ID
4. Per-base quality score (Phred-scaled) Different platforms use different symbols (ascii characters) Remove the bit on ascii characters? Phred-scaled...

16 Quality Control using FASTQC
FASTQ files Traffic light overview Graphical summaries HTML report Galaxy integration

17 Per-base Sequence Content
Distribution of quality scores for each base position across all reads Quality score > 20 (Q20) = 1% error rate (yellow) Quality score > 30 (Q30) = 0.1% error rate (green) Quality score > 40 (Q40) = 0.01% error rate (max) Base Base 100 Quality score (0-40) Possible to ‘trim’ reads by quality

18 Sequence Quality Distribution
Average quality across all bases in a read Frequency Average sequence quality = 37 Average sequence quality = 31 with additional sequences low quality sequences at 17 Low quality reads can be removed by filtering

19 Exonic sequence has higher GC content than genomic background
GC count per read Theoretical distribution % GC Exonic sequence has higher GC content than genomic background

20 Step 2 - Mapping Reads to a Reference Genome
File Format Analysis Step Analysis Tools FASTQ QC of FASTQ FASTQC SAM/BAM Mapping Reads BWA IGV BED Processing & QC Picard VCF Variant Calling GATK Variant Annotation Variant Annotation SnpEff/SnpSift VEP Gene Annotation Variant Filtering GATK DAVID

21 Mapping Reads to a Reference Genome
Find the position(s) in the reference genome where each short read sequence aligns with the fewest mismatches Must be fast (millions of short reads) Must allow small differences (sequencing errors or polymorphisms) String matching problem GCTGATGTGCCGCCTCACTCCGGTGG Reference Sequence CACTCCTGTGG CTCACTCCTGTGG GCTGATGTGCCACCTCA GATGTGCCACCTCACTC GTGCCGGCTCACTCCTG CTCCTGTGG Short reads TGATGTGCCGCCTCACT Sequencing error Heterozygous SNP Homozygous SNP Use g1k reference provided by 1000 genomes as it includes revised Cambridge Reference sequence for mitochondria, supercontigs, decoy sequences, Human herpesvirus 4 type 1)

22 Burrows-Wheeler Aligner (BWA)
Part of the Broad Institute’s best practice pipeline Gapped alignment (enables indel calling) Aligns short sequences against long reference genome Fast (if not too many errors) Takes FASTQ as input, produces SAM as output Key parameters: Number of mismatches allowed (default read length dependent) Number of gaps allowed (default 1) Number of alternative hits to report (default 3)

23 SAM Format Sequence Alignment Map Standardised text file format
Contains alignment information Header SAM Format version Sort order Reference sequence dictionary Reference sequence length Remove? Reference seq dictionary? Reference sequence name

24 SAM Format 1 2 3 4 5 6 7 8 9 10 11 QNAME: Query template NAME
FLAG: bitwise FLAG RNAME: Reference sequence NAME POS: 1-based leftmost mapping POSition MAPQ: MAPping Quality CIGAR: CIGAR string RNEXT: Ref. name of the mate/next segment PNEXT: Position of the mate/next segment TLEN: observed Template LENgth SEQ: segment SEQuence QUAL: ASCII of Phred-scaled base QUALity+33 Remove?

25 Step 3 - QC of Mapped Reads
File Format Analysis Step Analysis Tools FASTQ QC of FASTQ FASTQC SAM/BAM Mapping Reads BWA IGV BED Processing & QC Picard VCF Variant Calling GATK Variant Annotation Variant Annotation SnpEff/SnpSift VEP Gene Annotation Variant Filtering GATK DAVID

26 Processing steps (Picard)
SAM Convert SAM to BAM Sort BAM by position Reorder contigs Add read groups Remove duplicates BAM

27 Post-alignment QC (Picard)
Alignment summary metrics BAM QC statistics Insert size distribution BAM QC statistics Hybrid summary metrics BAM QC statistics Can also run FASTQC on BAM files

28 Alignment Summary Metrics
PF_READS_ALIGNED: # reads aligned to reference sequence. PF_ALIGNED_BASES: # bases aligned to reference sequence. PF_MISMATCH_RATE: The rate of bases mismatching the reference. PF_INDEL_RATE: # indels per 100 aligned bases. READS_ALIGNED_IN_PAIRS: # aligned reads whose pair was aligned to reference. STRAND_BALANCE: # reads aligned to the positive strand of the genome divided by the number of reads aligned to the genome. PCT_CHIMERAS: % reads that map outside of a maximum insert size (usually 100kb) or have two ends mapping to different chromosomes. PCT_ADAPTER: The percentage of reads that are unaligned and match to a known adapter sequence right from the start of the read. Expect >90% alignment to reference genome

29 Insert Size Distribution
Look for a tight distribution that peaks at around theoretical insert size from the sequencing library production size selection step

30 Hybrid Summary Metrics
ON_BAIT_BASES: aligned bases that mapped to a baited region of the genome. NEAR_BAIT_BASES: aligned bases that mapped to within a fixed interval of a baited region, but not on a baited region. OFF_BAIT_BASES: aligned bases that mapped to neither on or near a bait. MEAN_BAIT_COVERAGE: The mean coverage of all baits in the experiment. FOLD_ENRICHMENT: The fold by which the baited region has been amplified above genomic background. ZERO_CVG_TARGETS_PCT: The number of targets that did not reach coverage=2 over any base. PCT_TARGET_BASES_20X: The percentage of ALL target bases achieving 20X or greater coverage. Expect > 80% bases covered at 20x picard.sourceforge.net/picard-metric-definitions.shtml

31 Step 4 - Variant Calling File Format Analysis Step Analysis Tools
FASTQ QC of FASTQ FASTQC SAM/BAM Mapping Reads BWA IGV IGV BED Processing & QC Picard VCF Variant Calling GATK Variant Annotation Variant Annotation SnpEff/SnpSift VEP Gene Annotation Variant Filtering DAVID GATK

32 Variant calling steps (GATK)
Local realignment around indels BAM Base quality score recalibration BAM Variant calling (HaplotypeCaller) BAM BAM Call families together to get matrix of all sites by all samples otherwise filtering dominated by missing data – need to know how confident you are that a site is reference. Case-control – call together, better than intersecting with publicly available data Variant quality score recalibration VCF

33 Variant calling steps (GATK)
Local realignment around indels BAM Base quality score recalibration BAM Variant calling (HaplotypeCaller) BAM BAM Variant quality score recalibration VCF

34 Local Realignment around Indels
During mapping each read is aligned against the reference genome separately This can lead to incorrect mapping around indels and false positive SNVs Two steps Identify target regions for realignment Realign all reads in this region to produce the most parsimonious alignment along all reads Alters original CIGAR string and adds tag. Depth dependent – can’t realign with 4X. Up to 20bp indels (HC up to 200bp) Pindel also does this

35 Indel Realigned BAM Before After

36 Variant calling steps (GATK)
Local realignment around indels BAM Base quality score recalibration BAM Variant calling (HaplotypeCaller) BAM BAM Variant quality score recalibration VCF

37 Base quality score recalibration
Important because downstream analysis is based on quality scores – weighing the evidence in a Bayesian framework DePristo et al., Nat Genet. 43(5), 491-8

38 Base Quality Score Recalibration
Take all mismatches from the reference in the BAM file (excluding sites of known genetic variation) and assume to be errors Examine empirical error rate in each bin (dinucleotide context/machine run/reported quality) and use this to adjust reported error rate

39 Base Quality Score Recalibration
DePristo et al., Nat Genet. 43(5), 491-8

40 Variant calling steps (GATK)
Local realignment around indels BAM Base quality score recalibration BAM Variant calling (HaplotypeCaller) BAM BAM Variant quality score recalibration VCF

41 Variant Calling Many tools including Samtools, Platypus, Cortex etc.
GATK HaplotypeCaller Calls SNVs and indels simultaneously Performs local de novo assembly of variable regions to produce candidate haplotypes (de Bruijn graph) Calculates likelihood for each read-haplotype pairing (using pairHMM) Assigns genotype by calculating the likelihood of each possible genotype given the scores for each read-haplotype pair (using Bayes theorum) Output VCF file Still need to indel realign as BQSR relies on this realignment Constructs de Bruijn assembly, each edge weighted by number of supporting reads

42 VCF File Format Headers Entries Location Alleles Info Format

43 Variant calling steps (GATK)
Local realignment around indels BAM Base quality score recalibration BAM Variant calling (HaplotypeCaller) BAM BAM Variant quality score recalibration VCF

44 Variant Quality Score Recalibration
HaplotypeCaller is designed to be sensitive High false positive rate Learn how to filter from the data itself Models the distribution of known (true) variation relative to specified variant annotations Use model to evaluate novel variants

45 Variant Quality Score Recalibration

46 Fitting and Applying the Model
Also good to check Ts/Tv ratio and compare calls to chips and novel variant rate. Really hard to assign quality to indels (number of indels still in dispute/multiple alleles/reptitive regions). Out-of-frame vs –in-frame DePristo et al., Nat Genet. 43(5), 491-8

47 Step 5: Variant Annotation
File Format Analysis Step Analysis Tools FASTQ QC of FASTQ FASTQC SAM/BAM Mapping Reads BWA Galaxy IGV BED Processing & QC Picard VCF Variant Calling GATK Variant Annotation Variant Annotation SnpEff/SnpSift VEP Variant Filtering Gene Annotation GATK DAVID

48 Variant annotation Variant context Who else has it?
Is it in a gene, coding region? Protein consequence? Is it conserved? Is it a known variant? Does it have known clinical significance? Pathway Who else has it? Presence/frequency in population or disease cohorts (eg. dbSNP, ClinSeq, 1000 genomes, NHLBI GO exome sequencing project)

49 Variant Annotation Tools
Examples of Variant Annotation Tools include: Ensembl Variant Effect Predictor (VEP) Annovar SnpEff ...and others

50 SnpEff & SnpSift 20,000 reference genomes supported
Wide variety of data sources integrated Pre-computed variant effect predictions from multiple sources (SIFT, Polyphen2, LRT, GERP…) Regulatory & non-coding annotations Add your own annotations Integrates ENCODE data Update

51 Variant Effect Prioritisation
Variants can have different effects of different transcripts on the same protein We want to identify the isoform with the most deleterious effect GATK VariantAnnotator and newer versions of snpEff

52 Step 6: Variant Filtering
File Format Analysis Step Analysis Tools FASTQ QC of FASTQ FASTQC SAM/BAM Mapping Reads BWA Galaxy IGV BED Processing & QC Picard VCF Variant Calling GATK Variant Annotation Variant Annotation SnpEff/SnpSift VEP Variant Filtering Gene Annotation GATK DAVID

53 Filtering strategies De novo Recessive Dominant X-linked

54 De novo filters Total variants = 131760 (example trio) RR
RA Total variants = (example trio) Inheritance model: use phred-scaled likelihoods of the genotypes Variants = 104 Read depth in parents >=10, alternate allele depth in proband >=3 Variants = 79 Alternate allele read depth/total read depth > 0.25 in proband and <0.05 in parents Variants = 9 Functional impact of variant predicted high or moderate ie. Variant is nonsense, missense, splice site, or coding indel Variants = 3 MAF in population-based cohorts < 0.001 Variants = 2 (Between 1 and 4 candidate mutations in other trios!) Epi4K Consortium & Epilepsy Phenome/Genome Project, Nature 501, 217–221 (12 Sep 2013)

55 Filtering tools Pass jexl expressions to GATK SelectVariants
Other tools include VAAST, VarMD, VarSifter, SnpSift filter

56 Filtering using JEXL Expressions
Java Expression Language Syntax for constructing logic queries Select sample Select annotation operator vc.getGenotype("NA12891").getDP()>=10 && vc.getGenotype("NA12892").getDP()>=10 value Logical operator

57 Visualisation Samtools tview UCSC IGV Viewing options eg. zooming
Highlight reads for more information Designed to be integrative

58 Further analysis Lower the stringency of filters
Check coverage over candidate genes CNVs/SVs Consider whole-genome sequencing?

59 Variant calling, annotation and filtering from exome data
Practical session 1 Variant calling, annotation and filtering from exome data

60 HapMap CEU trio NA12891 NA12892 NA12878

61 Practical hand-out 1. Introduction
2. Calling variants from NGS data – you can read this section but do not download the data or perform the steps outlined as they are too computationally expensive (VCF section) 3 and 4. Variant annotation and filtering – read through these sections and carry out the exercises. Have a go at the advanced exercises if you have time

62 Integromics – Allele Specific Expression
Practical session 2 Integromics – Allele Specific Expression

63 Allele specific expression

64 Allele specific expression
Capture biological phenomena: Effects of cis-regulatory variants Nonsense-mediated decay Imprinting Single individual – no need to normalise

65 Why have DNA and RNAseq? Accuracy of variant calling (especially indels and structural variants) not as good from functional genomics data Allele-specific expression may lead to false reference homozygotes RNA editing

66 Reference bias (A) Construction of a personal genome by vcf2diploid tool is made by incorporating personal variants into the reference genome. Personal variants may require additional pre‐processing, that is, filtering, genotyping, and/or phasing. The output is the two (paternal and maternal) haplotypes of personal genome. During the construction step, the reference genome is represented as an array of nucleotides with each cell representing a single base. Iteratively, the nucleotides in the array are being modified to reflect personal variations. Once all the variations have been applied, a personal haplotype is constructed by reading through the array. Simultaneously, equivalence map (MAP‐file format—see Supplementary Figure 1) between personal haplotypes and reference genome is being constructed. This can similarly be done for a personal transcriptome. (B) AlleleSeq pipeline for determining allele‐specific binding (ASB) and allele‐specific expression (ASE) aligning reads against the personal diploid genome sequence as well as a diploid‐aware gene annotation file (including splice‐junction library). ©2011 by European Molecular Biology Organization Joel Rozowsky et al. Mol Syst Biol 2011;7:522

67 AlleleSeq pipeline (A) Construction of a personal genome by vcf2diploid tool is made by incorporating personal variants into the reference genome. Personal variants may require additional pre‐processing, that is, filtering, genotyping, and/or phasing. The output is the two (paternal and maternal) haplotypes of personal genome. During the construction step, the reference genome is represented as an array of nucleotides with each cell representing a single base. Iteratively, the nucleotides in the array are being modified to reflect personal variations. Once all the variations have been applied, a personal haplotype is constructed by reading through the array. Simultaneously, equivalence map (MAP‐file format—see Supplementary Figure 1) between personal haplotypes and reference genome is being constructed. This can similarly be done for a personal transcriptome. (B) AlleleSeq pipeline for determining allele‐specific binding (ASB) and allele‐specific expression (ASE) aligning reads against the personal diploid genome sequence as well as a diploid‐aware gene annotation file (including splice‐junction library). ©2011 by European Molecular Biology Organization Joel Rozowsky et al. Mol Syst Biol 2011;7:522

68 False discovery rate Correct for multiple hypothesis testing using FDR
Simulates number of false positives given no allele-specific events by permuting allele labels of each mapped read at hetSNV loci For a given P value threshold (binomial test), number of false positives/total observed positives = FDR Default: FDR = 10%

69 HapMap CEU trio NA12891 NA12892 RNA-seq NA12878

70 Other tools... MBASED ASEQ GATK ASEReadCounter


Download ppt "> cd ~ > cp –R /media/sf_shared/BioNGS/GenomicVar/* ."

Similar presentations


Ads by Google