Introduction to Variant Analysis of Exome- and Amplicon sequencing data Lecture by: Date: Training: Extended version see: Dr. Christian Rausch 29 May 2015.

Slides:



Advertisements
Similar presentations
In Silico Primer Design and Simulation for Targeted High Throughput Sequencing I519 – FALL 2010 Adam Thomas, Kanishka Jain, Tulip Nandu.
Advertisements

Reference mapping and variant detection Peter Tsai Bioinformatics Institute, University of Auckland.
Mutation Analysis Server Nagarajanlab. © Copyright 2005, Washington University School of Medicine. 2 Agenda Mutation pipeline overview High level design.
ChIP-seq analysis Ecole de bioinformatique AVIESAN – Roscoff, Jan 2013.
DNAseq analysis Bioinformatics Analysis Team
GNUMap: Unbiased Probabilistic Mapping of Next- Generation Sequencing Reads Nathan Clement Computational Sciences Laboratory Brigham Young University Provo,
SOLiD Sequencing & Data
Targeted Data Introduction  Many mapping, alignment and variant calling algorithms  Most of these have been developed for whole genome sequencing and.
Introduction to Short Read Sequencing Analysis
Jeff Shen, Morgan Kearse, Jeff Shi, Yang Ding, & Owen Astrachan Genome Revolution Focus 2007, Duke University, Durham, North Carolina Introduction.
The Extraction of Single Nucleotide Polymorphisms and the Use of Current Sequencing Tools Stephen Tetreault Department of Mathematics and Computer Science.
Genotyping of James Watson’s genome from Low-coverage Sequencing Data Sanjiv Dinakar and Yözen Hernández.
High Throughput Sequencing
Biological Sequence Analysis BNFO 691/602 Spring 2014 Mark Reimers
Before we start: Align sequence reads to the reference genome
NGS Analysis Using Galaxy
Whole Exome Sequencing for Variant Discovery and Prioritisation
Biostatistics-Lecture 15 High-throughput sequencing and sequence alignment Ruibin Xi Peking University School of Mathematical Sciences.
Genome & Exome Sequencing Read Mapping Xiaole Shirley Liu STAT115, STAT215, BIO298, BIST520.
Considerations for Analyzing Targeted NGS Data Introduction Tim Hague, CTO.
Bioinformatics and OMICs Group Meeting REFERENCE GUIDED RNA SEQUENCING.
MES Genome Informatics I - Lecture VIII. Interpreting variants Sangwoo Kim, Ph.D. Assistant Professor, Severance Biomedical Research Institute,
Introduction to Short Read Sequencing Analysis
MES Genome Informatics I - Lecture V. Short Read Alignment
File formats Wrapping your data in the right package Deanna M. Church
Galaxy for Bioinformatics Analysis An Introduction TCD Bioinformatics Support Team Fiona Roche, PhD Date: 31/08/15.
DAY 1. GENERAL ASPECTS FOR GENETIC MAP CONSTRUCTION SANGREA SHIM.
NGS data analysis CCM Seminar series Michael Liang:
Next Generation DNA Sequencing
Introductory RNA-seq Transcriptome Profiling. Before we start: Align sequence reads to the reference genome The most time-consuming part of the analysis.
Alexis DereeperCIBA courses – Brasil 2011 Detection and analysis of SNP polymorphisms.
HaloPlexHS Get to Know Your DNA. Every Single Fragment.
Sackler Medical School
Introductory RNA-seq Transcriptome Profiling. Before we start: Align sequence reads to the reference genome The most time-consuming part of the analysis.
Introduction to RNAseq
__________________________________________________________________________________________________ Fall 2015GCBA 815 __________________________________________________________________________________________________.
UCSC Genome Browser Zeevik Melamed & Dror Hollander Gil Ast Lab Sackler Medical School.
Current Data And Future Analysis Thomas Wieland, Thomas Schwarzmayr and Tim M Strom Helmholtz Zentrum München Institute of Human Genetics Geneva, 16/04/12.
Personalized genomics
Introduction of the ChIP-seq pipeline Shigeki Nakagome November 16 th, 2015 Di Rienzo lab meeting.
GSVCaller – R-based computational framework for detection and annotation of short sequence variations in the human genome Vasily V. Grinev Associate Professor.
A brief guide to sequencing Dr Gavin Band Wellcome Trust Advanced Courses; Genomic Epidemiology in Africa, 21 st – 26 th June 2015 Africa Centre for Health.
Using public resources to understand associations Dr Luke Jostins Wellcome Trust Advanced Courses; Genomic Epidemiology in Africa, 21 st – 26 th June 2015.
Introduction to Variant Analysis with NGS data
Canadian Bioinformatics Workshops
Reliable Identification of Genomic Variants from RNA-seq Data Robert Piskol, Gokul Ramaswami, Jin Billy Li PRESENTED BY GAYATHRI RAJAN VINEELA GANGALAPUDI.
From Reads to Results Exome-seq analysis at CCBR
DAY 2. GETTING FAMILIAR WITH NGS SANGREA SHIM. INDEX  Day 2  Get familiar with NGS  Understanding of NGS raw read file  Quality issue  Alignment/Mapping.
Canadian Bioinformatics Workshops
Introductory RNA-seq Transcriptome Profiling
Lesson: Sequence processing
Cancer Genomics Core Lab
Next Generation Sequencing Analysis
MGmapper A tool to map MetaGenomics data
NGS Analysis Using Galaxy
VCF format: variants c.f. S. Brown NYU
Introductory RNA-Seq Transcriptome Profiling
GE3M25: Data Analysis, Class 4
EMC Galaxy Course November 24-25, 2014
Assessment of HaloPlex Amplification for Sequence Capture and Massively Parallel Sequencing of Arrhythmogenic Right Ventricular Cardiomyopathy–Associated.
Yonglan Zheng Galaxy Hands-on Demo Step-by-step Yonglan Zheng
Sequencing Data Analysis
Assessment of HaloPlex Amplification for Sequence Capture and Massively Parallel Sequencing of Arrhythmogenic Right Ventricular Cardiomyopathy–Associated.
Molecular Diagnosis of Autosomal Dominant Polycystic Kidney Disease Using Next- Generation Sequencing  Adrian Y. Tan, Alber Michaeel, Genyan Liu, Olivier.
Maximize read usage through mapping strategies
Basic Local Alignment Search Tool (BLAST)
Canadian Bioinformatics Workshops
BF528 - Sequence Analysis Fundamentals
Sequencing Data Analysis
The Variant Call Format
Presentation transcript:

Introduction to Variant Analysis of Exome- and Amplicon sequencing data Lecture by: Date: Training: Extended version see: Dr. Christian Rausch 29 May 2015 TraIT Galaxy Training

Focus of lecture and practical part Lecture: from NGS data to Variant analysis Hands on training: we will analyze NGS read data of a panel of cancer genes (Illumina TruSeq Amplicon - Cancer Panel) of prostate cancer cell line VCaP. Analysis software tools will be run interactively through Galaxy, “a web-based platform for data intensive biomedical research” Image source: A survey of tools for variant analysis of next-generation genome sequencing data, Pabinger et al., Brief Bioinform (2013) doi: /bib/bbs086

Workflow: QC & Mapping reads Input reads (fastq files) Quality check with FastQC Quality- & Adapter- trimming Not OK? OK? Map reads to reference genome using e.g. BWA or Bowtie2 Output: Sorted BAM file (binary SAM sequence alignment map) Sort by coordinates using SAMtools sort or PicardTools SortSam

Variant Calling & Annotation pipeline Reads mapped to reference genome SAM or BAM file SAMtools Mpileup Analyze mismatches & compute likelihoods of SNP etc. Varscan2 does the actual calling Output: VCF Various statistics on quality of each variant (read depth etc.), homozygous/heterozygous etc. Filter variant allele frequency Discard variants with variant allele frequency below threshold Slice VCF Cut vcf file to retain only the regions that were enriched for sequencing (= discard regions covered by off target reads ANNOVAR: Annotate SNPs with: statistics (dbSNP, 1000 genomes etc.) and predictions (SIFT, PholyPhen etc.) DGIdb: Drug Gene Interaction Database Find drugs to diseases arising from gene mutations

Target enrichment used for selecting exomes Image source: Agilent

Selecting parts of the genome for sequencing The Illumina TruSeq Amplicon - Cancer Panel uses Multiplex PCR to amplify a selected part of the genome (a selection of the exons of 48 genes are targeted with 212 amplicons)

Properties of Reads (Illumina) Typical read length: 50 … 100 … 150 … 200 … 300 bp Paired reads: Insert size 200 – 500 bp Mate pairs: Insert size several kbp Depending on which Illumina platform is used, the read quality drops after 100, 150 or 200 bp Errors in Illumina reads are typically substitution errors Source: evomics.org/2014/01/alignment- methods/ Image source: Mate Pair v2 Sample Prep Guide For 2-5 kb Libraries

Quality Measure: Phred Score Phred score = quality scores originally developed by the base calling program Phred used with Sanger sequencing data Phred quality score Q is defined as a property which is logarithmically related to the base-calling error probability P Error rate P = 10 – (Phred score Q / 10) Q = -10 log 10 P Example: Phred score 30 = error rate = 1 base in 1000 will be wrong Illumina’s ‘Q score’ = Phred score The base calling programs that convert raw data to sequence data (the ‘base callers’) need to be ‘trained’ to give realistic quality values

format standard format to store sequence data (DNA and protein seq.) >FASTA header, often contains unique identifiers and descriptions of the identifiers and optional descriptions of the sequence the actual DNA sequence + separator optionally followed by description The quality values of the sequence (one character per nucleotide) standard format to sequencing reads with quality information (‘Q’ stands for Quality) More info see wikipedia ‘FASTQ_format’

Quality control with FastQC Need to check the quality of reads before further analysis Program FastQC is quasi standard Sequencing platform companies provide also their own tools for quality control

Quality control: FastQC Encoding tells which set of characters stands for which Phred scores. There’s also Encoding Illumina 1.5 and others. Other programs might not automatically recognize the encoding In Galaxy there is a possibility to set the encoding of a FastQ file via the ‘pen’ symbol.

Read Quality Control with FastQC Examples of per base sequence quality of all read s  Historical example of very first Solexa reads 2006 (Solexa acquired by Illumina 2007) Not so good, might still be usable, depending on application  50 bp

Examples of other quality measures in FastQC Upper 4 graphs from the data set of the practical course: Many reads are repeated Apparently not uniformly distributed over whole genome Overrepresented sequences: Sequenced fragment was too short and sequencing reaction ran into the Adapter/PCR primer

“Mapping reads to the reference” is finding where their sequence occurs in the genome Source: Wikimedia, file:Mapping Reads.png 100 bp identified 200 – 500 bp unknown sequence

“Mapping reads to the reference”: naïve text search algorithms are too slow Naïve approach: compare each read with every position in the genome – Takes too long, will not find sequences with mismatches Search programs typically create an index of the reference sequence (or text) and store the reference sequence (text) in an advanced data structure for fast searching. An index is basically like a phone book (with addresses)  Quickly find address (location) of a person Example of algorithm using ‘indexed seed tables’ to quickly find locations of exact parts of a read

“Mapping reads to the reference”: frequently used programs BLAST, the most famous bioinformatics program since 1990, is used to find similar sequences in DNA and protein data bases – sec to find a result – Mapping 60 million reads would take ~ 2 months on one CPU 1  too slow for NGS 1 Popular tools for read mapping: – Bowtie, Bowtie2, BWA, SOAP2, MAQ, RMAP, GSNAP, Novoalign, and mrsFAST/mrFAST: Hatem et al. BMC Bioinformatics 2013, 14: – CLCbio read mapper (commercial) – No tool is the best tool in all example conditions – differences in speed – Differently optimized for mismatches/gap models/Insertions & Deletions/taking into account read base quality/local realignment of matches etc.

Read Mappers: BWA and Bowtie2 Are based on the Burrows-Wheeler Transformation (BWT) BWT: special sorting of all letters in the text (sequence) Similar suffixes (word ends) will be close to each other Easier to compress Good for approximate string matching (sequence alignment) Index (FM index) for finding the locations of matched strings (sequences) in the genome

Read Mapping: General problems Read can match equally well at more than one location (e.g. repeats, pseudo-genes) Even fit less well to it’s actual position, e.g. if it carries a break point, insertions and/or deletions

SAM and BAM files SAM = Sequence Alignment Map BAM = Binary SAM = compressed SAM Sequence Alignment/Map format contains information about how sequence reads map to a reference genome Requires ~1 byte per input base to store sequences, qualities and meta information. Supports paired-end reads and color space from SOLiD. Is produced by bowtie, BWA and other mapping tools Partly from: genetics.stanford.edu/gene211/lectures/Lecture3_Resequencing_Functional_Genomics-2014.pdf

Example from: genetics.stanford.edu/gene211/lectures/Lecture3_Resequencing_Functional_Genomics-2014.pdf

Harvesting Information from SAM Query name, QNAME (SAM) / read_name (BAM). FLAG provides the following information: are there multiple fragments? are all fragments properly aligned? is this fragment unmapped? is the next fragment unmapped? is this query the reverse strand? is the next fragment the reverse strand? is this the last fragment? is this a secondary alignment? did this read fail quality controls? is this read a PCR or optical duplicate Source:

Variant Calling & Annotation

Possible reasons for a mismatch True SNP Error generated in library preparation Base calling error – May be reduced by better base calling methods, but cannot be eliminated Misalignment (mapping error): – Local re-alignment to improve mapping Error in reference genome sequence Partly from

Variant Calling: Principles Naïve approach (used in early NGS studies): – Filter base calls according to quality – Filter by frequency – Typically, a quality Filter of PHRED Q 20 was used (i.e., probability of error 1% ). – Then, the following frequency thresholds were used according to the frequency of the non-ref base, f(b): – The frequency heuristic works well if the sequencing depth is high, so that the probability of a heterozygous nucleotide falling outside of the 20% - 80% region is low. – Problems with frequency heuristic: For low sequencing depth, leads to undercalling of heterozygous genotypes Use of quality threshold leads to loss of information on individual read/base qualities Does not provide a measure of confidence in the call In parts from: compbio.charite.de/contao/index.php/genomics.html

Variant Calling: Principles Today’s Variant Callers rely on probability calculations Use of Bayes’ Theorem: – E.g. MAQ: One of the first widely used read mappers and variant callers Takes into account a quality score for whole read alignment & quality of base at the individual position Calls the most likely genotype given observed substitutions Reliability score can be calculated

Variant Calling & Annotation: Popular Tools SAMtools (Mpileup & Bcftools) GATK Varscan2 Freebayes MAQ

VCF = Variant Call Format Variant Call Format / BCF = binary version

dbSNP and snpEff dbSNP = the Single Nucleotide Polymorphism NCBI Different collections of SNPs are available: ‘all humans’, human subpopulations, different clinical significance ( snpEff is a program that can annotate a collection of SNVs according to information available in dbSNP and information extracted from the location of the SNV (Exon, Intron, silent/non-sense mutation etc.)

ANNOVAR a Swiss Knife to annotate genetic variants (SNPs and CNVs) Input: – variants as VCF file – various databases with statistical and predictive annotations: dbSNP, 1000 genomes, … Output: – In coding region? Which gene? How frequently observed in 1000 genomes project? (and more statistics). According to coordinates from RefSeq genes, UCSC genes, ENSEMBL genes, GENCODE genes or others, etc. – In non-coding region? Conserved region? According to conservation in 44 species, transcription factor binding sites, GWAS hits, etc. – Predicted Effect? Score of SIFT, PolyPhen-2, GERP++

Annotation with DGIdb: mining the druggable genome Drug Gene Interaction Database: Matching disease genes with potential drugs. Searches genes against a compendium of drug-gene interactions and identify potentially 'druggable' genes

Practical part

Import Workflow into Galaxy Import the workflow 'Goecks Exome pipeline hg19_noDedup’ to your workflows by either – clicking on ‘Shared Data’  ‘Published Workflows’  Goecks Exome pipeline hg19_noDedup  green plus symbol to import – Or click here: – goecks-exome-pipeline-hg19 goecks-exome-pipeline-hg19 – This workflow has been imported with small modifications from: Ref: Goecks et al., Cancer Med Goecks et al., Cancer Med

Import data set Import data by either – Clicking on: galaxian.erasmusmc.nl/galaxy/u/crausch/h/vcap-variant-analysishttp://bioinf- galaxian.erasmusmc.nl/galaxy/u/crausch/h/vcap-variant-analysis – Or go to ‘Shared Data’  published histories  VCaP Variant Analysis – Or download the data from unzip and upload to Galaxy by clicking onhttp://tinyurl.com/pfmshlu

Run Goeck’s Exome analysis pipeline on your data Chose parameter Variant allele frequency: e.g. 10% Chose the name of the data The genomic regions (bed file) contains the locations of the exons / amplicons Overview of the workflow: see next slide Because the whole workflow runs about 20 min, you can import all results from: – exome-pipeline-hg19nodedupvcap exome-pipeline-hg19nodedupvcap Before looking at the results of the variant analysis pipeline, check that our seq. data has good quality, using program FastQC (logically that would be the first one would do…)

Overview of the Variant Analysis workflow Legend: see next slide

Overview of the Variant Analysis workflow 1.Input: 2 fastq files, of the forward and reverse reads. Make sure that sequencing adapters and parts of reads with low quality have been removed. 2.Mapping of the reads to the reference genome hg19, output format: BAM 3.Sorting: Reads are sorted according to their coordinate position on the genome 4.Marking and Removing of ‘duplicate reads’: Reads with the identical position on the genome are likely duplicates created during the PCR amplification step. Exome sequencing typically relies on hybridization-based selection of genomic shared DNA fragments in the so-called target regions. Because the DNA sharing step is (assumed to be) a random process, reads starting at exactly the same position are more likely to be due to PCR amplification than to originate from two independent fragments starting at the same position. Note: this step is omitted when processing Amplicon sequencing reads, because reads of a given amplicon all start at the same position but can be copies of different original templates. 5.Summary of Alignment Statistics 6.MPileup: Variant counts per position and statistics (1st step of variant calling) 7.Varscan2: variant calling with program Varscan 2. Check: different analysis types are possible. In this course: “Analysis type: single nucleotide variation” is selected. Output format: VCF (variant call file). 8.Label and filter out variants with low Variant Allele frequency. 9.Slice VCF: discard all genetic regions except the exons (defined in the input BED file). 10.ANNOVAR: filter and annotate variants. 11.Annotate with DGI db (Drug Gene Interaction database).

Variant filtering and visualization Open the Annovar output in a new tab Go with the mouse over the lines of the VCF file and open the file in trackster (click on bar chart symbol) In trackster, load the sorted BAM file as additional track Choose one mutation that is annotated by the last annotation step, DGI. What was done in the last filter step before DGI was run? Read all annotations that the mutation has that you have selected Browse to this genomic location in Trackster. What is the coverage? Is this variant reliably covered?

d