Tools for Targeted Sequencing and NGS analysis O. Harismendy, PhD BIOM262 – W2016.

Slides:



Advertisements
Similar presentations
A. Dereeper, G. Sarah, F. Sabot, Y. Hueber Exploiting SNP polymorphism data Formation Bio-informatique, 9 au 13 février 2015.
Advertisements

SCHOOL OF COMPUTING ANDREW MAXWELL 9/11/2013 SEQUENCE ALIGNMENT AND COMPARISON BETWEEN BLAST AND BWA-MEM.
DNAseq analysis Bioinformatics Analysis Team
Variant Calling Workshop Chris Fields Variant Calling Workshop v2 | Chris Fields1 Powerpoint by Casey Hanson.
Ruibin Xi Peking University School of Mathematical Sciences
High Throughput Sequencing Xiaole Shirley Liu STAT115, STAT215, BIO298, BIST520.
Introduction to Short Read Sequencing Analysis
Vanderbilt Center for Quantitative Sciences Summer Institute Sequencing Analysis (DNA) Yan Guo.
Biological Sequence Analysis BNFO 691/602 Spring 2014 Mark Reimers
Bioinformatics Tips NGS data processing and pipeline writing
NGS Analysis Using Galaxy
NGS Workshop Variant Calling
Steve Newhouse 28 Jan  Practical guide to processing next generation sequencing data  No details on the inner workings of the software/code &
Whole Exome Sequencing for Variant Discovery and Prioritisation
Mapping NGS sequences to a reference genome. Why? Resequencing studies (DNA) – Structural variation – SNP identification RNAseq – Mapping transcripts.
Genome & Exome Sequencing Read Mapping Xiaole Shirley Liu STAT115, STAT215, BIO298, BIST520.
Variant Calling Workshop Chris Fields Variant Calling Workshop | Chris Fields | PowerPoint by Casey Hanson.
Introduction to Short Read Sequencing Analysis
MES Genome Informatics I - Lecture V. Short Read Alignment
File formats Wrapping your data in the right package Deanna M. Church
Advanced ChIPseq Identification of consensus binding sites for the LEAFY transcription factor.
DAY 1. GENERAL ASPECTS FOR GENETIC MAP CONSTRUCTION SANGREA SHIM.
NGS data analysis CCM Seminar series Michael Liang:
Next Generation DNA Sequencing
ParSNP Hash Pipeline to parse SNP data and output summary statistics across sliding windows.
Alexis DereeperCIBA courses – Brasil 2011 Detection and analysis of SNP polymorphisms.
National Center for Supercomputing Applications University of Illinois at Urbana-Champaign Variant Calling Workshop.
Bioinformatics trainings, Vietnam Hanoi, November, 2015
Tutorial 6 High Throughput Sequencing. HTS tools and analysis Review of resequencing pipeline Visualization - IGV Analysis platform – Galaxy Tuning up.
IGV tools. Pipeline Download genome from Ensembl bacteria database Export the mapping reads file (SAM) Map reads to genome by CLC Using the mapping.
Current Data And Future Analysis Thomas Wieland, Thomas Schwarzmayr and Tim M Strom Helmholtz Zentrum München Institute of Human Genetics Geneva, 16/04/12.
Personalized genomics
Calling Somatic Mutations using VarScan
Computing on TSCC Make a folder for the class and move into it –mkdir –p /oasis/tscc/scratch/username/biom262_harismendy –cd /oasis/tscc/scratch/username/biom262_harismendy.
Canadian Bioinformatics Workshops
Introduction to Variant Analysis of Exome- and Amplicon sequencing data Lecture by: Date: Training: Extended version see: Dr. Christian Rausch 29 May 2015.
Canadian Bioinformatics Workshops
Canadian Bioinformatics Workshops
Inheritance Model testing Andrew Stubbs Dept. Bioinformatics.
Canadian Bioinformatics Workshops
From Reads to Results Exome-seq analysis at CCBR
DAY 2. GETTING FAMILIAR WITH NGS SANGREA SHIM. INDEX  Day 2  Get familiar with NGS  Understanding of NGS raw read file  Quality issue  Alignment/Mapping.
Canadian Bioinformatics Workshops
Konstantin Okonechnikov Qualimap v2: advanced quality control of
Using command line tools to process sequencing data
Day 5 Mapping and Visualization
> cd ~ > cp –R /media/sf_shared/BioNGS/GenomicVar/* .
Cancer Genomics Core Lab
Next Generation Sequencing Analysis
NGS Analysis Using Galaxy
Variant Calling Workshop
VCF format: variants c.f. S. Brown NYU
First Bite of Variant Calling in NGS/MPS Precourse materials
MiSeq Validation Pipeline
EMC Galaxy Course November 24-25, 2014
Yonglan Zheng Galaxy Hands-on Demo Step-by-step Yonglan Zheng
BF528 - Biological Data Formats
Next-generation sequencing - Mapping short reads
Next Gen. Sequencing Files and pysam
Maximize read usage through mapping strategies
BIOINFORMATICS Fast Alignment
Next Gen. Sequencing Files and pysam
Next Gen. Sequencing Files and pysam
Next-generation sequencing - Mapping short reads
CS 6293 Advanced Topics: Translational Bioinformatics
Canadian Bioinformatics Workshops
Alignment of Next-Generation Sequencing Data
Computational Pipeline Strategies
Variant Calling Chris Fields
The Variant Call Format
Presentation transcript:

Tools for Targeted Sequencing and NGS analysis O. Harismendy, PhD BIOM262 – W2016

Multi-Step analysis Quality Control Read Alignment Alignment Refinement Variant Calling Variant Annotation Visualization – Alignment – Variants

Alignment and Variant Calling Pipeline Process from HiSeq AlignRefineCall Variants HiSeq BCL files FASTQ format CASAVA 1.8 FASTQ format BWA Mapped BAM Samtools Mapped BAM Align Sort Reformat -Unmapped GATK Realign Picard -Duplicates GATK Recalibrate Cleaned BAM GATK Call Left Align Indels Filter Evaluate Filtered VCF

File Sizes Assume 100 bp read length

INSTRUMENT QC

Instrument QC

MiSeq specifications

Instrument QC

FASTQ

@IL31_4368:1:1:996:8507/2 TCCCTTACCCCCAAGCTCCATACCCTCCTAATGCCCACACCTCTTACCTTAGGA + CAAAAACTTTCACTTTACCTGCCGGGTTTCCCAGTTTACATTCCACTGTTTGAC GATCTTCTGTGACTGGAAGAAAATGTGTTACATATTACATTTCTGTCCCCATTG + E?=EECE CAGCCTCAGATTCAGCATTCTCAAATTCAGCTGCGGCTGAAACAGCAGCAGGAC AATGTTCTGAAACCTCTGAGAAAGCAAATATTTATTTTAATGAAAAATCCTTAT + ACATTTACCAAGACCAAAGGAAACTTACCTTGCAAGAATTAGACAGTTCATTTG CCCACCTCTCTCAATGTTTTCCATATGGCAGGGACTCAGCACAGGTGGATTAAT (...) FASTQ Instrument serial # Lane Swath X coord Y coord Read direction

FASTQ read ID 1:Y:18:ATCACG

PHRED-like quality Phred=-10log10(error)

ASCII-Code

FASTQ QC

FASTQC

FASTQC Adapter Contamination

ALIGNMENT Using BWA

Suffix Array GTCCGAAGCTCCGG$ Fast but too much memory requirement for the whole genome

Burrows-Wheeler Store entire reference genome. Align tag base by base from the end. When tag is traversed, all active locations are reported. If no match is found, then back up and try a substitution.

Burrows-Wheeler Transform (BWT) acaacg$ $acaacg aacg$ac acaacg$ acg$aca caacg$a cg$acaa g$acaac gc$aaac BWT

Burrows-Wheeler Matrix $acaacg aacg$ac acaacg$ acg$aca caacg$a cg$acaa g$acaac

Burrows-Wheeler Matrix $acaacg aacg$ac acaacg$ acg$aca caacg$a cg$acaa g$acaac

Burrows-Wheeler Transform Langmead et al (2009)

BWA

BWA mem

SAM/BAM FORMAT

Utility

BAM format Bioinformatics Aug 15;25(16): Epub 2009 Jun 8. The Sequence Alignment/Map format and SAMtools.

BAM example (field 1-9)

CIGAR string Before alignment After alignment

BAM example

SAM FLAGS

Extra Info (custom)

BAM/SAM format flexible : compatible with multiple alignment programs simple: easy to generate and convert compact in file size works on a stream : low memory footprint Indexable for efficiency SAM is human readable

SAMTOOLS & PICARD SAM/BAM manipulation

SAMTOOLS Manipulate alignments in the BAM format – imports from and exports to the SAM (Sequence Alignment/Map) format – sorting – merging – indexing – retrieve reads in any regions Works on a stream. Commands combined with Unix pipes. Able to open a BAM on a remote FTP or HTTP server

SAMTOOLS examples

SAMTOOLS view Usage: samtools view [options] | [region1 [...]] Options: -b output BAM -h print header for the SAM output -H print header only (no alignments) -S input is SAM -u uncompressed BAM output (force -b) -1 fast compression (force -b) -x output FLAG in HEX (samtools-C specific) -X output FLAG in string (samtools-C specific) -c print only the count of matching records -B collapse the backward CIGAR operation INT number of BAM compression threads [0] -L FILE output alignments overlapping the input BED FILE [null] -t FILE list of reference names and lengths (force -S) [null] -T FILE reference sequence file (force -S) [null] -o FILE output file name [stdout] -R FILE list of read groups to be outputted [null] -f INT required flag, 0 for unset [0] -F INT filtering flag, 0 for unset [0] -q INT minimum mapping quality [0] -l STR only output reads in library STR [null] -r STR only output reads in read group STR [null] -s FLOAT fraction of templates to subsample; integer part as seed [-1] -? longer help

Chrom Position Ref Coverage Read basesQualities SAMTOOLS Pileup

Picard – the other tool box

BED AND BEDTOOLS Interval Manipulation

BED track

BED format Header (space separated) [optional] Chr [mandatory] Start [mandatory] Stop [mandatory] Other (optional) – Name – Strand – Color – Type A BAM alignment can be condensed into a BED file: Information loss !

BEDTOOLS

IntersectBed

Intersect BED

coverageBed

GATK BAM postprocessing

GATK Best Practice

INDEL Realignment The problem

INDEL Realignment The problem

INDEL Realignment The solution

Base Quality Recalibration DePristo et al (2011)

ALIGNMENT VISUALISATION

/ Integrative Genome Viewer

/ Integrative Genome Viewer

VARIANT CALLING GATK/VARSCAN

Variant Calling Bayesian Statistics P(genotype|data)  P(data|genotype).P(genotype) – P(genotype) : prior probability for the genotype – P(data|genotype): likelihood for observed (called) allele Likelihood P(data|genotype): What’s known to affect base calling: Error rate increases as cycle numbers increase Error rate depends on substitution type Error rate depends on local sequence

GATK Callers

Somatic Variant Calling Compare Normal and Tumor directly Results – Germline: Present in both – Somatic: Present in Tumor only – Loss of Heterozygosity: Missed Het in the tumor Fisher Exact test Kobolt et al. (2012)

Heuristic Filters

VARIANT ANALYSIS VCF / VCFTOOLS

VCF format

VCFLib

VARIANT ANNOTATION

Variant Annotation Tools SNPEff (Broad) – VEP (Sanger) – ex.html ANNOVAR – Variant Tools (database / Manipulation) – McCarty et al:

ANNOVAR exonic_variant_functoin

McCarty et al:

Annotation Challenges Compensating mutations DNP or TNP need to be annotated together Phase of Indels on the same haplotype Partial effects Mutation affecting one transcript only Same gene Unknown Loss of Function mutations affecting known cancer genes Adapted from MacArthur et al, 2012

Useful Annotations dbSNP dbNSFP – SIFT – MutationTaster COSMIC ESP data 1000G data ExAC

Variant Effect Prediction SIFT Ng et al

Summary and Resources Formats – FASTQ – SAM/BAM (CRAM) – VCF Alignment and tools – BWA – IGV – Samtools Post-Alignment Processing and QC – GATK IndelRealigner – GATK BaseRecalibrator – Picard MarkDuplicate – Picard HS-Metrics – Samtools Variant Calling – GATK UnifiedGenotyper – GATK HaplotypeCaller – FreeBayes – VarScan (somatic) – Mutect (somatic) Variant Annotations – SNPEff – ANNOVAR – Vtools – VEP

Other Resources/Tools BaseSpace: the illumina app store – cBioportal: A repository of cancer genomics data – UCSC Cancer Genomic: A repository/browser of cancer genomics data – MAGI: A tool for Mutation Annotation and Genomic Interpretation –