Read Processing and Mapping: From Raw to Analysis-ready Reads

Slides:



Advertisements
Similar presentations
NGS Bioinformatics Workshop 2.1 Tutorial – Next Generation Sequencing and Sequence Assembly Algorithms May 3rd, 2012 IRMACS Facilitator: Richard.
Advertisements

IMGS 2012 Bioinformatics Workshop: File Formats for Next Gen Sequence Analysis.
IMGS 2012 Bioinformatics Workshop: RNA Seq using Galaxy
DNAseq analysis Bioinformatics Analysis Team
SOLiD Sequencing & Data
High Throughput Sequencing Xiaole Shirley Liu STAT115, STAT215, BIO298, BIST520.
Introduction to Short Read Sequencing Analysis
Biological Sequence Analysis BNFO 691/602 Spring 2014 Mark Reimers
Data Formats & QC Analysis for NGS Rosana O. Babu 8/19/20151.
NGS Analysis Using Galaxy
Steve Newhouse 28 Jan  Practical guide to processing next generation sequencing data  No details on the inner workings of the software/code &
Presented by Mario Flores, Xuepo Ma, and Nguyen Nguyen.
Whole Exome Sequencing for Variant Discovery and Prioritisation
National Center for Genome Analysis Support: Carrie Ganote Ram Podicheti Le-Shin Wu Tom Doak Quality Control and Assessment.
Expression Analysis of RNA-seq Data
Bioinformatics and OMICs Group Meeting REFERENCE GUIDED RNA SEQUENCING.
Introduction to Short Read Sequencing Analysis
File formats Wrapping your data in the right package Deanna M. Church
DAY 1. GENERAL ASPECTS FOR GENETIC MAP CONSTRUCTION SANGREA SHIM.
NGS data analysis CCM Seminar series Michael Liang:
Aligning Reads Ramesh Hariharan Strand Life Sciences IISc.
Next Generation DNA Sequencing
TopHat Mi-kyoung Seo. Today’s paper..TopHat Cole Trapnell at the University of Washington's Department of Genome Sciences Steven Salzberg Center.
RNA-Seq in Galaxy Igor Makunin QAAFI, Internal Workshop, April 17, 2015.
Introduction To Next Generation Sequencing (NGS) Data Analysis
Quick introduction to genomic file types Preliminary quality control (lab)
Gerton Lunter Wellcome Trust Centre for Human Genetics From calling bases to calling variants: Experiences with Illumina data.
RNA-Seq Assembly 转录组拼接 唐海宝 基因组与生物技术研究中心 2013 年 11 月 23 日.
Alexis DereeperCIBA courses – Brasil 2011 Detection and analysis of SNP polymorphisms.
1 Removing Sequential Bottlenecks in Analysis of Next-Generation Sequencing Data Yi Wang, Gagan Agrawal, Gulcin Ozer and Kun Huang The Ohio State University.
IPlant Collaborative Discovery Environment RNA-seq Basic Analysis Log in with your iPlant ID; three orange icons.
Introductory RNA-seq Transcriptome Profiling. Before we start: Align sequence reads to the reference genome The most time-consuming part of the analysis.
RNA-Seq Primer Understanding the RNA-Seq evidence tracks on the GEP UCSC Genome Browser Wilson Leung08/2014.
Introduction to RNAseq
Sequence File Formats.
No reference available
RNA-Seq in Galaxy Igor Makunin DI/TRI, March 9, 2015.
Moderní metody analýzy genomu - analýza Mgr. Nikola Tom Brno,
GSVCaller – R-based computational framework for detection and annotation of short sequence variations in the human genome Vasily V. Grinev Associate Professor.
Short Read Workshop Day 5: Mapping and Visualization
Canadian Bioinformatics Workshops
Introduction to Variant Analysis of Exome- and Amplicon sequencing data Lecture by: Date: Training: Extended version see: Dr. Christian Rausch 29 May 2015.
Canadian Bioinformatics Workshops
Reliable Identification of Genomic Variants from RNA-seq Data Robert Piskol, Gokul Ramaswami, Jin Billy Li PRESENTED BY GAYATHRI RAJAN VINEELA GANGALAPUDI.
Canadian Bioinformatics Workshops
Introduction to Illumina Sequencing
From Reads to Results Exome-seq analysis at CCBR
DAY 2. GETTING FAMILIAR WITH NGS SANGREA SHIM. INDEX  Day 2  Get familiar with NGS  Understanding of NGS raw read file  Quality issue  Alignment/Mapping.
Canadian Bioinformatics Workshops
RNAseq: a Closer Look at Read Mapping and Quantitation
Using command line tools to process sequencing data
Cancer Genomics Core Lab
Next Generation Sequencing Analysis
NGS Analysis Using Galaxy
S1 Supporting information Bioinformatic workflow and quality of the metrics Number of slides: 10.
Introduction to RAD Acropora millepora.
The FASTQ format and quality control
EMC Galaxy Course November 24-25, 2014
2nd (Next) Generation Sequencing
ChIP-Seq Data Processing and QC
Next-generation sequencing - Mapping short reads
A critical evaluation of HTQC: a fast quality control toolkit for Illumina sequencing data Chandan Pal, PhD student Sahlgrenska Academy Institute of.
Maximize read usage through mapping strategies
BF nd (Next) Generation Sequencing
Next-generation sequencing - Mapping short reads
CS 6293 Advanced Topics: Translational Bioinformatics
Canadian Bioinformatics Workshops
BF528 - Sequence Analysis Fundamentals
RNA-Seq Data Analysis UND Genomics Core.
The Variant Call Format
Presentation transcript:

Read Processing and Mapping: From Raw to Analysis-ready Reads Ben Passarelli Stem Cell Institute Genome Center NGS Workshop 31 MAY 2013

From Raw to Analysis-ready Reads Raw reads Read assessment and prep Mapping Duplicate Marking Local realignment Base quality recalibration Analysis-ready reads Session Topics Overview of high-throughput sequencing platforms Understand read data formats and quality scores Identify and fix some common read data problems Find a genomic reference for mapping Mapping reads to a reference genome Understand alignment output Sort, merge, index alignment for further analysis Mark/eliminate duplicate reads Locally realign at indels Recalibrate base quality scores How to get started

Sample to Raw Reads Library Construction QC and Quantification Sample Preparation Sequencing Raw Reads

Sequence Data Instrument Output (FASTQ Format) Images (.tiff) Illumina MiSeq Illumina HiSeq Ion PGM Ion Proton Pacific Biosciences RS Images (.tiff) Cluster intensity file (.cif) Base call file (.bcl) Standard flowgram file (.sff) Movie Trace (.trc.h5) Pulse (.pls.h5) Base (.bas.h5) Sequence Data (FASTQ Format)

Sequencing Platforms at a Glance

Solid Phase Amplification V3 HiSeq Sequencing Steps Clusters are linearized Sequencing primer annealed All four dNTPs added at each cycle Each with different **Fluorescent Tag** Intensity of different tags  base call Error Profile: substitutions Library DNA binds to Oligos Immobilized on Glass Flowcell Surface

FASTQ Format (Illumina Example) Flow Cell ID Lane Tile Tile Coordinates Barcode Read Record Header @DJG84KN1:272:D17DBACXX:2:1101:12432:5554 1:N:0:AGTCAA CAGGAGTCTTCGTACTGCTTCTCGGCCTCAGCCTGATCAGTCACACCGTT + BCCFFFDFHHHHHIJJIJJJJJJJIJJJJJJJJJJIJJJJJJJJJIJJJJ @DJG84KN1:272:D17DBACXX:2:1101:12454:5610 1:N:0:AG AAAACTCTTACTACATCAGTATGGCTTTTAAAACCTCTGTTTGGAGCCAG @@@DD?DDHFDFHEHIIIHIIIIIBBGEBHIEDH=EEHI>FDABHHFGH2 @DJG84KN1:272:D17DBACXX:2:1101:12438:5704 1:N:0:AG CCTCCTGCTTAAAACCCAAAAGGTCAGAAGGATCGTGAGGCCCCGCTTTC CCCFFFFFHHGHHJIJJJJJJJI@HGIJJJJIIIJGIGIHIJJJIIIIJJ @DJG84KN1:272:D17DBACXX:2:1101:12340:5711 1:N:0:AG GAAGATTTATAGGTAGAGGCGACAAACCTACCGAGCCTGGTGATAGCTGG CCCFFFFFHHHHHGGIJJJIJJJJJJIJJIJJJJJGIJJJHIIJJJIJJJ Read Bases Separator (with optional repeated header) Read Quality Scores NOTE: for paired-end runs, there is a second file with one-to-one corresponding headers and reads

Base Call Quality: Phred Quality Scores Phred* quality score Q with base-calling error probability P Q = -10 log10P * Name of first program to assign accurate base quality scores. From the Human Genome Project. Q score Probability of base error Base confidence Sanger-encoded (Q Score + 33) ASCII character 10 0.1 90% “+” 20 0.01 99% “5” 30 0.001 99.9% “?” 40 0.0001 99.99% “I” SSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSS..................................................... ...............................IIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIII...................... LLLLLLLLLLLLLLLLLLLLLLLLLLLLLLLLLLLLLLLLLL.................................................... !"#$%&'()*+,-./0123456789:;<=>?@ABCDEFGHIJKLMNOPQRSTUVWXYZ[\]^_`abcdefghijklmnopqrstuvwxyz{|}~ | | | | | | 33 59 64 73 104 126 S - Sanger Phred+33 range: 0 to 40 I - Illumina 1.3+ Phred+64 range: 0 to 40 L - Illumina 1.8+ Phred+33 range: 0 to 41

Initial Read Assessment and Processing Raw reads Read assessment and prep Mapping Duplicate Marking Local realignment Base quality recalibration Analysis-ready reads Common problems that can affect analysis Low confidence base calls typically toward ends of reads criteria vary by application Presence of adapter sequence in reads poor fragment size selection protocol execution or artifacts Over-abundant sequence duplicates Library contamination

Quick Read Assessment: FastQC Free Download Download: http://www.bioinformatics.babraham.ac.uk/projects/fastqc/ Tutorial : http://www.youtube.com/watch?v=bz93ReOv87Y Samples reads (200K default): fast, low resource use

Read Assessment Example (Cont’d) Trim leading bases (library artifact) Trim for base quality or adapters (run or library issue)

Read Assessment Example (Cont’d) TruSeq Adapter, Index 9 5’ GATCGGAAGAGCACACGTCTGAACTCCAGTCACGATCAGATCTCGTATGCCGTCTTCTGCTTG

Comprehensive Read Assessment: Prinseq http://prinseq.sourceforge.net/

Selected Tools to Process Reads Fastx toolkit* http://hannonlab.cshl.edu/fastx_toolkit/ (partial list) FASTQ Information: Chart Quality Statistics and Nucleotide Distribution FASTQ Trimmer: Shortening FASTQ/FASTA reads (removing barcodes or noise). FASTQ Clipper: Removing sequencing adapters FASTQ Quality Filter: Filters sequences based on quality FASTQ Quality Trimmer: Trims (cuts) sequences based on quality FASTQ Masker: Masks nucleotides with 'N' (or other character) based on quality *defaults to old Illumina fastq (ASCII offset 64). Use –Q33 option. SepPrep https://github.com/jstjohn/SeqPrep Adapter trimming Merge overlapping paired-end read Biopython http://biopython.org, http://biopython.org/DIST/docs/tutorial/Tutorial.html (for python programmers) Especially useful for implementing custom/complex sequence analysis/manipulation Galaxy http://galaxy.psu.edu Great for beginners: upload data, point and click Just about everything you’ll see in today’s presentations SolexaQA2 http://solexaqa.sourceforge.net Dynamic trimming Length sorting (resembles read grouping of Prinseq)

Many Analysis Pipelines Start with Read Mapping http://www.broadinstitute.org/gatk/guide/topic?name=best-practices http://www.nature.com/nprot/journal/v7/n3/full/nprot.2012.016.html

Read Mapping http://www.broadinstitute.org/igv/ Raw reads Read assessment and prep Mapping Duplicate Marking Local realignment Base quality recalibration Analysis-ready reads http://www.broadinstitute.org/igv/

Sequence References and Annotations http://www.ncbi.nlm.nih.gov/projects/genome/assembly/grc/data.shtml http://www.ncbi.nlm.nih.gov/guide/howto/dwn-genome Comprehensive reference information http://hgdownload.cse.ucsc.edu/downloads.html Comprehensive reference, annotation, and translation information ftp://gsapubftp-anonymous@ftp.broadinstitute.org/bundle References and SNP information data by GATK Human only http://cufflinks.cbcb.umd.edu/igenomes.html Pre-indexed references and gene annotations for Tuxedo suite Human, Mouse, Rat , Cow, Dog, Chicken, Drosophila, C. elegans, Yeast http://www.repeatmasker.org

One or more sequences per file Fasta Sequence Format One or more sequences per file “>” denotes beginning of sequence or contig Subsequent lines up to the next “>” define sequence Lowercase base denotes repeat masked base Contig ID may have comments delimited by “|” >chr1 … TGGACTTGTGGCAGGAATgaaatccttagacctgtgctgtccaatatggt agccaccaggcacatgcagccactgagcacttgaaatgtggatagtctga attgagatgtgccataagtgtaaaatatgcaccaaatttcaaaggctaga aaaaaagaatgtaaaatatcttattattttatattgattacgtgctaaaa taaccatatttgggatatactggattttaaaaatatatcactaatttcat >chr2 >chr3

Read Mapping Novoalign (3.0) SOAP3 (version 91) BWA (0.7.4) Bowtie2   Novoalign (3.0) SOAP3 (version 91) BWA (0.7.4) Bowtie2 (2.1.0) Tophat2 (2.0.8b) STAR (2.3.0e) License Commercial GPL v3 Artistic Mismatch allowed up to 8 up to 3 user specified. max is function of read length and error rate user specified uses Bowtie2 Alignments reported per read random/all/none user selected Gapped alignment up to 7bp 1-3bp gap yes splice junctions introns Pair-end reads Best alignment highest alignment score minimal number of mismatches Trim bases 3’ end 3’ and 5’ end Comments At one time, best performance and alignment quality Element of Broad’s “best practices” genotyping workflow Smith-Waterman quality alignments, currently fastest Currently most popular RNA-seq aligner Very fast; uses memory to achieve performance

Read Mapping: BWA BWA Features Uses Burrows Wheeler Transform fast modest memory footprint (<4GB) Accurate Tolerates base mismatches increased sensitivity reduces allele bias Gapped alignment for both single- and paired-ended reads Automatically adjusts parameters based on read lengths and error rates Native BAM/SAM output (the de facto standard) Large installed base, well-supported Open-source (no charge)

Read Mapping: Bowtie 2 Bowtie2 Uses dynamic programming (edit distance scoring) Eliminates need for realignment around indels Can be tuned for different sequencing technologies Multi-seed search - adjustable sensitivity Input read length limited only by available memory Fasta or Fastq input Caveats Longer input reads require much more memory Trade-off parallelism with memory requirement Dynamic Programming Illustration http://bowtie-bio.sourceforge.net/bowtie2 Langmead B, Salzberg S. Fast gapped-read alignment with Bowtie 2, Nature Methods. 2012, 9:357-359

SAM (BAM) Format Sequence Alignment/Map format Universal standard Human-readable (SAM) and compact (BAM) forms Structure Header version, sort order, reference sequences, read groups, program/processing history Alignment records

SAM/BAM Format: Header [benpass align_genotype]$ samtools view -H allY.recalibrated.merge.bam @HD VN:1.0 GO:none SO:coordinate @SQ SN:chrM LN:16571 @SQ SN:chr1 LN:249250621 @SQ SN:chr2 LN:243199373 @SQ SN:chr3 LN:198022430 … @SQ SN:chr19 LN:59128983 @SQ SN:chr20 LN:63025520 @SQ SN:chr21 LN:48129895 @SQ SN:chr22 LN:51304566 @SQ SN:chrX LN:155270560 @SQ SN:chrY LN:59373566 @RG ID:86-191 PL:ILLUMINA LB:IL500 SM:86-191-1 @RG ID:BsK010 PL:ILLUMINA LB:IL501 SM:BsK010-1 @RG ID:Bsk136 PL:ILLUMINA LB:IL502 SM:Bsk136-1 @RG ID:MAK001 PL:ILLUMINA LB:IL503 SM:MAK001-1 @RG ID:NG87 PL:ILLUMINA LB:IL504 SM:NG87-1 @RG ID:SDH023 PL:ILLUMINA LB:IL508 SM:SDH023 @PG ID:GATK IndelRealigner VN:2.0-39-gd091f72 CL:knownAlleles=[] targetIntervals=tmp.intervals.list LODThresholdForCleaning=5.0 consensusDeterminationModel=USE_READS entropyThreshold=0.15 maxReadsInMemory=150000 maxIsizeForMovement=3000 maxPositionalMoveAllowed=200 maxConsensuses=30 maxReadsForConsensuses=120 maxReadsForRealignment=20000 noOriginalAlignmentTags=false nWayOut=null generate_nWayOut_md5s=false check_early=false noPGTag=false keepPGTags=false indelsFileForDebugging=null statisticsFileForDebugging=null SNPsFileForDebugging=null @PG ID:bwa PN:bwa VN:0.6.2-r126 samtools to view bam header sort order reference sequence names with lengths read groups with platform, library and sample information program (analysis) history

SAM/BAM Format: Alignment Records [benpass align_genotype]$ samtools view allY.recalibrated.merge.bam HW-ST605:127:B0568ABXX:2:1201:10933:3739 147 chr1 27675 60 101M = 27588 -188 TCATTTTATGGCCCCTTCTTCCTATATCTGGTAGCTTTTAAATGATGACCATGTAGATAATCTTTATTGTCCCTCTTTCAGCAGACGGTATTTTCTTATGC =7;:;<=??<=BCCEFFEJFCEGGEFFDF?BEA@DEDFEFFDE>EE@E@ADCACB>CCDCBACDCDDDAB@@BCADDCBC@BCBB8@ABCCCDCBDA@>:/ RG:Z:86-191 HW-ST605:127:B0568ABXX:3:1104:21059:173553 83 chr1 27682 60 101M = 27664 -119 ATGGCCCCTTCTTCCTATATCTGGTAGCTTTTAAATGATGACCATGTAGATAATCTTTATTGTCCCTCTTTCAGCAGACGGTATTTTCTTATGCTACAGTA 8;8.7::<?=BDHFHGFFDCGDAACCABHCCBDFBE</BA4//BB@BCAA@CBA@CB@ABA>A??@B@BBACA>?;A@8??CABBBA@AAAA?AA??@BB0 RG:Z:SDH023 * Many fields after column 12 deleted (e.g., recalibrated base scores) have been deleted for improved readability 2 3 4 5 6 8 9 1 10 11 http://samtools.sourceforge.net/SAM1.pdf

Preparing for Next Steps Raw reads Read assessment and prep Mapping Duplicate Marking Local realignment Base quality recalibration Analysis-ready reads Subsequent steps require sorted and indexed bams Sort orders: karyotypic, lexicographical Indexing improves analysis performance Picard tools: fast, portable, free http://picard.sourceforge.net/command-line-overview.shtml Sort: SortSam.jar Merge: MergeSamFiles.jar Index: BuildBamIndex.jar Order: sort, merge (optional), index

Duplicate Marking Raw reads Read assessment and prep Mapping Duplicate Marking Local realignment Base quality recalibration Analysis-ready reads $java -Xmx4g -jar <path to picard>/MarkDuplicates.jar \ INPUT=aligned.sorted.bam \ OUTPUT=aligned.sorted.dedup.bam \ VALIDATION_STRINGENCY=LENIENT \ METRICS_FILE=aligned.dedup.metrics.txt \ REMOVE_DUPLICATES=false \ ASSUME_SORTED=true http://picard.sourceforge.net/command-line-overview.shtml#MarkDuplicates

SAM/BAM Format: Alignment Records [benpass align_genotype]$ samtools view allY.recalibrated.merge.bam HW-ST605:127:B0568ABXX:2:1201:10933:3739 147 chr1 27675 60 101M = 27588 -188 TCATTTTATGGCCCCTTCTTCCTATATCTGGTAGCTTTTAAATGATGACCATGTAGATAATCTTTATTGTCCCTCTTTCAGCAGACGGTATTTTCTTATGC =7;:;<=??<=BCCEFFEJFCEGGEFFDF?BEA@DEDFEFFDE>EE@E@ADCACB>CCDCBACDCDDDAB@@BCADDCBC@BCBB8@ABCCCDCBDA@>:/ RG:Z:86-191 http://picard.sourceforge.net/explain-flags.html http://samtools.sourceforge.net/SAM1.pdf

Local Realignment Raw reads Read assessment and prep Mapping Duplicate Marking Local realignment Base quality recalibration Analysis-ready reads BWT-based alignment is fast for matching reads to reference Individual base alignments often sub-optimal at indels Approach Fast read mapping with BWT-based aligner Realign reads at indel sites using gold standard (but much slower) Smith-Waterman algorithm Benefits Refines location of indels Reduces erroneous SNP calls Very high alignment accuracy in significantly less time, with fewer resources 1Smith, Temple F.; and Waterman, Michael S. (1981). "Identification of Common Molecular Subsequences". Journal of Molecular Biology 147: 195–197. doi:10.1016/0022-2836(81)90087-5. PMID 7265238

Post re-alignment at indels Local Realignment Raw BWA alignment Post re-alignment at indels DePristo MA, et al. A framework for variation discovery and genotyping using next-generation DNA sequencing data. Nat Genet. 2011 May;43(5):491-8. PMID: 21478889

Base Quality Recalibration Raw reads Read assessment and prep Mapping Duplicate Marking Local realignment Base quality recalibration Analysis-ready reads STEP 1: Find covariates at non-dbSNP sites using: Reported quality score The position within the read The preceding and current nucleotide (sequencer properties) java -Xmx4g -jar GenomeAnalysisTK.jar \ -T BaseRecalibrator \ -I alignment.bam \ -R hg19/ucsc.hg19.fasta \ -knownSites hg19/dbsnp_135.hg19.vcf \ -o alignment.recal_data.grp STEP 2: Generate BAM with recalibrated base scores: -T PrintReads \ -BQSR alignment.recal_data.grp \ -o alignment.recalibrated.bam

Base Quality Recalibration (Cont’d)

Raw reads Analysis-ready reads Mapping Duplicate Marking Read assessment and prep Mapping Duplicate Marking Local realignment Base quality recalibration Analysis-ready reads

Is there an easier way to get started?! http://galaxyproject.org/ Click on “Use Galaxy”

Getting Started

Raw reads Analysis-ready reads Mapping Duplicate Marking Read assessment and prep Mapping Duplicate Marking Local realignment Base quality recalibration Analysis-ready reads