NGS data processing Bioinformatics tips, tools of the trade and pipeline writing Na Cai 4 th year DPhil in Clinical Medicine Supervisor: Jonathan Flint.

Slides:



Advertisements
Similar presentations
NGS Bioinformatics Workshop 2.1 Tutorial – Next Generation Sequencing and Sequence Assembly Algorithms May 3rd, 2012 IRMACS Facilitator: Richard.
Advertisements

Reference mapping and variant detection Peter Tsai Bioinformatics Institute, University of Auckland.
DNAseq analysis Bioinformatics Analysis Team
Variant Calling Workshop Chris Fields Variant Calling Workshop v2 | Chris Fields1 Powerpoint by Casey Hanson.
High Throughput Sequencing
The Extraction of Single Nucleotide Polymorphisms and the Use of Current Sequencing Tools Stephen Tetreault Department of Mathematics and Computer Science.
MCB Lecture #20 Nov 18/14 Reference alignments.
Biological Sequence Analysis BNFO 691/602 Spring 2014 Mark Reimers
NGS data format and General Quality Control. Data format “Flowchart” Sequencer raw data FastqSAM/BAM.
Bioinformatics Tips NGS data processing and pipeline writing
Data Formats & QC Analysis for NGS Rosana O. Babu 8/19/20151.
NGS Analysis Using Galaxy
Steve Newhouse 28 Jan  Practical guide to processing next generation sequencing data  No details on the inner workings of the software/code &
Whole Exome Sequencing for Variant Discovery and Prioritisation
DRAW+SneakPeek: Analysis Workflow and Quality Metric Management for DNA-Seq Experiments O. Valladares 1,2, C.-F. Lin 1,2, D. M. Childress 1,2, E. Klevak.
Servers, R and Wild Mice Robert William Davies Feb 5, 2014.
Variant Calling Workshop Chris Fields Variant Calling Workshop | Chris Fields | PowerPoint by Casey Hanson.
Some Ideas on Final Project. Feature extraction TGGCCGTACGAGTAACGGACTGGCTGTCTTCTCGT n CCGATACCCCCCACGCGAAACCCTACACATCAAAT p AGCTAACTAGAGTCACTCCTTAGGATAGTGAGCGT.
PAGE: A Framework for Easy Parallelization of Genomic Applications 1 Mucahid Kutlu Gagan Agrawal Department of Computer Science and Engineering The Ohio.
MES Genome Informatics I - Lecture V. Short Read Alignment
File formats Wrapping your data in the right package Deanna M. Church
How I learned to quit worrying Deanna M. Church Staff Scientist, Short Course in Medical Genetics 2013 And love multiple coordinate.
The WinMine Toolkit Max Chickering. Build Statistical Models From Data Dependency Networks Bayesian Networks Local Distributions –Trees Multinomial /
NGS data analysis CCM Seminar series Michael Liang:

Copyright OpenHelix. No use or reproduction without express written consent1.
Advanced File Processing. 2 Objectives Use the pipe operator to redirect the output of one command to another command Use the grep command to search for.
Chapter Five Advanced File Processing. 2 Objectives Use the pipe operator to redirect the output of one command to another command Use the grep command.
IBM Software Group ® Overview of SA and RSA Integration John Jessup June 1, 2012 Slides from Kevin Cornell December 2008 Have been reused in this presentation.
NIH Extracellular RNA Communication Consortium 2 nd Investigators’ Meeting May 19 th, 2014 Sai Lakshmi Subramanian – (Primary
MapReduce Kristof Bamps Wouter Deroey. Outline Problem overview MapReduce o overview o implementation o refinements o conclusion.
Chip – Seq Peak Calling in Galaxy Lisa Stubbs Chip-Seq Peak Calling in Galaxy | Lisa Stubbs | PowerPoint by Casey Hanson.
Considerations for Analyzing Targeted NGS Data Exome Tim Hague, CTO.
Alexis DereeperCIBA courses – Brasil 2011 Detection and analysis of SNP polymorphisms.
1 Removing Sequential Bottlenecks in Analysis of Next-Generation Sequencing Data Yi Wang, Gagan Agrawal, Gulcin Ozer and Kun Huang The Ohio State University.
Pipes & Filters Architecture Pattern Source: Pattern-Oriented Software Architecture, Vol. 1, Buschmann, et al.
BRUDNO LAB: A WHIRLWIND TOUR Marc Fiume Department of Computer Science University of Toronto.
IPlant Collaborative Discovery Environment RNA-seq Basic Analysis Log in with your iPlant ID; three orange icons.
National Center for Supercomputing Applications University of Illinois at Urbana-Champaign Variant Calling Workshop.
Tutorial 6 High Throughput Sequencing. HTS tools and analysis Review of resequencing pipeline Visualization - IGV Analysis platform – Galaxy Tuning up.
Genome STRiP ASHG Workshop demo materials
First of all: “Darnit Jim, I’m a doctor not a bioinformatician!”
Personalized genomics
GSVCaller – R-based computational framework for detection and annotation of short sequence variations in the human genome Vasily V. Grinev Associate Professor.
Chip – Seq Peak Calling in Galaxy Lisa Stubbs Lisa Stubbs | Chip-Seq Peak Calling in Galaxy1.
Short Read Workshop Day 5: Mapping and Visualization
HOMER – a one stop shop for ChIP-Seq analysis
Introduction to Exome Analysis in Galaxy Carol Bult, Ph.D. Professor Deputy Director, JAX Cancer Center Short Course Bioinformatics Workshops 2014 Disclaimer…I.
Integrated variant detection Erik Garrison, Boston College.
Canadian Bioinformatics Workshops
Multiple Sequence Alignment with PASTA Michael Nute Austin, TX June 17, 2016.
Visualizing data from Galaxy
Canadian Bioinformatics Workshops
From Reads to Results Exome-seq analysis at CCBR
Canadian Bioinformatics Workshops
Using command line tools to process sequencing data
Day 5 Mapping and Visualization
Cancer Genomics Core Lab
Fundamentals of Python: First Programs
Dowell Short Read Class Phillip Richmond
Next Generation Sequencing Analysis
SparkBWA: Speeding Up the Alignment of High-Throughput DNA Sequencing Data - Aditi Thuse.
NGS Analysis Using Galaxy
Variant Calling Workshop
MiSeq Validation Pipeline
EMC Galaxy Course November 24-25, 2014
Yonglan Zheng Galaxy Hands-on Demo Step-by-step Yonglan Zheng
Guide To UNIX Using Linux Third Edition
Computational Pipeline Strategies
Automating NGS Gene Panel Analysis Workflows
Presentation transcript:

NGS data processing Bioinformatics tips, tools of the trade and pipeline writing Na Cai 4 th year DPhil in Clinical Medicine Supervisor: Jonathan Flint

Example projects CONVERGE -1.7x whole genome sequencing in 12,000 Han Chinese Women Cases of MD, 6000 controls -Detailed questionnaire -45T of sequencing data Commercial Outbred Mice -0.1x whole genome sequencing in 2,000 mice -Known breeding history -Extensive phenotyping -2T of sequencing data

NGS data processing Taken from:

Large-scale sequencing projects Lots of data – Terabytes! Storage problems, I/O problems, RAM problems Time consuming to process Errors! Lots of them! Contamination Duplication Missing data Difficult regions/features of the genome

Approach to NGS data Explore the data before processing large-scale Pilot your experiments with small subsets Try default parameters of softwares before altering Check output – Right number of lines? – Did anything fail silently? – Different handling of different classes of input? – How are missing values coded? – % failure?

Exploratory work in R – read.table(…, as.is=T, na.strings=c(“NA”, “nan”)) – dim(), str(), mode(), complete.cases() – head(), tail() – table(), summary() – order(), rank() – plot(), library(ggplot2) – library(plyr)

Pipeline writing – Arguments/options for different input – Arguments/options for parameters/auxillary files – Reusable functions – Reasonably flexible input format recognition – Set up for parallelizing – stderr – for debugging, checking progress, but beware of its size and I/O! – Create new directories as you go along – Create flag files to indicate successful completion of each step

Make Specify input file and output file Specify command for input  output Make checks presence of output file before running command Make deletes output of commands that did not finish running

Ruffus Flexible: one  many and many  one processes Fully integrated with Python programming Need specify only the max number of cores allowed for parallelisation Useful printout options to check pipeline

Setting up Ruffus

Once Ruffus is set up - Help

Once Ruffus is set up – just print

NGS data processing Taken from:

Processing a raw BAM file Practical concerns – Number of samples – Size of files – Run time – Server/cluster usage: How the jobs can be parallelized Scientific concerns – Ploidy of genome – Source of DNA – Features of genome – Variation between samples – Genome coverage – Error rates

Manipulating a BAM file – Converting between bams and fastqs – Indexing – Coordinate sorting – Splitting or merging – Filter out reads using bitwise flags/other criteria – Mask entire regions

Example: Contaminants

Useful Resource: Harvard Sysbio Remove duplicate sequences in FASTA Remove short sequences in FASTA Format FASTA onal/scriptome/UNIX/Protocols/Sequences.html onal/scriptome/UNIX/Protocols/Sequences.html

Useful Resource: NGSUtils Tools (in Python) for FASTA, BAM, BED, GTF file processing Eg. bamutils filter can filter out reads with more than x mismatches

Useful Resource: PicardTools Tools (in java) for BAM and FASTA processing Cool tools: SamToFastq, MergeSamFiles, ValidateSamFile, ReplaceSamHeader, MarkDuplicates Cool options: SORT_ORDER, CREATE_INDEX, CREATE_MD5_FILE, VALIDATION_STRINGENCY

Useful Resource: GATK Tools (in java) for NGS processing and analysis Cools things about it: Best Practices page, Forum, Tutorials, Presentations

Useful Resource: GATK

Indel Realignment

Why Realign Around Indels?

Why Realign Around Indels?

How does it work? Identified intervals: Known Indels Indels discovered in original alignments (in CIGAR strings of reads in BAM files) Reads where there is evidence of possible misalignment Identified intervals: Known Indels Indels discovered in original alignments (in CIGAR strings of reads in BAM files) Reads where there is evidence of possible misalignment

The Indel Realigner Workflow

Implementing RealignerTargetCreator Site1Site2Site3Site4Site5Site6Site7Site8 sample1reads sample2reads sample3reads sample4… sample5 sample6 sample7 The RealignerTargetCreater needs as many reads from all the samples at a particular site to determine if reads tend to get misaligned there  need to parse in data for all samples at the same time

Implementing IndelRealigner Site1Site2Site3Site4Site5Site6Site7Site8 sample1reads sample2reads sample3reads sample4… sample5 sample6 sample7 Once the Intervals are identified, reads from any single sample can be realigned individually based on the sample’s own insertion/deletion lengths  only need to parse in one sample’s data at a time

Base Quality Score Recalibration (BQSR)

Why BQSR?

The BQSR workflow

Implementing BaseRecalibrator Site1Site2Site3Site4Site5Site6Site7Site8 sample1reads sample2reads sample3reads sample4… sample5 sample6 sample7 The BaseRecalibrator needs all reads from each samples at all unmasked sites to come up with the recalibration table for the dataset  need to parse in all of the data of each sample

Variant Calling

Variant Calling

Implementing Variant Calling Site1Site2Site3Site4Site5Site6Site7Site8 sample1reads sample2reads sample3reads sample4… sample5 sample6 sample7 The UnifiedGenotyper (and many other callers) needs as many reads from all the samples at a particular site to determine if there is a variant at the site tend  need to parse in data for all samples at a particular site at the same time

Useful Resource: Variant Callers

Acknowledgements Jonathan Flint, Richard Mott Robbie Davies, Winni Kretzschmar Kiran Garimella (GATK) Leo Goodstadt (Ruffus) Gerton Lunter (Stampy) Andy Rimmer (Platypus) Zam Iqbal (Cortex) John Broxholme (all software help and maintenance) Jon Diprose, Robert Esnouf (Clusters) Tim Bardsley, Mark Gibbons, Ruth Porter (IT support)