NGS data processing Bioinformatics tips, tools of the trade and pipeline writing Na Cai 4 th year DPhil in Clinical Medicine Supervisor: Jonathan Flint
Example projects CONVERGE -1.7x whole genome sequencing in 12,000 Han Chinese Women Cases of MD, 6000 controls -Detailed questionnaire -45T of sequencing data Commercial Outbred Mice -0.1x whole genome sequencing in 2,000 mice -Known breeding history -Extensive phenotyping -2T of sequencing data
NGS data processing Taken from:
Large-scale sequencing projects Lots of data – Terabytes! Storage problems, I/O problems, RAM problems Time consuming to process Errors! Lots of them! Contamination Duplication Missing data Difficult regions/features of the genome
Approach to NGS data Explore the data before processing large-scale Pilot your experiments with small subsets Try default parameters of softwares before altering Check output – Right number of lines? – Did anything fail silently? – Different handling of different classes of input? – How are missing values coded? – % failure?
Exploratory work in R – read.table(…, as.is=T, na.strings=c(“NA”, “nan”)) – dim(), str(), mode(), complete.cases() – head(), tail() – table(), summary() – order(), rank() – plot(), library(ggplot2) – library(plyr)
Pipeline writing – Arguments/options for different input – Arguments/options for parameters/auxillary files – Reusable functions – Reasonably flexible input format recognition – Set up for parallelizing – stderr – for debugging, checking progress, but beware of its size and I/O! – Create new directories as you go along – Create flag files to indicate successful completion of each step
Make Specify input file and output file Specify command for input output Make checks presence of output file before running command Make deletes output of commands that did not finish running
Ruffus Flexible: one many and many one processes Fully integrated with Python programming Need specify only the max number of cores allowed for parallelisation Useful printout options to check pipeline
Setting up Ruffus
Once Ruffus is set up - Help
Once Ruffus is set up – just print
NGS data processing Taken from:
Processing a raw BAM file Practical concerns – Number of samples – Size of files – Run time – Server/cluster usage: How the jobs can be parallelized Scientific concerns – Ploidy of genome – Source of DNA – Features of genome – Variation between samples – Genome coverage – Error rates
Manipulating a BAM file – Converting between bams and fastqs – Indexing – Coordinate sorting – Splitting or merging – Filter out reads using bitwise flags/other criteria – Mask entire regions
Example: Contaminants
Useful Resource: Harvard Sysbio Remove duplicate sequences in FASTA Remove short sequences in FASTA Format FASTA onal/scriptome/UNIX/Protocols/Sequences.html onal/scriptome/UNIX/Protocols/Sequences.html
Useful Resource: NGSUtils Tools (in Python) for FASTA, BAM, BED, GTF file processing Eg. bamutils filter can filter out reads with more than x mismatches
Useful Resource: PicardTools Tools (in java) for BAM and FASTA processing Cool tools: SamToFastq, MergeSamFiles, ValidateSamFile, ReplaceSamHeader, MarkDuplicates Cool options: SORT_ORDER, CREATE_INDEX, CREATE_MD5_FILE, VALIDATION_STRINGENCY
Useful Resource: GATK Tools (in java) for NGS processing and analysis Cools things about it: Best Practices page, Forum, Tutorials, Presentations
Useful Resource: GATK
Indel Realignment
Why Realign Around Indels?
Why Realign Around Indels?
How does it work? Identified intervals: Known Indels Indels discovered in original alignments (in CIGAR strings of reads in BAM files) Reads where there is evidence of possible misalignment Identified intervals: Known Indels Indels discovered in original alignments (in CIGAR strings of reads in BAM files) Reads where there is evidence of possible misalignment
The Indel Realigner Workflow
Implementing RealignerTargetCreator Site1Site2Site3Site4Site5Site6Site7Site8 sample1reads sample2reads sample3reads sample4… sample5 sample6 sample7 The RealignerTargetCreater needs as many reads from all the samples at a particular site to determine if reads tend to get misaligned there need to parse in data for all samples at the same time
Implementing IndelRealigner Site1Site2Site3Site4Site5Site6Site7Site8 sample1reads sample2reads sample3reads sample4… sample5 sample6 sample7 Once the Intervals are identified, reads from any single sample can be realigned individually based on the sample’s own insertion/deletion lengths only need to parse in one sample’s data at a time
Base Quality Score Recalibration (BQSR)
Why BQSR?
The BQSR workflow
Implementing BaseRecalibrator Site1Site2Site3Site4Site5Site6Site7Site8 sample1reads sample2reads sample3reads sample4… sample5 sample6 sample7 The BaseRecalibrator needs all reads from each samples at all unmasked sites to come up with the recalibration table for the dataset need to parse in all of the data of each sample
Variant Calling
Variant Calling
Implementing Variant Calling Site1Site2Site3Site4Site5Site6Site7Site8 sample1reads sample2reads sample3reads sample4… sample5 sample6 sample7 The UnifiedGenotyper (and many other callers) needs as many reads from all the samples at a particular site to determine if there is a variant at the site tend need to parse in data for all samples at a particular site at the same time
Useful Resource: Variant Callers
Acknowledgements Jonathan Flint, Richard Mott Robbie Davies, Winni Kretzschmar Kiran Garimella (GATK) Leo Goodstadt (Ruffus) Gerton Lunter (Stampy) Andy Rimmer (Platypus) Zam Iqbal (Cortex) John Broxholme (all software help and maintenance) Jon Diprose, Robert Esnouf (Clusters) Tim Bardsley, Mark Gibbons, Ruth Porter (IT support)