Download presentation
Presentation is loading. Please wait.
Published byAugust Lloyd Modified over 9 years ago
1
NGS data processing Bioinformatics tips, tools of the trade and pipeline writing Na Cai 4 th year DPhil in Clinical Medicine Supervisor: Jonathan Flint
2
Example projects CONVERGE -1.7x whole genome sequencing in 12,000 Han Chinese Women -6000 Cases of MD, 6000 controls -Detailed questionnaire -45T of sequencing data Commercial Outbred Mice -0.1x whole genome sequencing in 2,000 mice -Known breeding history -Extensive phenotyping -2T of sequencing data
3
NGS data processing Taken from: http://www.broadinstitute.org/gatk/guide/best-practices
4
Large-scale sequencing projects Lots of data – Terabytes! Storage problems, I/O problems, RAM problems Time consuming to process Errors! Lots of them! Contamination Duplication Missing data Difficult regions/features of the genome
5
Approach to NGS data Explore the data before processing large-scale Pilot your experiments with small subsets Try default parameters of softwares before altering Check output – Right number of lines? – Did anything fail silently? – Different handling of different classes of input? – How are missing values coded? – % failure?
6
Exploratory work in R – read.table(…, as.is=T, na.strings=c(“NA”, “nan”)) – dim(), str(), mode(), complete.cases() – head(), tail() – table(), summary() – order(), rank() – plot(), library(ggplot2) – library(plyr)
7
Pipeline writing – Arguments/options for different input – Arguments/options for parameters/auxillary files – Reusable functions – Reasonably flexible input format recognition – Set up for parallelizing – stderr – for debugging, checking progress, but beware of its size and I/O! – Create new directories as you go along – Create flag files to indicate successful completion of each step
8
Make Specify input file and output file Specify command for input output Make checks presence of output file before running command Make deletes output of commands that did not finish running
9
Ruffus http://www.ruffus.org.uk Flexible: one many and many one processes Fully integrated with Python programming Need specify only the max number of cores allowed for parallelisation Useful printout options to check pipeline
10
Setting up Ruffus
11
Once Ruffus is set up - Help
12
Once Ruffus is set up – just print
13
NGS data processing Taken from: http://www.broadinstitute.org/gatk/guide/best-practices
14
Processing a raw BAM file Practical concerns – Number of samples – Size of files – Run time – Server/cluster usage: How the jobs can be parallelized Scientific concerns – Ploidy of genome – Source of DNA – Features of genome – Variation between samples – Genome coverage – Error rates
15
Manipulating a BAM file – Converting between bams and fastqs – Indexing – Coordinate sorting – Splitting or merging – Filter out reads using bitwise flags/other criteria – Mask entire regions
16
Example: Contaminants
18
Useful Resource: Harvard Sysbio Remove duplicate sequences in FASTA Remove short sequences in FASTA Format FASTA http://archive.sysbio.harvard.edu/csb/resources/computati onal/scriptome/UNIX/Protocols/Sequences.html http://archive.sysbio.harvard.edu/csb/resources/computati onal/scriptome/UNIX/Protocols/Sequences.html
19
Useful Resource: NGSUtils Tools (in Python) for FASTA, BAM, BED, GTF file processing Eg. bamutils filter can filter out reads with more than x mismatches http://ngsutils.org
20
Useful Resource: PicardTools Tools (in java) for BAM and FASTA processing Cool tools: SamToFastq, MergeSamFiles, ValidateSamFile, ReplaceSamHeader, MarkDuplicates Cool options: SORT_ORDER, CREATE_INDEX, CREATE_MD5_FILE, VALIDATION_STRINGENCY http://broadinstitute.github.io/picard
21
Useful Resource: GATK Tools (in java) for NGS processing and analysis Cools things about it: Best Practices page, Forum, Tutorials, Presentations https://www.broadinstitute.org/gatk/
22
Useful Resource: GATK http://www.broadinstitute.org/gatk/guide/best-practices
23
Indel Realignment http://www.broadinstitute.org/gatk/guide/best-practices
24
Why Realign Around Indels? http://www.broadinstitute.org/gatk/events/2038/GATKwh0-BP-2-Realignment.pdf
25
Why Realign Around Indels? http://www.broadinstitute.org/gatk/events/2038/GATKwh0-BP-2-Realignment.pdf
26
How does it work? Identified intervals: Known Indels Indels discovered in original alignments (in CIGAR strings of reads in BAM files) Reads where there is evidence of possible misalignment Identified intervals: Known Indels Indels discovered in original alignments (in CIGAR strings of reads in BAM files) Reads where there is evidence of possible misalignment http://www.broadinstitute.org/gatk/events/2038/GATKwh0-BP-2-Realignment.pdf
27
The Indel Realigner Workflow http://www.broadinstitute.org/gatk/events/2038/GATKwh0-BP-2-Realignment.pdf
28
Implementing RealignerTargetCreator Site1Site2Site3Site4Site5Site6Site7Site8 sample1reads sample2reads sample3reads sample4… sample5 sample6 sample7 The RealignerTargetCreater needs as many reads from all the samples at a particular site to determine if reads tend to get misaligned there need to parse in data for all samples at the same time
30
Implementing IndelRealigner Site1Site2Site3Site4Site5Site6Site7Site8 sample1reads sample2reads sample3reads sample4… sample5 sample6 sample7 Once the Intervals are identified, reads from any single sample can be realigned individually based on the sample’s own insertion/deletion lengths only need to parse in one sample’s data at a time
32
Base Quality Score Recalibration (BQSR) http://www.broadinstitute.org/gatk/guide/best-practices
33
Why BQSR? http://www.broadinstitute.org/gatk/events/2038/GATKwh0-BP-3-Base_recalibration.pdf
34
The BQSR workflow http://www.broadinstitute.org/gatk/events/2038/GATKwh0-BP-3-Base_recalibration.pdf
35
Implementing BaseRecalibrator Site1Site2Site3Site4Site5Site6Site7Site8 sample1reads sample2reads sample3reads sample4… sample5 sample6 sample7 The BaseRecalibrator needs all reads from each samples at all unmasked sites to come up with the recalibration table for the dataset need to parse in all of the data of each sample
37
Variant Calling http://www.broadinstitute.org/gatk/guide/best-practices
38
Variant Calling http://www.broadinstitute.org/gatk//events/2038/GATKwh0-BP-5-Variant_calling.pdf
39
Implementing Variant Calling Site1Site2Site3Site4Site5Site6Site7Site8 sample1reads sample2reads sample3reads sample4… sample5 sample6 sample7 The UnifiedGenotyper (and many other callers) needs as many reads from all the samples at a particular site to determine if there is a variant at the site tend need to parse in data for all samples at a particular site at the same time
41
Useful Resource: Variant Callers
42
Acknowledgements Jonathan Flint, Richard Mott Robbie Davies, Winni Kretzschmar Kiran Garimella (GATK) Leo Goodstadt (Ruffus) Gerton Lunter (Stampy) Andy Rimmer (Platypus) Zam Iqbal (Cortex) John Broxholme (all software help and maintenance) Jon Diprose, Robert Esnouf (Clusters) Tim Bardsley, Mark Gibbons, Ruth Porter (IT support)
Similar presentations
© 2025 SlidePlayer.com. Inc.
All rights reserved.