Presentation is loading. Please wait.

Presentation is loading. Please wait.

NGS data processing Bioinformatics tips, tools of the trade and pipeline writing Na Cai 4 th year DPhil in Clinical Medicine Supervisor: Jonathan Flint.

Similar presentations


Presentation on theme: "NGS data processing Bioinformatics tips, tools of the trade and pipeline writing Na Cai 4 th year DPhil in Clinical Medicine Supervisor: Jonathan Flint."— Presentation transcript:

1 NGS data processing Bioinformatics tips, tools of the trade and pipeline writing Na Cai 4 th year DPhil in Clinical Medicine Supervisor: Jonathan Flint

2 Example projects CONVERGE -1.7x whole genome sequencing in 12,000 Han Chinese Women -6000 Cases of MD, 6000 controls -Detailed questionnaire -45T of sequencing data Commercial Outbred Mice -0.1x whole genome sequencing in 2,000 mice -Known breeding history -Extensive phenotyping -2T of sequencing data

3 NGS data processing Taken from: http://www.broadinstitute.org/gatk/guide/best-practices

4 Large-scale sequencing projects Lots of data – Terabytes! Storage problems, I/O problems, RAM problems Time consuming to process Errors! Lots of them! Contamination Duplication Missing data Difficult regions/features of the genome

5 Approach to NGS data Explore the data before processing large-scale Pilot your experiments with small subsets Try default parameters of softwares before altering Check output – Right number of lines? – Did anything fail silently? – Different handling of different classes of input? – How are missing values coded? – % failure?

6 Exploratory work in R – read.table(…, as.is=T, na.strings=c(“NA”, “nan”)) – dim(), str(), mode(), complete.cases() – head(), tail() – table(), summary() – order(), rank() – plot(), library(ggplot2) – library(plyr)

7 Pipeline writing – Arguments/options for different input – Arguments/options for parameters/auxillary files – Reusable functions – Reasonably flexible input format recognition – Set up for parallelizing – stderr – for debugging, checking progress, but beware of its size and I/O! – Create new directories as you go along – Create flag files to indicate successful completion of each step

8 Make Specify input file and output file Specify command for input  output Make checks presence of output file before running command Make deletes output of commands that did not finish running

9 Ruffus http://www.ruffus.org.uk Flexible: one  many and many  one processes Fully integrated with Python programming Need specify only the max number of cores allowed for parallelisation Useful printout options to check pipeline

10 Setting up Ruffus

11 Once Ruffus is set up - Help

12 Once Ruffus is set up – just print

13 NGS data processing Taken from: http://www.broadinstitute.org/gatk/guide/best-practices

14 Processing a raw BAM file Practical concerns – Number of samples – Size of files – Run time – Server/cluster usage: How the jobs can be parallelized Scientific concerns – Ploidy of genome – Source of DNA – Features of genome – Variation between samples – Genome coverage – Error rates

15 Manipulating a BAM file – Converting between bams and fastqs – Indexing – Coordinate sorting – Splitting or merging – Filter out reads using bitwise flags/other criteria – Mask entire regions

16 Example: Contaminants

17

18 Useful Resource: Harvard Sysbio Remove duplicate sequences in FASTA Remove short sequences in FASTA Format FASTA http://archive.sysbio.harvard.edu/csb/resources/computati onal/scriptome/UNIX/Protocols/Sequences.html http://archive.sysbio.harvard.edu/csb/resources/computati onal/scriptome/UNIX/Protocols/Sequences.html

19 Useful Resource: NGSUtils Tools (in Python) for FASTA, BAM, BED, GTF file processing Eg. bamutils filter can filter out reads with more than x mismatches http://ngsutils.org

20 Useful Resource: PicardTools Tools (in java) for BAM and FASTA processing Cool tools: SamToFastq, MergeSamFiles, ValidateSamFile, ReplaceSamHeader, MarkDuplicates Cool options: SORT_ORDER, CREATE_INDEX, CREATE_MD5_FILE, VALIDATION_STRINGENCY http://broadinstitute.github.io/picard

21 Useful Resource: GATK Tools (in java) for NGS processing and analysis Cools things about it: Best Practices page, Forum, Tutorials, Presentations https://www.broadinstitute.org/gatk/

22 Useful Resource: GATK http://www.broadinstitute.org/gatk/guide/best-practices

23 Indel Realignment http://www.broadinstitute.org/gatk/guide/best-practices

24 Why Realign Around Indels? http://www.broadinstitute.org/gatk/events/2038/GATKwh0-BP-2-Realignment.pdf

25 Why Realign Around Indels? http://www.broadinstitute.org/gatk/events/2038/GATKwh0-BP-2-Realignment.pdf

26 How does it work? Identified intervals: Known Indels Indels discovered in original alignments (in CIGAR strings of reads in BAM files) Reads where there is evidence of possible misalignment Identified intervals: Known Indels Indels discovered in original alignments (in CIGAR strings of reads in BAM files) Reads where there is evidence of possible misalignment http://www.broadinstitute.org/gatk/events/2038/GATKwh0-BP-2-Realignment.pdf

27 The Indel Realigner Workflow http://www.broadinstitute.org/gatk/events/2038/GATKwh0-BP-2-Realignment.pdf

28 Implementing RealignerTargetCreator Site1Site2Site3Site4Site5Site6Site7Site8 sample1reads sample2reads sample3reads sample4… sample5 sample6 sample7 The RealignerTargetCreater needs as many reads from all the samples at a particular site to determine if reads tend to get misaligned there  need to parse in data for all samples at the same time

29

30 Implementing IndelRealigner Site1Site2Site3Site4Site5Site6Site7Site8 sample1reads sample2reads sample3reads sample4… sample5 sample6 sample7 Once the Intervals are identified, reads from any single sample can be realigned individually based on the sample’s own insertion/deletion lengths  only need to parse in one sample’s data at a time

31

32 Base Quality Score Recalibration (BQSR) http://www.broadinstitute.org/gatk/guide/best-practices

33 Why BQSR? http://www.broadinstitute.org/gatk/events/2038/GATKwh0-BP-3-Base_recalibration.pdf

34 The BQSR workflow http://www.broadinstitute.org/gatk/events/2038/GATKwh0-BP-3-Base_recalibration.pdf

35 Implementing BaseRecalibrator Site1Site2Site3Site4Site5Site6Site7Site8 sample1reads sample2reads sample3reads sample4… sample5 sample6 sample7 The BaseRecalibrator needs all reads from each samples at all unmasked sites to come up with the recalibration table for the dataset  need to parse in all of the data of each sample

36

37 Variant Calling http://www.broadinstitute.org/gatk/guide/best-practices

38 Variant Calling http://www.broadinstitute.org/gatk//events/2038/GATKwh0-BP-5-Variant_calling.pdf

39 Implementing Variant Calling Site1Site2Site3Site4Site5Site6Site7Site8 sample1reads sample2reads sample3reads sample4… sample5 sample6 sample7 The UnifiedGenotyper (and many other callers) needs as many reads from all the samples at a particular site to determine if there is a variant at the site tend  need to parse in data for all samples at a particular site at the same time

40

41 Useful Resource: Variant Callers

42 Acknowledgements Jonathan Flint, Richard Mott Robbie Davies, Winni Kretzschmar Kiran Garimella (GATK) Leo Goodstadt (Ruffus) Gerton Lunter (Stampy) Andy Rimmer (Platypus) Zam Iqbal (Cortex) John Broxholme (all software help and maintenance) Jon Diprose, Robert Esnouf (Clusters) Tim Bardsley, Mark Gibbons, Ruth Porter (IT support)


Download ppt "NGS data processing Bioinformatics tips, tools of the trade and pipeline writing Na Cai 4 th year DPhil in Clinical Medicine Supervisor: Jonathan Flint."

Similar presentations


Ads by Google