DRAW+SneakPeek: Analysis Workflow and Quality Metric Management for DNA-Seq Experiments O. Valladares 1,2, C.-F. Lin 1,2, D. M. Childress 1,2, E. Klevak 3, E. T. Geller 1, Y.-C. Hwang 2,4, E. A. Tsai 4,5, A. B. Partch 1,2, G. D. Schellenberg 1, L.-S. Wang 1,2 1) Department of Pathology and Laboratory Medicine, University of Pennsylvania. Philadelphia, PA; 2) Institute for Biomedical Informatics, University of Pennsylvania, Philadelphia, PA; 3) Department of Physics, University of Washington, Seattle, WA; 4) Genomics and Computational Biology Graduate Group, University of Pennsylvania. Philadelphia, PA; 5) Department of Pathology and Laboratory Medicine, The Children's Hospital of Philadelphia, Philadelphia, PA. Next-generation sequencing (NGS) has redefined what big data means in biomedical research. Advances in quality and capacity have led to a declining cost of implementation, allowing NGS to be used in a wide range of experiments at a variety of scales; from a few samples in small laboratories to thousands of samples from multi- institute collaborations. Processing terabytes of data requires a certain level of information technology and bioinformatics expertise, which can be daunting to small laboratories with limited resources. The programs we developed will enable these groups to process DNA-seq data and identify single-nucleotide variants and small insertions and deletions (indels). Introduction Integrates open-source programs to analyze DNA-seq data in a Linux environment GATK ( SAMtools ( BWA ( PICARD ( SnpEff ( Operates on distributed resource management system (Oracle Grid Engine) Job dependency and error checking Available on Amazon Elastic Cloud Computing DRAW: DNA Resequencing Analysis Workflow Acknowledgements We thank the constructive input from members of the Schellenberg and Wang labs, collaborators from the ARRA autism sequencing consortium, Nancy B. Spinner, Samir Wadwahan, Maja Bucan, Chris Stoeckert, and members of the Penn HTS group. Funding: The authors gratefully acknowledge funding from NIMH (R01 MH089004, R01 MH094382, and R01 MH094382), NIA (U24 AG041689, U01 AG032984, P30 AG010124), NINDS (P50 NS053488), and CurePSP Foundation. SneakPeek: Quality Metrics Management System Provides an overview of all samples processed through a dynamic web interface Allows user to assess quality of sequencing data Identify samples with unusual QC metric(s) Identify batch problems DRAW+SneakPeek Availability Released under the MIT license Free for academic and non-profit use Available at the National Institute on Aging Genetics of Alzheimer’s Disease Data Storage Site (NIAGADS) ( Source code Amazon Machine Images (AMIs) Install guide, documentation, tutorial Running DRAW One command will run all three phases of DRAW: Phase 5: Import into MySQL tables using in- house scripts Phase 3: Variant and coverage using GATK/snpEff Phase 2: QC using GATK/Picard/Samtools Phase 1: Mapping using BWA Inpu t Demultiplexed FastQ filesAlign reads, Paired ends Mark duplicates, Local realignment, Base quality recalibration Variant detection, filtration, annotation Quality metrics on SneakPeek Read, Base/Depth Coverage, QC metrics Annotated VCF file One flow cell: Illumina Hi-Seq 2000, 100-bp pair-end, ~350 Gb, 34 multiplexed samples using Nimblegen Human Exome v2 Library. 1.1TB data in two days; total cost $528 Running DRAW on Amazon Guide available on NIAGADS.org What Motivates Draw+SneakPeek Features of DRAW Running DRAW on Amazon EC2: A benchmark study Workflow Comparison A comparison of DRAW+SneakPeek with other workflows. Reference Lin CF, Valladares O, Childress DM, Klevak E, Geller ET, Hwang YC, Tsai EA, Schellenberg GD, and Wang LS. DRAW+SneakPeek: Analysis Workflow and Quality Metric Management for DNA- Seq Experiments. Bioinformatics, Oct 1;29(19): Epub 2013 Aug 13.