Cloud based NGS data analysis

Slides:



Advertisements
Similar presentations
RNA-Seq as a Discovery Tool
Advertisements

Vanderbilt Center for Quantitative Sciences Summer Institute Sequencing Analysis Yan Guo.
RNAseq.
Natasha Pavlovikj, Kevin Begcy, Sairam Behera, Malachy Campbell, Harkamal Walia, Jitender S.Deogun University of Nebraska-Lincoln Evaluating Distributed.
Experiences In Building Globus Genomics Using Galaxy, Globus Online and AWS Ravi K Madduri University of Chicago and ANL.
Peter Tsai Bioinformatics Institute, University of Auckland
Transcriptomics Jim Noonan GENE 760.
mRNA-Seq: methods and applications
DRAW+SneakPeek: Analysis Workflow and Quality Metric Management for DNA-Seq Experiments O. Valladares 1,2, C.-F. Lin 1,2, D. M. Childress 1,2, E. Klevak.
Li and Dewey BMC Bioinformatics 2011, 12:323
Genome Annotation using MAKER-P at iPlant Collaboration with Mark Yandell Lab (University of Utah) iPlant: Josh Stein (CSHL) Matt Vaughn.
Transcriptome analysis With a reference – Challenging due to size and complexity of datasets – Many tools available, driven by biomedical research – GATK.
Human SNPs from short reads in hours using cloud computing Ben Langmead 1, 2, Michael C. Schatz 2, Jimmy Lin 3, Mihai Pop 2, Steven L. Salzberg 2 1 Department.
Next Generation Sequencing and its data analysis challenges Background Alignment and Assembly Applications Genome Epigenome Transcriptome.
The iPlant Collaborative
RNA-Seq Assembly 转录组拼接 唐海宝 基因组与生物技术研究中心 2013 年 11 月 23 日.
Cloud Implementation of GT-FAR (Genome and Transcriptome-Free Analysis of RNA-Seq) University of Southern California.
Biological Signal Detection for Protein Function Prediction Investigators: Yang Dai Prime Grant Support: NSF Problem Statement and Motivation Technical.
Transcriptomics Sequencing. over view The transcriptome is the set of all RNA molecules, including mRNA, rRNA, tRNA, and other non coding RNA produced.
RNA-Seq Primer Understanding the RNA-Seq evidence tracks on the GEP UCSC Genome Browser Wilson Leung08/2014.
Data Workflow Overview Genomics High- Throughput Facility Genome Analyzer IIx Institute for Genomics and Bioinformatics Computation Resources Storage Capacity.
Introduction to RNAseq
Geuvadis Analysis Meeting 16/02/2012 Micha Sammeth CNAG – Barcelona.
Computational methods for genomics-guided immunotherapy Sahar Al Seesi Computer Science & Engineering Department, UCONN Immunology Department, UCONN Health.
TOX680 Unveiling the Transcriptome using RNA-seq Jinze Liu.
The iPlant Collaborative
No reference available
GSVCaller – R-based computational framework for detection and annotation of short sequence variations in the human genome Vasily V. Grinev Associate Professor.
PARALLEL AND DISTRIBUTED PROGRAMMING MODELS U. Jhashuva 1 Asst. Prof Dept. of CSE om.
Canadian Bioinformatics Workshops
Reliable Identification of Genomic Variants from RNA-seq Data Robert Piskol, Gokul Ramaswami, Jin Billy Li PRESENTED BY GAYATHRI RAJAN VINEELA GANGALAPUDI.
Overview of Genomics Workflows
University of Pavia Dep. of Electrical, Computer and Biomedical Engineering Laboratory of Bioinformatics, Mathematical Modelling and Synthetic Biology.
Canadian Bioinformatics Workshops
Canadian Bioinformatics Workshops
From Reads to Results Exome-seq analysis at CCBR
RNA-Seq with the Tuxedo Suite Monica Britton, Ph.D. Sr. Bioinformatics Analyst September 2015 Workshop.
Centralizing Bioinformatics Services: Analysis Pipelines, Opportunities, and Challenges with Large- scale –Omics, and other BigData High-Performance Computing.
To develop the scientific evidence base that will lessen the burden of cancer in the United States and around the world. NCI Mission Key message:
RNA-Seq Primer Understanding the RNA-Seq evidence tracks on
Easier Workflows & Tool comparison with oqtans+
RNA Quantitation from RNAseq Data
An Introduction to RNA-Seq Data and Differential Expression Tools in R
Genomon a high-integrity pipeline for cancer genome and transcriptome sequence analysis Kenichi Chiba(1), Yuichi Shiraishi(1), Ai Okada(1), Hiroko.
COI Disclosure Information Eigo Shimizu
Cancer Genomics Core Lab
University of Chicago and ANL
Dr. Christoph W. Sensen und Dr. Jung Soh Trieste Course 2017
Gene expression from RNA-Seq
RNA-Seq analysis in R (Bioconductor)
Optimizing Biological Data Integration
S1 Supporting information Bioinformatic workflow and quality of the metrics Number of slides: 10.
Canadian Bioinformatics Workshops
Canadian Bioinformatics Workshops
AWS Batch Overview A highly-efficient, dynamically-scaled, batch computing service May 2017.
An easier path? Customizing a “Global Solution”
Kallisto: near-optimal RNA seq quantification tool
Many Sample Size and Power Calculators Exist On-Line
DKTK MASTER is an example of whole-exome and transcriptome sequencing-based precision oncology programme. DKTK MASTER is an example of whole-exome and.
RNA sequencing (RNA-Seq) and its application in ovarian cancer
Inference of alternative splicing from RNA-Seq data with probabilistic splice graphs BMI/CS Spring 2019 Colin Dewey
Alex M. Plocik, Brenton R. Graveley  Molecular Cell 
ChIP-seq Robert J. Trumbly
Galaxy course EMC TraIT Nov 2014_Jenster
Bo Li, Akshay Tambe, Sharon Aviran, Lior Pachter  Cell Systems 
HPC for large NGS data: Microbial diversity analysis
Transcriptomics Data Visualization Using Partek Flow Software
Sequence Analysis - RNA-Seq 2
Cancer Cell Line Encyclopedia
Schematic representation of a transcriptomic evaluation approach.
Presentation transcript:

Cloud based NGS data analysis of KM12 cell line Ettore Rizzo1,2 Ph.D, Roberta Bosotti2, Giovanni Carapezza2, Sebastiano Di Bella2, Antonella Isacchi2, Riccardo Bellazzi1 Ph.D.  1Laboratory of Bioinformatics, Mathematical Modelling and Synthetic Biology, University of Pavia, via Ferrata 1 Pavia 27100 Italy 2Nerviano Medical Sciences S.r.l., Via Pasteur Louis, 10, 20014 Nerviano (Milan), Italy BACKGROUND. In order to detect DNA variants and investigate gene expression data, we developed two different scalable and parallelizable workflows capable of running on the cloud (e.g., Amazon Web Services – AWS), that allow to reduce the time of the analysis and are cost-effective. The developed pipelines integrate different state-of-the-art bioinformatics tools and are build on top of COSMOS [1], a workflow management system that allow to lower genomic data analysis cost in two ways: 1) it implements a highly parallelizable workflow that can be quickly and efficiently run on a large compute cluster, and 2) it takes advantage of AWS spot-instance pricing to reduce the cost per hour. As test case, the implemented pipelines were applied to the analysis of Next-Generation Sequencing data of the DNA and RNA extracted from the KM12 human colorectal cancer cell line for whole exome and whole transcriptome sequencing. The analysis allowed to highlight the characteristic TPM3-NTRK1 genomic rearrangement harbored by this cell line [2]. MATERIALS AND METHODS. The focus of this study is the implementation of two COSMOS workflows that respectively perform variant discovery in DNAseq data and evaluate differential gene expression in RNAseq data. The DNAseq workflow implements the GATK [3] best practice protocol (Broad Institute), which is a widely accepted analysis standard. The involved steps are the following: mapping and marking duplicates; local realignment around indels; base quality score recalibration; variant calling by HaplotypeCaller and variant quality score recalibration. This pipeline also includes annotation through Annovar [4] and structural variation screening using DELLY [5], a Structural Variant (SV) discovery method suitable for the detection of copy-number alterations, duplication events or balanced rearrangements (inversions, translocations). The RNAseq workflow implements the TCGA RNAseq pipeline. Reads are aligned to the reference genome through MapSplice [6], that allows also the detection of splicing junctions. Isoform-level and gene-level abundance are then estimated through RSEM [7]. Finally differential expression analysis is performed using the Bioconductor package edgeR [8]. In order to automate and simplify the process of building, configuring, and managing the AWS EC2 cluster used to run the described pipelines, we rely on StarCluster toolkit. It allows indeed to launch and shutdown cluster nodes without user intervention and automatically installs both a job manager and a file sharing system a on all the cluster nodes. After the workflow management system loads a workflow, a “workflow” parser breaks up each stage of the workflow into multiple jobs that are then executed in parallel. Jobs are distributed from a master node to worker nodes using a standard job manager as Grid Engine. Users can monitor real-time their workflows, state and job dependencies and use of resources per each job, through a dynamic web interface provided by COSMOS. RESULTS The developed pipelines were executed on a cloud computing environment which uses 5 node (1 master and 4 worker node) with each node a “cc2.8xlarge” AWS instance with 32 cores and 60 Gb of Ram. DNAseq analysis took less than 3 hours of AWS “wall” time from raw data processing to annotation step and, more importantly, cost less than 50€. RNAseq analysis took less than 2 hours and cost less than 40€. To evaluate experiment quality and the accuracy of our DNAseq pipeline, we compared SNV calls against KM12 publically available NGS data (CCLE, Cancer Cell Line Encyclopedia [9]) obtaining a 95% overlap between call sets. Finally, the presence of the known TPM3-NTRK1 rearrangement in KM12 was detected by both exome and transcriptome analysis (see below). REFERENCES [1] Gafni, Erik, et al. "COSMOS: Python library for massively parallel workflows." Bioinformatics (2014): btu385. [2] Ardini, Elena, et al. "The TPM3-NTRK1 rearrangement is a recurring event in colorectal carcinoma and is associated with tumor sensitivity to TRKA kinase inhibition." Molecular oncology 8.8 (2014): 1495-1507. [3] McKenna, Aaron, et al. "The Genome Analysis Toolkit: a MapReduce framework for analyzing next-generation DNA sequencing data." Genome research 20.9 (2010): 1297-1303. [4] Wang, Kai, Mingyao Li, and Hakon Hakonarson. "ANNOVAR: functional annotation of genetic variants from high-throughput sequencing data." Nucleic acids research 38.16 (2010): e164-e164. [5] Rausch, Tobias, et al. "DELLY: structural variant discovery by integrated paired-end and split-read analysis." Bioinformatics 28.18 (2012): i333-i339. [6] Wang, Kai, et al. "MapSplice: accurate mapping of RNA-seq reads for splice junction discovery." Nucleic acids research 38.18 (2010): e178-e178. [7] Li, Bo, and Colin N. Dewey. "RSEM: accurate transcript quantification from RNA-Seq data with or without a reference genome." BMC bioinformatics 12.1 (2011): 323. [8] Robinson, Mark D., Davis J. McCarthy, and Gordon K. Smyth. "edgeR: a Bioconductor package for differential expression analysis of digital gene expression data." Bioinformatics 26.1 (2010): 139-140. [9] Barretina, Jordi, et al. "The Cancer Cell Line Encyclopedia enables predictive modelling of anticancer drug sensitivity." Nature 483.7391 (2012): 603-607. .