0 NGS Data Analysis with the Galaxy Platform - an application to ChIP-seq Monterotondo, 16 April 2015 Charles Girardot Genome Biology Computational Support.

Slides:

Advertisements

Similar presentations

IMGS 2012 Bioinformatics Workshop: RNA Seq using Galaxy

Advertisements

RCAC Research Computing Presents: DiaGird Overview Tuesday, September 24, 2013.

A pilot application 12/9/2008Microsoft eScience Workshop 2008 Robert Bukowski and Jarek Pillardy Computational Biology Service Unit Cornell University.

Aleksi Kallio CSC – IT Center for Science Chipster and collaboration with other bioinformatics platforms.

MSF Testing Introduction Functional Testing Performance Testing.

The iPlant Collaborative Community Cyberinfrastructure for Life Science Tools and Services Workshop Discovery Environment Overview.

Data Formats & QC Analysis for NGS Rosana O. Babu 8/19/20151.

NGS Analysis Using Galaxy

Whole Exome Sequencing for Variant Discovery and Prioritisation

Crystal Hoyer Program Manager IIS Team Preview of features that will be announced at MIX09 Please do not blog, take pictures or video of session.

Customized cloud platform for computing on your terms !

Viewing & Getting GO COST Functional Modeling Workshop April, Helsinki.

Developing Workflows with SharePoint Designer David Coe Application Development Consultant Microsoft Corporation.

An Introduction to Designing, Executing and Sharing Workflows with Taverna Nowgen, Next Gen Workshop 17/01/2012.

Genomics Virtual Lab: analyze your data with a mouse click Igor Makunin School of Agriculture and Food Sciences, UQ, April 8, 2015.

Computer Lab (I) Introduction of galaxy and UCSC genome browser.

London April 2005 London April 2005 Creating Eyeblaster Ads The Rich Media Platform The Rich Media Platform Eyeblaster.

Collecting and Storing Sequences In the laboratory Heather Helm UPR Sequencing Facilities Manager.

IPlant Collaborative Tools and Services Workshop iPlant Collaborative Tools and Services Workshop Collaborating with iPlant.

Galaxy for Bioinformatics Analysis An Introduction TCD Bioinformatics Support Team Fiona Roche, PhD Date: 31/08/15.

Regulatory Genomics Lab Saurabh Sinha Regulatory Genomics Lab v1 | Saurabh Sinha1 Powerpoint by Casey Hanson.

NGS data analysis CCM Seminar series Michael Liang:

IPlant Collaborative Tools and Services Workshop iPlant Collaborative Tools and Services Workshop Collaborating with iPlant.

Next Generation DNA Sequencing

Chromatin Immunoprecipitation DNA Sequencing (ChIP-seq)

What’s new in Kentico CMS 5.0 Michal Neuwirth Product Manager Kentico Software.

The iPlant Collaborative Community Cyberinfrastructure for Life Science Tools and Services Workshop Discovery Environment Overview.

Bioinformatics Core Facility Guglielmo Roma January 2011.

BRUDNO LAB: A WHIRLWIND TOUR Marc Fiume Department of Computer Science University of Toronto.

The iPlant Collaborative Community Cyberinfrastructure for Life Science Tools and Services Workshop Discovery Environment Overview.

Introductory RNA-seq Transcriptome Profiling. Before we start: Align sequence reads to the reference genome The most time-consuming part of the analysis.

Analysis of GEO datasets using GEO2R Parthav Jailwala CCR Collaborative Bioinformatics Resource CCR/NCI/NIH.

Regulatory Genomics Lab Saurabh Sinha Regulatory Genomics | Saurabh Sinha | PowerPoint by Casey Hanson.

Part 4 Processing and saving data with CGI/Perl Psychological Science on the Internet: Designing Web-Based Experiments From the Ground Up R. Chris Fraley.

Sushi – An exquisite recipe for NGS data analysis Hubert Rehrauer & Masaomi Hatakeyama Supporting User for SHell-script Integration.

Trinity College Dublin, The University of Dublin GE3M25: Data Analysis, Class 4 Karsten Hokamp, PhD Genetics TCD, 07/12/2015

Bioinformatics for biologists Dr. Habil Zare, PhD PI of Oncinfo Lab Assistant Professor, Department of Computer Science Texas State University Presented.

IPlant Collaborative Tools and Services Workshop iPlant Collaborative Tools and Services Workshop Overview of the iPlant Discovery Environment.

The iPlant Collaborative

Oman College of Management and Technology Course – MM Topic 7 Production and Distribution of Multimedia Titles CS/MIS Department.

Accessing and visualizing genomics data

Introduction of the ChIP-seq pipeline Shigeki Nakagome November 16 th, 2015 Di Rienzo lab meeting.

Chip – Seq Peak Calling in Galaxy Lisa Stubbs Lisa Stubbs | Chip-Seq Peak Calling in Galaxy1.

CCLE Cancer Cell Line Encyclopedia Alexey Erohskin.

0 Managing your NGS Data with emBASE Sample Annotation NGS Assays Data sets grouping in experiments and projects Programmatic access Adding, Deleting and.

HOMER – a one stop shop for ChIP-Seq analysis

0 GBCS ecosystem for NGS Data Data Management with emBASE Data analysis with Galaxy.

Using Galaxy to build and run data processing pipelines Jelle Scholtalbers / Charles Girardot GBCS Genome Biology Computational Support.

IGV Demo Slides:/g/funcgen/trainings/visualization/Demos/IGV_demo.ppt Galaxy Dev: 0.

Visualizing data from Galaxy

MESA A Simple Microarray Data Management Server. General MESA is a prototype web-based database solution for the massive amounts of initial data generated.

InSilicoLab – Grid Environment for Supporting Numerical Experiments in Chemistry Joanna Kocot, Daniel Harężlak, Klemens Noga, Mariusz Sterzel, Tomasz Szepieniec.

CyVerse Workshop Discovery Environment Overview. Welcome to the Discovery Environment A Simple Interface to Hundreds of Bioinformatics Apps, Powerful.

Konstantin Okonechnikov Qualimap v2: advanced quality control of

Join the Community

Using command line tools to process sequencing data

Canadian Bioinformatics Workshops

CyVerse Discovery Environment

NGS Analysis Using Galaxy

Regulatory Genomics Lab

ChIP-Seq Analysis – Using CLCGenomics Workbench

Maintaining software solutions

Topics Introduction Hardware and Software How Computers Store Data

Material for today’s workshop is at:

Storing and Accessing G-OnRamp’s Assembly Hubs outside of Galaxy

ChIP-seq Robert J. Trumbly

Regulatory Genomics Lab

Lab 2: Information Retrieval

Regulatory Genomics Lab

Presentation transcript:

0 NGS Data Analysis with the Galaxy Platform - an application to ChIP-seq Monterotondo, 16 April 2015 Charles Girardot Genome Biology Computational Support (GBCS)

What is Galaxy? 1 Dataset file(s) results e.g. mapped reads, pictures, statistics in Dataset file(s) private e.g. ngs reads public e.g. genes out Tool e.g. read mapper How does a computational biologist work ? Install the “tool” on his computer Check how to run the “tool” Execute the command line Look at the results

What is Galaxy? 2 But … tool installation is not always a piece of cake “public” files need to be assembled the task needs to be executed on a server / a compute cluster results might not be human-readable (appropriate tools/software needed) Dataset file(s) results e.g. mapped reads, pictures, statistics in Dataset file(s) private e.g. ngs reads public e.g. genes out Tool e.g. read mapper

What is Galaxy? 3 in Tool 1 out Multiple tools need to be chained in a defined order Tool 2 out in … you don’t want to manually check when “Tool 1” has completed !  Organize command lines in a “script”

What is Galaxy? 4 in Tool 1 out Multiple outputs need to be combined execution of tools should be controlled (parallel processing) workflows contain dozens of steps, generate a lot of tmp data some tool execution might fail Tool 2 out Tool 3 out Tool 4 out … you don’t want to implement flow control in your script !  How do you adapt needed resources (cpu, memory) in a single script?

What is Galaxy? 5 in Tool 1 out Sequencers throughput require parallel processing of multiple samples Tool 2 out Tool 3 out Tool 4 out  how do you efficiently monitor all these workflow executions ? in Tool 1 out Tool 2 out Tool 3 out Tool 4 out

What is Galaxy ? 6 Open source platform for high-throughput genomics : analyze (genomics) data and create workflows get and integrate public + private data visualize, share and publish encourage reproducible science tool installation from public toolshed enable biologists to analyze their data w/o bioinformaticians The Galaxy Project is supported in part by NSF, NHGRI, The Huck Institutes of the Life Sciences, The Institute for CyberScience at Penn State, and Johns Hopkins University. Galaxy is a web application that does ALL this for you (and more)

What is Galaxy ? 7 Extremely active and popular project more than 60 public servers focus proteomics, metagenomics, metabolomics solid and stable team of developers user conference regularly occurring, on all continents National Galaxy hubs, MLs and workshop events tons of online learning material Learning Galaxy is a valuable long term investment both for biologists and bioinformaticians Provides advanced features for bioinformaticians RESTful APIs and bioblend scripting interface Can be launched on the cloud …

Why a local Galaxy ? 8 Working with public Galaxy servers is limited Volume of data to transfer back and forth Confidentiality of data Impossibility of integrating custom tools Our HD-Galaxy is also relevant to you : Samples sequencing occurs in HD Raw files are automatically streamed to Galaxy and emBASE Galaxy gives you access to all HD resources and compute power Store your data on your file servers Use other tools using web interfaces or the command line Transfer to MR only relevant and processed data (even sync at night) We maintain a Galaxy Server at

Welcome to Galaxy 9 Tools History Main Panel : Launch Jobs / View Results Search tools Personal workflows == pipelines Tools are organized by CategoriesClick a Category to see all tools

Noticeable Tool Categories 10 Fetch public data from UCSC, ENSEMBL, mouseMINE… NGS Tools organized by: application (QC, mapping, peak calling, RNA, …) package name (bedtools, picard) General sequence and text format manipulation Tools Proteomics ToolsUtility (local), Test (beta version) and Deprecated (for backward comp.) Upload personal files

Running a Tool is easy 11 Click a tool to bring it up in the middle panel eg FastQC

Running a Tool is easy 12 (1)Select input files (2)Position parameters (3)Click Execute Job is submitted to compute cluster => a new dataset block is added in the active history Green : Successfully completed Yellow : Running Grey : Waiting Red : Failed job Run the tool on many files is easy too !

Tool summary 13 More than 350 tools available ! All results can be downloaded or directly transferred to your project folder Missing Tools can be easily integrated Easy way to add a GUI to your own script Parameters for cluster submission can be adjusted for each tool (and even be dynamically computed)

Tools can be assembled into workflows 14

Fetch public data from popular resources 15 Fetch Data from UCSC Table Browser But also from ensembl, SRA, MouseMine …

Organize your files in Libraries 16  Each Lab has its own access-protected library  Datasets are organized into “folders” (can be nested)  We add your files automatically upon data release from GeneCore  These files are links to avoid wasting space

How do you like your ChIP ? 17 ChIP against… TF Co-factors Histone modifications Nucleosomes Transcription machinery … To find … regulatory elements activity states …

Overview of ChIP-seq processing 18 [Park et al, Nat. Gen. 2009] 1.Sequencing : single end, no strand specificity randomly sequencing one strand QC : quality per base, GC content…

Overview of ChIP-seq processing 19 [Park et al, Nat. Gen. 2009] 2.Mapping : pos. strand reads map upstream target neg. strand reads map downstream target QC : #pos/#neg == 1 is expected % read that do not map % read that map at multiple position % of read duplicates

Overview of ChIP-seq processing 20 [Park et al, Nat. Gen. 2009] 3.Strand specific coverage: symmetrical distribution expected summits separated by avg fragment length QC: strand cross-correlation

Strand cross-correlation 21 Landt et al, Gen. Res., 2012 NSC < 1.1 are relatively low (ENCODE) Highly enriched experiments have RSC > 1 RSC << 1 may indicate low quality A “ChIP Quality” score is derived from these metrics

Strand cross-correlation 22 Landt et al, Gen. Res., 2012 cross-correlation metrics for ENCODE datasets

Overview of ChIP-seq processing 23 [Park et al, Nat. Gen. 2009] 4.ChIP profile map shift each dist. by half fragment length extend each read by fragment length used for peak calling QC : visualize in genome browser Bam fingerprint

BAM Fingerprints (Deeptools) 24 Visualize enrichment w/o calling peaks ! In Section NGS: Deeptools Cumulative read sum Rank

All these steps and QC metrics can be modeled in one workflow 25 This workflow is public and can be used as such or copied and modified to your needs One sample “Read QC, Mapping and Filtering WF” i.e. does not include peak calling

… executed on X samples in just one click 26

each workflow is executed in its own history 27

Easily visualize results of each step 28 Image results from tool Tabular datasets (eg bed) HTML Report

Trackster : Embedded Genome Browser 29 Bam files bigwig files bed files Visualization can be saved and shared

Interactive Charting 30

Overview of ChIP-seq processing 31 Detection of enriched regions must be adapted to the ChIPed target Sequence-specific binding “point source” eg TF Mixture (Pol II) : peak followed by broad enrichment Median size peaks Large size peaks [Park et al, Nat. Gen. 2009] 5.Find peaks

Dealing with replicates 32 Sample#unique reads # peaks ( MACS14) #peak cov.InOtherSet YG1 IP15.1 M2,9683,898,23685% in YG2 YG2 IP14.75 M2,8443,820,35889% in YG1 YG1+YG228 M32174,710,262 YG3 Input12 M Correlation with Signal in Merged Peaks (log2(IP/Input)) p=0.94 A good correlation “allows” you to merge them and call peaks on the merged reads Approach suffers from cutoff effects (pval dist. are sample specific) Workflow available in Galaxy

The irreproducible discovery rate (IDR) 33 Unified approach to measure the reproducibility of findings identified from replicate high-throughput experiments Idea : call peaks with low cutoff and classify peaks as reproducible or not (bivariate rank distributions) based on overlap of ranked peaks (consistency) Landt et al, 2012 This is a little stringent if the ChIP efficiency are not equivalent Not for broad regions

The IDR Workflow 34 Landt et al, 2012 Assess sample reproducibility and compute final peak list with a “rescue” strategy Workflow available in Galaxy

Functional analysis of peak list(s) 35 DiffBind Tool Trackster Get Sequence, then go to MEME/RSAT server Now you have a peak list (using IDR or traditional way)

The “Deeptools” Package 36 Available in Galaxy

Platforms like Galaxy are now essential 37 Bioinformatics is now the bottleneck e.g. not enough bioinformaticians to cope with the amount of data  biologists need to learn more and more of bioinformatics and execute simple tasks themselves Field calls for easier reproducibility of data analysis e.g. “Rebooting review”, Nat Biotech April 2015  systems easing this process must be used Automation helps reduce manual errors i.e. dozens/hundreds of datasets in a study becomes common (Arner et al., Science, 1189 CAGE libs !) => pipelines using parallel processing must be applied

The GBCS NGS Ecosystem 38 Data GeneCore Online Ordering GC Bridge Annotate data Manage data sets (Analyze arrays) Export to EBI Archive R studio Server GB Servers File servers SEPP libraries IT LSF Cluster jobs run on cluster NGS Analysis Build/Store Workflows

Thank you 39 GeneCore Jonathon Blake Juergen Zimmermann Vladimir Benes Eileen Furlong and Lars Steinmetz IT Services Michael Wahlers Andres Lindau All GB members