Using Galaxy to build and run data processing pipelines Jelle Scholtalbers / Charles Girardot GBCS Genome Biology Computational Support.

Slides:



Advertisements
Similar presentations
Before we start Login to the laptop: user: crgcomu Password: crgcomu Login to the network: Wifi: carretwifi Password : Login to galaxy (ldap):
Advertisements

IMGS 2012 Bioinformatics Workshop: RNA Seq using Galaxy
RCAC Research Computing Presents: DiaGird Overview Tuesday, September 24, 2013.
Welcome to E-Prime E-Prime refers to the Experimenter’s Prime (best) development studio for the creation of computerized behavioral research. E-Prime is.
A pilot application 12/9/2008Microsoft eScience Workshop 2008 Robert Bukowski and Jarek Pillardy Computational Biology Service Unit Cornell University.
Introduction to EMF Server Communication and Cases Beta Testing November 4, 2009.
SIMULINK Dr. Samir Al-Amer. SIMULINK SIMULINK is a power simulation program that comes with MATLAB Used to simulate wide range of dynamical systems To.
Before we start: Align sequence reads to the reference genome
Data Formats & QC Analysis for NGS Rosana O. Babu 8/19/20151.
NGS Analysis Using Galaxy
Apache Airavata GSOC Knowledge and Expertise Computational Resources Scientific Instruments Algorithms and Models Archived Data and Metadata Advanced.
Polymorphism and Variant Analysis Lab
CERN - IT Department CH-1211 Genève 23 Switzerland t Monitoring the ATLAS Distributed Data Management System Ricardo Rocha (CERN) on behalf.
© 2012 Avaya, Inc. All rights reserved, Page 1 Module Duration: Module 05: Handling Data in Bulk 3 Hours.
An Introduction to Designing, Executing and Sharing Workflows with Taverna Nowgen, Next Gen Workshop 17/01/2012.
Genomics Virtual Lab: analyze your data with a mouse click Igor Makunin School of Agriculture and Food Sciences, UQ, April 8, 2015.
Computer Lab (I) Introduction of galaxy and UCSC genome browser.
Informix IDS Administration with the New Server Studio 4.0 By Lester Knutsen My experience with the beta of Server Studio and the new Informix database.
IPlant Collaborative Tools and Services Workshop iPlant Collaborative Tools and Services Workshop Collaborating with iPlant.
Introduction to RNA-Seq & Transcriptome Analysis
Galaxy for Bioinformatics Analysis An Introduction TCD Bioinformatics Support Team Fiona Roche, PhD Date: 31/08/15.
Opton 4 short presentation1 Opton 4 User friendly operating software for Symmetron’s Stylitis data loggers.
Polymorphism & Variant Analysis Lab Saurabh Sinha Polymorphism and Variant Analysis Lab v1 | Saurabh Sinha 1 Powerpoint by Casey Hanson.
Regulatory Genomics Lab Saurabh Sinha Regulatory Genomics Lab v1 | Saurabh Sinha1 Powerpoint by Casey Hanson.
NGS data analysis CCM Seminar series Michael Liang:
Introduction of Geoprocessing Topic 7a 4/10/2007.
LCG Middleware Testing in 2005 and Future Plans E.Slabospitskaya, IHEP, Russia CERN-Russia Joint Working Group on LHC Computing March, 6, 2006.
NIH Extracellular RNA Communication Consortium 2 nd Investigators’ Meeting May 19 th, 2014 Sai Lakshmi Subramanian – (Primary
Chip – Seq Peak Calling in Galaxy Lisa Stubbs Chip-Seq Peak Calling in Galaxy | Lisa Stubbs | PowerPoint by Casey Hanson.
BRUDNO LAB: A WHIRLWIND TOUR Marc Fiume Department of Computer Science University of Toronto.
CNGrid GOS 3.0 Practice OMII-Euro & CNGrid Joint Training Material QiaoJian Jan
Introductory RNA-seq Transcriptome Profiling. Before we start: Align sequence reads to the reference genome The most time-consuming part of the analysis.
1 EndNote X2 Your Bibliographic Management Tool 29 September 2009 Humanities and Social Sciences Resource Teams.
Genesys Shell development Input-side development progress.
Regulatory Genomics Lab Saurabh Sinha Regulatory Genomics | Saurabh Sinha | PowerPoint by Casey Hanson.
EGEE-III INFSO-RI Enabling Grids for E-sciencE Ricardo Rocha CERN (IT/GS) EGEE’08, September 2008, Istanbul, TURKEY Experiment.
Tbox is a monitoring solution for all your computer systems Unifies and simplifies management of system surveillance Notifies you in the event of.
Web based spectrum databases and utilities László Dobos Tamás Budavári István Csabai MAGPOP kick-off meeting, January Cassis.
EOVSA Pipeline Processing System J. McTiernan EOVSA Prototype Review 24-Sep-2012.
First of all: “Darnit Jim, I’m a doctor not a bioinformatician!”
The Development Process Compilation. Compilation - Dr. Craig A. Struble 2 Programming Process Problem Solving Phase We will spend significant time on.
Introduction of Geoprocessing Lecture 9 3/24/2008.
EOVSA Data and Database System J. McTiernan EOVSA Technical DesignMeeting 7-Nov-2011.
Chip – Seq Peak Calling in Galaxy Lisa Stubbs Lisa Stubbs | Chip-Seq Peak Calling in Galaxy1.
User-friendly Galaxy interface and analysis workflows for deep sequencing data Oskari Timonen and Petri Pölönen.
HOMER – a one stop shop for ChIP-Seq analysis
TRACKSTER &CIRCSTER DEMO Slides: /g/funcgen/trainings/visualization/Demos/Trackster+Circster.ppt Galaxy: Galaxy Dev:
IGV Demo Slides:/g/funcgen/trainings/visualization/Demos/IGV_demo.ppt Galaxy Dev: 0.
CRM Training Courses &Online Courses and Salesforce Online | classroom| Corporate Training | certifications | placements| support.
0 NGS Data Analysis with the Galaxy Platform - an application to ChIP-seq Monterotondo, 16 April 2015 Charles Girardot Genome Biology Computational Support.
Visualizing data from Galaxy
Canadian Bioinformatics Workshops
Galaxy for analyzing genome data Hardison October 05, 2010
Konstantin Okonechnikov Qualimap v2: advanced quality control of
Using command line tools to process sequencing data
Canadian Bioinformatics Workshops
University of Chicago and ANL
NGS Analysis Using Galaxy
Chip – Seq Peak Calling in Galaxy
S1 Supporting information Bioinformatic workflow and quality of the metrics Number of slides: 10.
Workshop on Microbiome and Health
Yonglan Zheng Galaxy Hands-on Demo Step-by-step Yonglan Zheng
Welcome to E-Prime E-Prime refers to the Experimenter’s Prime (best) development studio for the creation of computerized behavioral research. E-Prime is.
ChIP-Seq Data Processing and QC
Exploring and Understanding ChIP-Seq data
Agile testing for web API with Postman
Regulatory Genomics Lab
Computational Pipeline Strategies
Introduction to RNA-Seq & Transcriptome Analysis
Chip – Seq Peak Calling in Galaxy
Presentation transcript:

Using Galaxy to build and run data processing pipelines Jelle Scholtalbers / Charles Girardot GBCS Genome Biology Computational Support

What is Galaxy A web-based framework for running command- line utilities from a graphical user interface - Keep track of history - - Share data and analysis steps - - Create workflows - - Visualize results -

What is Galaxy Extremely active and popular open source project More than 60 public servers focus proteomics, metagenomics, metabolomics Solid and stable team of developers User conference regularly occurring, on all continents National Galaxy hubs and workshop events Tons of online learning material Provides advanced features for bioinformaticians RESTful APIs and bioblend scripting interface Can be launched on the cloud …

Why do we need it - Easy to manage your workspace - - Rerun tools with a click - - Store, Export and Share complete analysis - - Has a Workflow Manager -

The main public instance is at

Tools are on the left, history on the right Dataset History Available Tools

Tool parameters are given in the central view Main pane: run tools, view results Available Tools Dataset History

A tool without the UI looks like: $ fastqc --help FastQC - A high throughput sequence QC analysis tool SYNOPSIS fastqc seqfile1 seqfile2.. seqfileN fastqc [-o output dir] [--(no)extract] [-f fastq|bam|sam] [-c contaminant file] seqfile1.. seqfileN

With the UI: Main pane: run tools, view results Available Tools Dataset History

Available Tools View results Main pane: run tools, view results View

Rerun tools Main pane: run tools, view results Rerun Save View

Shared data libraries

Shared histories

Shared workflows

What is a workflow manager Allows one to create a chain of dependent tasks to achieve a defined goal

Instead of

Galaxy workflows Main pane: design workflow Tool parameters Available Tools

Galaxy workflows

bowtie fastqc Compute Cluster flagstat Cluster Queue Galaxy at EMBL runs on a compute cluster

Output states Click on the bug Info button

Practical : Building a Workflow for ChIP-seq processing in Galaxy

Go to Workflow => “Create new workflow” Input dataset: 1 fastq file Steps: – Check read quality Tool: FastQC – Map reads with bowtie2 Tool: Bowtie2 (organism: dm3) – Remove unmapped and multi-mapping reads Tool: Filter BAM – Remove duplicates Tool: MarkDuplicates – Check Strand Cross-correlation Tool : SPP (replicates removed: yes) – Generate bigwig coverage file for visualization Tool: bamCoverage (organism: dm3) Exercise 1: Build a workflow for basic processing of FASTQ files ; according to below specifications

Grab a (small) data file : – Go to Shared Data | Data Libraries Beta | Training | ChIPseq Training – Select file “K27ac_R2_chr2L_1-5M.fastq” – Click “to History” button to import the dataset in your history Execute your workflow : – Go to Workflow – Locate your workflow, click the down arrow and select “run” – Position parameters where needed and “Run Workflow” Exercise 2: Execute the workflow

Look at the FastQC, SPP results – c/Help/ or c/Help/ Check read statistics (MarkDuplicates report) Visualize bigwig file in Trackster (dm3 genome) I have pre-run a similar workflow on two K27ac replicates and their input control. You can get all QC results by importing the History : – “EMBL ChIP-seq Training: QC Results” Exercise 3: Check workflow results

Break

We collected all filtered BAM and bigwig files in the History “EMBL ChIP-seq Training : Result files (BAMs,bigwig) and further analysis” – Import it – Check results for “Correlate BAM GenomeWide (2K bins)” NB: you can check what was run by clicking on “Run this job again” – Check results for “bamFingerprint GenomeWide” (datasets 16) – Check IDR results (datasets 25 and 28) Exercise 4: Run additional quality checks and call peaks using IDR workflow

Still using the history “EMBL ChIP-seq Training : Result files (BAMs,bigwig) and further analysis” – Prepare a signal file representing the IP signal corrected for input (subtraction eg IP-input) in which both IP and input are replicate averages. Use tools “Average multiple (Big)Wig files” and “Subtract two (Big)Wig files”. Convert final final file to bigwig format with “Wig/BedGraph-to-bigWig converter” Precomputed datasets : 38 to 41 – Check results and visualize all bigwig (individual files and summarized ones) in Trackster Use the genome layout fetched from UCSC (dataset 42) – Prepare a data matrix summarizing signal values around all TSSs of the genome TSS are defined in the “35 : TSS_dm3.bed” file Use the computeMatrix tool (result : dataset 43) – Plot the data matrix as a heatmap and a profile average Use Deeptools’ heatmapper (result : dataset 44) Use Deeptools’ profiler (result : dataset 45) Exercise 5 : Generate heatmap and average plots

Part of fastQC wrapper

Exercise 1