Download presentation
Presentation is loading. Please wait.
Published byAntonia Janis Thomas Modified over 8 years ago
1
Using Galaxy to build and run data processing pipelines Jelle Scholtalbers / Charles Girardot GBCS Genome Biology Computational Support
2
What is Galaxy A web-based framework for running command- line utilities from a graphical user interface - Keep track of history - - Share data and analysis steps - - Create workflows - - Visualize results -
3
What is Galaxy Extremely active and popular open source project More than 60 public servers focus proteomics, metagenomics, metabolomics Solid and stable team of developers User conference regularly occurring, on all continents National Galaxy hubs and workshop events Tons of online learning material Provides advanced features for bioinformaticians RESTful APIs and bioblend scripting interface Can be launched on the cloud …
4
Why do we need it - Easy to manage your workspace - - Rerun tools with a click - - Store, Export and Share complete analysis - - Has a Workflow Manager -
5
The main public instance is at http://usegalaxy.org
6
Tools are on the left, history on the right Dataset History Available Tools
7
Tool parameters are given in the central view Main pane: run tools, view results Available Tools Dataset History
8
A tool without the UI looks like: $ fastqc --help FastQC - A high throughput sequence QC analysis tool SYNOPSIS fastqc seqfile1 seqfile2.. seqfileN fastqc [-o output dir] [--(no)extract] [-f fastq|bam|sam] [-c contaminant file] seqfile1.. seqfileN
9
With the UI: Main pane: run tools, view results Available Tools Dataset History
10
Available Tools View results Main pane: run tools, view results View
11
Rerun tools Main pane: run tools, view results Rerun Save View
12
Shared data libraries
13
Shared histories
14
Shared workflows
15
What is a workflow manager Allows one to create a chain of dependent tasks to achieve a defined goal
16
Instead of
17
Galaxy workflows Main pane: design workflow Tool parameters Available Tools
18
Galaxy workflows
20
bowtie fastqc Compute Cluster flagstat Cluster Queue Galaxy at EMBL runs on a compute cluster
21
Output states Click on the bug Info button
22
Practical : Building a Workflow for ChIP-seq processing in Galaxy http://galaxy.embl.de
23
Go to Workflow => “Create new workflow” Input dataset: 1 fastq file Steps: – Check read quality Tool: FastQC – Map reads with bowtie2 Tool: Bowtie2 (organism: dm3) – Remove unmapped and multi-mapping reads Tool: Filter BAM – Remove duplicates Tool: MarkDuplicates – Check Strand Cross-correlation Tool : SPP (replicates removed: yes) – Generate bigwig coverage file for visualization Tool: bamCoverage (organism: dm3) Exercise 1: Build a workflow for basic processing of FASTQ files ; according to below specifications
25
Grab a (small) data file : – Go to Shared Data | Data Libraries Beta | Training | ChIPseq Training – Select file “K27ac_R2_chr2L_1-5M.fastq” – Click “to History” button to import the dataset in your history Execute your workflow : – Go to Workflow – Locate your workflow, click the down arrow and select “run” – Position parameters where needed and “Run Workflow” Exercise 2: Execute the workflow
26
Look at the FastQC, SPP results – http://www.bioinformatics.babraham.ac.uk/projects/fastq c/Help/ or http://bit.ly/1RcoFtN http://www.bioinformatics.babraham.ac.uk/projects/fastq c/Help/http://bit.ly/1RcoFtN Check read statistics (MarkDuplicates report) Visualize bigwig file in Trackster (dm3 genome) I have pre-run a similar workflow on two K27ac replicates and their input control. You can get all QC results by importing the History : – “EMBL ChIP-seq Training: QC Results” Exercise 3: Check workflow results
27
Break
28
We collected all filtered BAM and bigwig files in the History “EMBL ChIP-seq Training : Result files (BAMs,bigwig) and further analysis” – Import it – Check results for “Correlate BAM GenomeWide (2K bins)” NB: you can check what was run by clicking on “Run this job again” – Check results for “bamFingerprint GenomeWide” (datasets 16) – Check IDR results (datasets 25 and 28) Exercise 4: Run additional quality checks and call peaks using IDR workflow
29
Still using the history “EMBL ChIP-seq Training : Result files (BAMs,bigwig) and further analysis” – Prepare a signal file representing the IP signal corrected for input (subtraction eg IP-input) in which both IP and input are replicate averages. Use tools “Average multiple (Big)Wig files” and “Subtract two (Big)Wig files”. Convert final final file to bigwig format with “Wig/BedGraph-to-bigWig converter” Precomputed datasets : 38 to 41 – Check results and visualize all bigwig (individual files and summarized ones) in Trackster Use the genome layout fetched from UCSC (dataset 42) – Prepare a data matrix summarizing signal values around all TSSs of the genome TSS are defined in the “35 : TSS_dm3.bed” file Use the computeMatrix tool (result : dataset 43) – Plot the data matrix as a heatmap and a profile average Use Deeptools’ heatmapper (result : dataset 44) Use Deeptools’ profiler (result : dataset 45) Exercise 5 : Generate heatmap and average plots
30
Part of fastQC wrapper
31
Exercise 1
Similar presentations
© 2025 SlidePlayer.com. Inc.
All rights reserved.