0 NGS Data Analysis with the Galaxy Platform - an application to ChIP-seq Monterotondo, 16 April 2015 Charles Girardot Genome Biology Computational Support (GBCS)
What is Galaxy? 1 Dataset file(s) results e.g. mapped reads, pictures, statistics in Dataset file(s) private e.g. ngs reads public e.g. genes out Tool e.g. read mapper How does a computational biologist work ? Install the “tool” on his computer Check how to run the “tool” Execute the command line Look at the results
What is Galaxy? 2 But … tool installation is not always a piece of cake “public” files need to be assembled the task needs to be executed on a server / a compute cluster results might not be human-readable (appropriate tools/software needed) Dataset file(s) results e.g. mapped reads, pictures, statistics in Dataset file(s) private e.g. ngs reads public e.g. genes out Tool e.g. read mapper
What is Galaxy? 3 in Tool 1 out Multiple tools need to be chained in a defined order Tool 2 out in … you don’t want to manually check when “Tool 1” has completed ! Organize command lines in a “script”
What is Galaxy? 4 in Tool 1 out Multiple outputs need to be combined execution of tools should be controlled (parallel processing) workflows contain dozens of steps, generate a lot of tmp data some tool execution might fail Tool 2 out Tool 3 out Tool 4 out … you don’t want to implement flow control in your script ! How do you adapt needed resources (cpu, memory) in a single script?
What is Galaxy? 5 in Tool 1 out Sequencers throughput require parallel processing of multiple samples Tool 2 out Tool 3 out Tool 4 out how do you efficiently monitor all these workflow executions ? in Tool 1 out Tool 2 out Tool 3 out Tool 4 out
What is Galaxy ? 6 Open source platform for high-throughput genomics : analyze (genomics) data and create workflows get and integrate public + private data visualize, share and publish encourage reproducible science tool installation from public toolshed enable biologists to analyze their data w/o bioinformaticians The Galaxy Project is supported in part by NSF, NHGRI, The Huck Institutes of the Life Sciences, The Institute for CyberScience at Penn State, and Johns Hopkins University. Galaxy is a web application that does ALL this for you (and more)
What is Galaxy ? 7 Extremely active and popular project more than 60 public servers focus proteomics, metagenomics, metabolomics solid and stable team of developers user conference regularly occurring, on all continents National Galaxy hubs, MLs and workshop events tons of online learning material Learning Galaxy is a valuable long term investment both for biologists and bioinformaticians Provides advanced features for bioinformaticians RESTful APIs and bioblend scripting interface Can be launched on the cloud …
Why a local Galaxy ? 8 Working with public Galaxy servers is limited Volume of data to transfer back and forth Confidentiality of data Impossibility of integrating custom tools Our HD-Galaxy is also relevant to you : Samples sequencing occurs in HD Raw files are automatically streamed to Galaxy and emBASE Galaxy gives you access to all HD resources and compute power Store your data on your file servers Use other tools using web interfaces or the command line Transfer to MR only relevant and processed data (even sync at night) We maintain a Galaxy Server at
Welcome to Galaxy 9 Tools History Main Panel : Launch Jobs / View Results Search tools Personal workflows == pipelines Tools are organized by CategoriesClick a Category to see all tools
Noticeable Tool Categories 10 Fetch public data from UCSC, ENSEMBL, mouseMINE… NGS Tools organized by: application (QC, mapping, peak calling, RNA, …) package name (bedtools, picard) General sequence and text format manipulation Tools Proteomics ToolsUtility (local), Test (beta version) and Deprecated (for backward comp.) Upload personal files
Running a Tool is easy 11 Click a tool to bring it up in the middle panel eg FastQC
Running a Tool is easy 12 (1)Select input files (2)Position parameters (3)Click Execute Job is submitted to compute cluster => a new dataset block is added in the active history Green : Successfully completed Yellow : Running Grey : Waiting Red : Failed job Run the tool on many files is easy too !
Tool summary 13 More than 350 tools available ! All results can be downloaded or directly transferred to your project folder Missing Tools can be easily integrated Easy way to add a GUI to your own script Parameters for cluster submission can be adjusted for each tool (and even be dynamically computed)
Tools can be assembled into workflows 14
Fetch public data from popular resources 15 Fetch Data from UCSC Table Browser But also from ensembl, SRA, MouseMine …
Organize your files in Libraries 16 Each Lab has its own access-protected library Datasets are organized into “folders” (can be nested) We add your files automatically upon data release from GeneCore These files are links to avoid wasting space
How do you like your ChIP ? 17 ChIP against… TF Co-factors Histone modifications Nucleosomes Transcription machinery … To find … regulatory elements activity states …
Overview of ChIP-seq processing 18 [Park et al, Nat. Gen. 2009] 1.Sequencing : single end, no strand specificity randomly sequencing one strand QC : quality per base, GC content…
Overview of ChIP-seq processing 19 [Park et al, Nat. Gen. 2009] 2.Mapping : pos. strand reads map upstream target neg. strand reads map downstream target QC : #pos/#neg == 1 is expected % read that do not map % read that map at multiple position % of read duplicates
Overview of ChIP-seq processing 20 [Park et al, Nat. Gen. 2009] 3.Strand specific coverage: symmetrical distribution expected summits separated by avg fragment length QC: strand cross-correlation
Strand cross-correlation 21 Landt et al, Gen. Res., 2012 NSC < 1.1 are relatively low (ENCODE) Highly enriched experiments have RSC > 1 RSC << 1 may indicate low quality A “ChIP Quality” score is derived from these metrics
Strand cross-correlation 22 Landt et al, Gen. Res., 2012 cross-correlation metrics for ENCODE datasets
Overview of ChIP-seq processing 23 [Park et al, Nat. Gen. 2009] 4.ChIP profile map shift each dist. by half fragment length extend each read by fragment length used for peak calling QC : visualize in genome browser Bam fingerprint
BAM Fingerprints (Deeptools) 24 Visualize enrichment w/o calling peaks ! In Section NGS: Deeptools Cumulative read sum Rank
All these steps and QC metrics can be modeled in one workflow 25 This workflow is public and can be used as such or copied and modified to your needs One sample “Read QC, Mapping and Filtering WF” i.e. does not include peak calling
… executed on X samples in just one click 26
each workflow is executed in its own history 27
Easily visualize results of each step 28 Image results from tool Tabular datasets (eg bed) HTML Report
Trackster : Embedded Genome Browser 29 Bam files bigwig files bed files Visualization can be saved and shared
Interactive Charting 30
Overview of ChIP-seq processing 31 Detection of enriched regions must be adapted to the ChIPed target Sequence-specific binding “point source” eg TF Mixture (Pol II) : peak followed by broad enrichment Median size peaks Large size peaks [Park et al, Nat. Gen. 2009] 5.Find peaks
Dealing with replicates 32 Sample#unique reads # peaks ( MACS14) #peak cov.InOtherSet YG1 IP15.1 M2,9683,898,23685% in YG2 YG2 IP14.75 M2,8443,820,35889% in YG1 YG1+YG228 M32174,710,262 YG3 Input12 M Correlation with Signal in Merged Peaks (log2(IP/Input)) p=0.94 A good correlation “allows” you to merge them and call peaks on the merged reads Approach suffers from cutoff effects (pval dist. are sample specific) Workflow available in Galaxy
The irreproducible discovery rate (IDR) 33 Unified approach to measure the reproducibility of findings identified from replicate high-throughput experiments Idea : call peaks with low cutoff and classify peaks as reproducible or not (bivariate rank distributions) based on overlap of ranked peaks (consistency) Landt et al, 2012 This is a little stringent if the ChIP efficiency are not equivalent Not for broad regions
The IDR Workflow 34 Landt et al, 2012 Assess sample reproducibility and compute final peak list with a “rescue” strategy Workflow available in Galaxy
Functional analysis of peak list(s) 35 DiffBind Tool Trackster Get Sequence, then go to MEME/RSAT server Now you have a peak list (using IDR or traditional way)
The “Deeptools” Package 36 Available in Galaxy
Platforms like Galaxy are now essential 37 Bioinformatics is now the bottleneck e.g. not enough bioinformaticians to cope with the amount of data biologists need to learn more and more of bioinformatics and execute simple tasks themselves Field calls for easier reproducibility of data analysis e.g. “Rebooting review”, Nat Biotech April 2015 systems easing this process must be used Automation helps reduce manual errors i.e. dozens/hundreds of datasets in a study becomes common (Arner et al., Science, 1189 CAGE libs !) => pipelines using parallel processing must be applied
The GBCS NGS Ecosystem 38 Data GeneCore Online Ordering GC Bridge Annotate data Manage data sets (Analyze arrays) Export to EBI Archive R studio Server GB Servers File servers SEPP libraries IT LSF Cluster jobs run on cluster NGS Analysis Build/Store Workflows
Thank you 39 GeneCore Jonathon Blake Juergen Zimmermann Vladimir Benes Eileen Furlong and Lars Steinmetz IT Services Michael Wahlers Andres Lindau All GB members