0 GBCS ecosystem for NGS Data Data Management with emBASE Data analysis with Galaxy
The big picture 1 Data File servers 1.emBASE is a database, with a web front-end, storing all metadata about your data files (e.g. fastq) NGS GB
The big picture 2 Data File servers 2.Your data files remains on your group fileserver in your “NGS data library” and are accessible directly NGS GB
The big picture 3 Data File servers NGS GB Annotate data : sample description, protocol description Manage data sets : link files to experiments/projects Publish data to public repository : upon publication Export to Tape : long term storage
The big picture 4 Data File servers GeneCore Online Ordering GCBridge Automated data transfer from GC servers to emBASE to avoid : file renaming i.e. lack of data traceability duplication of data files in several places (with different names!) unreliable or unknown storage places (your laptop…) data not being loaded in the system NGS GB
NGS ecosystem by GBCS 5 Data File servers GeneCore Online Ordering GCBridge IT LSF Cluster jobs run on cluster NGS Analysis Build/Store Workflows R studio Server GB Servers access files directly fetch info with JemBASEAPI SEPP libraries NGS GB
0. What is the “data” 6
SampleSequencing File FASTQ, BAM The typical user view of the “model” Send my sample to sequencing Download the file Mail the bioinformatician where the file is NGS GB : Data model
Sample (eg embryos, cells) Extract (eg DNA, mRNA) Library Protocols growth, treatment, extraction, amplification, sequencing, … Annotations Sequencing File FASTQ, BAM Annotations and protocols need to be controlled AMAP A more realistic view of the process NGS GB : Data model
Sample1Extract1Library1Sequencing File1 Annotations and protocols need to be controlled AMAP Replicates needs to be described properly (sample replicates vs library re-sequencing) A more realistic view of the process NGS GB : Data model Sample2Extract2Library2Sequencing File2 Sample1Extract1Library1Sequencing File1 Extract2Library2Sequencing File2 Sample1Extract1Library1Sequencing File1 Sequencing File2 ≠ ≠ Biol Rep Tech Rep
Exp Y / Project Q Projects are mixed in the same lane Exp X / Project P A complete view of the situation Samples are commonly multiplexed Sample4Extract4Library4Sequencing File FASTQ, BAM Sample1Extract1Library1 Sample2Extract2Library2 Sample3Extract3Library3 ……… Barcode Info File … FASTQ, BAM Analysis Stored (meta)data must be readily accessible for analysis Publish e.g. EBI Model and Vocabulary should match standards for final publishing NGS GB : Data model
1.emBASE 11 “Data management, organization, annotations and publication”
emBASE Items NGS GB 12 Sample2 (eg embryos, cells) Extract2 (eg DNA, mRNA) Library2 Protocols Annotations Sequencing File FASTQ, BAM Sample1Extract1Library1 Barcode Info File Sample2 (eg embryos, cells) Extract2 (eg DNA, mRNA) Library2NGS Assay Protocols Sample Annotations Sample1Extract1Library1 + File (BAM, FASTQ) SeqLane File(s) RawBioAssay1 RawBioAssay2 + File (BAM, FASTQ) Workflow emBASE
NGS GB :: Data Management :: emBASE 13 Developed in house using BASE Initially a LIMS for arrays Runs for 9 years now
emBASE Modules NGS GB :: Data Management :: emBASE 14 ; please request login Controlled Vocabulary Sample, Extract, Libraries… Assays grouped in Experiments and Projects NGS Assays Microarrays In Situ Images
emBASE NGS Assay List Page NGS GB :: Data Management :: emBASE 15 List all NGS Assays (== Lane)
emBASE NGS Assay List Page NGS GB :: Data Management :: emBASE 16 Access rights for each assay (unix like)
Search NGS Assays NGS GB :: Data Management :: emBASE 17 Powerful search on all “list” pages Customize table view Locate your assay and follow the link for details
NGS Assay: Example of a multiplexed lane 18 Lane File & Location Sequencing run info Assay (=Lane) info & rights Related raw data sets are grouped in “experiments” Individual data sets & De-multiplexed Files NGS GB :: Data Management :: emBASE
NGS Assay: Example of a multiplexed lane 19 Link to Libraries i.e. Samples NGS GB :: Data Management :: emBASE
Biomaterials NGS GB 20
Sample Annotation NGS GB 21 Sample Annotation Types : are typed free text, number (int, float) pre-defined values (enum) are owned can be created as needed by authorized users e.g. as required by ICGC
Sample Annotation NGS GB 22 Select SATs
Custom sample annotations NGS GB :: Data Management :: emBASE 23 Unlimited number of annotations Annotation types can be customized (per group)
Grouping data sets into Experiments NGS GB 24 An experiment has a single ‘type’ e.g. ChIP-seq, RNA-seq
Grouping data sets into Experiments NGS GB 25 Search raw data sets and add/remove them from exp.
Project Layer New emBASE Project Layer 26 Experiment is tied to a single type –eg ChIP-seq, RNA-seq, iCLIP-seq Group related exp. into project
NGS GB 27 Wait a sec... Do we really have to fill all these web forms ?!?! NO ! 1. GCBridge: all “items” are pre-created for you 2. Protocols and sample annotations remain to be done
2. Decentralized NGS File Data Lib 28 “Your data lives on your file server and is readily accessible”
NGS data Library NGS GB :: Data Management :: emBASE 29 NGS data library root folder (can be anywhere your like) Sub-folders containing the fastq files are organized by “Sequencer Run” Everything in your data library is managed by emBASE and is read-only to avoid data deletion, renaming, move. 1.emBASE is a database, with a web front-end, storing all metadata about your data files (e.g. fastq) 2.Your data files remains on your group fileserver in your “NGS data library” and are accessible directly
NGS data Library NGS Data Library extended to better support demultiplexed files 30 Lane directory : one per (existing) lane ; read-only
NGS data Library NGS Data Library extended to better support demultiplexed files 31 Library dir (named after immutable internal emBASE id), read-only
NGS data Library NGS Data Library extended to better support demultiplexed files 32 Data file dir, per file type read-write until you lock it; then read-only
Locking / Unlocking concept 1.Library file sub-directories are unlocked (writable for group) –you can work and replace files as you wish 2.At some point, files are ready and directories can be locked (only readable): 1.emBASE starts, at this point, to track these files 2.emBASE will allow lane file deletion when all its multiplexed libraries are locked. 3.Locking is operated via the web interface, on the whole lane or per library (case of shared lanes) 33
3. GC Bridge 34 “Ensuring smooth data transfer between GeneCore to emBASE”
GCBridge : Making your life as easy as possible 35 GeneCore Online Ordering 1.Transfer file NGS Lib NGS GB :: Automated Data Transfer
GCBridge : Making your life as easy as possible 36 GeneCore Online Ordering 1.Transfer file 2.Call GC Bridge NGS GB :: Automated Data Transfer NGS Lib
GCBridge : Making your life as easy as possible 37 GeneCore Online Ordering 1.Transfer file 2.Call GC Bridge NGS GB :: Automated Data Transfer NGS Lib
GCBridge : Making your life as easy as possible 38 GeneCore Online Ordering Lib fetch info from GC Db 1.Transfer file 2.Call GC Bridge NGS GB :: Automated Data Transfer User gets upon transfer completion Users gets when demultiplexing has performed
3. Practical steps 39 Validate GCBridge Transfer Form Annotate Samples, link protocols
Data released 40 Click the link to get to the GCBridge Transfer Form
Single Library Form 41 Lane File(s) The Bridge is connected to emBASE experiments
Single Library Form 42
Single Library Form 43
Single Library Form 44 Sample names can be matched against existing Sample or Libraries Search is performed ignoring prefix Sample1Extract1Library1 Extract2Library2 i.e. tech. replicate NGSAssay Library1 or lib. resequencing NGSAssay New entries are created by default Sample1Extract1Library1NGSAssay
Multiplexed Library Form 45 Identical Multiplex specific
Multiplexed Library Form 46 Tell us about lib number, so we can control submissions…
Easy demultiplexing in Data Lib Directly 47 Request demultiplexing (runs on cluster); starts when submission is complete Jemultiplexer is emBASE-aware (ie where files go in Data Library Jemultiplexer can also be (re)launched command line
Easy selection of lane mates 48 Select all lane-mates
Re-use emBASE samples and libraries 49 Step-by-step tutorial at (Quick Links) Select search level : sample or library
Re-use emBASE samples and libraries 50 Select search level : sample or library Select appropriate items Match levels can be mixed Allows to accurately model replicates (tech. vs biol. )
Re-use emBASE samples and libraries 51 Select search level : sample or library Step-by-step tutorial at (Quick Links)
Already demultiplexed samples NGS GB 52
Automatic notification NGS GB 53
NGS GB 54 Now what ? 1. GCBridge: all “items” are pre-created for you 2. Protocols and sample annotations remain to be done in emBASE
Working in batch with emBASE NGS GB Narrow your search to locate wanted samples
Working in batch with emBASE NGS GB Select the ones you want or All N.B : Increase number of item/page in GUI settings if needed
Working in batch with emBASE NGS GB Associate protocols, change access rights to all selected samples in one click
Working in batch with emBASE NGS GB Download pre-filled excel file for batch annotation
Working in batch with emBASE NGS GB 59 1.Keep columns you need, 2.Fill in your annotations in Excel, 3.Save back as text
Working in batch with emBASE NGS GB Batch (re)annotate your samples using this file
emBASE Advanced Features (for the command line user) 61
Working with emBASE 1.Export experiment or project views using the web interface 2.Use the new command line emBASE API to learn where files are or should be placed –These commands extracts all info from emBASE for a lane, an experiment or a project 62 Documentation at :
Concept : work as you like 63
Concept : work as you like 64 NGS Lib Database samples, libs, RBAs, exp, project link real files pull info as needed
Export Project View to disk 65
emBASE API Example 66 Assume you want to discover all libraries and associated files in a given lane …
emBASE API Example 67 Available from anywhere Logged in user used to authenticate in emBASE Rights apply the same way as in emBASE
emBASE API Example 68 Example : Create symlinks on the fly to the NGS data lib for all libs of a new lane
Archiving of emBASE Data Goal : save space by moving data offline when projects are finished 69 Fill in optionsemBASE admin is warned
Archiving of emBASE Data All data files connected to the experiments are exported IT performs back up on tape We delete ‘deletable’ files (concept of active experiment): –emBASE knows which files can be deleted, which ones have been deleted and how to get them back, if needed –delete files are locally replaced with the a small file containing back up information You can follow the archiving status in emBASE 70 What happens next ? This is a couple of clicks on your side but remember that you still pay the bill !
Galaxy (First Steps) 71 “Powerful data analysis made easy and reproducible ”
Galaxy is a web-based job management platform 72 ToolsHistory (active analysis) Launch Analysis Jobs NGS GB :: Data Analysis :: Galaxy : log in with your EMBL account
Finding your data 73 NGS GB :: Data Analysis :: Galaxy => select your group library
Run jobs 74 NGS GB :: Data Analysis :: Galaxy
Jobs can be assembled into workflows 75 NGS GB :: Data Analysis :: Galaxy
Apply workflows to each demultiplexed data set in one click 76 NGS GB :: Data Analysis :: Galaxy
Each data set analysis is well identified 77 NGS GB :: Data Analysis :: Galaxy
Galaxy Summary 78 1.Galaxy is a job management / analysis platform Run standard analysis (trimming, QC, mapping, peak calling,…) Assemble workflows and perform parallel processing 2.Jobs are sent to the new LSF EMBL cluster We implement cluster good practices (copy to local /tmp, …) Tools are available under BCR/SEPP 3.Continuous update/addition of tools & indices 4.Open source and very active project NGS GB :: Data Analysis
Galaxy Summary 79 1.Galaxy is a job management / analysis platform Run standard analysis (trimming, QC, mapping, peak calling,…) Assemble workflows and perform parallel processing 2.Jobs are sent to the new LSF EMBL cluster We implement cluster good practices (copy to local /tmp, …) Tools are available under BCR/SEPP 3.Continuous update/addition of tools & indices 4.Open source and very active project NGS GB :: Data Analysis
Galaxy Summary 80 1.Galaxy is a job management / analysis platform Run standard analysis (trimming, QC, mapping, peak calling,…) Assemble workflows and perform parallel processing 2.Jobs are sent to the new LSF EMBL cluster We implement cluster good practices (copy to local /tmp, …) Tools are available under BCR/SEPP 3.Continuous update/addition of tools & indices 4.Galaxy uses the data from your NGS Data library directly 5.Easy transfer of results from Galaxy to your own disks NGS GB :: Data Analysis
Conclusion 81 There are absolutely no drawbacks in using our system, only benefits ! NGS GB :: Data Analysis
82 Joscha Sauer Shu-yi Su Laura O’Donovan Matthias Monfort Alumni Aziz Moussa M. Chaturvedi L-A Schmitt Nicolas Delhomme Leila Tlili Arnaud Huaulme GeneCore Jonathon Blake Juergen Zimmermann Markus Fritz Vladimir Benes Eileen Furlong IT Services Michael Wahlers Andres Lindau All GB members Chenchen Zhu Simon Anders Tobias Rausch Frank Thommen (CBB) Thank you