Download presentation
Presentation is loading. Please wait.
Published byLindsay Williamson Modified over 8 years ago
1
0 GBCS ecosystem for NGS Data Data Management with emBASE Data analysis with Galaxy
2
The big picture 1 Data File servers http://gbcs.embl.de/base/ 1.emBASE is a database, with a web front-end, storing all metadata about your data files (e.g. fastq) NGS Data @ GB
3
The big picture 2 Data File servers 2.Your data files remains on your group fileserver in your “NGS data library” and are accessible directly NGS Data @ GB
4
The big picture 3 Data File servers NGS Data @ GB Annotate data : sample description, protocol description Manage data sets : link files to experiments/projects Publish data to public repository : upon publication Export to Tape : long term storage
5
The big picture 4 Data File servers GeneCore Online Ordering GCBridge Automated data transfer from GC servers to emBASE to avoid : file renaming i.e. lack of data traceability duplication of data files in several places (with different names!) unreliable or unknown storage places (your laptop…) data not being loaded in the system NGS Data @ GB
6
NGS ecosystem by GBCS 5 Data File servers GeneCore Online Ordering GCBridge IT LSF Cluster jobs run on cluster NGS Analysis Build/Store Workflows R studio Server GB Servers access files directly fetch info with JemBASEAPI SEPP libraries NGS Data @ GB
7
0. What is the “data” 6
8
SampleSequencing File FASTQ, BAM The typical user view of the “model” Send my sample to sequencing Download the file Mail the bioinformatician where the file is NGS Data @ GB : Data model
9
Sample (eg embryos, cells) Extract (eg DNA, mRNA) Library Protocols growth, treatment, extraction, amplification, sequencing, … Annotations Sequencing File FASTQ, BAM Annotations and protocols need to be controlled AMAP A more realistic view of the process NGS Data @ GB : Data model
10
Sample1Extract1Library1Sequencing File1 Annotations and protocols need to be controlled AMAP Replicates needs to be described properly (sample replicates vs library re-sequencing) A more realistic view of the process NGS Data @ GB : Data model Sample2Extract2Library2Sequencing File2 Sample1Extract1Library1Sequencing File1 Extract2Library2Sequencing File2 Sample1Extract1Library1Sequencing File1 Sequencing File2 ≠ ≠ Biol Rep Tech Rep
11
Exp Y / Project Q Projects are mixed in the same lane Exp X / Project P A complete view of the situation Samples are commonly multiplexed Sample4Extract4Library4Sequencing File FASTQ, BAM Sample1Extract1Library1 Sample2Extract2Library2 Sample3Extract3Library3 ……… Barcode Info File … FASTQ, BAM Analysis Stored (meta)data must be readily accessible for analysis Publish e.g. EBI Model and Vocabulary should match standards for final publishing NGS Data @ GB : Data model
12
1.emBASE 11 “Data management, organization, annotations and publication”
13
emBASE Items NGS Data @ GB 12 Sample2 (eg embryos, cells) Extract2 (eg DNA, mRNA) Library2 Protocols Annotations Sequencing File FASTQ, BAM Sample1Extract1Library1 Barcode Info File Sample2 (eg embryos, cells) Extract2 (eg DNA, mRNA) Library2NGS Assay Protocols Sample Annotations Sample1Extract1Library1 + File (BAM, FASTQ) SeqLane File(s) RawBioAssay1 RawBioAssay2 + File (BAM, FASTQ) Workflow emBASE
14
NGS Data @ GB :: Data Management :: emBASE 13 Developed in house using BASE Initially a LIMS for arrays Runs for 9 years now
15
emBASE Modules NGS Data @ GB :: Data Management :: emBASE 14 http://gbcs.embl.de/basehttp://gbcs.embl.de/base ; please request login Controlled Vocabulary Sample, Extract, Libraries… Assays grouped in Experiments and Projects NGS Assays Microarrays In Situ Images
16
emBASE NGS Assay List Page NGS Data @ GB :: Data Management :: emBASE 15 List all NGS Assays (== Lane)
17
emBASE NGS Assay List Page NGS Data @ GB :: Data Management :: emBASE 16 Access rights for each assay (unix like)
18
Search NGS Assays NGS Data @ GB :: Data Management :: emBASE 17 Powerful search on all “list” pages Customize table view Locate your assay and follow the link for details
19
NGS Assay: Example of a multiplexed lane 18 Lane File & Location Sequencing run info Assay (=Lane) info & rights Related raw data sets are grouped in “experiments” Individual data sets & De-multiplexed Files NGS Data @ GB :: Data Management :: emBASE
20
NGS Assay: Example of a multiplexed lane 19 Link to Libraries i.e. Samples NGS Data @ GB :: Data Management :: emBASE
21
Biomaterials NGS Data @ GB 20
22
Sample Annotation NGS Data @ GB 21 Sample Annotation Types : are typed free text, number (int, float) pre-defined values (enum) are owned can be created as needed by authorized users e.g. as required by ICGC
23
Sample Annotation NGS Data @ GB 22 Select SATs
24
Custom sample annotations NGS Data @ GB :: Data Management :: emBASE 23 Unlimited number of annotations Annotation types can be customized (per group)
25
Grouping data sets into Experiments NGS Data @ GB 24 An experiment has a single ‘type’ e.g. ChIP-seq, RNA-seq
26
Grouping data sets into Experiments NGS Data @ GB 25 Search raw data sets and add/remove them from exp.
27
Project Layer New emBASE Project Layer 26 Experiment is tied to a single type –eg ChIP-seq, RNA-seq, iCLIP-seq Group related exp. into project
28
NGS Data @ GB 27 Wait a sec... Do we really have to fill all these web forms ?!?! NO ! 1. GCBridge: all “items” are pre-created for you 2. Protocols and sample annotations remain to be done
29
2. Decentralized NGS File Data Lib 28 “Your data lives on your file server and is readily accessible”
30
NGS data Library NGS Data @ GB :: Data Management :: emBASE 29 NGS data library root folder (can be anywhere your like) Sub-folders containing the fastq files are organized by “Sequencer Run” Everything in your data library is managed by emBASE and is read-only to avoid data deletion, renaming, move. 1.emBASE is a database, with a web front-end, storing all metadata about your data files (e.g. fastq) 2.Your data files remains on your group fileserver in your “NGS data library” and are accessible directly
31
NGS data Library NGS Data Library extended to better support demultiplexed files 30 Lane directory : one per (existing) lane ; read-only
32
NGS data Library NGS Data Library extended to better support demultiplexed files 31 Library dir (named after immutable internal emBASE id), read-only
33
NGS data Library NGS Data Library extended to better support demultiplexed files 32 Data file dir, per file type read-write until you lock it; then read-only
34
Locking / Unlocking concept 1.Library file sub-directories are unlocked (writable for group) –you can work and replace files as you wish 2.At some point, files are ready and directories can be locked (only readable): 1.emBASE starts, at this point, to track these files 2.emBASE will allow lane file deletion when all its multiplexed libraries are locked. 3.Locking is operated via the web interface, on the whole lane or per library (case of shared lanes) 33
35
3. GC Bridge 34 “Ensuring smooth data transfer between GeneCore to emBASE”
36
GCBridge : Making your life as easy as possible 35 GeneCore Online Ordering 1.Transfer file NGS Lib NGS Data @ GB :: Automated Data Transfer
37
GCBridge : Making your life as easy as possible 36 GeneCore Online Ordering 1.Transfer file 2.Call GC Bridge e-mail NGS Data @ GB :: Automated Data Transfer NGS Lib
38
GCBridge : Making your life as easy as possible 37 GeneCore Online Ordering 1.Transfer file 2.Call GC Bridge e-mail NGS Data @ GB :: Automated Data Transfer NGS Lib
39
GCBridge : Making your life as easy as possible 38 GeneCore Online Ordering Lib fetch info from GC Db 1.Transfer file 2.Call GC Bridge e-mail NGS Data @ GB :: Automated Data Transfer User gets email upon transfer completion Users gets email when demultiplexing has performed
40
3. Practical steps 39 Validate GCBridge Transfer Form Annotate Samples, link protocols
41
Data released email 40 Click the link to get to the GCBridge Transfer Form
42
Single Library Form 41 Lane File(s) The Bridge is connected to emBASE experiments
43
Single Library Form 42
44
Single Library Form 43
45
Single Library Form 44 Sample names can be matched against existing Sample or Libraries Search is performed ignoring prefix Sample1Extract1Library1 Extract2Library2 i.e. tech. replicate NGSAssay Library1 or lib. resequencing NGSAssay New entries are created by default Sample1Extract1Library1NGSAssay
46
Multiplexed Library Form 45 Identical Multiplex specific
47
Multiplexed Library Form 46 Tell us about lib number, so we can control submissions…
48
Easy demultiplexing in Data Lib Directly 47 Request demultiplexing (runs on cluster); starts when submission is complete Jemultiplexer is emBASE-aware (ie where files go in Data Library Jemultiplexer can also be (re)launched command line
49
Easy selection of lane mates 48 Select all lane-mates
50
Re-use emBASE samples and libraries 49 Step-by-step tutorial at http://gbcs.embl.dehttp://gbcs.embl.de (Quick Links) Select search level : sample or library
51
Re-use emBASE samples and libraries 50 Select search level : sample or library Select appropriate items Match levels can be mixed Allows to accurately model replicates (tech. vs biol. )
52
Re-use emBASE samples and libraries 51 Select search level : sample or library Step-by-step tutorial at http://gbcs.embl.dehttp://gbcs.embl.de (Quick Links)
53
Already demultiplexed samples NGS Data @ GB 52
54
Automatic notification NGS Data @ GB 53
55
NGS Data @ GB 54 Now what ? 1. GCBridge: all “items” are pre-created for you 2. Protocols and sample annotations remain to be done in emBASE
56
Working in batch with emBASE NGS Data @ GB 55 1. Narrow your search to locate wanted samples
57
Working in batch with emBASE NGS Data @ GB 56 2. Select the ones you want or All N.B : Increase number of item/page in GUI settings if needed
58
Working in batch with emBASE NGS Data @ GB 57 3. Associate protocols, change access rights to all selected samples in one click
59
Working in batch with emBASE NGS Data @ GB 58 4. Download pre-filled excel file for batch annotation
60
Working in batch with emBASE NGS Data @ GB 59 1.Keep columns you need, 2.Fill in your annotations in Excel, 3.Save back as text
61
Working in batch with emBASE NGS Data @ GB 60 5. Batch (re)annotate your samples using this file
62
emBASE Advanced Features (for the command line user) 61
63
Working with emBASE 1.Export experiment or project views using the web interface 2.Use the new command line emBASE API to learn where files are or should be placed –These commands extracts all info from emBASE for a lane, an experiment or a project 62 Documentation at : http://gbcs.embl.de/
64
Concept : work as you like 63
65
Concept : work as you like 64 NGS Lib Database samples, libs, RBAs, exp, project link real files pull info as needed
66
Export Project View to disk 65
67
emBASE API Example 66 Assume you want to discover all libraries and associated files in a given lane …
68
emBASE API Example 67 Available from anywhere Logged in user used to authenticate in emBASE Rights apply the same way as in emBASE
69
emBASE API Example 68 Example : Create symlinks on the fly to the NGS data lib for all libs of a new lane
70
Archiving of emBASE Data Goal : save space by moving data offline when projects are finished 69 Fill in optionsemBASE admin is warned
71
Archiving of emBASE Data All data files connected to the experiments are exported IT performs back up on tape We delete ‘deletable’ files (concept of active experiment): –emBASE knows which files can be deleted, which ones have been deleted and how to get them back, if needed –delete files are locally replaced with the a small file containing back up information You can follow the archiving status in emBASE 70 What happens next ? This is a couple of clicks on your side but remember that you still pay the bill !
72
Galaxy (First Steps) 71 “Powerful data analysis made easy and reproducible ”
73
Galaxy is a web-based job management platform 72 ToolsHistory (active analysis) Launch Analysis Jobs NGS Data @ GB :: Data Analysis :: Galaxy http://gbcs.embl.de/galaxy/http://gbcs.embl.de/galaxy/ : log in with your EMBL account
74
Finding your data 73 NGS Data @ GB :: Data Analysis :: Galaxy => select your group library
75
Run jobs 74 NGS Data @ GB :: Data Analysis :: Galaxy
76
Jobs can be assembled into workflows 75 NGS Data @ GB :: Data Analysis :: Galaxy
77
Apply workflows to each demultiplexed data set in one click 76 NGS Data @ GB :: Data Analysis :: Galaxy
78
Each data set analysis is well identified 77 NGS Data @ GB :: Data Analysis :: Galaxy
79
Galaxy Summary 78 1.Galaxy is a job management / analysis platform Run standard analysis (trimming, QC, mapping, peak calling,…) Assemble workflows and perform parallel processing 2.Jobs are sent to the new LSF EMBL cluster We implement cluster good practices (copy to local /tmp, …) Tools are available under BCR/SEPP 3.Continuous update/addition of tools & indices 4.Open source and very active project NGS Data @ GB :: Data Analysis
80
Galaxy Summary 79 1.Galaxy is a job management / analysis platform Run standard analysis (trimming, QC, mapping, peak calling,…) Assemble workflows and perform parallel processing 2.Jobs are sent to the new LSF EMBL cluster We implement cluster good practices (copy to local /tmp, …) Tools are available under BCR/SEPP 3.Continuous update/addition of tools & indices 4.Open source and very active project NGS Data @ GB :: Data Analysis
81
Galaxy Summary 80 1.Galaxy is a job management / analysis platform Run standard analysis (trimming, QC, mapping, peak calling,…) Assemble workflows and perform parallel processing 2.Jobs are sent to the new LSF EMBL cluster We implement cluster good practices (copy to local /tmp, …) Tools are available under BCR/SEPP 3.Continuous update/addition of tools & indices 4.Galaxy uses the data from your NGS Data library directly 5.Easy transfer of results from Galaxy to your own disks NGS Data @ GB :: Data Analysis
82
Conclusion 81 There are absolutely no drawbacks in using our system, only benefits ! NGS Data @ GB :: Data Analysis
83
82 Joscha Sauer Shu-yi Su Laura O’Donovan Matthias Monfort Alumni Aziz Moussa M. Chaturvedi L-A Schmitt Nicolas Delhomme Leila Tlili Arnaud Huaulme GeneCore Jonathon Blake Juergen Zimmermann Markus Fritz Vladimir Benes Eileen Furlong IT Services Michael Wahlers Andres Lindau All GB members Chenchen Zhu Simon Anders Tobias Rausch Frank Thommen (CBB) Thank you
Similar presentations
© 2025 SlidePlayer.com. Inc.
All rights reserved.