Presentation is loading. Please wait.

Presentation is loading. Please wait.

0 GBCS ecosystem for NGS Data Data Management with emBASE Data analysis with Galaxy.

Similar presentations


Presentation on theme: "0 GBCS ecosystem for NGS Data Data Management with emBASE Data analysis with Galaxy."— Presentation transcript:

1 0 GBCS ecosystem for NGS Data Data Management with emBASE Data analysis with Galaxy

2 The big picture 1 Data File servers http://gbcs.embl.de/base/ 1.emBASE is a database, with a web front-end, storing all metadata about your data files (e.g. fastq) NGS Data @ GB

3 The big picture 2 Data File servers 2.Your data files remains on your group fileserver in your “NGS data library” and are accessible directly NGS Data @ GB

4 The big picture 3 Data File servers NGS Data @ GB Annotate data : sample description, protocol description Manage data sets : link files to experiments/projects Publish data to public repository : upon publication Export to Tape : long term storage

5 The big picture 4 Data File servers GeneCore Online Ordering GCBridge Automated data transfer from GC servers to emBASE to avoid : file renaming i.e. lack of data traceability duplication of data files in several places (with different names!) unreliable or unknown storage places (your laptop…) data not being loaded in the system NGS Data @ GB

6 NGS ecosystem by GBCS 5 Data File servers GeneCore Online Ordering GCBridge IT LSF Cluster jobs run on cluster NGS Analysis Build/Store Workflows R studio Server GB Servers access files directly fetch info with JemBASEAPI SEPP libraries NGS Data @ GB

7 0. What is the “data” 6

8 SampleSequencing File FASTQ, BAM The typical user view of the “model” Send my sample to sequencing Download the file Mail the bioinformatician where the file is NGS Data @ GB : Data model

9 Sample (eg embryos, cells) Extract (eg DNA, mRNA) Library Protocols growth, treatment, extraction, amplification, sequencing, … Annotations Sequencing File FASTQ, BAM Annotations and protocols need to be controlled AMAP A more realistic view of the process NGS Data @ GB : Data model

10 Sample1Extract1Library1Sequencing File1 Annotations and protocols need to be controlled AMAP Replicates needs to be described properly (sample replicates vs library re-sequencing) A more realistic view of the process NGS Data @ GB : Data model Sample2Extract2Library2Sequencing File2 Sample1Extract1Library1Sequencing File1 Extract2Library2Sequencing File2 Sample1Extract1Library1Sequencing File1 Sequencing File2 ≠ ≠ Biol Rep Tech Rep

11 Exp Y / Project Q Projects are mixed in the same lane Exp X / Project P A complete view of the situation Samples are commonly multiplexed Sample4Extract4Library4Sequencing File FASTQ, BAM Sample1Extract1Library1 Sample2Extract2Library2 Sample3Extract3Library3 ……… Barcode Info File … FASTQ, BAM Analysis Stored (meta)data must be readily accessible for analysis Publish e.g. EBI Model and Vocabulary should match standards for final publishing NGS Data @ GB : Data model

12 1.emBASE 11 “Data management, organization, annotations and publication”

13 emBASE Items NGS Data @ GB 12 Sample2 (eg embryos, cells) Extract2 (eg DNA, mRNA) Library2 Protocols Annotations Sequencing File FASTQ, BAM Sample1Extract1Library1 Barcode Info File Sample2 (eg embryos, cells) Extract2 (eg DNA, mRNA) Library2NGS Assay Protocols Sample Annotations Sample1Extract1Library1 + File (BAM, FASTQ) SeqLane File(s) RawBioAssay1 RawBioAssay2 + File (BAM, FASTQ) Workflow emBASE

14 NGS Data @ GB :: Data Management :: emBASE 13 Developed in house using BASE Initially a LIMS for arrays Runs for 9 years now

15 emBASE Modules NGS Data @ GB :: Data Management :: emBASE 14 http://gbcs.embl.de/basehttp://gbcs.embl.de/base ; please request login Controlled Vocabulary Sample, Extract, Libraries… Assays grouped in Experiments and Projects NGS Assays Microarrays In Situ Images

16 emBASE NGS Assay List Page NGS Data @ GB :: Data Management :: emBASE 15 List all NGS Assays (== Lane)

17 emBASE NGS Assay List Page NGS Data @ GB :: Data Management :: emBASE 16 Access rights for each assay (unix like)

18 Search NGS Assays NGS Data @ GB :: Data Management :: emBASE 17 Powerful search on all “list” pages Customize table view Locate your assay and follow the link for details

19 NGS Assay: Example of a multiplexed lane 18 Lane File & Location Sequencing run info Assay (=Lane) info & rights Related raw data sets are grouped in “experiments” Individual data sets & De-multiplexed Files NGS Data @ GB :: Data Management :: emBASE

20 NGS Assay: Example of a multiplexed lane 19 Link to Libraries i.e. Samples NGS Data @ GB :: Data Management :: emBASE

21 Biomaterials NGS Data @ GB 20

22 Sample Annotation NGS Data @ GB 21 Sample Annotation Types : are typed free text, number (int, float) pre-defined values (enum) are owned can be created as needed by authorized users e.g. as required by ICGC

23 Sample Annotation NGS Data @ GB 22 Select SATs

24 Custom sample annotations NGS Data @ GB :: Data Management :: emBASE 23 Unlimited number of annotations Annotation types can be customized (per group)

25 Grouping data sets into Experiments NGS Data @ GB 24 An experiment has a single ‘type’ e.g. ChIP-seq, RNA-seq

26 Grouping data sets into Experiments NGS Data @ GB 25 Search raw data sets and add/remove them from exp.

27 Project Layer New emBASE Project Layer 26 Experiment is tied to a single type –eg ChIP-seq, RNA-seq, iCLIP-seq Group related exp. into project

28 NGS Data @ GB 27 Wait a sec... Do we really have to fill all these web forms ?!?! NO ! 1. GCBridge: all “items” are pre-created for you 2. Protocols and sample annotations remain to be done

29 2. Decentralized NGS File Data Lib 28 “Your data lives on your file server and is readily accessible”

30 NGS data Library NGS Data @ GB :: Data Management :: emBASE 29 NGS data library root folder (can be anywhere your like) Sub-folders containing the fastq files are organized by “Sequencer Run” Everything in your data library is managed by emBASE and is read-only to avoid data deletion, renaming, move. 1.emBASE is a database, with a web front-end, storing all metadata about your data files (e.g. fastq) 2.Your data files remains on your group fileserver in your “NGS data library” and are accessible directly

31 NGS data Library NGS Data Library extended to better support demultiplexed files 30 Lane directory : one per (existing) lane ; read-only

32 NGS data Library NGS Data Library extended to better support demultiplexed files 31 Library dir (named after immutable internal emBASE id), read-only

33 NGS data Library NGS Data Library extended to better support demultiplexed files 32 Data file dir, per file type read-write until you lock it; then read-only

34 Locking / Unlocking concept 1.Library file sub-directories are unlocked (writable for group) –you can work and replace files as you wish 2.At some point, files are ready and directories can be locked (only readable): 1.emBASE starts, at this point, to track these files 2.emBASE will allow lane file deletion when all its multiplexed libraries are locked. 3.Locking is operated via the web interface, on the whole lane or per library (case of shared lanes) 33

35 3. GC Bridge 34 “Ensuring smooth data transfer between GeneCore to emBASE”

36 GCBridge : Making your life as easy as possible 35 GeneCore Online Ordering 1.Transfer file NGS Lib NGS Data @ GB :: Automated Data Transfer

37 GCBridge : Making your life as easy as possible 36 GeneCore Online Ordering 1.Transfer file 2.Call GC Bridge e-mail NGS Data @ GB :: Automated Data Transfer NGS Lib

38 GCBridge : Making your life as easy as possible 37 GeneCore Online Ordering 1.Transfer file 2.Call GC Bridge e-mail NGS Data @ GB :: Automated Data Transfer NGS Lib

39 GCBridge : Making your life as easy as possible 38 GeneCore Online Ordering Lib fetch info from GC Db 1.Transfer file 2.Call GC Bridge e-mail NGS Data @ GB :: Automated Data Transfer  User gets email upon transfer completion  Users gets email when demultiplexing has performed

40 3. Practical steps 39  Validate GCBridge Transfer Form  Annotate Samples, link protocols

41 Data released email 40 Click the link to get to the GCBridge Transfer Form

42 Single Library Form 41 Lane File(s) The Bridge is connected to emBASE experiments

43 Single Library Form 42

44 Single Library Form 43

45 Single Library Form 44  Sample names can be matched against existing Sample or Libraries  Search is performed ignoring prefix Sample1Extract1Library1 Extract2Library2 i.e. tech. replicate NGSAssay Library1 or lib. resequencing NGSAssay New entries are created by default Sample1Extract1Library1NGSAssay

46 Multiplexed Library Form 45 Identical Multiplex specific

47 Multiplexed Library Form 46 Tell us about lib number, so we can control submissions…

48 Easy demultiplexing in Data Lib Directly 47  Request demultiplexing (runs on cluster); starts when submission is complete  Jemultiplexer is emBASE-aware (ie where files go in Data Library  Jemultiplexer can also be (re)launched command line

49 Easy selection of lane mates 48  Select all lane-mates

50 Re-use emBASE samples and libraries 49 Step-by-step tutorial at http://gbcs.embl.dehttp://gbcs.embl.de (Quick Links)  Select search level : sample or library

51 Re-use emBASE samples and libraries 50  Select search level : sample or library  Select appropriate items  Match levels can be mixed  Allows to accurately model replicates (tech. vs biol. )

52 Re-use emBASE samples and libraries 51  Select search level : sample or library Step-by-step tutorial at http://gbcs.embl.dehttp://gbcs.embl.de (Quick Links)

53 Already demultiplexed samples NGS Data @ GB 52

54 Automatic notification NGS Data @ GB 53

55 NGS Data @ GB 54 Now what ? 1. GCBridge: all “items” are pre-created for you 2. Protocols and sample annotations remain to be done in emBASE

56 Working in batch with emBASE NGS Data @ GB 55 1. Narrow your search to locate wanted samples

57 Working in batch with emBASE NGS Data @ GB 56 2. Select the ones you want or All N.B : Increase number of item/page in GUI settings if needed

58 Working in batch with emBASE NGS Data @ GB 57 3. Associate protocols, change access rights to all selected samples in one click

59 Working in batch with emBASE NGS Data @ GB 58 4. Download pre-filled excel file for batch annotation

60 Working in batch with emBASE NGS Data @ GB 59 1.Keep columns you need, 2.Fill in your annotations in Excel, 3.Save back as text

61 Working in batch with emBASE NGS Data @ GB 60 5. Batch (re)annotate your samples using this file

62 emBASE Advanced Features (for the command line user) 61

63 Working with emBASE 1.Export experiment or project views using the web interface 2.Use the new command line emBASE API to learn where files are or should be placed –These commands extracts all info from emBASE for a lane, an experiment or a project 62 Documentation at : http://gbcs.embl.de/

64 Concept : work as you like 63

65 Concept : work as you like 64 NGS Lib Database samples, libs, RBAs, exp, project link real files pull info as needed

66 Export Project View to disk 65

67 emBASE API Example 66 Assume you want to discover all libraries and associated files in a given lane …

68 emBASE API Example 67 Available from anywhere Logged in user used to authenticate in emBASE Rights apply the same way as in emBASE

69 emBASE API Example 68 Example : Create symlinks on the fly to the NGS data lib for all libs of a new lane

70 Archiving of emBASE Data Goal : save space by moving data offline when projects are finished 69 Fill in optionsemBASE admin is warned

71 Archiving of emBASE Data All data files connected to the experiments are exported IT performs back up on tape We delete ‘deletable’ files (concept of active experiment): –emBASE knows which files can be deleted, which ones have been deleted and how to get them back, if needed –delete files are locally replaced with the a small file containing back up information You can follow the archiving status in emBASE 70 What happens next ? This is a couple of clicks on your side but remember that you still pay the bill !

72 Galaxy (First Steps) 71 “Powerful data analysis made easy and reproducible ”

73 Galaxy is a web-based job management platform 72 ToolsHistory (active analysis) Launch Analysis Jobs NGS Data @ GB :: Data Analysis :: Galaxy http://gbcs.embl.de/galaxy/http://gbcs.embl.de/galaxy/ : log in with your EMBL account

74 Finding your data 73 NGS Data @ GB :: Data Analysis :: Galaxy => select your group library

75 Run jobs 74 NGS Data @ GB :: Data Analysis :: Galaxy

76 Jobs can be assembled into workflows 75 NGS Data @ GB :: Data Analysis :: Galaxy

77 Apply workflows to each demultiplexed data set in one click 76 NGS Data @ GB :: Data Analysis :: Galaxy

78 Each data set analysis is well identified 77 NGS Data @ GB :: Data Analysis :: Galaxy

79 Galaxy Summary 78 1.Galaxy is a job management / analysis platform Run standard analysis (trimming, QC, mapping, peak calling,…) Assemble workflows and perform parallel processing 2.Jobs are sent to the new LSF EMBL cluster We implement cluster good practices (copy to local /tmp, …) Tools are available under BCR/SEPP 3.Continuous update/addition of tools & indices 4.Open source and very active project NGS Data @ GB :: Data Analysis

80 Galaxy Summary 79 1.Galaxy is a job management / analysis platform Run standard analysis (trimming, QC, mapping, peak calling,…) Assemble workflows and perform parallel processing 2.Jobs are sent to the new LSF EMBL cluster We implement cluster good practices (copy to local /tmp, …) Tools are available under BCR/SEPP 3.Continuous update/addition of tools & indices 4.Open source and very active project NGS Data @ GB :: Data Analysis

81 Galaxy Summary 80 1.Galaxy is a job management / analysis platform Run standard analysis (trimming, QC, mapping, peak calling,…) Assemble workflows and perform parallel processing 2.Jobs are sent to the new LSF EMBL cluster We implement cluster good practices (copy to local /tmp, …) Tools are available under BCR/SEPP 3.Continuous update/addition of tools & indices 4.Galaxy uses the data from your NGS Data library directly 5.Easy transfer of results from Galaxy to your own disks NGS Data @ GB :: Data Analysis

82 Conclusion 81 There are absolutely no drawbacks in using our system, only benefits ! NGS Data @ GB :: Data Analysis

83 82 Joscha Sauer Shu-yi Su Laura O’Donovan Matthias Monfort Alumni Aziz Moussa M. Chaturvedi L-A Schmitt Nicolas Delhomme Leila Tlili Arnaud Huaulme GeneCore Jonathon Blake Juergen Zimmermann Markus Fritz Vladimir Benes Eileen Furlong IT Services Michael Wahlers Andres Lindau All GB members Chenchen Zhu Simon Anders Tobias Rausch Frank Thommen (CBB) Thank you


Download ppt "0 GBCS ecosystem for NGS Data Data Management with emBASE Data analysis with Galaxy."

Similar presentations


Ads by Google