Download presentation
Presentation is loading. Please wait.
Published byTrevor Richardson Modified over 8 years ago
1
Rafael Jimenez ELIXIR CTO BioMedBridges Life science requirements from e-infrastructure: initial results from a joint BioMedBridges workshop Stephanie Suhr BioMedBridges Project Manager
2
BioMedBridges Biomedical sciences research infrastructures stronger through common links FP7-funded cluster project 21 project partners in 9 countries Computational ‘data and service’ bridges between the BMS RIs Interoperability between data and services in the biological, medical, translational and clinical domains 2
3
Second BioMedBridges AGM e-infrastructure advisory board meeting with BMS RI technical representatives 10-11 March 2014 Florence, Italy
4
BioMedBridges workshop E-Infrastructure support for the life sciences: Preparing for the data deluge 15 May, 2014 Genome Campus, Hinxton, UK
5
Knowledge exchange workshop Discussion of big data challenges in life sciences Focus on few representative domains Looking 5 years ahead Jointly identify potential solutions to our problems Data ICT e-infrastructures LS life sciences Physical facilities Scientific information Transfer Computation Storage
6
How does it affect data sharing in life sciences?
7
Large-scale data sharing in the life sciences http://www.mrc.ac.uk/Utilities/Documentrecord/index.htm?d=MRC002552
8
How does big data affect data sharing? http://www.mrc.ac.uk/Utilities/Documentrecord/index.htm?d=MRC002552 Compute Storage Compute Transfer Transfer Storage What How Where
10
Growing data
11
Cost of DNA sequencing 11
12
Data generation vs. data transfer 12 ~100 GB ~4 TB 24 hours 1 Gb100 Mb 10 Mb ~30 min ~9 hour ~5 hours ~4 days ~2 days ~5 weeks DNA sequencing Mass spectrometry Microscopy Network File Transfer
13
Potential Bottlenecks in Life Sciences Data production grows faster than storage Cost of data production technologies declines faster than storage It takes longer to transfer data than produce the data.
14
Data growth how to reduce the IT budget shortfall? http://www.eweek.com/
15
Data growth how to reduce the IT budget shortfall? http://www.eweek.com/ Optimization Using technology more effectively Selecting relevant data
16
Potential solutions Storage Data compression Select what we store Evaluate data reproducibility & value of data Network Faster protocols Partitioning Network upgrade Computation Clouds Data close to computation
17
Data compression Efficient representation Capacity for controlled data reduction Efficient transformations Tool chain Precision Compression CRAM Fritz, M.H. Leinonen, R., et al. (2011) Efficient storage of high throughput DNA sequencing data using reference-based compression. Genome Res. 21 (5), 734-40 Cochrane G., Cook C.E. and Birney E. (2012) The future of DNA sequence archiving. GigaScience 2012, 1:2 http://www.ebi.ac.uk/ena/about/cram_toolkit http://www.ebi.ac.uk/ena/about/cram_toolkit
18
Data transfer optimization e.g. Getting more from available bandwidth
19
Data partitioning Organisation of data around biological concepts Indexing system around these concepts Support for requests for partitions along this index Reference-oriented indexing
20
What data is relevant?
21
21 Life sciences diversity Genomes Nucleotides Transcripts Proteins Complexes Pathways Small molecules Structures Domains Cells Biobanks Tissues and organs Human populations Therapies Disease prevention Early Diagnosis Human individuals
22
Life sciences diversity Different communities Some similar requirements Not always the same solutions Proteomics Metabolomics Clinical dataGenomics Imaging
23
Factors that can influence data availability scientific (e.g. data reproducibility, uniqueness, value of processed and/or raw data) financial (cost of data storage, transfer, reproduction) technical (storage, network, computation…) political (drivers e.g. from funding bodies/large organisations/national interests) social (data sharing mentality of the community in question) legal/ethical/formal (requirements/constraints for data storage/transfer/access - e.g. need to store data on German citizens in Germany; requirements from journal publishers, data management plans, etc.)
24
Some conclusions Opportunity for e-infrastructures to better understand BMS RI problems. Identification of bottlenecks Discussion of some potential solutions Data growth will change how we do things today Different communities -> different models -> some common solutions Solutions have to come from use cases BMS RI need to be better defining requirements We need to use technology more efficiently BMS community has to evaluate the practicality of storing everything Privacy issues makes big data more challenging Difficult to separate big data from computation Shortage of expertise of how to deal with scientific data and IT services
25
Thank you! Questions? Special thanks to … Stephanie Suhr BioMedBridges Tom Hancocks EMBL-EBI Cath Brooksbank EMBL-EBI
26
Data deposition
27
Data submission 27 Submissions raw data processed data metadata Centralized database
28
Data sharing The casual approach ‘data on my disk and available to anyone who requests it' Submission to data repositories
29
Data submissions 29 Data repository Journal submission Data repository Journal submission reads Journal requestCurator Data repository Data Management Plan submission Data management +
30
Data sharing The casual approach ‘data on my disk and available to anyone who requests it' Submission to data repositories Will big data affect data deposition?
31
Data submissions How much data? How much available data?
Similar presentations
© 2025 SlidePlayer.com. Inc.
All rights reserved.