Download presentation
Presentation is loading. Please wait.
Published byAnnabella Lamb Modified over 9 years ago
1
Scientific Data Management - From the Lab to the Web Semantic Data Management Dagstuhl Seminar 22-27 April 2012 José Manuel Gómez Pérez, iSOCO www.wf4ever-project.org
2
2 Some facts The data deluge Source: IDC ‘s The 2011 Digital Universe Study – Extracting Value from Chaos »In 2010 the size of the digital universe exceeded 1 Zettabyte (=1 trillion Gb) »1.8 Zb in 2011 »35 Zb expected in 2020 »90% unstructured data »70% user-generated »75% resulting from data copying, merging, and transforming »Metadata is the fastest growing data category »Much of such data is dynamic, real-time, volatile
3
3 Two main challenges Dealing with dynamicity »Challenge 1: Identifying and structuring the relevant portions of the data for the task at hand ›First-class data citizens »Challenge 2: Managing the lifecycle of data entities ›Preservation ›Evolution and versioning ›Decay Both technical and social aspects involved
4
4 Experiment Results (data) Scientific Interpretation Workflows in the Scientific Method The Research Lifecycle Example: Genome-Wide Association Studies Background Hypothesis Assumptions Input data Method Publication Results (Data)
5
5 Workflow-based Science »A mechanism for coordinating the execution of services and linking together resources. »The combination of data and processes into a configurable, structured set of steps that implement semi-automated computational solutions in scientific problem-solving What is a Scientific Workflow? Scientific workflows are at the core of scientific data management ›Enable automation ›Encourage best practices
6
Challenge 1 Identifying and structuring the relevant portions of the data for the task at hand First-class data citizens
7
7 Questions for Scientific Data and WorkflowsIssues Who are you ? Where and when were you born ? Who were your parents (creators) ? Identity and Description Authenticity Uniqueness For which purpose were you conceived and have been used ?Reuse, Repurpose What do you have inside ?Inspection Visualization Annotations How is your content linked ?Graphical Representation May I access all your parts ?Access Rights Which parts can I replace ?Adaptability What have they done to you ? Who and When ? Why did they do that ? Provenance Versioning Why have you been recommended to me ? Can I believe what you are saying or trust your results ? Information Quality Do you still produce the same results ?Reproducibility Are you still working ? How could I repair you ? Completeness Stability How could I thank you ? How could I talk about you ? Credit
8
8 Research Objects as Technical Objects Challenge 1: Identifying and structuring the relevant data Carriers of Research Context »Referentiable »Aggregation, Dispersed ›Heterogeneous ›Local and External »Annotated metadata ›Provenance ›Structured: Manifests, Recipes, Permissions, Discourse »Lifecycle ›Publishing, Evolution ›Versioning »Mixed Stewardship ›Graceful Degradation »Sharing »Security & Privacy »Stereotypical User Profiles »Services Distributed Third Party Tenancy Alien Store Technical Objects Social Objects OAI-ORE
9
9 9 9 Research Objects as Social Objects Package, Explore, Inspect, Review, Exchange, Share, Reuse, Publish, Credit
10
10 Research Object model core (simplified) http://purl.org/wf4ever/ro# ro:Resource ro:ResearchObject ro:Manifest ro:AggregatedAnnotation ore:aggregates ro:annotatesAggregatedResource wfdesc:Workflow ore:isDescribedBy Note: This figure shows a simplified view of the RO core. RO specification: http://wf4ever.github.com/ro ›ro (aggregation and annotation) ›wfdesc (workflow description) ›Minim * (minimum info model) ›wfprov (workflow provenance) ›roprov (RO provenance) ›roevo (evolution model) * Minim based on M. Gamble’s MIM
11
Challenge 2 Managing the lifecycle of data entities Evolution and Decay
12
12 RO Evolution & Versioning Challenge 2: Managing the lifecycle of data entities
13
13 Workflow Decay Component level flux/decay/unavailability Data level Infrastructure level Experiment Decay Methodological changes New technologies New resources/components New data RO Decay Challenge 2: Managing the lifecycle of data entities
14
14 Preservation, Conservation, Recreating Preserving Archived Record Fixed Snapshots Review Rerun & Replay Conserving Active Instrument Live Rerun & Reuse Repair & Restore Recreating Archived Record Active Instrument Live Rebuild Recycle Repurpose
15
15 Possible types of decay (an example) Challenge 2: Managing the lifecycle of data entities
16
16 A Taxonomy of RO decay Decay Analysis 1.Service tool is missing 2.Service file descriptor disappeared 3.Service up but not contactable 4.Service up but functionality changed 5.Local software dependencies 6.Data unavailability 7.Changes in data formats 8.Chained dependency 9.Credentials deprecated 10.Input data superseded by other data 11.RO metadata outdated (upon versioning) 12.Old fashioned RO 13.External references lose credit 14.Execution framework no longer available
17
17 Sample decay type A taxonomy of workflow decay
18
18 1.0 Certificate – Evaluation of Stability and Completeness Decay Analysis Is the RO free from any form of decay preventing workflow execution? »Focus on reproducibility »Assisted detection of RO decay »Active monitoring on decay forms »RO and workflow provenance Is the minimal aggregation of resources encapsulated by the RO consistent? »RO checklists »Produced by scientists »Automatically checked against minimal model (minim) »RO evolution StabilityCompleteness 1.0 Certificate notion originally proposed by Yde de Jong 1.0 Certificate of quality »Notification »Explanation
19
19 Lessons learnt Recap »Data with a Purpose »Encapsulate & Conquer ›Goal-driven (purpose) ›Aggregation ›Community-managed »Nothing is immutable, especially data. ›Foster evolution ›Monitor decay Scalability Provenance
20
20 Questions Thanks for your Attention! Any Questions? http://www.wf4ever-project.org/
Similar presentations
© 2024 SlidePlayer.com. Inc.
All rights reserved.