Brief: Data Science Progress/ Activities and Renewal Plans DCO Executive Committee. Oct. 8-9, Rome (IT) DCO-DS = DCO Data Science
Since March (Intl. Science Mtg.) DCO Statistics (now): Over 5,500 people across 698 organizations. Over 2,100 publications (548 via DCO^). Over 216 projects, field studies (76), equipment (43),... Over 1,995 research topics. Over 155 datasets (-24 Igor, +). DCO Data Types … Even more “objects”, reports,... DCO Statistics: Over 4,700 people across 567 organizations. Over 1,400 publications*. Over 210 projects including field studies. Over 1,600 research topics. Over 160 datasets. Over 590 research locations. Over objects.
Aug. 2014
DCO Context: Virtual Observatory and Virtual Organization Linking the resources! Deep Carbon Observatory Online Peter Fox, and Janet Kozyra, 2015, eScience and Informatics for international science programs, Progress in Earth and Planetary Science, 2:12, pp. 9. doi: /s
DCO Knowledge Graph: Refactoring and Resolving ●Ontology is an important contribution to the scientific ontology ecosystem; organizes scientific knowledge and unlocks the “data” about DCO ●Meets specific ontology best practices ●Recognized opportunities for ontology reuse, esp... ○representing datasets using WWW Dataset Catalog (W3C DCAT) ontology ○incorporating provenance into DCO using WWW Provenance Standard (W3C PROV-O) ●Clarified labels and descriptions of all concepts and relationships ●Added annotations ●Made DCO Ontology browsable and resolvable via content negotiation ●
New Faceted Search Interface ●Implementing for: People, Publications, Projects, Field Studies, Datasets, DataTypes ●Replaces slower, prototype faceted browser ●Using open source platform "ElasticSearch” ●Provides faster text-based searching (based on inverted indices) ●Faster and easier to develop and maintain ns
DCO Knowledge Graph Analytics 1.Identified key areas of DCO for analysis and visualization, initially: ○Publications and publication keywords ○User registrations ○DCO Member areas of expertise 2.Implemented simple visualizations using open source visualization libraries 3.Generated dynamically via direct queries to DCO Knowledge Graph 4.What would you like to see?
DCO Knowledge Graph Analytics Publication Subject Area Word Cloud
DCO-DS Boundary Activities
EPC: Thermodynamic Data Rescue ●A large number of geoscience publications contain publication datasets that are not expressed external to the publication text ●Extracting, organizing, and reusing these datasets is valuable ●Data Science Team and Extreme Physics and Chemistry community member Mark Ghiorso identified thermodynamic datasets about the enthalpy and entropy of chemicals
Thermodynamic Data Rescue ●Method for extracting ‘dark data’ in publications Locate and download journal article (PDF document) Generate metadata about material, experiment and results Tabulate results from document and run OCR over it to generate data Generate candidate dataset using OCR software Deposit data into data repository; link to original document DCO Knowledge Store DCO Data Repository Data Review and Evaluation
The data rescue work of each paper has a card Move the card to the next step when the task of previous step is done Members can communicate within a card and paste links to relevant resources Implementation of the data rescue workflow
Thermodynamic Data Rescue: Output New datasets available via dataset browser Includes citations to the originating publication Data files accessible through dataset repository Replicable to other Communities, e.g. R&F
DCO-DS Evaluation Form as key input to DCO-DS renewal ●Focused on the evaluation of Deep Carbon virtual Observatory ●Evaluation questions will help determine DCvO's role in ○Increasing members, activity and awareness of DCO activities ○Enabling search, access, exchange and use of data & information for DCO scientific and educational needs ○Needs to further integrate with DCO Members' essential technologies ●Phased roll-out to begin early Oct ○Wave 1: Executive Committee, Secretariat, Community leads, selected others ○Wave 2: DCO SSCs, Engagement ○Waves 3, 4, 5, 6: DCO Communities
Current work (examples) DCO Data Legacy preparation (see later on the agenda) DCO data registration from all DCO projects – by DCO data curator hosted at LDEO, funded by secretariat DCO project reporting Deep Time Data Infrastructure (Keck)
Current Work: Geo Sample curation and IGSN ●Have GeoSample as a class in DCO ontology and collect the core metadata items for sample registration in the DCO data portal; ●Interface between the DCO IGSN Allocation Agent and the IGSN registry agent, with two potential functionalities: ○Assign IGSN to a sample record through the DCO data portal in collaboration with UT funded activity ○Use IGSN to import sample records from existing repositories to the DCO data portal, if there is a mature IGSN metadata API
Future Work: Instrument Reporting and Browsing* ●Progress to-date: ○Reporting on DCO-funded Instrument use by Projects and Field Studies ○Referencing DCO Instrument use within Grant Summary Reports ■within Instrument grants and related project/field study grants ●Future work: The Instrument Browser ○Dynamically generated instrument list and instrument summary page ○A faceted search interface for instruments ○Instrument discovery based on nature of use, data collected, projects and point of contact * Outcome from the DCO Data Science day at RPI in 2014!!!
Future Work: Deep Carbon Science Trend Analysis ●Natural Language Processing (NLP) based analysis of Deep Carbon publication corpus ○Extracts entities and relations from the corpus ○Constructs a Deep Carbon Knowledge Base consisting of unified entities and relations ○Provides structured knowledge for downstreaming applications and analysis ●Includes retrieval of authoritative metadata into DCO Knowledge Graph ●Includes Deep Carbon Science Visualization Dashboard
Future Work: Leveraging existing data resources Interface between DCO Data Portal and other data repositories – key part of post-2019 efforts (e.g. Spring 2015 effort with CoDL/ MBL) Incorporate specific metadata requirements into the DCO Knowledge Store Extend DCO Ontology for incorporation of other repository data, and/or utilize existing schema Provide data in a variety of formats for use (non-specialists) Populate the metadata and data repository for DCO projects that do not already have their own portal Disseminate template data management plan for new projects
Future Work: Continue Infrastructure Evolution Better integration between Community Portal and Data Portal Easier data entry for key concepts (project updates, datasets, publications, etc) DevOps enhancements for easier and faster deployment of infrastructure updates Improve new user onboarding Improve usability of Data Portal Create annotations for representation of evolving deep carbon concepts and relationships
Expected work in renewal To lead into 2019 – a technology refresh for major platform components for the DCO network, and a “network” succession plan Prioritized efforts based on evaluations (Oct-Dec) Inputs from DCO synthesis discussions and post-2019 committees/ task groups Significant efforts on data registration and data legacies Compete key boundary activities Two years or 3.5? Draft in December. Your inputs are essential!