Data Knowledge Base Prototype for Collaborative Scientific Research M. Grigorieva (National Research Center "Kurchatov Institute"), M. Gubin (Tomsk Polytechnic.

Slides:



Advertisements
Similar presentations
Dr. Leo Obrst MITRE Information Semantics Information Discovery & Understanding Command & Control Center February 6, 2014February 6, 2014February 6, 2014.
Advertisements

Configuration management
Management, Population and Marketing of institutional repositories / open access journals Iryna Kuchma, eIFL Open Access program manager, eIFL.net Presented.
Presentation by Priyanka Sawarkar
Haystack: Per-User Information Environment 1999 Conference on Information and Knowledge Management Eytan Adar et al Presented by Xiao Hu CS491CXZ.
The Data Lifecycle and the Curation of Laboratory Experimental Data Tony Hey Corporate VP for Technical Computing Microsoft Corporation.
I-Room : Integrating Intelligent Agents and Virtual Worlds.
--What is a Database--1 What is a database What is a Database.
ReQuest (Validating Semantic Searches) Norman Piedade de Noronha 16 th July, 2004.
Knowledge Science & Engineering Institute, Beijing Normal University, Analyzing Transcripts of Online Asynchronous.
Synthesis of Incomplete and Qualified Data using the GCE Data Toolbox Wade Sheldon Georgia Coastal Ecosystems LTER University of Georgia.
GL12 Conf. Dec. 6-7, 2010NTL, Prague, Czech Republic Extending the “Facets” concept by applying NLP tools to catalog records of scientific literature *E.
A Semantic Workflow Mechanism to Realise Experimental Goals and Constraints Edoardo Pignotti, Peter Edwards, Alun Preece, Nick Gotts and Gary Polhill School.
United Nations Economic Commission for Europe Statistical Division Applying the GSBPM to Business Register Management Steven Vale UNECE
Improving Data Discovery in Metadata Repositories through Semantic Search Chad Berkley 1, Shawn Bowers 2, Matt Jones 1, Mark Schildhauer 1, Josh Madin.
MDC Open Information Model West Virginia University CS486 Presentation Feb 18, 2000 Lijian Liu (OIM:
Špindlerův Mlýn, Czech Republic, SOFSEM Semantically-aided Data-aware Service Workflow Composition Ondrej Habala, Marek Paralič,
ArcGIS Workflow Manager An Introduction
Moving forward our shared data agenda: a view from the publishing industry ICSTI, March 2012.
CONTI’2008, 5-6 June 2008, TIMISOARA 1 Towards a digital content management system Gheorghe Sebestyen-Pal, Tünde Bálint, Bogdan Moscaliuc, Agnes Sebestyen-Pal.
Advances in Technology and CRIS Nikos Houssos National Documentation Centre / National Hellenic Research Foundation, Greece euroCRIS Task Group Leader.
FP OntoGrid: Paving the way for Knowledgeable Grid Services and Systems WP8: Use case 1: Quality Analysis for Satellite Missions.
CERN – IT Department CH-1211 Genève 23 Switzerland t CERN Open Source Collaborative tools: Digital Library Software Tim Smith CERN/IT.
Mantychore Oct 2010 WP 7 Andrew Mackarel. Agenda 1. Scope of the WP 2. Mm distribution 3. The WP plan 4. Objectives 5. Deliverables 6. Deadlines 7. Partners.
 To explain the importance of software configuration management (CM)  To describe key CM activities namely CM planning, change management, version management.
LIS 506 (Fall 2006) LIS 506 Information Technology Week 11: Digital Libraries & Institutional Repositories.
Managed by UT-Battelle for the Department of Energy 1 Integrated Catalogue (ICAT) Auto Update System Presented by Jessica Feng Research Alliance in Math.
Using the Open Metadata Registry (openMDR) to create Data Sharing Interfaces October 14 th, 2010 David Ervin & Rakesh Dhaval, Center for IT Innovations.
1 Schema Registries Steven Hughes, Lou Reich, Dan Crichton NASA 21 October 2015.
Ocean Observatories Initiative Data Management (DM) Subsystem Overview Michael Meisinger September 29, 2009.
GEON2 and OpenEarth Framework (OEF) Bradley Wallet School of Geology and Geophysics, University of Oklahoma
ESIP Semantic Web Products and Services ‘triples’ “tutorial” aka sausage making ESIP SW Cluster, Jan ed.
1 Registry Services Overview J. Steven Hughes (Deputy Chair) Principal Computer Scientist NASA/JPL 17 December 2015.
Research Data Management At the Smithsonian Using Sidora CNI December 10, 2013.
Metadata By N.Gopinath AP/CSE Metadata and it’s role in the lifecycle. The collection, maintenance, and deployment of metadata Metadata and tool integration.
1 Class exercise II: Use Case Implementation Deborah McGuinness and Peter Fox CSCI Week 8, October 20, 2008.
Jean-Roch Vlimant, CERN Physics Performance and Dataset Project Physics Data & MC Validation Group McM : The Evolution of PREP. The CMS tool for Monte-Carlo.
Knowledge Modeling and Discovery. About Thetus Thetus develops knowledge modeling and discovery infrastructure software for customers who: Have high-value.
1 Open Ontology Repository initiative - Planning Meeting - Thu Co-conveners: PeterYim, LeoObrst & MikeDean ref.:
Mike Hildreth DASPOS Update Mike Hildreth representing the DASPOS project 1.
Collaborative Work Module Gwen Kerdiles European Solution Centre SunGard Higher Education.
ETICS An Environment for Distributed Software Development in Aerospace Applications SpaceTransfer09 Hannover Messe, April 2009.
Data Management: Data Processing Types of Data Processing at USGS There are several ways to classify Data Processing activities at USGS, and here are some.
The Virtual Observatory and Ecological Informatics System (VOEIS): Using RESTful architecture and an extensible data model to provide a unique data management.
The basic ideas of the DKB development To represent all stages of data processing and analysis by the ATLAS Collaboration in unified information space.
MESA A Simple Microarray Data Management Server. General MESA is a prototype web-based database solution for the massive amounts of initial data generated.
Software Configuration Management
DQM4HEP Monitoring Status Tom Coates AIDA-2020 Second Annual Meeting
PLM, Document and Workflow Management
aspects of archive system design
Chapter 11: Software Configuration Management
Computer Aided Software Engineering (CASE)
Existing Perl/Oracle Pipeline
System Design.
National Research Center “Kurchatov Institute”
ATLAS Data Analysis Ontology: ontological representation of investigations DKB Meeting.
Using the Drupal Content Management Software (CMS) as a framework for OMICS/Imaging-based collaboration.
Software Documentation
VI-SEEM Data Repository
VI-SEEM Data Repository
Copyright © 2011 Pearson Education, Inc. Publishing as Pearson Addison-Wesley Chapter 2 Database System Concepts and Architecture.
Data Management: Documentation & Metadata
Health Ingenuity Exchange - HingX
ece 627 intelligent web: ontology and beyond
Chapter 11: Software Configuration Management
Open Science: the crucial importance of metadata
Building an open library without walls : Archiving of particle physics data and results for long-term access and use Joanne Yeomans CERN Scientific Information.
ROLE OF «electronic virtual enhanced research-engaged student teams» WEB PORTAL IN SOLUTION OF PROBLEM OF COLLABORATION INTERNATIONAL TEAMS INSIDE ONE.
Data + Research Elements What Publishers Can Do (and Are Doing) to Facilitate Data Integration and Attribution David Parsons – Lawrence, KS, 13th February.
SDMX IT Tools SDMX Registry
Presentation transcript:

Data Knowledge Base Prototype for Collaborative Scientific Research M. Grigorieva (National Research Center "Kurchatov Institute"), M. Gubin (Tomsk Polytechnic University), A. Alexeev (Tomsk Polytechnic University), M. Golosova (National Research Center "Kurchatov Institute"), A. Klimentov (Brookhaven National Lab, National Research Center "Kurchatov Institute" ), V. Osipova (Tomsk Polytechnic University) /14

Thanks This talk drew on presentations, discussions, comments, input from many. Thanks to all, including those I’ve missed D. Golubkov, D. Krasnopevtsev and D. Laykom Special thanks go to Torre Wenaus who initiated this work and for his ideas about Data Knowledge Base content and design This work was funded in part by the Russian Ministry of Science and Education under contract #14.Z and the Russian Foundation for Basic Research under contract # /14

Outline Data Knowledge Base highlights Sources of metadata DKB architecture and prototype Data analysis ontology Summary and next steps /14

Data Knowledge Base Role in Science Data Knowledge Base (DKB) is an intelligence behind GUIs and APIs to aggregate and synthesize a range of primary metadata sources, and enhance them with flexible schema-less addition of aggregated data. One of the main goals - to equip scientific community with a knowledge-based infrastructure providing fast access to relevant scientific information, it will facilitate the access to information which is currently scattered around different services for each experiment. The DKB should be capable of automatic acquisition of knowledge from miscellaneous, not coherent and distributed sources, including archives of scientific papers, research groups wiki pages, tutorials, conference and workshop information, and link these information with well-structured technical data about the experiment (datasets, analysis code, various metadata information on all used signal and background samples). DKB should provide a coherent and integrated view of the experiment life-cycle. Possible DKB applications and practiced utilization: – Assist scientists when customizing their experimentation environments – Preservation of the data analysis process and reproduce the results of analysis (i.e. for collaborators outside the original team) “Often the only people that can realistically reproduce the results are those that were part of the original analysis team. This poses a potential problem for long-term preservation, in the case that people take on different responsibilities in the collaboration or even leave the field entirely” (K. Cranmer, L. Heinrich, R. Jones, D.South. Analysis Preservation in ATLAS // Journal of Physics: Conf.series 664 (2015) ) – The prevention of deletion of datasets, used in publications during analysis and journal review periods – Discovering Similar / Related datasets that have a large probability of containing data that is required for a specific purpose, ranked by that probability /14

Metadata Sources [ATLAS as an example] – Data Processing: Rucio (Distributed Data Management System) Production System: » DEFT [Database Engine For Tasks] » JEDI [Job Execution and Definition Interface] JIRA ITS (Issue Tracking Service) Analysis Code Repositories (ATLAS policy required all analysis code to be checked into version control systems to preserve it for the latter reference) Google docs (datasets lists) ATLAS virtual machine images (preserving exact SW and hardware configurations) – Scientific analysis: Indico (manage complex conferences, workshops and meetings) CERN Document Server CERN Twiki ATLAS Supporting documents (Internal Notes) AMI (Atlas Metadata Interface) GLANCE (search engine for the Atlas Collaboration) Despite the available documentation, it is in practice often quite involved to trace exactly how a given result was produced. The necessary information is scattered over many different repositories, such as the metadata interface, the various source code repositories, internal documents and web-pages. We need to present the whole process of data analysis life cycle from physicist idea to data processing chain and resulting publications /14 In order to be interpreted and mined, experimental data must be accompanied by auxiliary metadata, which are recorded at each data processing step. Metadata describes scientific data and represent scientific objects or results of scientific experiments, allowing them to be shared by various applications, to be recorded in databases or published via Web.

Prototype Data Knowledge Base Architecture /14

ATLAS Data Analysis Ontology Despite a lot of papers published in ATLAS collaboration, there is still “no formal way of representing or classifying experimental results – no metadata accompanies an article to formally describe the physics result therein” [D. Carral, M. Cheatham. “An ontology Design Pattern for Particle Physics Analysis”]. Ontology is a domain-specific dictionary of terms and definitions, it can also captures the semantic relationships between the terms, thus allowing logical inferencing about the entities represented by the ontology and by the data annotated using the ontology’s terms. The ontology-based approach to knowledge representations offers many significant opportunities for new approaches to data mining that go beyond the simple search for patterns in the primary data by integrating information incorporated in the structure of the ontology representation. Ontological storage will provide linked representation of all elements of the ATLAS data analysis /14

ATLAS Experiment Ontology Prototype Each ATLAS publication is based on some physical hypothesis, which should be confirmed or refuted. To test the hypotheses scientists usually use two data sets: simulated data (Monte-Carlo) and real data from ATLAS detector. These data sets are processed in ATLAS Data Processing Chain. And the results of the data analysis are described in Papers and Conference Notes. Each ATLAS Paper have link to the Supporting document (Internal ATLAS Note), describing the initial data samples that were used for the analysis /14

Example: Data samples representation in ATLAS supporting documents ─a list of datasets ─a table of dataset attributes ─a simple description - how the signal and the background data samples were obtained /14

Example: Datasets ID’s in ATLAS NOTEs Monte-Carlo dataset IDs ProductionSystem: DEFT Database 10/14

What metadata can be extracted from ATLAS Internal Notes: LHC Energy Run LHC Luminosity Year, Run Number, Periods Colliding beams (p-p, Pb-Pb) Monte-Carlo generators Triggers menus Statistics Data Samples – Real Data Samples – Monte-Carlo Data Samples – Signal – Background Software Release Conditions Data Experiment specific metadata must be automatically derived from ATLAS paper & internal documents texts. Formalization of the data analysis description In general, the data analysis described in the Internal Note is well structured. The authors use a very definite sentences, words, phrases to describe how and on which datasets an experiment was conducted. This will allow us to annotate the text with the knowledge base significant elements Available in paper’s metadata Available only in the full text of a Document 11/14

Datasets mining workflow 1. Parametric search of ATLAS papers and Internal Documents in CDS 2-4. Analyze full text of the document, extract information about datasets ( Insert Paper’s metadata in Virtuoso storage ) 5. Result list of datasets 6. Request to Hadoop Production System storage for the datasets metadata 7. Insert datasets metadata in Virtuoso Storage /14

Summary & Conclusion The development of Data Knowledge Base prototype for HEP (using ATLAS metadata as an example: – Ontology storage: Developed ATLAS Data Analysis ontology prototype for main classes: Document, Data Sample, ATLAS Member, ATLAS Experiment [OWL] Virtuoso ontology storage installed in Tomsk Polytechnic University – Transitional Hadoop Storage installed in Kurchatov Institute Production System metadata exported from Oracle DB and imported to Hadoop Storage – Internal Notes processing: Developed tools to prepare Notes full texts for data mining Developed dataset’s extraction module for Notes – In progress: Web Interface Prototype, based on NodeJS Framework Tools for Insert/update/select data in Virtuoso using Virtuoso API Search Interface for ATLAS Documents using RESTFUL Invenio API /14

Near term plan To develop first DKB architecture prototype v.1.0.0, including: – ontological database backend (Virtuoso) with ontology model version “Document-Dataset-Experiment”; – simple web-interface, allowing user’s search for the metadata of the experiments, publications and datasets by parameters; – tools for Internal Documentation full texts dataset mining; – tools for data manipulation in Virtuoso /14