Emergent Semantics: Towards Self-Organizing Scientific Metadata

Slides:



Advertisements
Similar presentations
Data Mining and the Web Susan Dumais Microsoft Research KDD97 Panel - Aug 17, 1997.
Advertisements

Dr. Leo Obrst MITRE Information Semantics Information Discovery & Understanding Command & Control Center February 6, 2014February 6, 2014February 6, 2014.
1 ICS-FORTH EU-NSF Semantic Web Workshop 3-5 Oct Christophides Vassilis Database Technology for the Semantic Web Vassilis Christophides Dimitris Plexousakis.
Haystack: Per-User Information Environment 1999 Conference on Information and Knowledge Management Eytan Adar et al Presented by Xiao Hu CS491CXZ.
28 October 2005Jeremy Frey, University of Southampton1 “The CombeChem Experience” CICC Workshop 28 October 2005 Bloomington Indiana.
Search Engines and Information Retrieval
--What is a Database--1 What is a database What is a Database.
Three Flavors of Data Science Data Simulations and Sensor Readings Catalog Data Metadata; descriptors of datasets, data products and other processing artifacts.
Information Retrieval Concerned with the: Representation of Storage of Organization of, and Access to Information items.
ReQuest (Validating Semantic Searches) Norman Piedade de Noronha 16 th July, 2004.
Smoothing the ROI Curve for Scientific Data Management Applications Bill Howe David Maier Laura Bright.
Objectives of the Lecture :
Data Mining Techniques
Database System Concepts and Architecture Lecture # 3 22 June 2012 National University of Computer and Emerging Sciences.
Search Engines and Information Retrieval Chapter 1.
Interoperable Digitised Content “Discover, search, extract, link, associate, and view digitised content” Les Carr.
Using the Purdue DB Technology to build simple on-demand data exploration tools Michael Grobe Pervasive Technology Institute Indiana University Hubbub.
Publishing and Visualizing Large-Scale Semantically-enabled Earth Science Resources on the Web Benno Lee 1 Sumit Purohit 2
OracleAS Reports Services. Problem Statement To simplify the process of managing, creating and execution of Oracle Reports.
A Metadata Based Approach For Supporting Subsetting Queries Over Parallel HDF5 Datasets Vignesh Santhanagopalan Graduate Student Department Of CSE.
Learning objectives By the end of this lecture you should be able to:  have a well-earned rest! Ch 24 Beyond the second semester.
DANIEL J. ABADI, ADAM MARCUS, SAMUEL R. MADDEN, AND KATE HOLLENBACH THE VLDB JOURNAL. SW-Store: a vertically partitioned DBMS for Semantic Web data.
Databases From A to Boyce Codd. What is a database? It depends on your point of view. For Manovich, a database is a means of structuring information in.
A framework to support collaborative Velo: Knowledge Management for Collaborative (Science | Biology) Projects A framework to support collaborative 1.
Lecture2: Database Environment Prepared by L. Nouf Almujally & Aisha AlArfaj 1 Ref. Chapter2 College of Computer and Information Sciences - Information.
Daniel J. Abadi · Adam Marcus · Samuel R. Madden ·Kate Hollenbach Presenter: Vishnu Prathish Date: Oct 1 st 2013 CS 848 – Information Integration on the.
A Logic Programming Approach to Scientific Workflow Provenance Querying* Shiyong Lu Department of Computer Science Wayne State University, Detroit, MI.
Grid Computing & Semantic Web. Grid Computing Proposed with the idea of electric power grid; Aims at integrating large-scale (global scale) computing.
MongoDB is a database management system designed for web applications and internet infrastructure. The data model and persistence strategies are built.
Efficient RDF Storage and Retrieval in Jena2 Written by: Kevin Wilkinson, Craig Sayers, Harumi Kuno, Dave Reynolds Presented by: Umer Fareed 파리드.
Data Grid Research Group Dept. of Computer Science and Engineering The Ohio State University Columbus, Ohio 43210, USA David Chiu & Gagan Agrawal Enabling.
C-Store: RDF Data Management Using Column Stores Jianlin Feng School of Software SUN YAT-SEN UNIVERSITY Apr. 24, 2009.
Project Database Handler The Project Database Handler is a brokering application that mediates interactions between the project database and the external.
User Profiling using Semantic Web Group members: Ashwin Somaiah Asha Stephen Charlie Sudharshan Reddy.
DANIELA KOLAROVA INSTITUTE OF INFORMATION TECHNOLOGIES, BAS Multimedia Semantics and the Semantic Web.
Behrooz ChitsazLorrie Apple Johnson Microsoft ResearchU.S. Department of Energy.
XMC Cat: An Adaptive Catalog for Scientific Metadata Scott Jensen and Beth Plale School of Informatics and Computing Indiana University-Bloomington Current.
Data Grid Research Group Dept. of Computer Science and Engineering The Ohio State University Columbus, Ohio 43210, USA David Chiu and Gagan Agrawal Enabling.
VIEWS b.ppt-1 Managing Intelligent Decision Support Networks in Biosurveillance PHIN 2008, Session G1, August 27, 2008 Mohammad Hashemian, MS, Zaruhi.
Supervisor : Prof . Abbdolahzadeh
CIS 375 Bruce R. Maxim UM-Dearborn
RSC Strategy Gordon Dunsire, Chair, RDA Steering Committee
Database and Cloud Security
Introduction to DBMS Purpose of Database Systems View of Data
Integration with External Applications: General View
Introduction To DBMS.
IST 220 – Intro to Databases
Module 11: File Structure
Testing and Debugging PPT By :Dr. R. Mall.
improve the efficiency, collaborative potential, and
The Client/Server Database Environment
Data Virtualization Tutorial: JSON_TABLE Queries
Physical Database Design for Relational Databases Step 3 – Step 8
Datamining : Refers to extracting or mining knowledge from large amounts of data Applications : Market Analysis Fraud Detection Customer Retention Production.
Database Concepts.
Chapter 2 Database Environment.
Ch 15 –part 3 -design evaluation
Generic Statistical Business Process Model (GSBPM)
Model-View-Controller Patterns and Frameworks
Property consolidation for entity browsing
Review: Design Pattern Structure
Introduction to DBMS Purpose of Database Systems View of Data
Test Case Test case Describes an input Description and an expected output Description. Test case ID Section 1: Before execution Section 2: After execution.
Laura Bright David Maier Portland State University
Computer Science Projects Database Theory / Prototypes
Grid Based Data Integration with Automatic Wrapper Generation
Introduction to Databases
New Technologies for Storage and Display of Meteorological Data
Oracle SQL Developer Data Modeler
Scientific Workflows Lecture 15
Presentation transcript:

Emergent Semantics: Towards Self-Organizing Scientific Metadata Bill Howe, David Maier Oregon Health and Science University

Oregon Health and Science University “The file ‘anim-sal_estuary_7.gif’ is a data product derived from the output of the ELCIRC simulation program run for the period January 8-15 2002. The image shows salinity (practical salinity units) in the estuary region of the domain. It’s actually an animation, where each frame is a horizontal slice 7 meters below the mean sea level. There are 96 frames, each representing 15 minutes.” program = ELCIRC simStart = 1/8/02 simEnd = 1/15/02 region = estuary variable = salinity timesteps = 96 plottype = animation These descriptors are not standard. Some users may record different descriptors for different purposes. How can extract direct, immediate benefit from the knowledge domain experts have about the file? 9/20/2018 Oregon Health and Science University

Environmental Observation and Forecasting System Daily forecasts and 1000s of ad hoc hindcasts One simulation involves ~20k files: inputs, parameters, outputs, derived data products This scale mandates: query access rather than simple filesystem browsing Automation everywhere 9/20/2018 Oregon Health and Science University

Oregon Health and Science University Tasks Collect metadata. Organize collected metadata. Publish organized metadata for querying. 9/20/2018 Oregon Health and Science University

Oregon Health and Science University Challenges Metadata is scattered in file paths within file headers in “nearby” files Metadata requirements change frequently new simulation codes new data product types new users, internal and external Variable = “Salinity” Depth = “7” …/anim-sal_estuary_7.gif Type = “Animation” Region = “Estuary” 9/20/2018 Oregon Health and Science University

Oregon Health and Science University “Obvious” Solution Data Managers work with Domain Experts design a relational schema, load data, test, repeat file But: Large up-front cost to DB design Slow return on investment Use cases unknown Significant change is anticipated DB languages/APIs not necessarily within scientists’ skill set data product region 9/20/2018 Oregon Health and Science University

Alternative Solution: Steps 1-3 Harvest metadata via simple collection scripts written by the domain experts Use RDF as a schema-independent metadata representation Use RDBMS technology for storage and management 1. Collection scripts filesystem 3. db 2. rdf 9/20/2018 Oregon Health and Science University

Oregon Health and Science University A Narrower Interface SQL statements Database APIs Load Strategies Data formats/models rich schema filesystem Collection scripts generic schema filesystem RDF triples 9/20/2018 Oregon Health and Science University

Oregon Health and Science University Generic RDF Schema subject property object file://forecasts/2003-184/images/anim-sal_estuary_7.gif property:region estuary property:variable salt property:plottype animation property:source file://forecasts/2003-184/run/1_salt.63 Variations to improve performance exist. Use integer keys for subjects, properties and objects. Apply efficient integer processing routines. Scalability? 9/20/2018 Oregon Health and Science University

Is Generic RDF Good Enough? “Find files with region, plottype, and variable descriptors” SELECT r.subject as file, r.object as region, p.object as plottype, v.object as variable FROM statements r, statements p, statements v WHERE r.subject = p.subject AND p.subject = v.subject AND r.property = ‘property:region’ AND p.property = ‘property:plottype’ AND v.property = ‘property:variable’ 3 self-joins! With 60 million descriptors, these joins unacceptable. 9/20/2018 Oregon Health and Science University

Oregon Health and Science University Decomposed Data So we can query the RDF directly, but… …no grouping structures to aid query formulation and processing. Automatically infer groupings from the RDF data, observing that related files often share signatures. Let users impose groupings using a web interface (like views) db ... <isofar.gif, type, isoline>, <isofar.gif, region, far>, <animsal.gif, timesteps, 10>, <animsal.gif, var, salt>, filesystem plot animation 9/20/2018 Oregon Health and Science University

Alternative Solution: Steps 4-6 Partition descriptors into equivalence classes based on file signatures Expose signatures via the web to facilitate browsing and querying Recompute signature extents as new metadata is integrated 4. partition data 5. publish to the web db website 6. query and browse via profiles 9/20/2018 Oregon Health and Science University

Oregon Health and Science University The set of properties defined for a particular file 9/20/2018 Oregon Health and Science University

Oregon Health and Science University Signatures A file’s signature is just the set of properties used to describe it. If signatures were fixed, we might derive a relational schema from them. Instead, we need to respond to changes 4. partition data db find signatures compute signature extents 9/20/2018 Oregon Health and Science University

Example: Consolidate Files with Similar Signatures Modify schema (DM) Transfer tuples from A to B (DM) Modify collection programs Modify extraction routines (DE) Modify Internal organization (DE) Modify SQL statements (DM) 9/20/2018 Oregon Health and Science University

Oregon Health and Science University Alternative Change two lines in a collection script (DE) Assert(fileA, “animation”, “”) Assert(fileA, “plottype”, “animation”) Assert(fileB, “plottype”, “animation”) Reload data (Automatic) Recompute Signatures (Automatic) Republish data (Automatic) 9/20/2018 Oregon Health and Science University

Oregon Health and Science University Benefits Narrow interface between data creators and data managers Metadata exploitable prior to finalizing a thorough schema Derived schema can adapt to changing requirements automatically Profiles constitute emergent semantics: meaning is assigned after data is collected. 9/20/2018 Oregon Health and Science University