Data Products and Product Management Bill Howe June 3, 2003.

Slides:



Advertisements
Similar presentations
An Object/Relational Mapping tool Free and open source Simplifies storage of object data in a relational database Removes the need to write and maintain.
Advertisements

Introduction to the BinX Library eDIKT project team Ted Wen Robert Carroll
XML: Extensible Markup Language
Data Modeling and Database Design Chapter 1: Database Systems: Architecture and Components.
Database Processing: Fundamentals, Design and Implementation, 9/e by David M. KroenkeChapter 1/1 Copyright © 2004 Please……. No Food Or Drink in the class.
Concepts of Database Management Sixth Edition
CMPT 354, Simon Fraser University, Fall 2008, Martin Ester 52 Database Systems I Relational Algebra.
Manish Bhide, Manoj K Agarwal IBM India Research Lab India {abmanish, Amir Bar-Or, Sriram Padmanabhan IBM Software Group, USA
ICS 421 Spring 2010 Data Warehousing (1) Asst. Prof. Lipyeow Lim Information & Computer Science Department University of Hawaii at Manoa 3/18/20101Lipyeow.
Fundamentals, Design, and Implementation, 9/e Chapter 12 ODBC, OLE DB, ADO, and ASP.
Detailed Design Kenneth M. Anderson Lecture 21
Three Flavors of Data Science Data Simulations and Sensor Readings Catalog Data Metadata; descriptors of datasets, data products and other processing artifacts.
Algebraic Manipulation of Scientific Datasets Bill Howe and David Maier OGI School of Science and Engineering at Oregon Health and Science University Portland.
Database Management: Getting Data Together Chapter 14.
Organizing Data & Information
3-1 Chapter 3 Data and Knowledge Management
Physical Database Monitoring and Tuning the Operational System.
Fundamentals, Design, and Implementation, 9/e COS 346 DAY 22.
Ch1: File Systems and Databases Hachim Haddouti
Programming Languages Structure
Fundamentals, Design, and Implementation, 9/e Chapter 1 Introduction to Database Processing.
Chapter 14 The Second Component: The Database.
Chapter 9 Database Design
Chapter 1 Overview of Databases and Transaction Processing.
Developing Health Geographic Information Systems (HGIS) for Khorasan Province in Iran (Technical Report) S.H. Sanaei-Nejad, (MSc, PhD) Ferdowsi University.
CS370 Spring 2007 CS 370 Database Systems Lecture 2 Overview of Database Systems.
1 Yolanda Gil Information Sciences InstituteJanuary 10, 2010 Requirements for caBIG Infrastructure to Support Semantic Workflows Yolanda.
Database Systems: Design, Implementation, and Management Eighth Edition Chapter 10 Database Performance Tuning and Query Optimization.
1 © Prentice Hall, 2002 Physical Database Design Dr. Bijoy Bordoloi.
IBM Start Now Business Intelligence Solutions. Agenda Overview of BI Who will buy and why Start Now BI solution Benefit to customer.
Course Introduction Introduction to Databases Instructor: Joe Bockhorst University of Wisconsin - Milwaukee.
Database and Database Users. Outline Database Introduction An Example Characteristics of the Database Actors on the Scene Advantages of using the DBMS.
ITEC224 Database Programming
MIS 385/MBA 664 Systems Implementation with DBMS/ Database Management Dave Salisbury ( )
Ohio State University Department of Computer Science and Engineering Automatic Data Virtualization - Supporting XML based abstractions on HDF5 Datasets.
A Metadata Based Approach For Supporting Subsetting Queries Over Parallel HDF5 Datasets Vignesh Santhanagopalan Graduate Student Department Of CSE.
2005 SPRING CSMUIntroduction to Information Management1 Organizing Data John Sum Institute of Technology Management National Chung Hsing University.
Chapter 7: Database Systems Succeeding with Technology: Second Edition.
Physical Database Design & Performance. Optimizing for Query Performance For DBs with high retrieval traffic as compared to maintenance traffic, optimizing.
Database Management Systems
Chapter 6 1 © Prentice Hall, 2002 The Physical Design Stage of SDLC (figures 2.4, 2.5 revisited) Project Identification and Selection Project Initiation.
Relational Databases Database Driven Applications Retrieving Data Changing Data Analysing Data What is a DBMS An application that holds the data manages.
Concepts of Database Management Seventh Edition
The european ITM Task Force data structure F. Imbeaux.
Introduction to Database Systems1. 2 Basic Definitions Mini-world Some part of the real world about which data is stored in a database. Data Known facts.
Database Systems Design, Implementation, and Management Coronel | Morris 11e ©2015 Cengage Learning. All Rights Reserved. May not be scanned, copied or.
1 Computing Challenges for the Square Kilometre Array Mathai Joseph & Harrick Vin Tata Research Development & Design Centre Pune, India CHEP Mumbai 16.
Data resource management
1 Technology in Action Chapter 11 Behind the Scenes: Databases and Information Systems Copyright © 2010 Pearson Education, Inc. Publishing as Prentice.
Methodology – Physical Database Design for Relational Databases.
©Ian Sommerville 2000 Software Engineering, 6th edition. Chapter 28Slide 1 CO7206 System Reengineering 4.2 Software Reengineering Most slides are Slides.
Database Systems Lecture 1. In this Lecture Course Information Databases and Database Systems Some History The Relational Model.
Physical Database Design Purpose- translate the logical description of data into the technical specifications for storing and retrieving data Goal - create.
Introduction.  Administration  Simple DBMS  CMPT 454 Topics John Edgar2.
Copyright (c) 2014 Pearson Education, Inc. Introduction to DBMS.
Presentation on Database management Submitted To: Prof: Rutvi Sarang Submitted By: Dharmishtha A. Baria Roll:No:1(sem-3)
A Data Handling System for Modern and Future Fermilab Experiments Robert Illingworth Fermilab Scientific Computing Division.
1 CENG 351 CENG 351 Introduction to Data Management and File Structures Department of Computer Engineering METU.
Chapter 1 Overview of Databases and Transaction Processing.
Introduction to DBMS Purpose of Database Systems View of Data
Physical Changes That Don’t Change the Logical Design
Physical Database Design and Performance
Relational Algebra Chapter 4, Part A
Emergent Semantics: Towards Self-Organizing Scientific Metadata
Cse 344 May 4th – Map/Reduce.
Introduction to DBMS Purpose of Database Systems View of Data
Laura Bright David Maier Portland State University
Data Warehousing Concepts
Chapter 1 Introduction to Database Processing
overview today’s ideas relational databases
Presentation transcript:

Data Products and Product Management Bill Howe June 3, 2003

"...a trend within astronomy over the past decade or so, which has seen the field change from being very individualistic (individual astronomer proposes observations, goes to telescope, brings data back on tape, reduces data, writes paper, puts tape in drawer and forgets about it) to being a more collective enterprise where much telescope time is devoted to systematic sky surveys and much science is performed using data from the archives they produce." -- Bob Mann, U of Edinburgh

Traditional Scientific Data Processing Personally managed Convenient Efficient analyze data product dataset filesystem browse others’ datasets

Modern Scientific Data Processing More Data More Analyses More Users analyze data product dataset filesystem browse? others’ datasets

A New Environment Axes of Growth Number of Users Number of Datasets Size of Datasets Number of Analysis Routines Complexity of Analysis Routines Problems Too many files to browse Multiple users performing same analyses Too many routines to manually invoke Datasets too large to fit into memory

Solutions Number of Users Better sharing of data and data products Both Intra-group and Inter-group Number of Datasets Better Organization (Metadata) Query instead of Browse Size of Datasets Better Hardware Better Algorithms Number of Analyses Identify equivalences Reuse common operations Complexity of Analyses Simpler Applications Better Understanding of Data Products Macro-data issuesMicro-data issues

Roadmap Micro-Data Issues Expressing Data Products Executing Data Product Recipes Macro-Data Issues: Techniques for Managing Data and Processes Provenance

CORIE Vertical Grid Horizontal Grid Data Products

Some Data Products Timeseries (1D) Isoline (2D) Transect (2D) Isosurface (3D) Volume Render (3D) Calculation (?D) Animations (+1D) Ensembles (+1D)

Expressing Data Products Specification Salt: “Show the salinity at depth 3m” Max: “Show the maximum salinity over depth” Vort: “Show the vorticity at depth 3m” Implementation How should we translate these descriptions into something executable?

Expressing Data Products (2) Criteria 1. Simple specs  simple implementations. 2. Small spec   small implementation  3. Environment   minimum implementation  Existing Technology General Purpose Programming Languages  Example: CORIE (Perl and C) Visualization Libraries/Applications  Examples: VTK, Data Explorer (DX), XMVIS

Salinity: XMVIS5d does the job, with a little help Max Salinity: Custom C program: read, traverse grid, write Vorticity Custom C program: read, find neighbors, traverse grid, write Simple Specification  Simple Implementation In CORIE:

Salinity: Horizontal Slice is simple. Max Salinity: Not so simple, since the 3D grid is Unstructured. Vorticity Vorticity is simple. Simple Specification  Simple Implementation In VTK/DX:

Vertical Slice instead of Horizontal Slice Custom Code: likely a drastic change. Why? XMVIS5d: just feed in the vertical ‘region’ Zoom in on Estuary, then Horizontal Slice Custom Code: not insignificant changes XMVIS5d: give region coords in the.par file Small spec   Small Implementation  In CORIE:

Vertical Slice instead of Horizontal Slice VTK/DX: Equivalent to Horizontal case. Why? Zoom in on Estuary, then Horizontal Slice Just filter out the unwanted portion Small spec   small implementation  In VTK/DX:

 File Layout Custom Code: usually drastic. XMVIS5d: Just an extra conversion task? Unstructured grid  Structured Grid Custom Code: significant changes XMVIS5d: can convert, but missing an opportunity Grid undergoes significant refinement does this matter? Environment   Minimum Implementation  In CORIE:

File Layout DX provides a file import language VTK would need a new ‘reader’ module to be written Unstructured grid  Structured Grid DX: changes in the import module only VTK: Structured Grids require a different set of objects  Algorithms are different, so the objects are different Grid is significantly refined does this matter? Environment   Minimum Implementation  In VTK/DX:

Expressing Data Products Programs are frequently too tightly coupled to data characteristics file format data size (>,< available memory) data location (file, memory) data type (float, int; scalar, vector) grid type (structured, unstructured) grid construction (repeated horizontal grid)

Executing Data Product Recipes

Efficient algorithms are tuned to the environment In-Memory, Out-of-memory Parallel, Sequential But we want to specify one operation that works in multiple environments So each specified operation must correspond to multiple algorithms… How can we separate our specifications from the algorithms that implement them? Hold off on this for now…

Execution: Pipelining Two Reasons: Reduce Memory Required Return partial results earlier VTK and DX materialize intermediate results Requires a lot of memory CORIE forecast engine pipelines timesteps Mostly to return partial results early There are (were?) occasional memory problems Is Pipelining always a good idea? Code is more complex If you have enough memory, pipelining will be slower

Execution: Parallelism Data Parallel Split up the data and compute a piece of the data product on each node Pros? Cons? Task Parallel If a data product consists of independent steps, perform each on separate processors. Pros? Cons?

Micro-Data Summary At some level, we should capture the logical data product. A logical ‘recipe’ should be immune to changes at the physical level. The logical recipe + data characteristics together are precise enough to execute in the computer.

Data Model Operators Data Representations Algorithms Logical Physical Logical vs Physical

Logical vs. Physical Physical Layer Logical Layer File formats, memory, algorithms, etc. Correctness, simplicity Expression Execution

Digression: Business Data Processing, ca 1960 Record-Oriented Data We ask Queries Who works in Sales? Where is Jim’s Dept? How to connect two records? Employees(Name, HiredOn, Salary, Dept) Sue, 6/21/2000, $46k, Engineering Jim, 1/11/2001, $39k, Sales Yin, 12/1/2000, $42k, (Sales, Engineering) Dept(Name, City) Sales, New York Engineering, Denver

50s – 60s: Hierarchical Data Model Who works in Sales? FIND ANY Dept USING ‘Sales’ FIND FIRST Employee WITHIN Dept DOWHILE DB-STATUS <> 0 GET Employee FIND NEXT Employee WITHIN Dept ENDDO Sales JimYin … Eng Sue … Yin

50s – 60s: Hierarchical Data Model Sales JimYin … Eng Sue … Will the same query work now? FIND ANY Dept USING ‘Sales’ FIND FIRST Employee WITHIN Dept DOWHILE DB-STATUS <> 0 GET Employee FIND NEXT Employee WITHIN Dept ENDDO Yin

What Changed? The representation, but not the information Observation: Representation dependent queries break Codd* investigated this problem of data dependence Data Dependence *E.F. Codd, A relational model for large shared data banks, CACM v13,6 1970

Define a logical Data Model  Tables are Relations between their columns  Even Dept-Emp connection modeled as a relation Extract a few logical Operators  Select records that match criteria  Project away unused columns  Join two tables based on common values DB management systems provide physical implementations of the logical operators  Users are insulated from representational and algorithmic complexity; free to focus on asking the right query Data Dependence: Solution

1970: Relational Model Explicit Connection between Emps and Depts Data Model is provably correct Relational Algebra provably complete Systems Designers’ tasks reduced to finding efficient implementations (Not to say this is trivial!) Employees(Name, HiredOn, Salary) Sue, 6/21/2000, $46k Jim, 1/11/2001, $39k Yin, 12/1/2000, $42k Dept(Name, City) Sales, New York Engineering, Denver Dept-Emp(Name,Name) Engineering, Sue Engineering, Yin Sales, Yin Sales, Jim

So What? For Scientific Data, can we find: A Logical Data Model? Logical Operators? …rich enough to express all relevant data products …precise enough to guide efficient implementations

Scientific Data Analysis filesystem database tertiary storage retrieve grid & data representation readanalyze data product(s)

Scientific Data Manipulation  = Preparation RepresentationManipulation Fortran, C, Perl Library

Scientific Data Manipulation: Patterns  = Preparation Representation Patterns Manipulation Patterns Iteration, aggregation, filtering

Data Model Datasets are defined over Grids Grids are sets of cells Cells: Nodes, Edges, Faces,... A GridField associates data values to Grid Elements GridField = (G, k, g) = G k g where G is a grid k is an integer g : G k  a G G 2 area a a  =

Operators associating grids with data combining grids topologically reducing a grid using data values transforming grids bind union, intersection, cross product restrict aggregate PatternOperator

Restrict restrict(<19)

Merge y2y2 y4y4 y3y3 y1y1 merge x2x2 x1x1 x3x3 (x 2,y 2 ) (x 1,y 1 ) (x 3,y 3 )

Aggregate Assignment Aggregation 12.1  C12.6  C13.1  C13.2  C12.8  C12.5  C  C  C {12.8  C, 12.5  C, 12.1  C}{12.6  C, 13.1  C, 13.2  C} T G 0 temp Target chunk(3) average Input Output A

Example: Max Salinity 1. target grid: H 2. assignment: cross V e = e  V 3. aggregate: max (H  V) 2 salt agg(H 2 cross(V), max) H 2 maxsalt (H  V) 2 salt H 2 maxsalt

Example: Salinity Gradient 1. target grid: G 2. assignment: neighbors 3. aggregate: gradient G 2 salt agg(H 2 neighbors, gradient) G 2 saltgrad

Example: WetGrid bind H merge bind V  xy bathym.z restrict(z>bathym) gfWetGrid

Example: Plume gfWetGrid salt restrict(x>300) H bind elev merge  V bind z restrict(z<elev) restrict(salt<26) bind gfPlume

Logical Analysis m SsSs XxXx r(S,X) m r(S) SsSs XxXx r(X) m SsSs XxXx r(S) X’  S X = S O(1) O(n 2 )

Macro-Data Issues Database Extensions for Scientific Data Management

Scientific Data Big efficicency is paramount Complex Processing formats must match existing tools Few updates concurrency is not a major concern Extensive Metadata Provenance/Lineage/Pedigree

Science is hitting a wall FTP and GREP are not adequate You can GREP 1 MB in 1 sec, and FTP 1 MB in 1 sec You can GREP 1 GB in 1 min, and FTP 1 GB in 1 min You can GREP 1 TB in 2 days, and FTP 1 TB in 2 days (1K$) You can GREP 1 PB in 3 years, and FTP 1 TB in 3 years (1M$) Oh!, and 1PB ~10,000 disks At some point you need indices to limit search parallel data search and analysis This is where databases can help Slide courtesy of Alex Szalay, Jim Gray. From “Public Access to Large Astronomical Datasets” presented at the Data Provenance Workshop 2002

no backup no recovery no transactions no concurrency limited security no query : limited sharing limited automation data dependence With No Database analyze data product file filesystem array read browse

impedance mismatch performance First Attempt: Relational Databases relation database query results render dbms data product special tools array

redundant computation repetitive computation data dependence Second Attempt: DB Managed Files IBM Datalinks Oracle IFS database query results analyze dbms data product special tools file array read

data dependent database query results analyze Third Attempt: Analysis-Aware DB ESSW Chimera dbms data product special tools file

Process Modeling M1M1 M2M2 F2F2 E1E1 Recipe Version Mesh Forcings Executions P1P1 P2P2 Parameters 10/3 C=.2 F3F3 F1F1 E2E2 Data Product Pipeline Bindings F1F1 F1F1 E1E1 F2F2 R1R1 atm tide cpu mem

data dependence? expressiveness? Fourth Attempt: Array Types for DBs AML AQL Monoid Calculus database query results render dbms data product special tools array

query language doesn’t exist! Fifth Attempt: Specialized Data Models for DBs Active Data Repository Aurora GridFields... database query results render dbms data product special tools specialized data model

Macro-Data Summary Science is outgrowing its infrastructure; Databases can help Competing solutions, no clear winner Extensions to Existing Database Technology Specialized Scientific Computing Platforms Limited Industrial Interest (Science not a big source of $$$)

A Final Topic: Data Provenance

Data Provenance “...a record of the origin and history of a piece of data.” -- Dave Pearson, Oracle UK “...a history of steps and procedures associated with the processing of associated data” -- Bob Mann, University of Edinburgh “...metadata which uniquely defines data and provides a traceable path to its origin.” -- Carmen Pancerella et al., Sandia Natl Lab “...determining the validity of data by gaining access to a complete audit trail describing how the data was produced from [base] datasets...” -- Ian Foster, U of Chicago

Data Provenance Used for: Discovery (querying) Validation Reproducibility Related Issues: Annotations Federated Databases Publishing

Data Provenance Research Thrusts Domain-specific standards  Astronomy  High Energy Physics  Bioinformatics  Environmental Observation and Forecasting? Representation  XML  BLOBs  Explicit Schema Support Database Extensions  Tracking provenance through queries implicitly

Summary Micro-Data Issues Logical Level  Convenient Expression  Genericity  Algebraic Optimization Physical Level  Efficient Execution Macro-Data Issues Database Features for Scientific Data Metadata Provenance

 =

Timeseries

Isoline

Transect

Ensemble

Volume Rendering

Isosurface

Salt in DX

Vorticity in DX

Max Salt in DX

Filtering in DX