Data Products and Product Management Bill Howe June 3, 2003.

Data Products and Product Management Bill Howe June 3, 2003

"...a trend within astronomy over the past decade or so, which has seen the field change from being very individualistic (individual astronomer proposes observations, goes to telescope, brings data back on tape, reduces data, writes paper, puts tape in drawer and forgets about it) to being a more collective enterprise where much telescope time is devoted to systematic sky surveys and much science is performed using data from the archives they produce." -- Bob Mann, U of Edinburgh

Traditional Scientific Data Processing Personally managed Convenient Efficient analyze data product dataset filesystem browse others’ datasets

Modern Scientific Data Processing More Data More Analyses More Users analyze data product dataset filesystem browse? others’ datasets

A New Environment Axes of Growth Number of Users Number of Datasets Size of Datasets Number of Analysis Routines Complexity of Analysis Routines Problems Too many files to browse Multiple users performing same analyses Too many routines to manually invoke Datasets too large to fit into memory

Solutions Number of Users Better sharing of data and data products Both Intra-group and Inter-group Number of Datasets Better Organization (Metadata) Query instead of Browse Size of Datasets Better Hardware Better Algorithms Number of Analyses Identify equivalences Reuse common operations Complexity of Analyses Simpler Applications Better Understanding of Data Products Macro-data issuesMicro-data issues

Roadmap Micro-Data Issues Expressing Data Products Executing Data Product Recipes Macro-Data Issues: Techniques for Managing Data and Processes Provenance

CORIE Vertical Grid Horizontal Grid Data Products

Some Data Products Timeseries (1D) Isoline (2D) Transect (2D) Isosurface (3D) Volume Render (3D) Calculation (?D) Animations (+1D) Ensembles (+1D)

Expressing Data Products Specification Salt: “Show the salinity at depth 3m” Max: “Show the maximum salinity over depth” Vort: “Show the vorticity at depth 3m” Implementation How should we translate these descriptions into something executable?

Expressing Data Products (2) Criteria 1. Simple specs  simple implementations. 2. Small spec   small implementation  3. Environment   minimum implementation  Existing Technology General Purpose Programming Languages  Example: CORIE (Perl and C) Visualization Libraries/Applications  Examples: VTK, Data Explorer (DX), XMVIS

Salinity: XMVIS5d does the job, with a little help Max Salinity: Custom C program: read, traverse grid, write Vorticity Custom C program: read, find neighbors, traverse grid, write Simple Specification  Simple Implementation In CORIE:

Salinity: Horizontal Slice is simple. Max Salinity: Not so simple, since the 3D grid is Unstructured. Vorticity Vorticity is simple. Simple Specification  Simple Implementation In VTK/DX:

Vertical Slice instead of Horizontal Slice Custom Code: likely a drastic change. Why? XMVIS5d: just feed in the vertical ‘region’ Zoom in on Estuary, then Horizontal Slice Custom Code: not insignificant changes XMVIS5d: give region coords in the.par file Small spec   Small Implementation  In CORIE:

Vertical Slice instead of Horizontal Slice VTK/DX: Equivalent to Horizontal case. Why? Zoom in on Estuary, then Horizontal Slice Just filter out the unwanted portion Small spec   small implementation  In VTK/DX:

 File Layout Custom Code: usually drastic. XMVIS5d: Just an extra conversion task? Unstructured grid  Structured Grid Custom Code: significant changes XMVIS5d: can convert, but missing an opportunity Grid undergoes significant refinement does this matter? Environment   Minimum Implementation  In CORIE:

File Layout DX provides a file import language VTK would need a new ‘reader’ module to be written Unstructured grid  Structured Grid DX: changes in the import module only VTK: Structured Grids require a different set of objects  Algorithms are different, so the objects are different Grid is significantly refined does this matter? Environment   Minimum Implementation  In VTK/DX:

Expressing Data Products Programs are frequently too tightly coupled to data characteristics file format data size (>,< available memory) data location (file, memory) data type (float, int; scalar, vector) grid type (structured, unstructured) grid construction (repeated horizontal grid)

Executing Data Product Recipes

Efficient algorithms are tuned to the environment In-Memory, Out-of-memory Parallel, Sequential But we want to specify one operation that works in multiple environments So each specified operation must correspond to multiple algorithms… How can we separate our specifications from the algorithms that implement them? Hold off on this for now…

Execution: Pipelining Two Reasons: Reduce Memory Required Return partial results earlier VTK and DX materialize intermediate results Requires a lot of memory CORIE forecast engine pipelines timesteps Mostly to return partial results early There are (were?) occasional memory problems Is Pipelining always a good idea? Code is more complex If you have enough memory, pipelining will be slower

Execution: Parallelism Data Parallel Split up the data and compute a piece of the data product on each node Pros? Cons? Task Parallel If a data product consists of independent steps, perform each on separate processors. Pros? Cons?

Micro-Data Summary At some level, we should capture the logical data product. A logical ‘recipe’ should be immune to changes at the physical level. The logical recipe + data characteristics together are precise enough to execute in the computer.

Data Model Operators Data Representations Algorithms Logical Physical Logical vs Physical

Logical vs. Physical Physical Layer Logical Layer File formats, memory, algorithms, etc. Correctness, simplicity Expression Execution

Digression: Business Data Processing, ca 1960 Record-Oriented Data We ask Queries Who works in Sales? Where is Jim’s Dept? How to connect two records? Employees(Name, HiredOn, Salary, Dept) Sue, 6/21/2000, $46k, Engineering Jim, 1/11/2001, $39k, Sales Yin, 12/1/2000, $42k, (Sales, Engineering) Dept(Name, City) Sales, New York Engineering, Denver

50s – 60s: Hierarchical Data Model Who works in Sales? FIND ANY Dept USING ‘Sales’ FIND FIRST Employee WITHIN Dept DOWHILE DB-STATUS <> 0 GET Employee FIND NEXT Employee WITHIN Dept ENDDO Sales JimYin … Eng Sue … Yin

50s – 60s: Hierarchical Data Model Sales JimYin … Eng Sue … Will the same query work now? FIND ANY Dept USING ‘Sales’ FIND FIRST Employee WITHIN Dept DOWHILE DB-STATUS <> 0 GET Employee FIND NEXT Employee WITHIN Dept ENDDO Yin

What Changed? The representation, but not the information Observation: Representation dependent queries break Codd* investigated this problem of data dependence Data Dependence *E.F. Codd, A relational model for large shared data banks, CACM v13,6 1970

Define a logical Data Model  Tables are Relations between their columns  Even Dept-Emp connection modeled as a relation Extract a few logical Operators  Select records that match criteria  Project away unused columns  Join two tables based on common values DB management systems provide physical implementations of the logical operators  Users are insulated from representational and algorithmic complexity; free to focus on asking the right query Data Dependence: Solution

1970: Relational Model Explicit Connection between Emps and Depts Data Model is provably correct Relational Algebra provably complete Systems Designers’ tasks reduced to finding efficient implementations (Not to say this is trivial!) Employees(Name, HiredOn, Salary) Sue, 6/21/2000, $46k Jim, 1/11/2001, $39k Yin, 12/1/2000, $42k Dept(Name, City) Sales, New York Engineering, Denver Dept-Emp(Name,Name) Engineering, Sue Engineering, Yin Sales, Yin Sales, Jim

So What? For Scientific Data, can we find: A Logical Data Model? Logical Operators? …rich enough to express all relevant data products …precise enough to guide efficient implementations

Scientific Data Analysis filesystem database tertiary storage retrieve grid & data representation readanalyze data product(s)

Scientific Data Manipulation  = Preparation RepresentationManipulation Fortran, C, Perl Library

Scientific Data Manipulation: Patterns  = Preparation Representation Patterns Manipulation Patterns Iteration, aggregation, filtering

Data Model Datasets are defined over Grids Grids are sets of cells Cells: Nodes, Edges, Faces,... A GridField associates data values to Grid Elements GridField = (G, k, g) = G k g where G is a grid k is an integer g : G k  a G G 2 area a a  =

Operators associating grids with data combining grids topologically reducing a grid using data values transforming grids bind union, intersection, cross product restrict aggregate PatternOperator

Restrict 18 21 15 14 restrict(<19) 18 14 15

Merge y2y2 y4y4 y3y3 y1y1 merge x2x2 x1x1 x3x3 (x 2,y 2 ) (x 1,y 1 ) (x 3,y 3 )

Aggregate Assignment Aggregation 12.1  C12.6  C13.1  C13.2  C12.8  C12.5  C 12.95  C 12.45  C {12.8  C, 12.5  C, 12.1  C}{12.6  C, 13.1  C, 13.2  C} T G 0 temp Target chunk(3) average Input Output A

Example: Max Salinity 1. target grid: H 2. assignment: cross V e = e  V 3. aggregate: max (H  V) 2 salt agg(H 2 cross(V), max) H 2 maxsalt (H  V) 2 salt H 2 maxsalt

Example: Salinity Gradient 1. target grid: G 2. assignment: neighbors 3. aggregate: gradient G 2 salt agg(H 2 neighbors, gradient) G 2 saltgrad

Example: WetGrid bind H merge bind V  xy bathym.z restrict(z>bathym) gfWetGrid

Example: Plume gfWetGrid salt restrict(x>300) H bind elev merge  V bind z restrict(z<elev) restrict(salt<26) bind gfPlume

Logical Analysis m SsSs XxXx r(S,X) m r(S) SsSs XxXx r(X) m SsSs XxXx r(S) X’  S X = S O(1) O(n 2 )

Macro-Data Issues Database Extensions for Scientific Data Management

Scientific Data Big efficicency is paramount Complex Processing formats must match existing tools Few updates concurrency is not a major concern Extensive Metadata Provenance/Lineage/Pedigree

Science is hitting a wall FTP and GREP are not adequate You can GREP 1 MB in 1 sec, and FTP 1 MB in 1 sec You can GREP 1 GB in 1 min, and FTP 1 GB in 1 min You can GREP 1 TB in 2 days, and FTP 1 TB in 2 days (1K$) You can GREP 1 PB in 3 years, and FTP 1 TB in 3 years (1M$) Oh!, and 1PB ~10,000 disks At some point you need indices to limit search parallel data search and analysis This is where databases can help Slide courtesy of Alex Szalay, Jim Gray. From “Public Access to Large Astronomical Datasets” presented at the Data Provenance Workshop 2002

no backup no recovery no transactions no concurrency limited security no query : limited sharing limited automation data dependence With No Database analyze data product file filesystem array read browse

impedance mismatch performance First Attempt: Relational Databases relation database query results render dbms data product special tools array

redundant computation repetitive computation data dependence Second Attempt: DB Managed Files IBM Datalinks Oracle IFS database query results analyze dbms data product special tools file array read

data dependent database query results analyze Third Attempt: Analysis-Aware DB ESSW Chimera dbms data product special tools file

Process Modeling M1M1 M2M2 F2F2 E1E1 Recipe Version Mesh Forcings Executions P1P1 P2P2 Parameters 10/3 C=.2 F3F3 F1F1 E2E2 Data Product Pipeline Bindings F1F1 F1F1 E1E1 F2F2 R1R1 atm tide cpu mem

data dependence? expressiveness? Fourth Attempt: Array Types for DBs AML AQL Monoid Calculus database query results render dbms data product special tools array

query language doesn’t exist! Fifth Attempt: Specialized Data Models for DBs Active Data Repository Aurora GridFields... database query results render dbms data product special tools specialized data model

Macro-Data Summary Science is outgrowing its infrastructure; Databases can help Competing solutions, no clear winner Extensions to Existing Database Technology Specialized Scientific Computing Platforms Limited Industrial Interest (Science not a big source of $$$)

A Final Topic: Data Provenance

Data Provenance “...a record of the origin and history of a piece of data.” -- Dave Pearson, Oracle UK “...a history of steps and procedures associated with the processing of associated data” -- Bob Mann, University of Edinburgh “...metadata which uniquely defines data and provides a traceable path to its origin.” -- Carmen Pancerella et al., Sandia Natl Lab “...determining the validity of data by gaining access to a complete audit trail describing how the data was produced from [base] datasets...” -- Ian Foster, U of Chicago

Data Provenance Used for: Discovery (querying) Validation Reproducibility Related Issues: Annotations Federated Databases Publishing

Data Provenance Research Thrusts Domain-specific standards  Astronomy  High Energy Physics  Bioinformatics  Environmental Observation and Forecasting? Representation  XML  BLOBs  Explicit Schema Support Database Extensions  Tracking provenance through queries implicitly

Summary Micro-Data Issues Logical Level  Convenient Expression  Genericity  Algebraic Optimization Physical Level  Efficient Execution Macro-Data Issues Database Features for Scientific Data Metadata Provenance

Timeseries

Isoline

Transect

Ensemble

Volume Rendering

Isosurface

Salt in DX

Vorticity in DX

Max Salt in DX

Filtering in DX

Data Products and Product Management Bill Howe June 3, 2003.

Similar presentations

Presentation on theme: "Data Products and Product Management Bill Howe June 3, 2003."— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

Data Products and Product Management Bill Howe June 3, 2003.

Similar presentations

Presentation on theme: "Data Products and Product Management Bill Howe June 3, 2003."— Presentation transcript:

Similar presentations

About project

Feedback