Download presentation
Presentation is loading. Please wait.
Published byHelen Stephany Moody Modified over 8 years ago
2
Data Products and Product Management Bill Howe June 3, 2003
3
"...a trend within astronomy over the past decade or so, which has seen the field change from being very individualistic (individual astronomer proposes observations, goes to telescope, brings data back on tape, reduces data, writes paper, puts tape in drawer and forgets about it) to being a more collective enterprise where much telescope time is devoted to systematic sky surveys and much science is performed using data from the archives they produce." -- Bob Mann, U of Edinburgh
4
Traditional Scientific Data Processing Personally managed Convenient Efficient analyze data product dataset filesystem browse others’ datasets
5
Modern Scientific Data Processing More Data More Analyses More Users analyze data product dataset filesystem browse? others’ datasets
6
A New Environment Axes of Growth Number of Users Number of Datasets Size of Datasets Number of Analysis Routines Complexity of Analysis Routines Problems Too many files to browse Multiple users performing same analyses Too many routines to manually invoke Datasets too large to fit into memory
7
Solutions Number of Users Better sharing of data and data products Both Intra-group and Inter-group Number of Datasets Better Organization (Metadata) Query instead of Browse Size of Datasets Better Hardware Better Algorithms Number of Analyses Identify equivalences Reuse common operations Complexity of Analyses Simpler Applications Better Understanding of Data Products Macro-data issuesMicro-data issues
8
Roadmap Micro-Data Issues Expressing Data Products Executing Data Product Recipes Macro-Data Issues: Techniques for Managing Data and Processes Provenance
9
CORIE Vertical Grid Horizontal Grid Data Products
10
Some Data Products Timeseries (1D) Isoline (2D) Transect (2D) Isosurface (3D) Volume Render (3D) Calculation (?D) Animations (+1D) Ensembles (+1D)
11
Expressing Data Products Specification Salt: “Show the salinity at depth 3m” Max: “Show the maximum salinity over depth” Vort: “Show the vorticity at depth 3m” Implementation How should we translate these descriptions into something executable?
12
Expressing Data Products (2) Criteria 1. Simple specs simple implementations. 2. Small spec small implementation 3. Environment minimum implementation Existing Technology General Purpose Programming Languages Example: CORIE (Perl and C) Visualization Libraries/Applications Examples: VTK, Data Explorer (DX), XMVIS
13
Salinity: XMVIS5d does the job, with a little help Max Salinity: Custom C program: read, traverse grid, write Vorticity Custom C program: read, find neighbors, traverse grid, write Simple Specification Simple Implementation In CORIE:
14
Salinity: Horizontal Slice is simple. Max Salinity: Not so simple, since the 3D grid is Unstructured. Vorticity Vorticity is simple. Simple Specification Simple Implementation In VTK/DX:
15
Vertical Slice instead of Horizontal Slice Custom Code: likely a drastic change. Why? XMVIS5d: just feed in the vertical ‘region’ Zoom in on Estuary, then Horizontal Slice Custom Code: not insignificant changes XMVIS5d: give region coords in the.par file Small spec Small Implementation In CORIE:
16
Vertical Slice instead of Horizontal Slice VTK/DX: Equivalent to Horizontal case. Why? Zoom in on Estuary, then Horizontal Slice Just filter out the unwanted portion Small spec small implementation In VTK/DX:
17
File Layout Custom Code: usually drastic. XMVIS5d: Just an extra conversion task? Unstructured grid Structured Grid Custom Code: significant changes XMVIS5d: can convert, but missing an opportunity Grid undergoes significant refinement does this matter? Environment Minimum Implementation In CORIE:
18
File Layout DX provides a file import language VTK would need a new ‘reader’ module to be written Unstructured grid Structured Grid DX: changes in the import module only VTK: Structured Grids require a different set of objects Algorithms are different, so the objects are different Grid is significantly refined does this matter? Environment Minimum Implementation In VTK/DX:
19
Expressing Data Products Programs are frequently too tightly coupled to data characteristics file format data size (>,< available memory) data location (file, memory) data type (float, int; scalar, vector) grid type (structured, unstructured) grid construction (repeated horizontal grid)
20
Executing Data Product Recipes
21
Efficient algorithms are tuned to the environment In-Memory, Out-of-memory Parallel, Sequential But we want to specify one operation that works in multiple environments So each specified operation must correspond to multiple algorithms… How can we separate our specifications from the algorithms that implement them? Hold off on this for now…
22
Execution: Pipelining Two Reasons: Reduce Memory Required Return partial results earlier VTK and DX materialize intermediate results Requires a lot of memory CORIE forecast engine pipelines timesteps Mostly to return partial results early There are (were?) occasional memory problems Is Pipelining always a good idea? Code is more complex If you have enough memory, pipelining will be slower
23
Execution: Parallelism Data Parallel Split up the data and compute a piece of the data product on each node Pros? Cons? Task Parallel If a data product consists of independent steps, perform each on separate processors. Pros? Cons?
24
Micro-Data Summary At some level, we should capture the logical data product. A logical ‘recipe’ should be immune to changes at the physical level. The logical recipe + data characteristics together are precise enough to execute in the computer.
25
Data Model Operators Data Representations Algorithms Logical Physical Logical vs Physical
26
Logical vs. Physical Physical Layer Logical Layer File formats, memory, algorithms, etc. Correctness, simplicity Expression Execution
27
Digression: Business Data Processing, ca 1960 Record-Oriented Data We ask Queries Who works in Sales? Where is Jim’s Dept? How to connect two records? Employees(Name, HiredOn, Salary, Dept) Sue, 6/21/2000, $46k, Engineering Jim, 1/11/2001, $39k, Sales Yin, 12/1/2000, $42k, (Sales, Engineering) Dept(Name, City) Sales, New York Engineering, Denver
28
50s – 60s: Hierarchical Data Model Who works in Sales? FIND ANY Dept USING ‘Sales’ FIND FIRST Employee WITHIN Dept DOWHILE DB-STATUS <> 0 GET Employee FIND NEXT Employee WITHIN Dept ENDDO Sales JimYin … Eng Sue … Yin
29
50s – 60s: Hierarchical Data Model Sales JimYin … Eng Sue … Will the same query work now? FIND ANY Dept USING ‘Sales’ FIND FIRST Employee WITHIN Dept DOWHILE DB-STATUS <> 0 GET Employee FIND NEXT Employee WITHIN Dept ENDDO Yin
30
What Changed? The representation, but not the information Observation: Representation dependent queries break Codd* investigated this problem of data dependence Data Dependence *E.F. Codd, A relational model for large shared data banks, CACM v13,6 1970
31
Define a logical Data Model Tables are Relations between their columns Even Dept-Emp connection modeled as a relation Extract a few logical Operators Select records that match criteria Project away unused columns Join two tables based on common values DB management systems provide physical implementations of the logical operators Users are insulated from representational and algorithmic complexity; free to focus on asking the right query Data Dependence: Solution
32
1970: Relational Model Explicit Connection between Emps and Depts Data Model is provably correct Relational Algebra provably complete Systems Designers’ tasks reduced to finding efficient implementations (Not to say this is trivial!) Employees(Name, HiredOn, Salary) Sue, 6/21/2000, $46k Jim, 1/11/2001, $39k Yin, 12/1/2000, $42k Dept(Name, City) Sales, New York Engineering, Denver Dept-Emp(Name,Name) Engineering, Sue Engineering, Yin Sales, Yin Sales, Jim
33
So What? For Scientific Data, can we find: A Logical Data Model? Logical Operators? …rich enough to express all relevant data products …precise enough to guide efficient implementations
34
Scientific Data Analysis filesystem database tertiary storage retrieve grid & data representation readanalyze data product(s)
35
Scientific Data Manipulation = Preparation RepresentationManipulation Fortran, C, Perl Library
36
Scientific Data Manipulation: Patterns = Preparation Representation Patterns Manipulation Patterns Iteration, aggregation, filtering
37
Data Model Datasets are defined over Grids Grids are sets of cells Cells: Nodes, Edges, Faces,... A GridField associates data values to Grid Elements GridField = (G, k, g) = G k g where G is a grid k is an integer g : G k a G G 2 area a a =
38
Operators associating grids with data combining grids topologically reducing a grid using data values transforming grids bind union, intersection, cross product restrict aggregate PatternOperator
39
Restrict 18 21 15 14 restrict(<19) 18 14 15
40
Merge y2y2 y4y4 y3y3 y1y1 merge x2x2 x1x1 x3x3 (x 2,y 2 ) (x 1,y 1 ) (x 3,y 3 )
41
Aggregate Assignment Aggregation 12.1 C12.6 C13.1 C13.2 C12.8 C12.5 C 12.95 C 12.45 C {12.8 C, 12.5 C, 12.1 C}{12.6 C, 13.1 C, 13.2 C} T G 0 temp Target chunk(3) average Input Output A
42
Example: Max Salinity 1. target grid: H 2. assignment: cross V e = e V 3. aggregate: max (H V) 2 salt agg(H 2 cross(V), max) H 2 maxsalt (H V) 2 salt H 2 maxsalt
43
Example: Salinity Gradient 1. target grid: G 2. assignment: neighbors 3. aggregate: gradient G 2 salt agg(H 2 neighbors, gradient) G 2 saltgrad
44
Example: WetGrid bind H merge bind V xy bathym.z restrict(z>bathym) gfWetGrid
45
Example: Plume gfWetGrid salt restrict(x>300) H bind elev merge V bind z restrict(z<elev) restrict(salt<26) bind gfPlume
46
Logical Analysis m SsSs XxXx r(S,X) m r(S) SsSs XxXx r(X) m SsSs XxXx r(S) X’ S X = S O(1) O(n 2 )
47
Macro-Data Issues Database Extensions for Scientific Data Management
48
Scientific Data Big efficicency is paramount Complex Processing formats must match existing tools Few updates concurrency is not a major concern Extensive Metadata Provenance/Lineage/Pedigree
49
Science is hitting a wall FTP and GREP are not adequate You can GREP 1 MB in 1 sec, and FTP 1 MB in 1 sec You can GREP 1 GB in 1 min, and FTP 1 GB in 1 min You can GREP 1 TB in 2 days, and FTP 1 TB in 2 days (1K$) You can GREP 1 PB in 3 years, and FTP 1 TB in 3 years (1M$) Oh!, and 1PB ~10,000 disks At some point you need indices to limit search parallel data search and analysis This is where databases can help Slide courtesy of Alex Szalay, Jim Gray. From “Public Access to Large Astronomical Datasets” presented at the Data Provenance Workshop 2002
50
no backup no recovery no transactions no concurrency limited security no query : limited sharing limited automation data dependence With No Database analyze data product file filesystem array read browse
51
impedance mismatch performance First Attempt: Relational Databases relation database query results render dbms data product special tools array
52
redundant computation repetitive computation data dependence Second Attempt: DB Managed Files IBM Datalinks Oracle IFS database query results analyze dbms data product special tools file array read
53
data dependent database query results analyze Third Attempt: Analysis-Aware DB ESSW Chimera dbms data product special tools file
54
Process Modeling M1M1 M2M2 F2F2 E1E1 Recipe Version Mesh Forcings Executions P1P1 P2P2 Parameters 10/3 C=.2 F3F3 F1F1 E2E2 Data Product Pipeline Bindings F1F1 F1F1 E1E1 F2F2 R1R1 atm tide cpu mem
55
data dependence? expressiveness? Fourth Attempt: Array Types for DBs AML AQL Monoid Calculus database query results render dbms data product special tools array
56
query language doesn’t exist! Fifth Attempt: Specialized Data Models for DBs Active Data Repository Aurora GridFields... database query results render dbms data product special tools specialized data model
57
Macro-Data Summary Science is outgrowing its infrastructure; Databases can help Competing solutions, no clear winner Extensions to Existing Database Technology Specialized Scientific Computing Platforms Limited Industrial Interest (Science not a big source of $$$)
58
A Final Topic: Data Provenance
59
Data Provenance “...a record of the origin and history of a piece of data.” -- Dave Pearson, Oracle UK “...a history of steps and procedures associated with the processing of associated data” -- Bob Mann, University of Edinburgh “...metadata which uniquely defines data and provides a traceable path to its origin.” -- Carmen Pancerella et al., Sandia Natl Lab “...determining the validity of data by gaining access to a complete audit trail describing how the data was produced from [base] datasets...” -- Ian Foster, U of Chicago
60
Data Provenance Used for: Discovery (querying) Validation Reproducibility Related Issues: Annotations Federated Databases Publishing
61
Data Provenance Research Thrusts Domain-specific standards Astronomy High Energy Physics Bioinformatics Environmental Observation and Forecasting? Representation XML BLOBs Explicit Schema Support Database Extensions Tracking provenance through queries implicitly
62
Summary Micro-Data Issues Logical Level Convenient Expression Genericity Algebraic Optimization Physical Level Efficient Execution Macro-Data Issues Database Features for Scientific Data Metadata Provenance
63
=
64
Timeseries
65
Isoline
66
Transect
67
Ensemble
68
Volume Rendering
69
Isosurface
70
Salt in DX
71
Vorticity in DX
72
Max Salt in DX
73
Filtering in DX
Similar presentations
© 2024 SlidePlayer.com. Inc.
All rights reserved.