Modeling Data Product Generation Bill Howe Dave Maier
Data Product Management Thesis: The value of an EOFS is the number of products it provides Limits on #’s of products Amount of oversight for current products Time to create a new product Resources required to generate products 11/14/2018 Modeling Data Product Generation
Modeling Data Products Data Product Definitions (DPDs) or “recipes” initially for documentation “blueprint” for manual construction 11/14/2018 Modeling Data Product Generation
Modeling Data Product Generation Beyond Documentation Quality Analysis and Translation calculate quality metrics from DPDs (e.g., resolution) translate DPDs into executable network of Infopipes (meeting a quality standard) 11/14/2018 Modeling Data Product Generation
Modeling Data Product Generation Product Generation and Documentation management and scheduling of product suite based on input avail, resources, dissem. req. job shop assembly line adaptive eventually; priorities, feedback to sensors and models Performance Optimization algebraic optimization common subresults & shared scans on groups of products 11/14/2018 Modeling Data Product Generation
Modeling Data Product Generation Remote Computation “product kit”: final product built at consumer site remote “product factory” 11/14/2018 Modeling Data Product Generation
Exercise: Fill in the Acronym CORMORANT COlumbia River Modeling, Observation, Retrieval?? & Archive… 11/14/2018 Modeling Data Product Generation
Modeling Data Product Generation Roadmap Vision Status Past Graphical Diagram Process Modeling Type System Current Abstract Grids Grid Functions 11/14/2018 Modeling Data Product Generation
Graphical System Description Studied relevant files and codes to model: Producers and consumers Control flow Data flow Benefits: understanding within the project communication outside the project Drawbacks: only a ‘snapshot’ very literal no scheduling help... 11/14/2018 Modeling Data Product Generation
Modeling Data Product Generation Brittle Scheduling Contentious codes cause crashes Annotate the diagram with cron job information? But, it would be nice to capture real executions of all system components for careful study 11/14/2018 Modeling Data Product Generation
Modeling Data Product Generation Instrumenting CORIE Model the executions of codes using a relational database Monitor CORIE activity using SGI’s FAM technology Try to identify bottlenecks, problem spots, and resource consumption properties Status: we’re poised to perform further testing; some security concerns have been raised 11/14/2018 Modeling Data Product Generation
More than just processes... The model is too close of a fit Let’s start at a higher level... 11/14/2018 Modeling Data Product Generation
A Candidate Type System Relevant types: TimeSeries (TS) ElementField (EF) / NodeField (NF) DepthField (DF) Ex: salt.63 = TS (EF (DF Salinity)) fort.21 = EF Depth findmax63 = TS (EF (DF a)) TS (EF a) 11/14/2018 Modeling Data Product Generation
Abstract Data Product Recipes But consider compute_plumevol: Grid Vol select(sal<30) subgrid(Ocean) Elev Vol sum(grid) + plumevol This informal recipe seems appropriate regardless of the specifics of our data representation This information should be captured somewhere! Currently it’s obfuscated by c codes, and tightly coupled with the TS (EF (DF a)) structure 11/14/2018 Modeling Data Product Generation
Modeling Data Product Generation Topological Grid A more general grid Gd is a collection of k-cells of dimension k, k in {0..d} A grid function GF is a mapping from a k-cell to a value of type T GF : k-cell T 11/14/2018 Modeling Data Product Generation
Modeling Data Product Generation Imagine a big 4d grid representing our current best data hindcast experimental ELCIRC vers missing hindcast forecast Grid Functions (GF) map grid locations to values 15º C 23.4 psu 11/14/2018 Modeling Data Product Generation
Modeling Data Product Generation Grid Functions We can derive new grid functions from our original set GF Salt GF Magnitude GF Velocity GF Velo N’hood GF Temp GF Vorticity GF Elev GF Neighbors 11/14/2018 Modeling Data Product Generation
Modeling Data Product Generation Benefits Say we have recipes that involve a grid, some grid functions, and some operators So what? Well, We can reason about data product outputs We can optimize recipe execution 11/14/2018 Modeling Data Product Generation
Modeling Data Product Generation Reasoning about Types GF Velocity applytoall(vort) GF Vorticity GF Salt applytoall(vort) GF ??? High level recipes can detect this kind of error before wasting compute resources 11/14/2018 Modeling Data Product Generation
Reasoning about Schema GF1 subgrid(Ocean) GF2 type(GF1) = type(GF2 ), but schema(GF1) schema(GF2 ) since GF2 is defined over a smaller grid than GF1 By tracking schema information through complex recipes we can: check for errors estimate resource requirements (big schema require big buffers) a valid transect an invalid transect 11/14/2018 Modeling Data Product Generation
Reasoning about Quality Say we have operators coarsen and refine which lower resolution via grouping and raise resolution via interpolation, respectively type(GF1) = type(GF2), schema(GF1) = schema(GF2), but qual(GF1) qual(GF2) GF1 coarsen refine GF2 11/14/2018 Modeling Data Product Generation
Optimize via Algebraic Manipulations Different sequences of operators can give equivalent results GF Elev computevol subgrid(Ocean) GF Vol GF Area ... GF Elev subgrid(Ocean) GF Vol GF Area computevol ... These are equivalent, but the second avoids computing volume over the entire grid 11/14/2018 Modeling Data Product Generation
Optimize via Choice of Implementation GF Salt select(s < 30) ? GF Bool F T GF (Maybe Salt) - 22 24 23 {KCell} {c1, c2, c3} 11/14/2018 Modeling Data Product Generation
Optimize via Shared Intermediate Results A Node’s neighbors don’t often change, so we can avoid re-computing this result GF Velocity GF Velo N’hood GF Vorticity GF Neighbors GF Salt N’hood GF Salinity GF Salt Gradient 11/14/2018 Modeling Data Product Generation
Modeling Data Product Generation Other niceties... We don’t have to re-implement everything to realize benefits But eventually we’ll want to wag the dog! A collection of recipes can help... communicate the product catalog provide provenance Derive new recipes from parts of old ones support for product lines 11/14/2018 Modeling Data Product Generation
Modeling Data Product Generation Summary Modeling the current CORIE Graphical System Description pmon Modeling the future CORIE Grid Functions Recipes Reasoning Optimization 11/14/2018 Modeling Data Product Generation
Modeling Data Product Generation Milestones RPE this spring Specify existing data products using the model Perform checks on existing production plans Type Schema / Resources Quality 11/14/2018 Modeling Data Product Generation
Modeling Data Product Generation 11/14/2018 Modeling Data Product Generation
A Thorough Experiment Management Schema 11/14/2018 Modeling Data Product Generation
Modeling Data Product Generation task definition A Good Start... task instance (with parameters) task execution 11/14/2018 Modeling Data Product Generation
Modeling Data Product Generation pmon (Process Monitor) Database Web Server pmon Architecture fam (File Alteration Monitor) imon, dnotify, or polling, depending on kernel patch Filesystem pacct (stopped process stats) /proc (running process info) acct (process accounting) Process to Monitor Linux Kernel 11/14/2018 Modeling Data Product Generation