Virtual Data and the Chimera System* Ian Foster Mathematics and Computer Science Division Argonne National Laboratory and Department of Computer Science The University of Chicago *Joint work with Jens Vöckler, Mike Wilde, Yong Zhao HPC 2002 Conference, Cetraro, June 26, 2002
2 ARGONNE CHICAGO Overview l Problem –Managing programs and computations as community resources l Technology –Chimera virtual data system l Applications –Virtual Data ≠ Virtual Concept! l Futures –Research challenges & plans
3 ARGONNE CHICAGO Overview l Problem –Managing programs and computations as community resources l Technology –Chimera virtual data system l Applications –Virtual Data ≠ Virtual Concept! l Futures –Research challenges & plans
4 ARGONNE CHICAGO Programs as Community Resources: Data Derivation and Provenance l Most [scientific] data are not simple “measurements”; essentially all are: –Computationally corrected/reconstructed –And/or produced by numerical simulation l And thus, as data and computers become ever larger and more expensive: –Programs are significant community resources –So are the executions of those programs l Management of the transformations that map between datasets an important problem
5 ARGONNE CHICAGO TransformationDerivation Data created-by execution-of consumed-by/ generated-by “I’ve detected a calibration error in an instrument and want to know which derived data to recompute.” “I’ve come across some interesting data, but I need to understand the nature of the corrections applied when it was constructed before I can trust it for my purposes.” “I want to search an astronomical database for galaxies with certain characteristics. If a program that performs this analysis exists, I won’t have to write one from scratch.” “I want to apply an astronomical analysis program to millions of objects. If the results already exist, I’ll save weeks of computation.” Motivations (1)
6 ARGONNE CHICAGO Motivations (2) l Data track-ability and result audit-ability –Universally sought by GriPhyN applications l Repair and correction of data –Rebuild data products—c.f., “make” l Workflow management –A new, structured paradigm for organizing, locating, specifying, and requesting data products l Performance optimizations –Ability to re-create data rather than move it l And others, some we haven’t thought of
7 ARGONNE CHICAGO Overview l Problem –Managing programs and computations as community resources l Technology –Chimera virtual data system l Applications –Virtual Data ≠ Virtual Concept! l Futures –Research challenges & plans
8 ARGONNE CHICAGO l Virtual data catalog –Transformations, derivations, data l Virtual data language –VDC definition and query l Applications include browsers and data analysis applications Chimera Virtual Data System GriPhyN VDT: Replica catalog DAGMan Globus Toolkit Etc.
9 ARGONNE CHICAGO Transformations and Derivations l Transformation –Abstract template of program invocation –Similar to "function definition" in C l Derivation –Formal invocation of a Transformation –Similar to "function call" in C –Store past and future: >A record of how data products were generated >A recipe of how data products can be generated l Invocation (future) –Record of each Derivation (re) execution –Similar to strace (BSD) or truss (SysV)
10 ARGONNE CHICAGO Virtual Data Catalog Structure
11 ARGONNE CHICAGO Virtual Data Tools l Virtual Data API –A Java class hierarchy to represent transformations and derivations l Virtual Data Language –Textual for people & illustrative examples –XML for machine-to-machine interfaces l Virtual Data Database –Makes the objects of a virtual data definition persistent l Virtual Data Service –Provides a service interface (e.g., OGSA) to persistent objects
12 ARGONNE CHICAGO Virtual Data Language: XML
13 ARGONNE CHICAGO Example Transformation TR t1( out a2, in a1, none pa = "500", none env = "100000" ) { profile hints.exec-pfn = "/usr/bin/app3"; argument = "-p "${pa}; argument = "-f "${a1}; argument = "-x –y"; argument stdout = ${a2}; profile env.MAXMEM = ${env}; } $a1 $a2 t1
14 ARGONNE CHICAGO Example Derivations DV d1->t1 ( env="20000", pa="600", ); DV d2->t1 ( );
15 ARGONNE CHICAGO Managing Dependencies TR tr1( out a2, in a1 ) { profile hints.exec-pfn = "/usr/bin/app1"; argument stdin = ${a1}; argument stdout = ${a2}; } TR tr2( out a2, in a1 ) { profile hints.exec-pfn = "/usr/bin/app2"; argument stdin = ${a1}; argument stdout = ${a2}; } DV x1->tr1( DV x2->tr2( file1 file2 file3 x1 x2
16 ARGONNE CHICAGO Initial “Strawman” Architecture (Use of GriPhyN Virtual Data Toolkit) VDLx abstract planner DAX DAG Man concrete planner
17 ARGONNE CHICAGO Overview l Problem –Managing programs and computations as community resources l Technology –Chimera virtual data system l Applications –Virtual Data ≠ Virtual Concept! l Futures –Research challenges & plans
18 ARGONNE CHICAGO Joint work with Jim Annis, Steve Kent, FNAL Size distribution of galaxy clusters? Galaxy cluster size distribution Chimera Virtual Data System + GriPhyN Virtual Data Toolkit + iVDGL Data Grid (many CPUs) Chimera Application: Sloan Digital Sky Survey Analysis
19 ARGONNE CHICAGO catalog cluster 5 4 core brg field tsObj brg field tsObj 2 1 brg field tsObj 2 1 brg field tsObj 2 1 core 3 Cluster-finding Data Pipeline
20 ARGONNE CHICAGO Cluster-Finding Pipeline Execution
21 ARGONNE CHICAGO Small SDSS Cluster-Finding DAG
22 ARGONNE CHICAGO And Even Bigger: 744 Files, 387 Nodes
23 ARGONNE CHICAGO Overview l Problem –Managing programs and computations as community resources l Technology –Chimera virtual data system l Applications –Virtual Data ≠ Virtual Concept! l Futures –Research challenges & plans
24 ARGONNE CHICAGO Virtual Data Usage Model l Transformation designers create programmatic abstractions –Simple or compound; augment with metadata l Production managers create bulk derivations –Can materialize data products or leave virtual l Users track their work through derivations –Augment (replace?) the scientist’s log book l Definitions can be augmented with metadata –The key to intelligent data retrieval –Issues relating to metadata propagation
25 ARGONNE CHICAGO Virtual Data Research Issues l Representation –Metadata: how is it created, stored, propagated? –What knowledge must be represented? How? –Capturing notions of data approximation –Higher-order knowledge: virtual transformations l VDC as a community resource –Automating data capture –Access control and privacy issues –Quality control l Data derivation –Query estimation and request planning
26 ARGONNE CHICAGO Virtual Data Research Issues l “Engineering” issues –Dynamic (runtime-computed) dependencies –Large dependent sets –Extensions to other data models: relational, OO –Virtual data browsers –XML vs. relational databases & query languages l Additional usage modalities –E.g., meta-analyses, automated experiment generation, “active notebooks” l Virtual data browsers, editors
27 ARGONNE CHICAGO Status of Chimera R&D Early virtual data system demonstrated Nov ’01: HEP collision simulations Larger scale problems addressed recently: “cluster finding” in SDSS First public release in June: Chimera v1.0 l Enhancements planned throughout the summer l Physics & astronomy applications by SC’02 l Future R&D focus #1: request planning l Future R&D focus #2: knowledge representation l Future apps: bioinformatics, earth sciences
28 ARGONNE CHICAGO Related Work l Data provenance –Materialized views, lineage: Cui, Widom –Data provenance tracking: Buneman et al. l Capturing transformations –ZOO system and conceptual schema l Data Grid technologies –GriPhyN, Globus Project, EU DataGrid
29 ARGONNE CHICAGO Summary l Concept: Tools to support management of transformations and derivations as community resources l Technology: Chimera virtual data system including virtual data catalog and virtual data language; use of GriPhyN virtual data toolkit for automated data derivation l Results: Successful early applications to CMS and SDSS data generation/analysis l Future: Public release of prototype, new apps, knowledge representation, planning
30 ARGONNE CHICAGO For More Information l GriPhyN project (NSF ITR funded) – l Chimera virtual data system – –“Chimera: A Virtual Data System for Representing, Querying, and Automating Data Derivation,” SSDBM, July 2002 –“Applying Chimera Virtual Data Concepts to Cluster Finding in the Sloan Sky Survey”, SC’02, November 2002.