Presentation is loading. Please wait.

Presentation is loading. Please wait.

Virtual Data and the Chimera System* Ian Foster Mathematics and Computer Science Division Argonne National Laboratory and Department of Computer Science.

Similar presentations


Presentation on theme: "Virtual Data and the Chimera System* Ian Foster Mathematics and Computer Science Division Argonne National Laboratory and Department of Computer Science."— Presentation transcript:

1 Virtual Data and the Chimera System* Ian Foster Mathematics and Computer Science Division Argonne National Laboratory and Department of Computer Science The University of Chicago http://www.mcs.anl.gov/~foster *Joint work with Jens Vöckler, Mike Wilde, Yong Zhao HPC 2002 Conference, Cetraro, June 26, 2002

2 2 foster@mcs.anl.gov ARGONNE  CHICAGO Overview l Problem –Managing programs and computations as community resources l Technology –Chimera virtual data system l Applications –Virtual Data ≠ Virtual Concept! l Futures –Research challenges & plans

3 3 foster@mcs.anl.gov ARGONNE  CHICAGO Overview l Problem –Managing programs and computations as community resources l Technology –Chimera virtual data system l Applications –Virtual Data ≠ Virtual Concept! l Futures –Research challenges & plans

4 4 foster@mcs.anl.gov ARGONNE  CHICAGO Programs as Community Resources: Data Derivation and Provenance l Most [scientific] data are not simple “measurements”; essentially all are: –Computationally corrected/reconstructed –And/or produced by numerical simulation l And thus, as data and computers become ever larger and more expensive: –Programs are significant community resources –So are the executions of those programs l Management of the transformations that map between datasets an important problem

5 5 foster@mcs.anl.gov ARGONNE  CHICAGO TransformationDerivation Data created-by execution-of consumed-by/ generated-by “I’ve detected a calibration error in an instrument and want to know which derived data to recompute.” “I’ve come across some interesting data, but I need to understand the nature of the corrections applied when it was constructed before I can trust it for my purposes.” “I want to search an astronomical database for galaxies with certain characteristics. If a program that performs this analysis exists, I won’t have to write one from scratch.” “I want to apply an astronomical analysis program to millions of objects. If the results already exist, I’ll save weeks of computation.” Motivations (1)

6 6 foster@mcs.anl.gov ARGONNE  CHICAGO Motivations (2) l Data track-ability and result audit-ability –Universally sought by GriPhyN applications l Repair and correction of data –Rebuild data products—c.f., “make” l Workflow management –A new, structured paradigm for organizing, locating, specifying, and requesting data products l Performance optimizations –Ability to re-create data rather than move it l And others, some we haven’t thought of

7 7 foster@mcs.anl.gov ARGONNE  CHICAGO Overview l Problem –Managing programs and computations as community resources l Technology –Chimera virtual data system l Applications –Virtual Data ≠ Virtual Concept! l Futures –Research challenges & plans

8 8 foster@mcs.anl.gov ARGONNE  CHICAGO l Virtual data catalog –Transformations, derivations, data l Virtual data language –VDC definition and query l Applications include browsers and data analysis applications Chimera Virtual Data System GriPhyN VDT: Replica catalog DAGMan Globus Toolkit Etc.

9 9 foster@mcs.anl.gov ARGONNE  CHICAGO Transformations and Derivations l Transformation –Abstract template of program invocation –Similar to "function definition" in C l Derivation –Formal invocation of a Transformation –Similar to "function call" in C –Store past and future: >A record of how data products were generated >A recipe of how data products can be generated l Invocation (future) –Record of each Derivation (re) execution –Similar to strace (BSD) or truss (SysV)

10 10 foster@mcs.anl.gov ARGONNE  CHICAGO Virtual Data Catalog Structure

11 11 foster@mcs.anl.gov ARGONNE  CHICAGO Virtual Data Tools l Virtual Data API –A Java class hierarchy to represent transformations and derivations l Virtual Data Language –Textual for people & illustrative examples –XML for machine-to-machine interfaces l Virtual Data Database –Makes the objects of a virtual data definition persistent l Virtual Data Service –Provides a service interface (e.g., OGSA) to persistent objects

12 12 foster@mcs.anl.gov ARGONNE  CHICAGO Virtual Data Language: XML

13 13 foster@mcs.anl.gov ARGONNE  CHICAGO Example Transformation TR t1( out a2, in a1, none pa = "500", none env = "100000" ) { profile hints.exec-pfn = "/usr/bin/app3"; argument = "-p "${pa}; argument = "-f "${a1}; argument = "-x –y"; argument stdout = ${a2}; profile env.MAXMEM = ${env}; } $a1 $a2 t1

14 14 foster@mcs.anl.gov ARGONNE  CHICAGO Example Derivations DV d1->t1 ( env="20000", pa="600", a2=@{out:run1.exp15.T1932.summary}, a1=@{in:run1.exp15.T1932.raw}, ); DV d2->t1 ( a1=@{in:run1.exp16.T1918.raw}, a2=@{out.run1.exp16.T1918.summary} );

15 15 foster@mcs.anl.gov ARGONNE  CHICAGO Managing Dependencies TR tr1( out a2, in a1 ) { profile hints.exec-pfn = "/usr/bin/app1"; argument stdin = ${a1}; argument stdout = ${a2}; } TR tr2( out a2, in a1 ) { profile hints.exec-pfn = "/usr/bin/app2"; argument stdin = ${a1}; argument stdout = ${a2}; } DV x1->tr1( a2=@{out:file2}, a1=@{in:file1}); DV x2->tr2( a2=@{out:file3}, a1=@{in:file2}); file1 file2 file3 x1 x2

16 16 foster@mcs.anl.gov ARGONNE  CHICAGO Initial “Strawman” Architecture (Use of GriPhyN Virtual Data Toolkit) VDLx abstract planner DAX DAG Man concrete planner

17 17 foster@mcs.anl.gov ARGONNE  CHICAGO Overview l Problem –Managing programs and computations as community resources l Technology –Chimera virtual data system l Applications –Virtual Data ≠ Virtual Concept! l Futures –Research challenges & plans

18 18 foster@mcs.anl.gov ARGONNE  CHICAGO Joint work with Jim Annis, Steve Kent, FNAL Size distribution of galaxy clusters? Galaxy cluster size distribution Chimera Virtual Data System + GriPhyN Virtual Data Toolkit + iVDGL Data Grid (many CPUs) Chimera Application: Sloan Digital Sky Survey Analysis

19 19 foster@mcs.anl.gov ARGONNE  CHICAGO catalog cluster 5 4 core brg field tsObj 3 2 1 brg field tsObj 2 1 brg field tsObj 2 1 brg field tsObj 2 1 core 3 Cluster-finding Data Pipeline

20 20 foster@mcs.anl.gov ARGONNE  CHICAGO Cluster-Finding Pipeline Execution

21 21 foster@mcs.anl.gov ARGONNE  CHICAGO Small SDSS Cluster-Finding DAG

22 22 foster@mcs.anl.gov ARGONNE  CHICAGO And Even Bigger: 744 Files, 387 Nodes 108 168 60 50

23 23 foster@mcs.anl.gov ARGONNE  CHICAGO Overview l Problem –Managing programs and computations as community resources l Technology –Chimera virtual data system l Applications –Virtual Data ≠ Virtual Concept! l Futures –Research challenges & plans

24 24 foster@mcs.anl.gov ARGONNE  CHICAGO Virtual Data Usage Model l Transformation designers create programmatic abstractions –Simple or compound; augment with metadata l Production managers create bulk derivations –Can materialize data products or leave virtual l Users track their work through derivations –Augment (replace?) the scientist’s log book l Definitions can be augmented with metadata –The key to intelligent data retrieval –Issues relating to metadata propagation

25 25 foster@mcs.anl.gov ARGONNE  CHICAGO Virtual Data Research Issues l Representation –Metadata: how is it created, stored, propagated? –What knowledge must be represented? How? –Capturing notions of data approximation –Higher-order knowledge: virtual transformations l VDC as a community resource –Automating data capture –Access control and privacy issues –Quality control l Data derivation –Query estimation and request planning

26 26 foster@mcs.anl.gov ARGONNE  CHICAGO Virtual Data Research Issues l “Engineering” issues –Dynamic (runtime-computed) dependencies –Large dependent sets –Extensions to other data models: relational, OO –Virtual data browsers –XML vs. relational databases & query languages l Additional usage modalities –E.g., meta-analyses, automated experiment generation, “active notebooks” l Virtual data browsers, editors

27 27 foster@mcs.anl.gov ARGONNE  CHICAGO Status of Chimera R&D Early virtual data system demonstrated Nov ’01: HEP collision simulations Larger scale problems addressed recently: “cluster finding” in SDSS First public release in June: Chimera v1.0 l Enhancements planned throughout the summer l Physics & astronomy applications by SC’02 l Future R&D focus #1: request planning l Future R&D focus #2: knowledge representation l Future apps: bioinformatics, earth sciences

28 28 foster@mcs.anl.gov ARGONNE  CHICAGO Related Work l Data provenance –Materialized views, lineage: Cui, Widom –Data provenance tracking: Buneman et al. l Capturing transformations –ZOO system and conceptual schema l Data Grid technologies –GriPhyN, Globus Project, EU DataGrid

29 29 foster@mcs.anl.gov ARGONNE  CHICAGO Summary l Concept: Tools to support management of transformations and derivations as community resources l Technology: Chimera virtual data system including virtual data catalog and virtual data language; use of GriPhyN virtual data toolkit for automated data derivation l Results: Successful early applications to CMS and SDSS data generation/analysis l Future: Public release of prototype, new apps, knowledge representation, planning

30 30 foster@mcs.anl.gov ARGONNE  CHICAGO For More Information l GriPhyN project (NSF ITR funded) –www.griphyn.org l Chimera virtual data system –www.griphyn.org/chimera –“Chimera: A Virtual Data System for Representing, Querying, and Automating Data Derivation,” SSDBM, July 2002 –“Applying Chimera Virtual Data Concepts to Cluster Finding in the Sloan Sky Survey”, SC’02, November 2002.


Download ppt "Virtual Data and the Chimera System* Ian Foster Mathematics and Computer Science Division Argonne National Laboratory and Department of Computer Science."

Similar presentations


Ads by Google