Virtual Data and the Chimera System* Ian Foster Mathematics and Computer Science Division Argonne National Laboratory and Department of Computer Science.

Slides:

Advertisements

Similar presentations

GRADD: Scientific Workflows. Scientific Workflow E. Science laboris Workflows are the new rock and roll of eScience Machinery for coordinating the execution.

Advertisements

The Quantum Chromodynamics Grid James Perry, Andrew Jackson, Matthew Egbert, Stephen Booth, Lorna Smith EPCC, The University Of Edinburgh.

Long-term Digital Metadata Curation Arif Shaon University of Reading 16 April 2014.

Grids and Workflows. 2 Overview Scientific workflows and Grids –Taxonomy –Example systems Kepler revisited Data Grids –Chimera –GridDB.

Parallel Scripting on Beagle with Swift Ketan Maheshwari Postdoctoral Appointee (Argonne National.

PROVENANCE FOR THE CLOUD (USENIX CONFERENCE ON FILE AND STORAGE TECHNOLOGIES(FAST `10)) Kiran-Kumar Muniswamy-Reddy, Peter Macko, and Margo Seltzer Harvard.

A. Arbree, P. Avery, D. Bourilkov, R. Cavanaugh, S. Katageri, G. Graham, J. Rodriguez, J. Voeckler, M. Wilde CMS & GriPhyN Conference in High Energy Physics,

GriPhyN & iVDGL Architectural Issues GGF5 BOF Data Intensive Applications Common Architectural Issues and Drivers Edinburgh, 23 July 2002 Mike Wilde Argonne.

1 Software & Grid Middleware for Tier 2 Centers Rob Gardner Indiana University DOE/NSF Review of U.S. ATLAS and CMS Computing Projects Brookhaven National.

R. Cavanaugh GriPhyN Analysis Workshop Caltech, June, 2003 Virtual Data Toolkit.

DataGrid Kimmo Soikkeli Ilkka Sormunen. What is DataGrid? DataGrid is a project that aims to enable access to geographically distributed computing power.

ReQuest (Validating Semantic Searches) Norman Piedade de Noronha 16 th July, 2004.

SCHOOL OF INFORMATION UNIVERSITY OF MICHIGAN GriPhyN: Grid Physics Network and iVDGL: International Virtual Data Grid Laboratory.

Pegasus: Mapping complex applications onto the Grid Ewa Deelman Center for Grid Technologies USC Information Sciences Institute.

NGNS Program Managers Richard Carlson Thomas Ndousse ASCAC meeting 11/21/2014 Next Generation Networking for Science Program Update.

Knowledge Environments for Science: Representative Projects Ian Foster Argonne National Laboratory University of Chicago

Mining Metamodels From Instance Models: The MARS System Faizan Javed Department of Computer & Information Sciences, University of Alabama at Birmingham.

The Grid as Infrastructure and Application Enabler Ian Foster Mathematics and Computer Science Division Argonne National Laboratory and Department of Computer.

January, 23, 2006 Ilkay Altintas

Apache Airavata GSOC Knowledge and Expertise Computational Resources Scientific Instruments Algorithms and Models Archived Data and Metadata Advanced.

Grappa: Grid access portal for physics applications Shava Smallen Extreme! Computing Laboratory Department of Physics Indiana University.

The Grid is a complex, distributed and heterogeneous execution environment. Running applications requires the knowledge of many grid services: users need.

HEP Experiment Integration within GriPhyN/PPDG/iVDGL Rick Cavanaugh University of Florida DataTAG/WP4 Meeting 23 May, 2002.

A Metadata Catalog Service for Data Intensive Applications Presented by Chin-Yi Tsai.

Patrick R Brady University of Wisconsin-Milwaukee

ARGONNE  CHICAGO Ian Foster Discussion Points l Maintaining the right balance between research and development l Maintaining focus vs. accepting broader.

Applying the Virtual Data Provenance Model IPAW 2006 Yong Zhao, Ian Foster, Michael Wilde 4 May 2006.

A Metadata Based Approach For Supporting Subsetting Queries Over Parallel HDF5 Datasets Vignesh Santhanagopalan Graduate Student Department Of CSE.

ASG - Towards the Adaptive Semantic Services Enterprise Harald Meyer WWW Service Composition with Semantic Web Services

Miguel Branco CERN/University of Southampton Enabling provenance on large-scale e-Science applications.

Introduction to Apache OODT Yang Li Mar 9, What is OODT Object Oriented Data Technology Science data management Archiving Systems that span scientific.

Kepler/pPOD: Scientific Workflow and Provenance Support for Assembling the Tree of Life UC DAVIS Department of Computer Science The Kepler/pPOD Team Shawn.

GriPhyN and Data Provenance The Grid Physics Network Virtual Data System DOE Data Management Workshop SLAC, 17 March 2004 Mike Wilde Argonne National Laboratory.

Towards Low Overhead Provenance Tracking in Near Real-Time Stream Filtering Nithya N. Vijayakumar, Beth Plale DDE Lab, Indiana University {nvijayak,

The Virtual Data Grid: A New Model and Architecture for Data-Intensive Collaboration Summer Grid 2004 UT Brownsville South Padre Island Center 24 June.

GriPhyN Status and Project Plan Mike Wilde Mathematics and Computer Science Division Argonne National Laboratory.

Data Grid projects in HENP R. Pordes, Fermilab Many HENP projects are working on the infrastructure for global distributed simulated data production, data.

Major Grid Computing Initatives Ian Foster Mathematics and Computer Science Division Argonne National Laboratory and Department of Computer Science The.

Virtual Data Grid Architecture Ewa Deelman, Ian Foster, Carl Kesselman, Miron Livny.

A Logic Programming Approach to Scientific Workflow Provenance Querying* Shiyong Lu Department of Computer Science Wayne State University, Detroit, MI.

Pegasus: Running Large-Scale Scientific Workflows on the TeraGrid Ewa Deelman USC Information Sciences Institute

Ian Taylor, Cardiff Work-Flow Application Toolkit Eger Meeting Ian Taylor & Ian Wang Cardiff University, UK.

National Geospatial Digital Archive Greg Janée University of California at Santa Barbara.

Authors: Ronnie Julio Cole David

Presented by Scientific Annotation Middleware Software infrastructure to support rich scientific records and the processes that produce them Jens Schwidder.

Part 9: MyProxy Pragmatics This presentation and lab ends the GRIDS Center agenda Q: When do we convene again tomorrow?

Pegasus: Mapping complex applications onto the Grid Ewa Deelman Center for Grid Technologies USC Information Sciences Institute.

© Geodise Project, University of Southampton, Knowledge Management in Geodise Geodise Knowledge Management Team Barry Tao, Colin Puleston, Liming.

The GriPhyN Planning Process All-Hands Meeting ISI 15 October 2001.

1. 2 Preface In the time since the 1986 edition of this book, the world of compiler design has changed significantly 3.

GriPhyN Virtual Data System Grid Execution of Virtual Data Workflows Mike Wilde Argonne National Laboratory Mathematics and Computer Science Division.

Presented by Jens Schwidder Tara D. Gibson James D. Myers Computing & Computational Sciences Directorate Oak Ridge National Laboratory Scientific Annotation.

May 6, 2002Earth System Grid - Williams The Earth System Grid Presented by Dean N. Williams PI’s: Ian Foster (ANL); Don Middleton (NCAR); and Dean Williams.

The Grid Enabling Resource Sharing within Virtual Organizations Ian Foster Mathematics and Computer Science Division Argonne National Laboratory and Department.

Recording Actor Provenance in Scientific Workflows Ian Wootten, Shrija Rajbhandari, Omer Rana Cardiff University, UK.

Planning Ewa Deelman USC Information Sciences Institute GriPhyN NSF Project Review January 2003 Chicago.

Pegasus: Planning for Execution in Grids Ewa Deelman, Carl Kesselman, Gaurang Mehta, Gurmeet Singh, Karan Vahi Information Sciences Institute University.

Virtual Data Management for CMS Simulation Production A GriPhyN Prototype.

GriPhyN Project Paul Avery, University of Florida, Ian Foster, University of Chicago NSF Grant ITR Research Objectives Significant Results Approach.

U.S. Grid Projects and Involvement in EGEE Ian Foster Argonne National Laboratory University of Chicago EGEE-LHC Town Meeting,

A Portrait of the Semantic Web in Action Jeff Heflin and James Hendler IEEE Intelligent Systems December 6, 2010 Hyewon Lim.

GADU: A System for High-throughput Analysis of Genomes using Heterogeneous Grid Resources. Mathematics and Computer Science Division Argonne National Laboratory.

Interoperability Achieved by GADU in using multiple Grids. OSG, Teragrid and ANL Jazz Presented by: Dinanath Sulakhe Mathematics and Computer Science Division.

Ewa Deelman, Virtual Metadata Catalogs: Augmenting Existing Metadata Catalogs with Semantic Representations Yolanda Gil, Varun Ratnakar,

Chimera Workshop September 4 th, Outline l Install Chimera l Run Chimera –Hello world –Convert simple shell pipeline –Some diamond, etc. l Get.

The Virtual Observatory and Ecological Informatics System (VOEIS): Using RESTful architecture and an extensible data model to provide a unique data management.

Pegasus and Condor Gaurang Mehta, Ewa Deelman, Carl Kesselman, Karan Vahi Center For Grid Technologies USC/ISI.

Overview of Workflows: Why Use Them?

Frieda meets Pegasus-WMS

Presentation transcript:

Virtual Data and the Chimera System* Ian Foster Mathematics and Computer Science Division Argonne National Laboratory and Department of Computer Science The University of Chicago *Joint work with Jens Vöckler, Mike Wilde, Yong Zhao HPC 2002 Conference, Cetraro, June 26, 2002

2 ARGONNE  CHICAGO Overview l Problem –Managing programs and computations as community resources l Technology –Chimera virtual data system l Applications –Virtual Data ≠ Virtual Concept! l Futures –Research challenges & plans

3 ARGONNE  CHICAGO Overview l Problem –Managing programs and computations as community resources l Technology –Chimera virtual data system l Applications –Virtual Data ≠ Virtual Concept! l Futures –Research challenges & plans

4 ARGONNE  CHICAGO Programs as Community Resources: Data Derivation and Provenance l Most [scientific] data are not simple “measurements”; essentially all are: –Computationally corrected/reconstructed –And/or produced by numerical simulation l And thus, as data and computers become ever larger and more expensive: –Programs are significant community resources –So are the executions of those programs l Management of the transformations that map between datasets an important problem

5 ARGONNE  CHICAGO TransformationDerivation Data created-by execution-of consumed-by/ generated-by “I’ve detected a calibration error in an instrument and want to know which derived data to recompute.” “I’ve come across some interesting data, but I need to understand the nature of the corrections applied when it was constructed before I can trust it for my purposes.” “I want to search an astronomical database for galaxies with certain characteristics. If a program that performs this analysis exists, I won’t have to write one from scratch.” “I want to apply an astronomical analysis program to millions of objects. If the results already exist, I’ll save weeks of computation.” Motivations (1)

6 ARGONNE  CHICAGO Motivations (2) l Data track-ability and result audit-ability –Universally sought by GriPhyN applications l Repair and correction of data –Rebuild data products—c.f., “make” l Workflow management –A new, structured paradigm for organizing, locating, specifying, and requesting data products l Performance optimizations –Ability to re-create data rather than move it l And others, some we haven’t thought of

7 ARGONNE  CHICAGO Overview l Problem –Managing programs and computations as community resources l Technology –Chimera virtual data system l Applications –Virtual Data ≠ Virtual Concept! l Futures –Research challenges & plans

8 ARGONNE  CHICAGO l Virtual data catalog –Transformations, derivations, data l Virtual data language –VDC definition and query l Applications include browsers and data analysis applications Chimera Virtual Data System GriPhyN VDT: Replica catalog DAGMan Globus Toolkit Etc.

9 ARGONNE  CHICAGO Transformations and Derivations l Transformation –Abstract template of program invocation –Similar to "function definition" in C l Derivation –Formal invocation of a Transformation –Similar to "function call" in C –Store past and future: >A record of how data products were generated >A recipe of how data products can be generated l Invocation (future) –Record of each Derivation (re) execution –Similar to strace (BSD) or truss (SysV)

10 ARGONNE  CHICAGO Virtual Data Catalog Structure

11 ARGONNE  CHICAGO Virtual Data Tools l Virtual Data API –A Java class hierarchy to represent transformations and derivations l Virtual Data Language –Textual for people & illustrative examples –XML for machine-to-machine interfaces l Virtual Data Database –Makes the objects of a virtual data definition persistent l Virtual Data Service –Provides a service interface (e.g., OGSA) to persistent objects

12 ARGONNE  CHICAGO Virtual Data Language: XML

13 ARGONNE  CHICAGO Example Transformation TR t1( out a2, in a1, none pa = "500", none env = "100000" ) { profile hints.exec-pfn = "/usr/bin/app3"; argument = "-p "${pa}; argument = "-f "${a1}; argument = "-x –y"; argument stdout = ${a2}; profile env.MAXMEM = ${env}; } $a1 $a2 t1

14 ARGONNE  CHICAGO Example Derivations DV d1->t1 ( env="20000", pa="600", ); DV d2->t1 ( );

15 ARGONNE  CHICAGO Managing Dependencies TR tr1( out a2, in a1 ) { profile hints.exec-pfn = "/usr/bin/app1"; argument stdin = ${a1}; argument stdout = ${a2}; } TR tr2( out a2, in a1 ) { profile hints.exec-pfn = "/usr/bin/app2"; argument stdin = ${a1}; argument stdout = ${a2}; } DV x1->tr1( DV x2->tr2( file1 file2 file3 x1 x2

16 ARGONNE  CHICAGO Initial “Strawman” Architecture (Use of GriPhyN Virtual Data Toolkit) VDLx abstract planner DAX DAG Man concrete planner

17 ARGONNE  CHICAGO Overview l Problem –Managing programs and computations as community resources l Technology –Chimera virtual data system l Applications –Virtual Data ≠ Virtual Concept! l Futures –Research challenges & plans

18 ARGONNE  CHICAGO Joint work with Jim Annis, Steve Kent, FNAL Size distribution of galaxy clusters? Galaxy cluster size distribution Chimera Virtual Data System + GriPhyN Virtual Data Toolkit + iVDGL Data Grid (many CPUs) Chimera Application: Sloan Digital Sky Survey Analysis

19 ARGONNE  CHICAGO catalog cluster 5 4 core brg field tsObj brg field tsObj 2 1 brg field tsObj 2 1 brg field tsObj 2 1 core 3 Cluster-finding Data Pipeline

20 ARGONNE  CHICAGO Cluster-Finding Pipeline Execution

21 ARGONNE  CHICAGO Small SDSS Cluster-Finding DAG

22 ARGONNE  CHICAGO And Even Bigger: 744 Files, 387 Nodes

23 ARGONNE  CHICAGO Overview l Problem –Managing programs and computations as community resources l Technology –Chimera virtual data system l Applications –Virtual Data ≠ Virtual Concept! l Futures –Research challenges & plans

24 ARGONNE  CHICAGO Virtual Data Usage Model l Transformation designers create programmatic abstractions –Simple or compound; augment with metadata l Production managers create bulk derivations –Can materialize data products or leave virtual l Users track their work through derivations –Augment (replace?) the scientist’s log book l Definitions can be augmented with metadata –The key to intelligent data retrieval –Issues relating to metadata propagation

25 ARGONNE  CHICAGO Virtual Data Research Issues l Representation –Metadata: how is it created, stored, propagated? –What knowledge must be represented? How? –Capturing notions of data approximation –Higher-order knowledge: virtual transformations l VDC as a community resource –Automating data capture –Access control and privacy issues –Quality control l Data derivation –Query estimation and request planning

26 ARGONNE  CHICAGO Virtual Data Research Issues l “Engineering” issues –Dynamic (runtime-computed) dependencies –Large dependent sets –Extensions to other data models: relational, OO –Virtual data browsers –XML vs. relational databases & query languages l Additional usage modalities –E.g., meta-analyses, automated experiment generation, “active notebooks” l Virtual data browsers, editors

27 ARGONNE  CHICAGO Status of Chimera R&D Early virtual data system demonstrated Nov ’01: HEP collision simulations Larger scale problems addressed recently: “cluster finding” in SDSS First public release in June: Chimera v1.0 l Enhancements planned throughout the summer l Physics & astronomy applications by SC’02 l Future R&D focus #1: request planning l Future R&D focus #2: knowledge representation l Future apps: bioinformatics, earth sciences

28 ARGONNE  CHICAGO Related Work l Data provenance –Materialized views, lineage: Cui, Widom –Data provenance tracking: Buneman et al. l Capturing transformations –ZOO system and conceptual schema l Data Grid technologies –GriPhyN, Globus Project, EU DataGrid

29 ARGONNE  CHICAGO Summary l Concept: Tools to support management of transformations and derivations as community resources l Technology: Chimera virtual data system including virtual data catalog and virtual data language; use of GriPhyN virtual data toolkit for automated data derivation l Results: Successful early applications to CMS and SDSS data generation/analysis l Future: Public release of prototype, new apps, knowledge representation, planning

30 ARGONNE  CHICAGO For More Information l GriPhyN project (NSF ITR funded) – l Chimera virtual data system – –“Chimera: A Virtual Data System for Representing, Querying, and Automating Data Derivation,” SSDBM, July 2002 –“Applying Chimera Virtual Data Concepts to Cluster Finding in the Sloan Sky Survey”, SC’02, November 2002.