GriPhyN & iVDGL Architectural Issues GGF5 BOF Data Intensive Applications Common Architectural Issues and Drivers Edinburgh, 23 July 2002 Mike Wilde Argonne.

Slides:



Advertisements
Similar presentations
FP7-INFRA Enabling Grids for E-sciencE EGEE Induction Grid training for users, Institute of Physics Belgrade, Serbia Sep. 19, 2008.
Advertisements

Virtual Data and the Chimera System* Ian Foster Mathematics and Computer Science Division Argonne National Laboratory and Department of Computer Science.
 Contributing >30% of throughput to ATLAS and CMS in Worldwide LHC Computing Grid  Reliant on production and advanced networking from ESNET, LHCNET and.
A. Arbree, P. Avery, D. Bourilkov, R. Cavanaugh, S. Katageri, G. Graham, J. Rodriguez, J. Voeckler, M. Wilde CMS & GriPhyN Conference in High Energy Physics,
A Computation Management Agent for Multi-Institutional Grids
A conceptual model of grid resources and services Authors: Sergio Andreozzi Massimo Sgaravatto Cristina Vistoli Presenter: Sergio Andreozzi INFN-CNAF Bologna.
1 Software & Grid Middleware for Tier 2 Centers Rob Gardner Indiana University DOE/NSF Review of U.S. ATLAS and CMS Computing Projects Brookhaven National.
The Globus Toolkit™: and its application to GryPhyN Carl Kesselman Director of the Center for Grid Technologies Information Sciences Institute University.
R. Cavanaugh GriPhyN Analysis Workshop Caltech, June, 2003 Virtual Data Toolkit.
The LHC Computing Grid Project Tomi Kauppi Timo Larjo.
Milos Kobliha Alejandro Cimadevilla Luis de Alba Parallel Computing Seminar GROUP 12.
Pegasus: Mapping complex applications onto the Grid Ewa Deelman Center for Grid Technologies USC Information Sciences Institute.
The Difficulties of Distributed Data Douglas Thain Condor Project University of Wisconsin
Web-based Portal for Discovery, Retrieval and Visualization of Earth Science Datasets in Grid Environment Zhenping (Jane) Liu.
Resource Management Reading: “A Resource Management Architecture for Metacomputing Systems”
The Grid as Infrastructure and Application Enabler Ian Foster Mathematics and Computer Science Division Argonne National Laboratory and Department of Computer.
CONDOR DAGMan and Pegasus Selim Kalayci Florida International University 07/28/2009 Note: Slides are compiled from various TeraGrid Documentations.
SAMGrid – A fully functional computing grid based on standard technologies Igor Terekhov for the JIM team FNAL/CD/CCF.
An Introduction to The Grid Mike Wilde Mathematics and Computer Science Division Argonne National Laboratory Oak Park River Forest High School
Grappa: Grid access portal for physics applications Shava Smallen Extreme! Computing Laboratory Department of Physics Indiana University.
XCAT Science Portal Status & Future Work July 15, 2002 Shava Smallen Extreme! Computing Laboratory Indiana University.
The Grid is a complex, distributed and heterogeneous execution environment. Running applications requires the knowledge of many grid services: users need.
The Japanese Virtual Observatory (JVO) Yuji Shirasaki National Astronomical Observatory of Japan.
Data Management Kelly Clynes Caitlin Minteer. Agenda Globus Toolkit Basic Data Management Systems Overview of Data Management Data Movement Grid FTP Reliable.
10/20/05 LIGO Scientific Collaboration 1 LIGO Data Grid: Making it Go Scott Koranda University of Wisconsin-Milwaukee.
HEP Experiment Integration within GriPhyN/PPDG/iVDGL Rick Cavanaugh University of Florida DataTAG/WP4 Meeting 23 May, 2002.
Patrick R Brady University of Wisconsin-Milwaukee
ARGONNE  CHICAGO Ian Foster Discussion Points l Maintaining the right balance between research and development l Maintaining focus vs. accepting broader.
Virtual Data Tools Status Update ATLAS Grid Software Meeting BNL, 6 May 2002 Mike Wilde Argonne National Laboratory An update on work by Jens Voeckler,
Grid Computing - AAU 14/ Grid Computing Josva Kleist Danish Center for Grid Computing
Digital Divide Meeting (May 23, 2005)Paul Avery1 University of Florida U.S. Grid Projects: Grid3 and Open Science Grid International.
GT Components. Globus Toolkit A “toolkit” of services and packages for creating the basic grid computing infrastructure Higher level tools added to this.
INFSO-RI Enabling Grids for E-sciencE The US Federation Miron Livny Computer Sciences Department University of Wisconsin – Madison.
QCDGrid Progress James Perry, Andrew Jackson, Stephen Booth, Lorna Smith EPCC, The University Of Edinburgh.
Grid Workload Management & Condor Massimo Sgaravatto INFN Padova.
GriPhyN Status and Project Plan Mike Wilde Mathematics and Computer Science Division Argonne National Laboratory.
Part 8: DAGMan A: Grid Workflow Management B: DAGMan C: Laboratory: DAGMan.
Miron Livny Computer Sciences Department University of Wisconsin-Madison Welcome and Condor Project Overview.
Ruth Pordes, Fermilab CD, and A PPDG Coordinator Some Aspects of The Particle Physics Data Grid Collaboratory Pilot (PPDG) and The Grid Physics Network.
- Distributed Analysis (07may02 - USA Grid SW BNL) Distributed Processing Craig E. Tull HCG/NERSC/LBNL (US) ATLAS Grid Software.
CYBERINFRASTRUCTURE FOR THE GEOSCIENCES Data Replication Service Sandeep Chandra GEON Systems Group San Diego Supercomputer Center.
Pegasus: Mapping Scientific Workflows onto the Grid Ewa Deelman Center for Grid Technologies USC Information Sciences Institute.
Data Grid projects in HENP R. Pordes, Fermilab Many HENP projects are working on the infrastructure for global distributed simulated data production, data.
Virtual Data Grid Architecture Ewa Deelman, Ian Foster, Carl Kesselman, Miron Livny.
Grid Physics Network & Intl Virtual Data Grid Lab Ian Foster* For the GriPhyN & iVDGL Projects SCI PI Meeting February 18-20, 2004 *Argonne, U.Chicago,
The Replica Location Service The Globus Project™ And The DataGrid Project Copyright (c) 2002 University of Chicago and The University of Southern California.
Ames Research CenterDivision 1 Information Power Grid (IPG) Overview Anthony Lisotta Computer Sciences Corporation NASA Ames May 2,
Grid Scheduler: Plan & Schedule Adam Arbree Jang Uk In.
Pegasus: Mapping complex applications onto the Grid Ewa Deelman Center for Grid Technologies USC Information Sciences Institute.
Atlas Grid Status - part 1 Jennifer Schopf ANL U.S. ATLAS Physics and Computing Advisory Panel Review Argonne National Laboratory Oct 30, 2001.
The GriPhyN Planning Process All-Hands Meeting ISI 15 October 2001.
LIGO-G E Generic LabelLIGO Laboratory at Caltech 1 Distributed Computing for LIGO Data Analysis The Aspen Winter Conference on Gravitational Waves,
Ruth Pordes November 2004TeraGrid GIG Site Review1 TeraGrid and Open Science Grid Ruth Pordes, Fermilab representing the Open Science.
High Energy Physics and Grids at UF (Dec. 13, 2002)Paul Avery1 University of Florida High Energy Physics.
US LHC OSG Technology Roadmap May 4-5th, 2005 Welcome. Thank you to Deirdre for the arrangements.
Planning Ewa Deelman USC Information Sciences Institute GriPhyN NSF Project Review January 2003 Chicago.
Virtual Data Management for CMS Simulation Production A GriPhyN Prototype.
GriPhyN Project Paul Avery, University of Florida, Ian Foster, University of Chicago NSF Grant ITR Research Objectives Significant Results Approach.
Super Computing 2000 DOE SCIENCE ON THE GRID Storage Resource Management For the Earth Science Grid Scientific Data Management Research Group NERSC, LBNL.
Peter Couvares Computer Sciences Department University of Wisconsin-Madison Condor DAGMan: Introduction &
Grid Compute Resources and Job Management. 2 Grid middleware - “glues” all pieces together Offers services that couple users with remote resources through.
Interoperability Achieved by GADU in using multiple Grids. OSG, Teragrid and ANL Jazz Presented by: Dinanath Sulakhe Mathematics and Computer Science Division.
April 25, 2006Parag Mhashilkar, Fermilab1 Resource Selection in OSG & SAM-On-The-Fly Parag Mhashilkar Fermi National Accelerator Laboratory Condor Week.
Grid Activities in CMS Asad Samar (Caltech) PPDG meeting, Argonne July 13-14, 2000.
Current Globus Developments Jennifer Schopf, ANL.
Managing LIGO Workflows on OSG with Pegasus Karan Vahi USC Information Sciences Institute
Condor DAGMan: Managing Job Dependencies with Condor
GGF OGSA-WG, Data Use Cases Peter Kunszt Middleware Activity, Data Management Cluster EGEE is a project funded by the European.
Pegasus and Condor Gaurang Mehta, Ewa Deelman, Carl Kesselman, Karan Vahi Center For Grid Technologies USC/ISI.
STORK: A Scheduler for Data Placement Activities in Grid
Presentation transcript:

GriPhyN & iVDGL Architectural Issues GGF5 BOF Data Intensive Applications Common Architectural Issues and Drivers Edinburgh, 23 July 2002 Mike Wilde Argonne National Laboratory Grid Physics Network International Virtual Data Grid Laboratory

Project Summary l Principle requirements –IT Research: virtual data and transparent execution –Grid building: deploy international grid lab at scale l Components developed/used –Virtual Data Toolkit; Linux deployment platform –Virtual Data Catalog, Request planner and executor, DAGman, NeST l Scale of current testbeds –ATLAS Test Grid – 8 sites –CMS Test Grid – 5 sites –Compute nodes: UW, UofC, UWM, UTB, ANL –>50 researchers and grid-builders working on IT research challenge problems and demos l Future directions (2002 & 2003) –Extensive work on virtual data, planning, and catalog architecture, and fault tolerance

Chimera Overview l Concept: Tools to support management of transformations and derivations as community resources l Technology: Chimera virtual data system including virtual data catalog and virtual data language; use of GriPhyN virtual data toolkit for automated data derivation l Results: Successful early applications to CMS and SDSS data generation/analysis l Future: Public release of prototype, new apps, knowledge representation, planning

“Chimera” Virtual Data Model l Transformation designers create programmatic abstractions –Simple or compound; augment with metadata l Production managers create bulk derivations –Can materialize data products or leave virtual l Users track their work through derivations –Augment (replace?) the scientist’s log book l Definitions can be augmented with metadata –The key to intelligent data retrieval –Issues relating to metadata propagation

pythia_input pythia.exe cmsim_input cmsim.exe writeHits writeDigis begin v /usr/local/demo/scripts/cmkin_input.csh file i ntpl_file_path file i template_file file i num_events stdout cmkin_param_file end begin v /usr/local/demo/binaries/kine_make_ntpl_pyt_cms121.exe pre cms_env_var stdin cmkin_param_file stdout cmkin_log file o ntpl_file end begin v /usr/local/demo/scripts/cmsim_input.csh file i ntpl_file file i fz_file_path file i hbook_file_path file i num_trigs stdout cmsim_param_file end begin v /usr/local/demo/binaries/cms121.exe condor copy_to_spool=false condor getenv=true stdin cmsim_param_file stdout cmsim_log file o fz_file file o hbook_file end begin v /usr/local/demo/binaries/writeHits.sh condor getenv=true pre orca_hits file i fz_file file i detinput file i condor_writeHits_log file i oo_fd_boot file i datasetname stdout writeHits_log file o hits_db end begin v /usr/local/demo/binaries/writeDigis.sh pre orca_digis file i hits_db file i oo_fd_boot file i carf_input_dataset_name file i carf_output_dataset_name file i carf_input_owner file i carf_output_owner file i condor_writeDigis_log stdout writeDigis_log file o digis_db end CMS Pipeline in VDL-0

Data Dependencies – VDL-1 TR tr1( out a2, in a1 ) { profile hints.exec-pfn = "/usr/bin/app1"; argument stdin = ${a1}; argument stdout = ${a2}; } TR tr2( out a2, in a1 ) { profile hints.exec-pfn = "/usr/bin/app2"; argument stdin = ${a1}; argument stdout = ${a2}; } DV x1->tr1( DV x2->tr2( file1 file2 file3 x1 x2

Executor Example: Condor DAGMan l Directed Acyclic Graph Manager l Specify the dependencies between Condor jobs using DAG data structure l Manage dependencies automatically –(e.g., “Don’t run job “B” until job “A” has completed successfully.”) l Each job is a “node” in DAG l Any number of parent or children nodes l No loops Job A Job BJob C Job D Slide courtesy Miron Livny, U. Wisconsin

Joint work with Jim Annis, Steve Kent, FNAL Size distribution of galaxy clusters? Galaxy cluster size distribution Chimera Virtual Data System + GriPhyN Virtual Data Toolkit + iVDGL Data Grid (many CPUs) Chimera Application: Sloan Digital Sky Survey Analysis

catalog cluster 5 4 core brg field tsObj brg field tsObj 2 1 brg field tsObj 2 1 brg field tsObj 2 1 core 3 Cluster-finding Data Pipeline

Small SDSS Cluster-Finding DAG

And Even Bigger: 744 Files, 387 Nodes

Vision: Distributed Virtual Data Service apps Tier 1 centers Regional Centers Local sites VDC Distributed virtual data service

Knowledge Management - Strawman Architecture l Knowledge based requests are formulated in terms of science data –Eg, Give me a specific transform of channels c,p,&t over time range t0-t1 l Finder finds the data files –Translates range “t0-t1” into a set of files l Coder creates an execution plan and defines derivations from known transformations –Can deal with missing files (e.g, file c in LIGO example) l Knowledge request is answered in terms of datasets l Coder translates datasets into logical files (or objects, queries, tables,…) l Planner translates logical entities into physical entities

GriPhyN/PPDG Data Grid Architecture Application Planner Executor Catalog Services Info Services Policy/Security Monitoring Repl. Mgmt. Reliable Transfer Service Compute ResourceStorage Resource DAG (concrete) DAG (abstract) DAGMAN, Kangaroo GRAMGridFTP; GRAM; SRM GSI, CAS MDS MCAT; GriPhyN catalogs GDMP MDS Globus

Common Problem #1 (evolving) View of Data Grid Stack Data Transport (GridFTP) Storage Element Local Repl Catalog (Flat or Hierarchical) Reliable File Transfer Replica Location Service Publish-Subscribe Service (GDMP) Storage Element Manager Reliable Replication

Architectural Complexities

Common Problem #2: Request Planning l Map of grid resources l Incoming work to plan –Queue? With lookahead? l Status of grid resources –State (up/down) –Load (current, queued, and anticipated) –Reservations l Policy –Allocation (commitment of resource to VO or group based on policy) l Ability to change decisions dynamically

Policy l Focus is on resource allocation (not with security) l Allocation examples: –“CMS should get 80% of the resources at Caltech” (averaged monthly) –“Higgs group has high prio at BNL till 8/1” l Need to apply fair share scheduling to grid l Need to understand the allocation models dictated by funders and data centers

Grids as overlays on shared resources

Grid Scheduling Problem l Given an abstract DAG representing logical work: –Where should each compute job be executed? >What does site and VO policy say? >What does grid “weather” dictate? –Where is the required data now? –Where should data results be sent? l Stop and re-schedule computations? l Suspend or de-prioritize work in progress to let higher prio work go through? l Degree of policy control? l Is a “grid” an entity? - “aggregator” of resources? l How is data placement coordinated with planning? l Use of a Execution profiler in the planner arch: –Characterize resource needs of an app over time –Parameterize resource reqs of app by its parameters l What happens when things go wrong?

Policy and the Planner l Planner considers: –Policy (fairly static, from CAS/SAS) –Grid status –Job (user/group) resource consumptn history –Job profiles (resources over time) from Prophesy

Open Issues – Planner (1) l Does the planner have a queue? If so, how does a planner manage its queue? l How many planners are there? Is it a service? l How is responsibility between planner and the executor (cluster scheduler) partitioned? l How many other entities need to be coordinated? –RFT, DAPman, SRM, NeST, …? –How to wait on reliable file transfers? l How does planner estimate times if it only has partial responsibility for when/where things run? l How is data placement planning coordinated with request planning?

Open Issues – Planner (2) l Clearly need incremental planning (eg for analysis) l Stop and re-schedule computations? l Suspend or de-prioritize work in progress to let higher prio work go through? l Degree of policy control? l Is the “grid” an entity? l Use of a Execution profiler in the planner arch: –Characterize the resource requirements of an app over time –Parameterize the res reqs of an app w.r.t its (salient) parameters l What happens when things go wrong?

Issue Summary l Consolidate the data grid stack –Reliable file transfer –Reliable replication –Replica catalog and virtual data catalog scaled for global use l Define interfaces and locations of planners l Unify job workflow representation around DAGs l Define how to state and manage policy l Strategies for fault tolerance – similar to replanning for weather and policy changes? l Evolution of services to OGSA