GriPhyN Virtual Data System Mike Wilde Argonne National Laboratory Mathematics and Computer Science Division LISHEP 2004, UERJ, Rio De Janeiro 13 Feb 2004.

Slides:



Advertisements
Similar presentations
A Workflow Engine with Multi-Level Parallelism Supports Qifeng Huang and Yan Huang School of Computer Science Cardiff University
Advertisements

Virtual Data and the Chimera System* Ian Foster Mathematics and Computer Science Division Argonne National Laboratory and Department of Computer Science.
A. Arbree, P. Avery, D. Bourilkov, R. Cavanaugh, S. Katageri, G. Graham, J. Rodriguez, J. Voeckler, M. Wilde CMS & GriPhyN Conference in High Energy Physics,
Ewa Deelman, Integrating Existing Scientific Workflow Systems: The Kepler/Pegasus Example Nandita Mangal,
XSEDE 13 July 24, Galaxy Team: PSC Team:
GriPhyN & iVDGL Architectural Issues GGF5 BOF Data Intensive Applications Common Architectural Issues and Drivers Edinburgh, 23 July 2002 Mike Wilde Argonne.
David Adams ATLAS DIAL Distributed Interactive Analysis of Large datasets David Adams BNL March 25, 2003 CHEP 2003 Data Analysis Environment and Visualization.
R. Cavanaugh GriPhyN Analysis Workshop Caltech, June, 2003 Virtual Data Toolkit.
Workflow Management and Virtual Data Ewa Deelman USC Information Sciences Institute.
NextGRID & OGSA Data Architectures: Example Scenarios Stephen Davey, NeSC, UK ISSGC06 Summer School, Ischia, Italy 12 th July 2006.
Milos Kobliha Alejandro Cimadevilla Luis de Alba Parallel Computing Seminar GROUP 12.
Pegasus: Mapping complex applications onto the Grid Ewa Deelman Center for Grid Technologies USC Information Sciences Institute.
Ewa Deelman Using Grid Technologies to Support Large-Scale Astronomy Applications Ewa Deelman Center for Grid Technologies USC Information.
Web-based Portal for Discovery, Retrieval and Visualization of Earth Science Datasets in Grid Environment Zhenping (Jane) Liu.
Resource Management Reading: “A Resource Management Architecture for Metacomputing Systems”
The Grid as Infrastructure and Application Enabler Ian Foster Mathematics and Computer Science Division Argonne National Laboratory and Department of Computer.
CONDOR DAGMan and Pegasus Selim Kalayci Florida International University 07/28/2009 Note: Slides are compiled from various TeraGrid Documentations.
Pegasus A Framework for Workflow Planning on the Grid Ewa Deelman USC Information Sciences Institute Pegasus Acknowledgments: Carl Kesselman, Gaurang Mehta,
Grappa: Grid access portal for physics applications Shava Smallen Extreme! Computing Laboratory Department of Physics Indiana University.
The Grid is a complex, distributed and heterogeneous execution environment. Running applications requires the knowledge of many grid services: users need.
ARGONNE  CHICAGO Ian Foster Discussion Points l Maintaining the right balance between research and development l Maintaining focus vs. accepting broader.
Virtual Data Management for Grid Computing Michael Wilde Argonne National Laboratory Gaurang Mehta, Karan Vahi Center for Grid Technologies.
Large-Scale Science Through Workflow Management Ewa Deelman Center for Grid Technologies USC Information Sciences Institute.
Grid Workload Management & Condor Massimo Sgaravatto INFN Padova.
GriPhyN and Data Provenance The Grid Physics Network Virtual Data System DOE Data Management Workshop SLAC, 17 March 2004 Mike Wilde Argonne National Laboratory.
Pegasus-a framework for planning for execution in grids Ewa Deelman USC Information Sciences Institute.
Pegasus: Planning for Execution in Grids Ewa Deelman Information Sciences Institute University of Southern California.
The Virtual Data Grid: A New Model and Architecture for Data-Intensive Collaboration Summer Grid 2004 UT Brownsville South Padre Island Center 24 June.
GriPhyN Status and Project Plan Mike Wilde Mathematics and Computer Science Division Argonne National Laboratory.
Part 8: DAGMan A: Grid Workflow Management B: DAGMan C: Laboratory: DAGMan.
Bookkeeping Tutorial. Bookkeeping & Monitoring Tutorial2 Bookkeeping content  Contains records of all “jobs” and all “files” that are created by production.
CYBERINFRASTRUCTURE FOR THE GEOSCIENCES Data Replication Service Sandeep Chandra GEON Systems Group San Diego Supercomputer Center.
Pegasus: Mapping Scientific Workflows onto the Grid Ewa Deelman Center for Grid Technologies USC Information Sciences Institute.
Virtual Data Grid Architecture Ewa Deelman, Ian Foster, Carl Kesselman, Miron Livny.
Datasets on the GRID David Adams PPDG All Hands Meeting Catalogs and Datasets session June 11, 2003 BNL.
Combining the strengths of UMIST and The Victoria University of Manchester Adaptive Workflow Processing and Execution in Pegasus Kevin Lee School of Computer.
Tool Integration with Data and Computation Grid GWE - “Grid Wizard Enterprise”
The Replica Location Service The Globus Project™ And The DataGrid Project Copyright (c) 2002 University of Chicago and The University of Southern California.
Ames Research CenterDivision 1 Information Power Grid (IPG) Overview Anthony Lisotta Computer Sciences Corporation NASA Ames May 2,
Pegasus: Running Large-Scale Scientific Workflows on the TeraGrid Ewa Deelman USC Information Sciences Institute
Intermediate Condor: Workflows Rob Quick Open Science Grid Indiana University.
Pegasus: Mapping complex applications onto the Grid Ewa Deelman Center for Grid Technologies USC Information Sciences Institute.
The GriPhyN Planning Process All-Hands Meeting ISI 15 October 2001.
GriPhyN Virtual Data System Grid Execution of Virtual Data Workflows Mike Wilde Argonne National Laboratory Mathematics and Computer Science Division.
May 6, 2002Earth System Grid - Williams The Earth System Grid Presented by Dean N. Williams PI’s: Ian Foster (ANL); Don Middleton (NCAR); and Dean Williams.
CEDPS Data Services Ann Chervenak USC Information Sciences Institute.
Workflow Management and Virtual Data Ewa Deelman Center for Grid Technologies USC Information Sciences Institute.
Pegasus-a framework for planning for execution in grids Karan Vahi USC Information Sciences Institute May 5 th, 2004.
Planning Ewa Deelman USC Information Sciences Institute GriPhyN NSF Project Review January 2003 Chicago.
Pegasus: Planning for Execution in Grids Ewa Deelman, Carl Kesselman, Gaurang Mehta, Gurmeet Singh, Karan Vahi Information Sciences Institute University.
Peter F. Couvares Computer Sciences Department University of Wisconsin-Madison Condor DAGMan: Managing Job.
Virtual Data Management for CMS Simulation Production A GriPhyN Prototype.
Securing the Grid & other Middleware Challenges Ian Foster Mathematics and Computer Science Division Argonne National Laboratory and Department of Computer.
Tool Integration with Data and Computation Grid “Grid Wizard 2”
AHM04: Sep 2004 Nottingham CCLRC e-Science Centre eMinerals: Environment from the Molecular Level Managing simulation data Lisa Blanshard e- Science Data.
GADU: A System for High-throughput Analysis of Genomes using Heterogeneous Grid Resources. Mathematics and Computer Science Division Argonne National Laboratory.
Peter Couvares Computer Sciences Department University of Wisconsin-Madison Condor DAGMan: Introduction &
Grid Compute Resources and Job Management. 2 Grid middleware - “glues” all pieces together Offers services that couple users with remote resources through.
Interoperability Achieved by GADU in using multiple Grids. OSG, Teragrid and ANL Jazz Presented by: Dinanath Sulakhe Mathematics and Computer Science Division.
1 USC Information Sciences InstituteYolanda Gil AAAI-08 Tutorial July 13, 2008 Part IV Workflow Mapping and Execution in Pegasus (Thanks.
Managing LIGO Workflows on OSG with Pegasus Karan Vahi USC Information Sciences Institute
A System for Monitoring and Management of Computational Grids Warren Smith Computer Sciences Corporation NASA Ames Research Center.
InSilicoLab – Grid Environment for Supporting Numerical Experiments in Chemistry Joanna Kocot, Daniel Harężlak, Klemens Noga, Mariusz Sterzel, Tomasz Szepieniec.
Condor DAGMan: Managing Job Dependencies with Condor
Pegasus WMS Extends DAGMan to the grid world
GGF OGSA-WG, Data Use Cases Peter Kunszt Middleware Activity, Data Management Cluster EGEE is a project funded by the European.
Pegasus and Condor Gaurang Mehta, Ewa Deelman, Carl Kesselman, Karan Vahi Center For Grid Technologies USC/ISI.
Overview of Workflows: Why Use Them?
Frieda meets Pegasus-WMS
L. Glimcher, R. Jin, G. Agrawal Presented by: Leo Glimcher
Presentation transcript:

GriPhyN Virtual Data System Mike Wilde Argonne National Laboratory Mathematics and Computer Science Division LISHEP 2004, UERJ, Rio De Janeiro 13 Feb 2004

LISHEP2004/UERJ 13 Feb 04 2 A Large Team Effort! The Chimera Virtual Data System is the work of Ian Foster, Jens Voeckler, Mike Wilde and Yong Zhao The Pegasus Planner is the work of Ewa Deelman, Gaurang Mehta, and Karan Vahi Applications described are the work of many, people, including: James Annis, Rick Cavanaugh, Rob Gardner, Albert Lazzarini, Natalia Maltsev, and their wonderful teams

LISHEP2004/UERJ 13 Feb 04 3 Acknowledgements GriPhyN, iVDGL, and QuarkNet (in part) are supported by the National Science Foundation The Globus Alliance, PPDG, and QuarkNet are supported in part by the US Department of Energy, Office of Science; by the NASA Information Power Grid program; and by IBM

LISHEP2004/UERJ 13 Feb 04 4 Tutorial Objectives l Provide a detailed introduction to existing services for virtual data management in grids l Provide descriptions and interactive demonstrations of: –the Chimera system for managing virtual data products –the Pegasus system for planning and execution in grids l Intended for those interested in creating and running huge workflows on the grid.

LISHEP2004/UERJ 13 Feb 04 5 Tutorial Outline l Introduction: Grids, GriPhyN, Virtual Data (5 minutes) l The Chimera system (25 minutes) l The Pegasus system (25 minutes) l Summary (5 minutes)

LISHEP2004/UERJ 13 Feb 04 6 GriPhyN – Grid Physics Network Mission Enhance scientific productivity through: l Discovery and application of datasets l Enabling use of a worldwide data grid as a scientific workstation Virtual Data enables this approach by creating datasets from workflow “recipes” and recording their provenance.

LISHEP2004/UERJ 13 Feb 04 7 Virtual Data System Approach Producing data from transformations with uniform, precise data interface descriptions enables… l Discovery: finding and understanding datasets and transformations l Workflow: structured paradigm for organizing, locating, specifying, & producing scientific datasets –Forming new workflow –Building new workflow from existing patterns –Managing change l Planning: automated to make the Grid transparent l Audit: explanation and validation via provenance

LISHEP2004/UERJ 13 Feb 04 8 Virtual Data Scenario simulate – t 10 … file1 file2 reformat – f fz … file1 File3,4,5 psearch – t 10 … conv – I esd – o aod file6 summarize – t 10 … file7 file8 On-demand data generation Update workflow following changes Manage workflow; psearch –t 10 –i file3 file4 file5 –o file8 summarize –t 10 –i file6 –o file7 reformat –f fz –i file2 –o file3 file4 file5 conv –l esd –o aod –i file 2 –o file6 simulate –t 10 –o file1 file2 Explain provenance, e.g. for file8:

LISHEP2004/UERJ 13 Feb 04 9 The Grid l Emerging computational, networking, and storage infrastructure –Pervasive, uniform, and reliable access to remote data, computational, sensor, and human resources l Enable new approaches to applications and problem solving –Remote resources the rule, not the exception l Challenges –Heterogeneous components –Component failures common –Different administrative domains –Local policies for security and resource usage

LISHEP2004/UERJ 13 Feb Grid3 – The Laboratory Supported by the National Science Foundation and the Department of Energy.

LISHEP2004/UERJ 13 Feb Grid3 – Cumulative CPU Days to ~ 25 Nov 2003

LISHEP2004/UERJ 13 Feb Grid2003: ~100TB data processed to ~ 25 Nov 2003

LISHEP2004/UERJ 13 Feb Requirements for Virtual Data Management l Terabytes or petabytes of data –Often read-only data, “published” by experiments –Other systems need to maintain data consistency l Large data storage and computational resources shared by researchers around the world –Distinct administrative domains –Respect local and global policies governing how resources may be used l Access raw experimental data l Run simulations and analysis to create “derived” data products

LISHEP2004/UERJ 13 Feb Requirements for Virtual Data Management (Cont.) l Locate existing data –Record and query for existence of data l Data access based on metadata –High-level attributes of data l Support high-speed, reliable data movement –E.g., for efficient movement of large experimental data sets l Planning, scheduling and monitoring execution of data requests and computations l Management of data replication –Register and query for replicas –Select the best replica for a data transfer l Virtual data –Desired data may be stored on a storage system (“materialized”) or created on demand

LISHEP2004/UERJ 13 Feb Tutorial Content l The Chimera system for managing virtual data products –Virtual data: materialize data on-demand –Virtual data language, catalog and interpreter l The Pegasus system for planning and execution in grids –Pegasus is a configurable system that can map and execute complex workflows on grid resources

LISHEP2004/UERJ 13 Feb Tutorial Outline l Introduction: Grids, GriPhyN, Virtual Data (5 minutes) l The Chimera system (25 minutes) l The Pegasus system (25 minutes) l Summary (5 minutes)

Chimera Virtual Data System

LISHEP2004/UERJ 13 Feb Chimera Virtual Data System Outline l Virtual data concept and vision l VDL – the Virtual Data Language l Simple virtual data examples l Virtual data applications in High Energy Physics and Astronomy l Use of virtual data tools

LISHEP2004/UERJ 13 Feb Virtual Data System Capabilities Producing data from transformations with uniform, precise data interface descriptions enables… l Discovery: finding and understanding datasets and transformations l Workflow: structured paradigm for organizing, locating, specifying, & producing scientific datasets –Forming new workflow –Building new workflow from existing patterns –Managing change l Planning: automated to make the Grid transparent l Audit: explanation and validation via provenance

LISHEP2004/UERJ 13 Feb VDL: Virtual Data Language Describes Data Transformations l Transformation –Abstract template of program invocation –Similar to "function definition" l Derivation –“Function call” to a Transformation –Store past and future: >A record of how data products were generated >A recipe of how data products can be generated l Invocation –Record of a Derivation execution

LISHEP2004/UERJ 13 Feb Example Transformation TR t1( out a2, in a1, none pa = "500", none env = "100000" ) { argument = "-p "${pa}; argument = "-f "${a1}; argument = "-x –y"; argument stdout = ${a2}; profile env.MAXMEM = ${env}; } $a1 $a2 t1

LISHEP2004/UERJ 13 Feb Example Derivations DV d1->t1 ( env="20000", pa="600", ); DV d2->t1 ( );

LISHEP2004/UERJ 13 Feb Workflow from File Dependencies TR tr1(in a1, out a2) { argument stdin = ${a1}; argument stdout = ${a2}; } TR tr2(in a1, out a2) { argument stdin = ${a1}; argument stdout = ${a2}; } DV DV file1 file2 file3 x1 x2

LISHEP2004/UERJ 13 Feb Example Invocation Completion status and resource usage Attributes of executable transformation Attributes of input and output files

LISHEP2004/UERJ 13 Feb Example Workflow l Complex structure –Fan-in –Fan-out –"left" and "right" can run in parallel l Uses input file –Register with RC l Complex file dependencies –Glues workflow findrange analyze preprocess

LISHEP2004/UERJ 13 Feb Workflow step "preprocess" l TR preprocess turns f.a into f.b1 and f.b2 TR preprocess( output b[], input a ) { argument = "-a top"; argument = " –i "${input:a}; argument = " –o " ${output:b}; } l Makes use of the "list" feature of VDL –Generates 0..N output files. –Number file files depend on the caller.

LISHEP2004/UERJ 13 Feb Workflow step "findrange" l Turns two inputs into one output TR findrange( output b, input a1, input a2, none name="findrange", none p="0.0" ) { argument = "-a "${name}; argument = " –i " ${a1} " " ${a2}; argument = " –o " ${b}; argument = " –p " ${p}; } l Uses the default argument feature

LISHEP2004/UERJ 13 Feb Can also use list[] parameters TR findrange( output b, input a[], none name="findrange", none p="0.0" ) { argument = "-a "${name}; argument = " –i " ${" "|a}; argument = " –o " ${b}; argument = " –p " ${p}; }

LISHEP2004/UERJ 13 Feb Workflow step "analyze" l Combines intermediary results TR analyze( output b, input a[] ) { argument = "-a bottom"; argument = " –i " ${a}; argument = " –o " ${b}; }

LISHEP2004/UERJ 13 Feb Complete VDL workflow l Generate appropriate derivations DV out:"f.b2"} ], ); DV left->findrange( name="left", p="0.5" ); DV right->findrange( name="right" ); DV );

LISHEP2004/UERJ 13 Feb Compound Transformations l Using compound TR –Permits composition of complex TRs from basic ones –Calls are independent >unless linked through LFN –A Call is effectively an anonymous derivation >Late instantiation at workflow generation time –Permits bundling of repetitive workflows –Model: Function calls nested within a function definition

LISHEP2004/UERJ 13 Feb Compound Transformations (cont) l TR diamond bundles black-diamonds TR diamond( out fd, io fc1, io fc2, io fb1, io fb2, in fa, p1, p2 ) { call preprocess( a=${fa}, b=[ ${out:fb1}, ${out:fb2} ] ); call findrange( a1=${in:fb1}, a2=${in:fb2}, name="LEFT", p=${p1}, b=${out:fc1} ); call findrange( a1=${in:fb1}, a2=${in:fb2}, name="RIGHT", p=${p2}, b=${out:fc2} ); call analyze( a=[ ${in:fc1}, ${in:fc2} ], b=${fd} ); }

LISHEP2004/UERJ 13 Feb Compound Transformations (cont) l Multiple DVs allow easy generator scripts: DV d1->diamond( p2="100", p1="0" ); DV d2->diamond( p2=" ", p1="0" );... DV d70->diamond( p2="800", p1="18" );

LISHEP2004/UERJ 13 Feb Jim Annis, Steve Kent, Vijay Sehkri, Fermilab, Michael Milligan, Yong Zhao, University of Chicago Galaxy cluster size distribution DAG Virtual Data Example: Galaxy Cluster Search Sloan Data

LISHEP2004/UERJ 13 Feb Cluster Search Workflow Graph and Execution Trace Workflow jobs vs time

LISHEP2004/UERJ 13 Feb mass = 200 decay = WW stability = 1 LowPt = 20 HighPt = mass = 200 decay = WW stability = 1 event = 8 mass = 200 decay = WW stability = 1 plot = 1 mass = 200 decay = WW plot = 1 mass = 200 decay = WW event = 8 mass = 200 decay = WW stability = 1 mass = 200 decay = WW stability = 3 mass = 200 decay = WW mass = 200 decay = ZZ mass = 200 decay = bb mass = 200 plot = 1 mass = 200 event = 8 Virtual Data Application: High Energy Physics Data Analysis Work and slide by Rick Cavanaugh and Dimitri Bourilkov, University of Florida

LISHEP2004/UERJ 13 Feb Observations l A provenance approach based on interface definition and data flow declaration fits well with Grid requirements for code and data transportability and heterogeneity l Working in a provenance-managed system has many fringe benefits: uniformity, precision, structure, communication, documentation

LISHEP2004/UERJ 13 Feb Virtual Data Grid Vision

LISHEP2004/UERJ 13 Feb Vision for Provenance in the Large l Universal knowledge management and production systems l Vendors integrate the provenance tracking protocol into data processing products l Ability to run anywhere “in the Grid”

LISHEP2004/UERJ 13 Feb Functional View of Virtual Data Management Location based on metadata attributes Location of one or more physical replicas State of grid resources, performance measurements and predictions Metadata Service Application Replica Location Service Information Services Planner: Data location, Replica selection, Selection of compute and storage resources Security and Policy Executor: Initiates data transfers and computations Data Movement Data Access Compute ResourcesStorage Resources

LISHEP2004/UERJ 13 Feb GriPhyN/PPDG Data Grid Architecture Application Planner Executor Catalog Services Info Services Policy/Security Monitoring Repl. Mgmt. Reliable Transfer Service Compute ResourceStorage Resource DAG (concrete) DAG (abstract) DAGMAN, Kangaroo GRAMGridFTP; GRAM; SRM GSI, CAS MDS MCAT; GriPhyN catalogs GDMP MDS Globus

LISHEP2004/UERJ 13 Feb Executor Example: Condor DAGMan l Directed Acyclic Graph Manager l Specify the dependencies between Condor jobs using DAG data structure l Manage dependencies automatically –(e.g., “Don’t run job “B” until job “A” has completed successfully.”) l Each job is a “node” in DAG l Any number of parent or children nodes l No loops Job A Job BJob C Job D Slide courtesy Miron Livny, U. Wisconsin

LISHEP2004/UERJ 13 Feb Executor Example: Condor DAGMan (Cont.) l DAGMan acts as a “meta-scheduler” –holds & submits jobs to the Condor queue at the appropriate times based on DAG dependencies l If a job fails, DAGMan continues until it can no longer make progress and then creates a “rescue” file with the current state of the DAG –When failed job is ready to be re-run, the rescue file is used to restore the prior state of the DAG DAGMan Condor Job Queue C D B C B A Slide courtesy Miron Livny, U. Wisconsin

LISHEP2004/UERJ 13 Feb Virtual Data in CMS Virtual Data Long Term Vision of CMS: CMS Note 2001/047, GRIPHYN

LISHEP2004/UERJ 13 Feb CMS Data Analysis 100b 200b 5K 7K 100K 50K 300K 100K 50K 100K 200K 100K 100b 200b 5K 7K 100K 50K 300K 100K 50K 100K 200K 100K Tag 2 Jet finder 2 Jet finder 1 Reconstruction Algorithm Tag 1 Calibration data Raw data (simulated or real) Reconstructed data (produced by physics analysis jobs) Event 1 Event 2Event 3 Uploaded dataVirtual dataAlgorithms Dominant use of Virtual Data in the Future

LISHEP2004/UERJ 13 Feb Data: 0.5 MB 175 MB 275 MB 105 MB SC2001 Demo Version: pythia cmsim writeHits writeDigis 1 run = 500 events 1 run 1 event CPU: 2 min 8 hours 5 min 45 min truth.ntpl hits.fz hits.DB digis.DB Production Pipeline GriphyN-CMS Demo

LISHEP2004/UERJ 13 Feb Pegasus: Planning for Execution in Grids l Maps from abstract to concrete workflow –Algorithmic and AI based techniques l Automatically locates physical locations for both components (transformations and data) –Use Globus Replica Location Service and the Transformation Catalog l find appropriate resources to execute –Via Globus Monitoring and Discovery Serivce l Reuse existing data products where applicable l Publishes newly derived data products –RLS, Chimera virtual data catalog

LISHEP2004/UERJ 13 Feb 04 48

LISHEP2004/UERJ 13 Feb Replica Location Service l Pegasus uses the RLS to find input data LRC RLI Computation l Pegasus uses the RLS to register new data products

LISHEP2004/UERJ 13 Feb Use of MDS in Pegasus l MDS provides up-to-date Grid state information –Total and idle job queues length on a pool of resources (condor) –Total and available memory on the pool –Disk space on the pools –Number of jobs running on a job manager l Can be used for resource discovery and selection –Developing various task to resource mapping heuristics l Can be used to publish information necessary for replica selection –Developing replica selection components

LISHEP2004/UERJ 13 Feb Abstract Workflow Reduction KEY The original node Input transfer node Registration node Output transfer node Node deleted by Reduction algorithm Job e Job gJob h Job d Job a Job c Job f Job i Job b l The output jobs for the Dag are all the leaf nodes –i.e. f, h, I l Each job requires 2 input files and generates 2 output files. l The user specifies the output location.

LISHEP2004/UERJ 13 Feb KEY The original node Input transfer node Registration node Output transfer node Node deleted by Reduction algorithm Job e Job gJob h Job d Job a Job c Job f Job i Job b Optimizing from the point of view of Virtual Data l Jobs d, e, f have output files that have been found in the Replica Location Service. l Additional jobs are deleted. l All jobs (a, b, c, d, e, f) are removed from the DAG.

LISHEP2004/UERJ 13 Feb Job e Job gJob h Job d Job a Job c Job f Job i Job b adding transfer nodes for the input files for the root nodes Plans for staging data in KEY The original node Input transfer node Registration node Output transfer node Node deleted by Reduction algorithm Planner picks execution and replica locations

LISHEP2004/UERJ 13 Feb Staging and registering for each job that materializes data (g, h, i ). KEY The original node Input transfer node Registration node Output transfer node Node deleted by Reduction algorithm transferring the output files of the leaf job (f) to the output location Job e Job gJob h Job d Job a Job c Job f Job i Job b Staging data out and registering new derived products in the RLS

LISHEP2004/UERJ 13 Feb KEY The original node Input transfer node Registration node Output transfer node Job gJob h Job i Job e Job gJob h Job d Job a Job c Job f Job i Job b Input DAG The final executable DAG

LISHEP2004/UERJ 13 Feb Pegasus Components l Concrete Planner and Submit file generator (gencdag) –The Concrete Planner of the VDS makes the logical to physical mapping of the DAX taking into account the pool where the jobs are to be executed (execution pool) and the final output location (output pool). l Java Replica Location Service Client (rls- client & rls-query-client) –Used to populate and query the globus replica location service.

LISHEP2004/UERJ 13 Feb Pegasus Components (cont’d) l XML Pool Config generator (genpoolconfig) –The Pool Config generator queries the MDS as well as local pool config files to generate a XML pool config which is used by Pegasus. –MDS is preferred for generation pool configuration as it provides a much richer information about the pool including the queue statistics, available memory etc. l The following catalogs are looked up to make the translation –Transformation Catalog (tc.data) –Pool Config File –Replica Location Services –Monitoring and Discovery Services

LISHEP2004/UERJ 13 Feb Transformation Catalog (Demo) l Consists of a simple text file. –Contains Mappings of Logical Transformations to Physical Transformations. l Format of the tc.data file #poolid logical tr physical tr env isi preprocess /usr/vds/bin/preprocess VDS_HOME=/usr/vds/; l All the physical transformations are absolute path names. l Environment string contains all the environment variables required in order for the transformation to run on the execution pool. l DB based TC in testing phase.

LISHEP2004/UERJ 13 Feb Pool Config (Demo) l Pool Config is an XML file which contains information about various pools on which DAGs may execute. l Some of the information contained in the Pool Config file is –Specifies the various job-managers that are available on the pool for the different types of condor universes. –Specifies the GridFtp storage servers associated with each pool. –Specifies the Local Replica Catalogs where data residing in the pool has to be cataloged. –Contains profiles like environment hints which are common site-wide. –Contains the working and storage directories to be used on the pool.

LISHEP2004/UERJ 13 Feb Pool config l Two Ways to construct the Pool Config File. –Monitoring and Discovery Service –Local Pool Config File (Text Based) l Client tool to generate Pool Config File –The tool genpoolconfig is used to query the MDS and/or the local pool config file/s to generate the XML Pool Config file.

LISHEP2004/UERJ 13 Feb Gvds.Pool.Config (Demo) l This file is read by the information provider and published into MDS. l Format gvds.pool.id : gvds.pool.lrc : gvds.pool.gridftp gvds.pool.gridftp : gvds.pool.universe : gvds.pool.gridlaunch : gvds.pool.workdir : gvds.pool.profile : gvds.pool.profile :

LISHEP2004/UERJ 13 Feb Properties (Demo) l Properties file define and modify the behavior of Pegasus. l Properties set in the $VDS_HOME/properties can be overridden by defining them either in $HOME/.chimerarc or by giving them on the command line of any executable. –eg. Gendax –Dvds.home=path to vds home…… l Some examples follow but for more details please read the sample.properties file in $VDS_HOME/etc directory. l Basic Required Properties –vds.home : This is auto set by the clients from the environment variable $VDS_HOME –vds.properties : Path to the default properties file >Default : ${vds.home}/etc/properties

LISHEP2004/UERJ 13 Feb Concrete Planner Gencdag (Demo) l The Concrete planner takes the DAX produced by Chimera and converts into a set of condor dag and submit files. l Usage : gencdag --dax --p [--dir ] [--o ] [--force] l You can specify more then one execution pools. Execution will take place on the pools on which the executable exists. If the executable exists on more then one pool then the pool on which the executable will run is selected randomly. l Output pool is the pool where you want all the output products to be transferred to. If not specified the materialized data stays on the execution pool

LISHEP2004/UERJ 13 Feb Future Improvements l A sophisticated concrete planner with AI technology l A sophisticated transformation catalog with a DB backend l Smarter scheduling of workflows by deciding whether the workflow is compute intensive or data intensive. l In-time planning. l Using resource queue information and network bandwidth information to make a smarter choice of resources. l Reservation of Disk Space on remote machines

LISHEP2004/UERJ 13 Feb Pegasus Portal

LISHEP2004/UERJ 13 Feb Tutorial Outline l Introduction: Grids, GriPhyN, Virtual Data (5 minutes) l The Chimera system (25 minutes) l The Pegasus system (25 minutes) l Summary (5 minutes)

LISHEP2004/UERJ 13 Feb Summary: GriPhyN Virtual Data System l Using Virtual Data helps in reducing time and cost of computation. l Services in the Virtual Data Toolkit –Chimera. Constructs a virtual plan –Pegasus. Constructs a concrete grid plan from this virtual plan. l Some current applications of the virtual data toolkit -

LISHEP2004/UERJ 13 Feb Astronomy l Montage (NASA and NVO) ( B. Berriman, J. Good, G. Singh, M. Su ) –Deliver science-grade custom mosaics on demand –Produce mosaics from a wide range of data sources (possibly in different spectra) –User-specified parameters of projection, coordinates, size, rotation and spatial sampling. Mosaic created by Pegasus based Montage from a run of the M101 galaxy images on the Teragrid.

LISHEP2004/UERJ 13 Feb Montage Workflow 1202 nodes

LISHEP2004/UERJ 13 Feb BLAST : set of sequence comparison algorithms that are used to search sequence databases for optimal local alignments to a query Lead by Veronika Nefedova (ANL) as part of the PACI Data Quest Expedition program 2 major runs were performed using Chimera and Pegasus: 1)60 genomes (4,000 sequences each), In 24 hours processed Genomes selected from DOE-sponsored sequencing projects 67 CPU-days of processing time delivered ~ 10,000 Grid jobs >200,000 BLAST executions 50 GB of data generated 2) 450 genomes processed Speedup of 5-20 times were achieved because the compute nodes we used efficiently by keeping the submission of the jobs to the compute cluster constant.

LISHEP2004/UERJ 13 Feb For further information l Globus Project: l Chimera : l Pegasus: pegasus.isi.edu l MCS: