R. Cavanaugh GriPhyN Analysis Workshop Caltech, June, 2003 Virtual Data Toolkit
Caltech Analysis Workshop2 Very Early GriPhyN Data Grid Architecture Application Planner Executor Catalog Services Info Services Monitoring Repl. Mgmt. Reliable Transfer Service Compute ResourceStorage Resource DAGMAN, Condor-G GRAMGridFTP; GRAM; SRM GSI, CAS MDS MCAT; GriPhyN catalogs GDMP MDS Globus = initial solution is operational Policy/Security
Caltech Analysis Workshop3 Currently Evolved GriPhyN Picture Picture Taken from Mike Wilde
Caltech Analysis Workshop4 Current VDT Emphasis l Current reality –Easy grid construction >Strikes a balance between flexibility and “easibility” >purposefully errs (just a little bit) on the side of “easibility” –Long running, high-throughput, file-based computing –Abstract description of complex workflows –Virtual Data Request Planning –Partial provenance tracking of workflows l Future directions (current research) including: –Policy based scheduling >With notions of Quality of Service (advanced reservation of resources, etc) –Dataset based (arbitrary type structures) –Full provenance tracking of workflows –Several others…
Caltech Analysis Workshop5 Current VDT Flavors l Client –Globus Toolkit 2 >GSI >globusrun >GridFTP Client –CA signing policies for DOE and EDG –Condor-G / DAGMan –RLS Client –MonALISA Client (soon) l Chimera l SDK –Globus –ClassAds –RLS Client –Netlogger l Server –Globus Toolkit >GSI >Gatekeeper >job-managers and GASS Cache >MDS >GridFTP Server –MyProxy –CA signing policies for DOE and EDG –EDG Certificate Revocation List –Fault Tolerant Shell –GLUE Schema –mkgridmap –Condor / DAGMan –RLS Server –MonALISA Server (soon)
Caltech Analysis Workshop6 Chimera Virtual Data System l Virtual Data Language –textual –XML l Virtual Data Catalog –MySQL or PostGreSQL based –File based version available
Caltech Analysis Workshop7 Virtual Data Language TR CMKIN( out a2, in a1 ) { argument file = ${a1}; argument file = ${a2}; } TR CMSIM( out a2, in a1 ) { argument file = ${a1}; argument file = ${a2}; } DV x1->CMKIN( DV x2->CMSIM( file1 file2 file3 x1 x2 Picture Taken from Mike Wilde
Caltech Analysis Workshop8 Virtual Data Request Planning l Abstract Planner –Graph traversal of (virtual) data dependencies –Generates the graph with maximal data dependencies –Somewhat analogous to Build Style l Concrete (Pegasus) Planner –Prunes execution steps for which data already exists (RLS lookup) –Binds all execution steps in the graph to a site –Adds “housekeeping” steps >Create environment, stage-in data, stage-out data, publish data, clean-up environment, etc –Generates a graph with minimal execution steps –Somewhat analogous to Make Style
Caltech Analysis Workshop9 Chimera Virtual Data System: Mapping Abstract Workflows onto Concrete Environments l Abstract DAGs (virtual workflow) –Resource locations unspecified –File names are logical –Data destinations unspecified –build style l Concrete DAGs (stuff for submission) –Resource locations determined –Physical file names specified –Data delivered to and returned from physical locations –make style Abs. Plan VDC RLS C. Plan. DAX DAGMan DAG VDL Logical Physical XML In general there is a full range of planning steps between abstract workflows and concrete workflows Picture Taken from Mike Wilde
Caltech Analysis Workshop10 mass = 200 decay = WW stability = 1 event = 8 mass = 200 decay = WW stability = 1 plot = 1 mass = 200 decay = WW plot = 1 mass = 200 decay = WW event = 8 mass = 200 decay = WW stability = 1 mass = 200 decay = WW stability = 3 mass = 200 decay = WW mass = 200 decay = ZZ mass = 200 decay = bb mass = 200 plot = 1 mass = 200 event = 8 A virtual space of simulated data is generated for future use by scientists... Supercomputing 2002
Caltech Analysis Workshop11 mass = 200 decay = WW stability = 1 LowPt = 20 HighPt = mass = 200 decay = WW stability = 1 event = 8 mass = 200 decay = WW stability = 1 plot = 1 mass = 200 decay = WW plot = 1 mass = 200 decay = WW event = 8 mass = 200 decay = WW stability = 1 mass = 200 decay = WW stability = 3 mass = 200 decay = WW mass = 200 decay = ZZ mass = 200 decay = bb mass = 200 plot = 1 mass = 200 event = 8 Scientists may add new derived data branches... Supercomputing 2002
Caltech Analysis Workshop12 POOL Generator Simulator Formator writeESD writeAOD writeTAG writeESD writeAOD writeTAG Analysis Scripts Digitiser Calib. DB Example CMS Data/ Workflow
Caltech Analysis Workshop13 POOL Generator Simulator Formator writeESD writeAOD writeTAG writeESD writeAOD writeTAG Analysis Scripts Digitiser Calib. DB Online Teams (Re)processing Team MC Production Team Physics Groups Data/workflow is a collaborative endeavour!
Caltech Analysis Workshop14 A “Concurrent Analysis Versioning System:” Complex Data Flow and Data Provenance in HEP Raw ESD AOD TAG Plots, Tables, Fits Comparisons Plots, Tables, Fits Real Data Simulated Data l Family History of a Data Analysis l Collaborative Analysis Development Environment l "Check-point" a Data Analysis l Analysis Development Environment (like CVS) l Audit a Data Analysis
Caltech Analysis Workshop15 Current Prototype GriPhyN “Architecture” (Picture) Picture Taken from Mike Wilde
Caltech Analysis Workshop16 Post-talk: My wandering mind… Typical VDT Configuration l Single public head-node (gatekeeper) –VDT-server installed l Many private worker-nodes –Local scheduler software installed –No grid-middleware installed l Shared file system (e.g. NFS) –User area shared between head-node and worker-nodes –One or many raid systems typically shared
Caltech Analysis Workshop17 compute machines Condor-G Chimera DAGman gahp_server submit hostremote host gatekeeper Local Scheduler (Condor, PBS, etc.) Default middleware configuration from the Virtual Data Toolkit
Caltech Analysis Workshop18 EDG Configuration (for comparison) l CPU separate from Storage –CE: single gatekeeper for access to cluster –SE: single gatekeeper for access to storage l Many public worker-nodes (at least NAT) –Local scheduler installed (LSF or PBS) –Each worker-node runs a GridFTP Client l No assumed shared file system –Data access is accomplished via globus-url-copy to local disk on worker-node
Caltech Analysis Workshop19 Why Care? l Data Analyses would benefit from being fabric independent! l But…the devil is (still) in the details! –Assumptions in job descriptions/requirements currently lead to direct fabric-level consequences and vice versa. l Are existing middleware configurations sufficient for Data Analysis (“scheduled” and “interactive”)? –Really need input from groups like here! –What kind of fabric layer is necessary for “interactive” data analysis using PROOF, JAS? l Does the VDT need multiple configuration flavors? –Production, batch oriented (current default) –Analysis, interactive oriented