Distributed Services for Grid Enabled Data Analysis Distributed Services for Grid Enabled Data Analysis
Scenario Liz and John are members of CMS Liz is from Caltech and is an expert in event reconstruction John is from Florida and is an expert in statistical fits They wish to combine their expertise and collaborate on a CMS Data Analysis Project
Grid Monitoring Service MonALISA Grid Resource Service VDT Server Grid Execution Service VDT Client Grid Scheduling Service Sphinx Virtual Data Service Chimera Workflow Generation Service ShahKar Collaborative Environment Service CAVE Grid-services Web Service: Clarens Analysis Client IGUANA Analysis Client ROOT Analysis Client Web Browser Analysis Client PDA Remote Data Service Clarens Demo Goals Prototype vertically integrated system –Transparent/seamless experience Distribute grid services using a uniform web service –Clarens ! –Understand system latencies failure modes Investigate request scheduling in a resource limited and dynamic environment –Emphasize functionality over scalability Investigate interactive vs. scheduled data analysis on a grid –Hybrid example –Understand where are the difficult issues
Chimera Virtual Data System Virtual data products are pre-registered with the Chimera Virtual Data Service. Using Clarens, data products are discovered by Liz and John by remotely browsing the Chimera Virtual Data Service y.ntpl y.root x.ntpl x.root requestbrowse y.ntpl y.cards pythia h2root y.root x.ntpl x.cards pythia h2root x.root Data Discovery
Liz wants to analyse x.root using her analysis code a.C // Analysis code: a.C #include #include "TFile.h" #include "TTree.h" #include "TBrowser.h" #include "TH1.h" #include "TH2.h" #include "TH3.h" #include "TRandom.h" #include "TCanvas.h" #include "TPolyLine3D.h" #include "TPolyMarker3D.h" #include "TString.h" void a( char treefile[], char newtreefile[] ) { Int_t Nhep; Int_t Nevhep; Int_t Isthep[3000]; Int_t Idhep[3000], Jmohep[3000][2], Jdahep[3000][2]; Float_t Phep[3000][5], Vhep[3000][4]; Int_t Irun, Ievt; Float_t Weight; Int_t Nparam; Float_t Param[200]; TFile *file = new TFile( treefile ); TTree *tree = (TTree*) file -> Get( "h10 tree -> SetBranchAddress( "Nhep", &Nh x.ntpl x.cards pythia h2root x.root Data Analysis Chimera Virtual Data System
Chimera Virtual Data System registerbrowse Select CINT script Define output LFN Select input LFN Liz browses the local directory for her analysis code and the Chimera Virtual Data Service for input LFNs… x.ntpl x.cards pythia h2root x.root Interactive Workflow Generation
Chimera Virtual Data System y.ntpl y.root x.ntpl x.root registerbrowse xa.root a.C b.C c.C d.C Select CINT script Define output LFN Select input LFN She selects and registers (to the Grid) her analysis code, the appropriate input LFN, and a newly defined ouput LFN x.ntpl x.cards pythia h2root x.root Interactive Workflow Generation
Chimera Virtual Data System y.ntpl y.root x.ntpl x.root registerbrowse x.root xa.root a.C b.C c.C d.C a.C Select CINT script Define output LFN Select input LFN A branch is automatically added in the Chimera Virtual Data Catalog, and a.C is uploaded into “gridspace” and registered with RLS root a.C xa.root x.ntpl x.cards pythia h2root x.root Interactive Workflow Generation
Querying the Virtual Data Service, Liz sees that xa.root is now available to her as a new virtual data product y.ntpl y.root x.ntpl x.root xa.root requestbrowse root a.C xa.root x.ntpl x.cards pythia h2root x.root Interactive Workflow Generation Chimera Virtual Data System
She requests it…. y.ntpl y.root x.ntpl x.root xa.root requestbrowse xa.root root a.C xa.root x.ntpl x.cards pythia h2root x.root Request Submission Chimera Virtual Data System
Brief Interlude: The Grid is Busy and Resources are Limited! Busy: –Production is taking place –Other physicists are using the system –Use MonALISA to avoid congestion in the grid Limited: –As grid computing becomes standard fare, oversubscription to resources will be common ! CMS gives Liz a global high priority Based upon local and global policies, and current Grid weather, a grid- scheduler: –must schedule her requests for optimal resource use
Sphinx Scheduling Server Nerve Centre –Global view of system Data Warehouse –Information driven –Repository of current state of the grid Control Process –Finite State Machine Different modules modify jobs, graphs, workflows, etc and change their state –Flexible –Extensible Sphinx Server Control Process Job Execution Planner Graph Reducer Graph Tracker Job Predictor Graph Data Planner Job Admission Control Message Interface Graph Predictor Graph Admission Control Data Warehouse Data Management Information Gatherer Policies Accounting Info Grid Weather Resource Prop. and status Request Tracking Workflows etc
Distributed Services for Grid Enabled Data Analysis Distributed Services for Grid Enabled Data Analysis Sphinx Scheduling Service Fermilab File Service VDT Resource Service Caltech File Service VDT Resource Service RLS Replica Location Service Sphinx/VDT Execution Service MonALISA Monitoring Service ROOT Data Analysis Client Chimera Virtual Data Service Iowa File Service VDT Resource Service Florida File Service VDT Resource Service Clarens Globus GridFTP Clarens Globus MonALISA
Meanwhile, John has been developing his statistical fits in b.C by analysing the data product x.root y.ntpl y.root x.ntpl x.root xa.root xb.root requestbrowse xb.root root a.C xa.root x.ntpl x.cards pythia h2root x.root root b.C xb.root Collaborative Analysis
After Liz has finished optimising the event reconstruction, John uses his analysis code b.C on her data product xa.root to produce the final statistical fits and results ! y.root x.ntpl x.root xa.root xb.root xab.root requestbrowse root a.C xa.root x.ntpl x.cards pythia h2root x.root root b.C xb.root root xab.root Collaborative Analysis
Key Features Distributed Services Prototype in Data Analysis –Remote Data Service –Replica Location Service –Virtual Data Service –Scheduling Service –Grid-Execution Service –Monitoring Service Smart Replication Strategies for “Hot Data” –Virtual Data w.r.t. Location Execution Priority Management on a Resource Limited Grid –Policy Based Scheduling & QoS –Virtual Data w.r.t. Existence Collaborative Environment –Sharing of Datasets –Use of Provenance
Credits California Institute of Technology –Julian Bunn, Iosif Legrand, Harvey Newman, Suresh Singh, Conrad Steenberg, Michael Thomas, Frank Van Lingen, Yang Xia University of Florida –Paul Avery, Dimitri Bourilkov, Richard Cavanaugh, Laukik Chitnis, Jang-uk In, Mandar Kulkarni, Pradeep Padala, Craig Prescott, Sanjay Ranka Fermi National Accelerator Laboratory –Anzar Afaq, Greg Graham
DMC (Data Management Component) Scheduling the data transfers to achieve optimal workflow execution The problem: Combining data and Execution scheduling Various kinds of data transfers Smart replication –User initiated –Workflow based replication –Automatic replication Hot data management
Monitoring in SPHINX Scheduler needs information to make decisions. –The information needs to be as “current” as possible That brings monitoring into the picture –Load Average –Free Memory –Disk Space Virtual Organization (VO) Quota System –Different policies for resources –Needs monitoring and accounting/tracking of resource quotas MonALISA –Dynamic discovery of sites –Configurable monitoring service and parameters –View Generation using filters –Displays SPHINX job information Future Directions –As grid grows, the problem of latency becomes more potent –Solution: Data Fusion/Aggregation –Inline with the hierarchical views of grid (VO) and the hierarchical scheduler!
Distributed Services for Grid Enabled Data Analysis Distributed Services for Grid Enabled Data Analysis Sphinx Scheduling Service Fermilab File Service VDT Resource Service Caltech File Service VDT Resource Service RLS Replica Location Service Sphinx/VDT Execution Service MonALISA Monitoring Service ROOT Data Analysis Client Chimera Virtual Data Service Iowa File Service VDT Resource Service Florida File Service VDT Resource Service Clarens Globus GridFTP Clarens Globus MonALISA