Download presentation
Presentation is loading. Please wait.
Published byAlison Byrd Modified over 9 years ago
1
Grid Canada CLS eScience Workshop 21 st November, 2005.
2
2 Grid Canada Joint venue with CANARIE/C3.ca/NRC. Grid Canada –Setup Testbed some 3 years ago using Globus. –CANARIE hosts the Web-Site, managed by Darcy. –Main aim to increase Grid awareness across Canada –Run the GC Certificate Authority (over 2500 issued) –Main project is currently GridX1
3
3 GridX1 : A Canadian Computational Grid A.Agarwal, M.Ahmed, D.Bickle, B.Caron, D.Deatrich, A.Dimopoulos, L.Groer, R.Haria, R.Impey, L.Klektau, G.Mateescu, C.Lindsay, D.Quesnel, B.St. Arnaud, R.Simmons, R.Sobie, D.Vanderster, M.Vetterli, R.Walker, M.Yuen University of Alberta University of Calgary University of Toronto University of Victoria Simon Fraser University National Research Council Canada CANARIE TRIUMF
4
4 Motivation GridX1 is driven by the scientific need for a Grid –the ATLAS particle physics experiment at CERN –Linked to the Large Hadron Collider (LHC) Grid Project Particle physics (HEP) simulations are “embarrassingly parallel”; multiple instances of serial (integer) jobs We want to exploit the unused cycles at non-HEP sites –Minimal software demands on sites Open to other applications (serial, integer) –Grid-enabling application is as complicated as making the Grid –BaBar particle physics application (SLAC) under development
5
5 GridX1 model A number of facilities are dedicated to particle physics groups but most are shared with researchers in other fields Each shared facility may have unique configuration requirements GridX1 model: Generic Middleware (Virtual Data Toolkit: GT 2.4.3 + fixes) No OS requirement: SuSe and RedHat clusters. Generic user accounts: gcprod01... gcprodmn Condor-G Resource Broker for load balancing.
6
6 Overview GridX1 currently has 9 clusters: Alberta(2), NRC Ottawa(2), WestGrid, Victoria(3), Toronto(1) Discussions underway with McGill (HEP) (Just about to add) Total resources >> (1000 CPUs,10 TB disk,400 TB tape) Maximum number of jobs running on GridX1 has exceeded 250 CondorG grid: Extension of Condor batch system Scalable to 1000’s of jobs Intuitive commands for running jobs on remote resources
7
7 Resource Management classAds are used for passing site and job specifications Resources periodically publishes their state to the collector Free/total CPUs; Num of running and waiting jobs; est queue waiting time. Job ClassAds contain a resource Requirements expression. CPU requirements,OS, application software,
8
8 Job management Each site specifies the maximum number of grid jobs, maxJobs. (100 at UVictoria) Job is sent to site with lowest wait time. Sites are selected on a round-robin basis. RB submits jobs to a site until number of jobs pending at a site is 10% of maxJobs
9
9 Monitoring
10
10 System Status unsubmitted:waiting on the GridX1 RB (no identified site) pending:sent to a resource but not running running:active waiting time:estimated time for the next job to run
11
11 Local resource status Each site sets its own policy. Some backfill and others have limits to the number of jobs.
12
12 Status GridX1 used by the ATLAS experiment via the LCG- TRIUMF gateway Over 12,000 ATLAS jobs successfully completed
13
13 Challenges GridX1 is a equivalent to a moderate-sized computing facility –It requires a “grid” system administrator to keep system operational We need a more automated way to install applications Monitoring is in good shape but further improvements are needed –Improve reliability and scalability Error recovery has not been an issue with LCG jobs –We will have to address this with the BaBar simulation application
14
14 Data management No data grid management required for ATLAS Data Challenge –Data analysis jobs will require access to large input data sets Prototype data grid elements in place –Replica catalog –Jobs running on GridX1 query RLI and either copy data from UVic to grid cache or link to file if it already exists in the grid cache Install dCache at UVic with a Storage Resource Manager (SRM) –dCache developed by Fermilab (Chicago) and DESY (Hamburg) –SRM’s are used to interface storage facilities on the Grid –Interface GridX1 storage to LCG via GridX1-SRM
15
15 Plans Short term plans –Improving the reliability and scalability of the monitoring –Getting all sites operational (e.g. NRC Venus) –Getting the BaBar application running on more GridX1 sites Long term plans –Add data grid capability –High-speed network links between sites –Explore virtual computing concept (e.g. Xen) –Web services based monitoring –Investigate grid resource broker algorithms (PhD thesis)
16
16 Summary GridX1 working very well Over 12,000 ATLAS jobs in past 6 months (5000 in March) In Feb/Mar GridX1 was running 7% to 10% of all LHC jobs world-wide BaBar application running on subset of GridX1 –Typically 200 BaBar jobs run on the UVic clusters and Westgrid Talks at international conferences and press (national and international) We want to add more sites –Other applications could be run on the Grid (looking at two more) –Requirements and Instructions are available at www.gridx1.ca
Similar presentations
© 2025 SlidePlayer.com. Inc.
All rights reserved.