Download presentation
Presentation is loading. Please wait.
1
20/05/2003Markus.Schulz@cern.ch1 Running Grid Testbeds and planning for LCG-1 at CERN (Update on report at HepIX2002) u EDG & LCG minimal introduction u EDG Testbeds/CERN Testbeds (history) u Planning for LCG-1 Reporting from the perspective of a: CERN testbed manager EDG integration team member CERN IT-GD group member I have too many slides !!!!!
2
20/05/2003Markus.Schulz@cern.ch2 EDG http://www.edg.org http://www.edg.org u European Data Grid (3 years project) u Project for middleware and fabric management u Emphasis on data intensive scientific computing u Large scale testbeds to demonstrate production quality globus u Based on globus toolkit u Applications u HEP, Biology/Medical Science, Earth Observation u Organized into VOs (Virtual Organizations) u Main Partners IBM-UK, CS-SI (Fr), Datamat (It) + 12 Research and University Institutes Provided part of the funding
3
20/05/2003Markus.Schulz@cern.ch3 LCG http://lcg.web.cern.ch/lcg u LHC Computing Grid u Provide computing for LHC experiments u Not a software development project (but changes to harden services) u Selects grid software from other projects (like edg) u Worldwide (Europe, Asia, US) but LHC experiment only u Focus on production quality service, operation and deployment u Some Milestones: July 2003 u Provide LCG-1 service in July 2003 u January 2004 start production service for CMS and Alice DC04
4
20/05/2003Markus.Schulz@cern.ch4 EDG Testbeds (all) u Started in October 2001 u ~15 machines, one site CERN, 20 user, 20GB storage u Now (2/2003) u Testbeds at many sites, peak 1200 CPUs, ~16 TB disk, 200TB tape, MSS, 470 user u 12 Virtual Organizations (VO) >400 users VOUsers CMS106 WP687 ALICE63 ATLAS55 Earth Obs.29 BaBar29 LHCb28 ITeam22 Genomic22 TSTG16 Medical Img.6 D03
5
20/05/2003Markus.Schulz@cern.ch5 EDG Application Testbeds (2/2003) SiteCountryCPUsStorage CC-IN2P3FR620192 GB CERNCH1381321 GB CNAFIT481300 GB Ecole Poly.FR6220 GB Imperial Coll.UK92450 GB LiverpoolUK210 GB ManchesterUK915 GB NIKHEFNL142433 GB OxfordUK130 GB PadovaIT11666 GB RALUK6332 GB SARANL010000+ GB TOTAL5107514969 GB Core Sites +Development Tb.
6
20/05/2003Markus.Schulz@cern.ch6 Types of Testbeds until 2/2003 CERN u Application Testbed (many sites) u Last known most stable release (certified) u Focus on service availability u Used by applications for large scale tests and productions u Significant resources u Installed and controlled by site managers u Development Testbeds (core sites) u Current tagged release, + new packages -> next tagged release u Installation and setup by site admins + developers u Integration, evaluation, tests (currently certification) u Focus on rapid changes u Some machines (every service represented) u Developer Nodes (local) u Basic installation by local admins, latest software from developers u Developers control the machines
7
20/05/2003Markus.Schulz@cern.ch7 Now: Some extra testbeds u Integration Testbeds (core sites) u Like a development testbed, but used for the integration of edg-2 u Many changes, for larger periods not integrated (despite the name) u Certification Testbeds (LCG) u Integrated new release candidates u Evaluate u Evaluate stability u Define configuration for large scale deployment u Streamline installation u Quite some resources every service represented, multiple sites, @ CERN multiple emulated sites
8
20/05/2003Markus.Schulz@cern.ch8 Basic Services (edg v1.4.x) what makes it hard running edg grid services u Authentication u GSI Grid Security Infrastructure based on PKI (openSSL) u Globus Gatekeeper, Proxy renewal service… u GIS u Grid Information Service, MDS (Metacomputing Dir. Service) based on LDAP, VOs u Storage Management u Replica Catalog (LDAP), GDMP, GSI enabled ftp,RFIO … u Resource Management u Resource Broker, Jobmanger, u Jobsubmission u Batch System (PBS,LSF) u Logging and Bookkeeping … Many Networked Daemons
9
20/05/2003Markus.Schulz@cern.ch9 What Runs Where (edg v.1.4.x) What Runs Where (edg v.1.4.x) Daemon (only grid)UIMDSCEWNSERCBDIIRBPX Edg Gatekeeper-- - ---- Replica Catalog----- --- GSI-enabled FTPd-- - -- - Globus MDS- - ---- Info-MDS- - ---- Broker------- - Job Submission serv.------- - Information Index------ - Logging & Bookkeeping------- - Local Logger-- - -- - CRL Update - ( ) -- - Grid mapfile Update-- - -- - RFIO---- ---- GDMP---- ---- MyProxy-------- Many Services on many different node types
10
20/05/2003Markus.Schulz@cern.ch10 Services + Constraints u Services are interdependent u Composite Services u Require their own Database (MySQL, Postgres, …) u CondorG u Services impose constraints on the setup u Some services require shared file system between nodes u WN/CE need to share part of /home (GASS CACHE) u CE/SE share the /etc/gridsecurity and SE disks u Some services can’t coexist with other services u EDG users are not all CERN users u Local user administration on testbeds needed (e.g. using NIS) u Exempt from site policies (no entry in the HR DB)
11
20/05/2003Markus.Schulz@cern.ch11 Cern Application Testbed (2/2003) Cern Application Testbed (2/2003) http://cern.ch/markusw/cernEDG.html SE I Access to grid Castor CE I General WNs From 20-70 nodes NIS UI I NIS Domain ProxyMDSRC RB1 RB2 LCFG Installs and (almost) completely configures (almost) all nodes NFS 5 server 2.5TB (mirrored) Provides: /home/griduserxxx /flatfiles/SE00/VOXX BDII CE 2 Tutorial SE 2 Access to CERN Castor UI 2
12
20/05/2003Markus.Schulz@cern.ch12 Cern EDG Testbeds Cern EDG Testbeds http://cern.ch/markusw/cernEDG.html Major Testbeds 20.02.2003 u Application Testbed v1.4.4 38 Nodes u Frequent restarts of services (daily) u Test production of applications, demonstrations u Tutorial Testbed 3 Nodes u Used for tutorials (shares services with app. Testbed) u Used every few weeks u Development 1.4.4 10 Nodes u Error reproduction u Used from time to time to test patches u Integration 8 Nodes u Development, porting to globus2.2.4, RH 7.3 u Nodes for development + Infrastructure 50 Nodes u 5 disk server u Frequent changes 2FTEs spread over 4
13
20/05/2003Markus.Schulz@cern.ch13 Cern Production Testbed (NOW) Cern Production Testbed (NOW) http://cern.ch/markusw/cernEDG.html NIS UI I NIS Domain LCFG Installs and (almost) completely configures (almost) all nodes NFS 5 server 2.5TB (mirrored) Provides: /home/griduserxxx /flatfiles/SE00/VOXX SE 2 Access to CERN Castor UI 2 Reduced to limit the impact on integration of edg-2 Services moved to other sites, production trials still going on!!! RH6.2 edg Demonstrated that edg has spread knowledge!!!!!
14
20/05/2003Markus.Schulz@cern.ch14 Cern EDG Testbeds Cern EDG Testbeds http://cern.ch/markusw/cernEDG.html Major Testbeds 20.05.2003 u Application Testbed v1.4.4 2 Nodes u Test production of applications, demonstrations u Integration 20 Nodes u Development, Integrating edg-v2, RH7.3, vdt 1.1.8 u On average 2-3 new tags/day u Nodes for development + Infrastructure 15 Nodes u 5 disk server u Frequent changes
15
20/05/2003Markus.Schulz@cern.ch15 Operation: Releases VersionDate 1.1.227Feb2002 1.1.302Apr2002 1.1.404Apr2002 1.2.a111Apr2002 1.2.b131May2002 1.2.012Aug2002 1.2.104Sep2002 1.2.209Sep2002 1.2.325Oct2002 1.3.008Nov2002 1.3.119Nov2002 1.3.220Nov2002 1.3.321Nov2002 1.3.425Nov2002 1.4.006Dec2002 1.4.107Jan2003 1.4.209Jan2003 1.4.314Jan2003 1.4.418Feb2003 Successes Matchmaking/Job Mgt. Basic Data Mgt. Known Problems: High Rate Submissions Long FTP Transfers Known Problems: GASS Cache Coherency Race Conditions in Gatekeeper Unstable MDS Intense Use by Applications! Limitations: Resource Exhaustion Size of Logical Collections Successes Improved MDS Stability FTP Transfers OK Known Problems: Interactions with RC ATLAS phase 1 start CMS stress test Nov.30 - Dec. 20 CMS, ATLAS, LHCB, ALICE 1.4.4-1.4.11 Security and application software Production tests gave most valuable feedback
16
20/05/2003Markus.Schulz@cern.ch16 Bug tracking (the long way to stability)
17
20/05/2003Markus.Schulz@cern.ch17 Releases Procedure (simplified) u Before 1.2 no procedures, no records (The Dark Ages) u Integration seemed to take infinite time u Interaction with developers very unreliable u Heroic u Heroic efforts required by everyone u Now: (simplified description of practice) u RPMs are build by the autobuild system from CVS u delivered to the Integration Team, Configuration in CVS u Installed on Development testbed at CERN -> highest rate of changes u First tests u Core sites install version on their development testbeds u Some distributed tests u Software is deployed on Application testbed at CERN and core sites u More tests u Applications start using the Application testbed u Other sites install, get certified by ITeam, join Bureaucratic u Some Bureaucratic effort (simple changes take longer) process evolved into a release procedure (documented) Application Software is installed on demand outside release cycle
18
20/05/2003Markus.Schulz@cern.ch18 Detailed Release Procedure (LCG/EDG) PERFECT!!!!!!! Test and Validation process to be used WPs add unit tested code to CVS repository Run nightly build & auto. tests Grid certification Fix problems Application Certification Build system Certification (**) Testbed ~40cpu Production (*) Testbed ~1000cpu WP specific (*) machines Certified public release for use by apps. 24x7 (**) Build systemTest Group WPs Bugzilla anomalies reports Unit Test Build Certification Production Users Development (*) Testbed ~15cpu Individual WP tests Integration Team Integration Office hours Overall release tests Tagged package Tagged release selected for certification Releases candidate Tagged Releases Releases candidate Certified Releases Certified release selected for deployment Apps. Representatives (**) with LCG (*)Current infrastructure
19
20/05/2003Markus.Schulz@cern.ch19 Reality: edg-2 Integration (Cutting Corners) u Release plan was too tight to follow the procedure u No reinstalls between WPs to be integrated u Generous u Generous interpretation of the procedure (accepting non autobuild RPMs…) u Minimal tests between adding new WPs software u Result: Some of the expected problems u Upgrade different from new install…. u Configuration via install tool not always working But: Large improvement to what we did for 1.x ( Almost complete record, about 3 cvs tags/day)
20
20/05/2003Markus.Schulz@cern.ch20 Installation and Config:LCFG(ng) LocalConFiGuration System edg WP4 Installation and Config:LCFG(ng) LocalConFiGuration System http://www.lcfg.org University of Edinburgh + edg WP4 http://www.lcfg.org u Experience: (2001to 2002) u Early versions not adequate u Only few middleware components provided config. objects u No feedback from the configured nodes u Predictability issues u Total control of the node by LCFG u Installation tool introduced @ time of first release u Intermediate: u Added local monitoring of installation u Added PXE/DHCP based install u Added serial line console monitoring + remote resets (retrofitted to nodes) This improved the turnaround time of installs u Used NIS for managing user accounts, LCFG for system accounts u Learned to handle developers nodes EDG 1.4.x50 pkgs. Globus ( 21) 233 pkgs. Globus (r24)70 pkgs. External50 pkgs. RedHat340 pkgs.
21
20/05/2003Markus.Schulz@cern.ch21 Installation and Config:LCFG(ng) LocalConFiGuration System Installation and Config:LCFG(ng) LocalConFiGuration System http://www.lcfg.org University of Edinburgh + edg WP4 http://www.lcfg.org u Now: LCFGng + LCFGng-lite u Many improvements (PXE integrated, robustness, predictable..) u WPs have written the needed objects for automatic config. u The lite version allows to use LCFG for installed nodes u Introduced well before integration of edg-v2 started u Still a few functionalities missing ( feedback from nodes, limitations..) u LCFGng works (now) quite well u LCFGng works (now) quite well (if): u Configuration tested on dedicated LCFGng server/client by developers u All services have to be configured by working LCFG-objects (almost there) u You know and respect the limitations of the tool WP4s final product is late for LCG-1 first deployment (edg-2 integration with LCFGng)
22
20/05/2003Markus.Schulz@cern.ch22 Running the Middleware edg v1.x.x u EDG & Globus are R&D projects (But advertised to some degree for production) u Many services are fragile (daily restarts) u Complex fault patterns u Complex fault patterns (every release created different ones) u The “right” way to do things had to be discovered u Developers had a strong focus on function not operation (storage management, backup, accounting, traceability, resources, restarts….) service monitoring u Project provided fabric monitoring, but no service monitoring u The wide area effect u A simple configuration error on a site can bring the whole grid down u Finding errors without access to remote sites is tricky u No central grid wide operation u No central grid wide operation (edg operations phone number??) u Changes propagate slowly through the grid u Big improvements due to release procedures
23
20/05/2003Markus.Schulz@cern.ch23 Middleware Solutions: u Ad hoc creation of monitoring/ restart tools +/- u Setting up multiple instances of the service ++ u Giving feedback about missing functionality + u Providing upgraded machines + u Edg replaced critical component (II / BDII) ++ u Release Procedure +++
24
20/05/2003Markus.Schulz@cern.ch24 Operation: CERN CA CA Certification Authority u To provide CERN users with X509 PKI certificates u In the past run in an amateur way. u CERN IT-GD Is converting it to a service (0.7 FTEs) u Much work on policies & procedures u Moved registration to experiments (RA) u Semi automatic processing of large number of certs. u Base certs on Kerberos credentials? u We are exploring the automatic generation of short lived certificates based on kerberos credentials (KCA) Running prototype u Security (KCA is online)? u Renewal of proxy for long running jobs (>12h) ?
25
20/05/2003Markus.Schulz@cern.ch25 Lessons Learned edg 1.X Matters u Number of different services u Lack of stability and monitoring of services u Number of different configurations u Tracking down problems in a distributed system u Fast responses on problems expected (no planned activities) u Manual interventions during setup u Demonstrations, Tutorials, and Production Tests are important but expensive Doesn’t Matter u Number of nodes is secondary (scaling from 3 to 50 WN) Reliability Stability Scalability Application’s wish list
26
20/05/2003Markus.Schulz@cern.ch26 More Lessons Learned u Communication is a problem u A real problem for sysadmins. Extreme rate of mails from edg users, developers and administrators, (total average for the integration team list is 7000/year 19/day, peak ~100/day) u Install and configuration tools are essential u Grid core sites are very resource intensive u Administrators need a detailed understanding of the services and their fault patterns u User support by dedicated service needed
27
20/05/2003Markus.Schulz@cern.ch27 The Future is Now: edg v2.x.x u Integration: u Almost done (Workload Management under way) u Many of the issues of edg v1 are now addressed u Better unit testing by WPs (hope for faster integrated testing) u Better integrated teams (Iteam knows people better) u Concerns for LCG-1 (July) u Core services are all new -> No experience with operation u Up to now no integrated stress/scale tests (remember 1.2->1.4) u Some middleware services have single point of failures (upgrades during the next few month will fix this in some cases) u Some scalability problems (partially due to the globus components)
28
20/05/2003Markus.Schulz@cern.ch28 Moving to production (LCG-GDB WG1-4 cover all this in their final reports) u Integrating existing fabrics u Scalability u Reliability/Recovery/Risk Assessment u Accounting, Audit Trails u Security (Local Policies vs. Global Grid) u Operational Aspects u Application Software Distribution u User support …………. I’ll describe a few for illustration LCG the Site where the Sun never sets the Site where the Sun never setsLCG We are just starting
29
20/05/2003Markus.Schulz@cern.ch29 Integrating Existing Fabrics u Network connectivity u EDG services require outgoing connectivity from all farm nodes u CERN plans to move all batch nodes to non routed networks (NAT) u Scheduler, Batch Systems u Local scheduler policies are too complex to be translated into grid schema u Accounting of EDG not compatible with used schedulers u Installation Procedures u Edg: Most sites use LCFGng for services but not all u CERN FIO will (for good reasons) not give up their procedures u Application software distribution and installation (ongoing) u Authorization and Authentication u Local system authentication based on kerberos NOT PKI X509 u Scale of the local fabric u Number of concurrent jobs u EDG a few hundreds CERN alone: 1500 running 10k queued
30
20/05/2003Markus.Schulz@cern.ch30 Scaling Examples u Job services u Middleware requires for every queued or running job a live process on the submission and gatekeeper node (RB and CE) (GRAM) u These procs communicate through 2-5 ports u GASS Cache u GRAM requires shared file system between CE and WNs u Places per job more than 100 tiny files into this cache u If the job terminates due to some problem the files stay (Inode leak) u Solution: We try to replace the shared file system with a transaction at start and stop of the job. (Not perfect) u Information System u RB in edg v2 is still accessing MDS based GRIS on every matching CE u Edg v2 still uses gridmapfile based authentication (heavy load on CEs) CERN BATCH 10k jobs in queue + 1.5 k running CEs and RBs > 10k procs and >30k ports 20 CE and 20 RB elements to keep 600 nodes going Farm will grow to 2k nodes (LSF << 10 nodes)
31
20/05/2003Markus.Schulz@cern.ch31 R/R/RA u Many services maintain state for running jobs u We studied how services keep the state and can be re-institantiated u Wide variation between services (in memory, files, DBs) u Much work needed on saving state information and restarting u Risk assessment based on cost of computing done for several services u First plan for hardening some of the services (hardware + software) u Using mirrored removable disks ( KIS) u EDG-2 in its first incarnation will contain incomplete releases of some core services with only limited capability to handle network cuts and remote service failures u New service level monitoring for new services needed u Planning the replication of services
32
20/05/2003Markus.Schulz@cern.ch32 Accounting Audit Trails u No real shared production without accounting u Local collection of accounting data u Convert the information to agreed XML based format u Operation center collects & re-distributes summaries u This can be automated at some time u CERN is working on the format definition and a sample implementation u Audit Trails u For edg the testbeds at CERN had been exempt form this u Production with remote users imprudent without audit u Many delicate privacy issues for trails that span sites u Quality of the middleware log files has to be improved u We (CERN) are developing a system to collect middleware and process audit trails locally and make it locally accessible (based on WP4 fabric monitoring) UIRB CE SE BadMan WN
33
20/05/2003Markus.Schulz@cern.ch33 Operational Aspects First Deployment & Upgrades (Middleware) concrete u We are trying to define a set of concrete procedures as a starting point (deployment, upgrades) u Based on our experience with various edg releases u Introduce roles (which are only used during deployment) u Rotate between persons & sites to spread knowledge u List of conditions to be met for a meaningful deployment u State of the delivered software u Documentation needed u Sites preparation …. u Introduced an operation hierarchy between sites u To limit on each level the amount of communication needed u Make incident reporting more effective u Due to proximity communication is more effective (time zone) u Upgrade of non backward compatible versions u Communication even between core sites problematic
34
20/05/2003Markus.Schulz@cern.ch34 Operational Aspects setup a daily phone conference? +IRC with published logs + e-mail
35
20/05/2003Markus.Schulz@cern.ch35 Operational Aspects Roles (straw man) u Deployment Manager and deputy DM DDM u Based at a core site u Assigned for deployment or major upgrade u At rotation either DM or DDM have had to be in a deployment team before u DM has to be a grid site manager of a core site u Move for first deployment or non backward compatible upgrade to same location u Deploys first and steers process (announces start, aborts, announces return to operation) u Core site deployment manager CDM u Responsible for local site u Steers minor sites depending on core site u Runs validation tests on minor sites and registers them after change u One CDM 8h away from DM acts as a link man to remote time zones u Grid site manager GSM u Represents the site during deployment u Controls or does the install and config work u There is only one GSM visible from each site to the CDM Communication: GSM reports to CDM which escalates to DM/DDM DM follows the same hierarchy to avoid confusion
36
20/05/2003Markus.Schulz@cern.ch36 Operational Aspects Structure for deployment, a first guess Operation Center Deployment Team Independent User UI SE IS RB UI WN CE vo SE IS RB UI WN CE vo SE IS RB UI WN CE vo SE UI WN CE SE UI WN CE SE UI WN CE SE UI WN CE SE UI WN CE SE UI WN CE Minor Sites O-10 Core Sites 1-10 G3b G4 SE IS RB UI WN CE vo SE UI WN CE SE UI WN CE SE UI WN CE SE UI WN CE SE UI WN CE SE UI WN CE O-10 G1G2 Grid Unique Service LCG-GDB Czar GSM DM CDM G3a Replicated Services Publishes status of deployment Policies, Versions, Target Date…
37
20/05/2003Markus.Schulz@cern.ch37 Summary u EDG 1 testbeds have been a perfect training ground u HEP experiments production trials gave important feedback u EDG middleware and procedures showed tremendous improvement u EDG 2 services still have to be field tested u Some distance to get LCG-1 service off the ground Reliability Stability Scalability Manageability Manageability
Similar presentations
© 2024 SlidePlayer.com. Inc.
All rights reserved.