Belle Data Grid Deployment … Beyond the Hype Lyle Winton Experimental Particle Physics, University of Melbourne eScience, December 2005
Lyle Winton, University of Melbourne Belle Experiment Belle in KEK, Japan – Investigates symmetries in nature – CPU and Data requirements explosion! 4 billion events needed simulating in 2004 to keep up with data production Belle MC Production effort Australian HPC has contributed Belle’s an ideal case – has real research data – has known application workflow – has real need for distributed access and processing
Lyle Winton, University of Melbourne HEP Simulation or (Monte Carlo) Simulated collisions or events (use Monte Carlo techniques) – used to predict what we’ll see (features of data) – Essential to support design of systems – Essential for analysis acceptances/efficiencies; fine tuning; understand uncertainties Computationally intensive – simulate beam particle collisions, interactions, decays – all components and materials (Belle is 10x10x20? m, ?000 tons, 100 µm accuracy) – tracking and energy deposition through all components – all electronics effects (signal shapes, thresholds, noise, cross- talk) – data acquisition system (DAQ) – We need ratio of greater than 3:1 for Simulated:Real data to reduce statistical fluctuations
Lyle Winton, University of Melbourne Background The general idea… – Investigation of Grid tools (Globus v1, v2, LCG) – Deployment to distributed testbed – Utilisation of the APAC and partner facilites – Deployment to the APAC National Grid
Lyle Winton, University of Melbourne Australian Belle Testbed Rapid deployment at 5 sites in 9 days – U.Melb. Physics + CS, U.Syd., ANU/GrangeNet, U.Adelaide CS – IBM Australia donated dual Xeon 2.6 GHz nodes Belle MC generation of 1,000,000 events Simulation and Analysis Demonstrated at PRAGMA4 and SC2003 Globus 2.4 middleware Data management – Globus 2 replica catalogue – GSIFTP Job management – GQSched (U.Melb Physics) – GridBus (U.Melb CS)
Lyle Winton, University of Melbourne Initial Production Deployment Custom built central job dispatcher – Initially used ssh and PBS commands – feared Grid was unreliable – then only 50% of facilities Grid accessible SRB (Storage Resource Broker) – Transfer of input data KEK → ANUSF → Facility – Transfer of output data Facility → ANUSF → KEK Successfully participated in Belle’s 4x10 9 event MC production during 2004 Now running on APAC NG using LCG2/EGEE
Lyle Winton, University of Melbourne Issues Deployment – time consuming for experts. – even more time consuming for site admins with no experience. – requires loosening security (network, unknown services, NFS on exposed boxes) – Grid services and clients generally require public IPs with open ports Middleware/Globus bugs, instabilities, failures – too many to list here – errors, logs, and manuals are frequently insufficient Distributed management – version problems between Globus (eg. globus-url-copy can hang) – stable middleware is compiled from source – but OS upgrades can break – once installed how do we keep configured considering… growing numbers of users and communities (VOs) expanding interoperable Grids (more CAs) Applications – installing by hand at each site – many require access to DB or remote data while processing – most clusters/facilities have private/off-internet compute nodes
Lyle Winton, University of Melbourne Issues Staging work around – GridFTP is not a problem, however, SRB is more difficult – remote queues for staging (APAC NF) – front end node staging to shared FS (via jobmanager- fork) – front end node staging via SSH No National CA (for a while) – started with explosion of toy CAs User Access Barriers – user has cert. from CA … then what? – access to facilities is more complicated (allocation/account/VO applications) – then all the above problems start! – Is Grid worth the effort?
Lyle Winton, University of Melbourne Observations Middleware – Everything is fabric, lack of user tools! Initially only Grid fabric (low level) – eg. Globus2 Application level or 3 rd Generation middleware – eg. LCG/EGEE, VDT – Overarching, joining, coordinating fabric – User tools for application deployment – Everybody must develop additional tools/portals for everyday user access (non-expert) No out of box solutions Real Data Grids! – Many international research big-science collaborations are data focused – This is not simply a staging issue! – Jobs need seamless access to data (at start, middle, end of job) Many site compute nodes have no external access Middleware cannot stage/replicate databases In some cases file access is determined at run time (ATLAS) – Current jobs must be modified/tailored for each site – not Grid
Lyle Winton, University of Melbourne Observations Information Systems – Required for resource brokering, debugging problems – MDS/GRIS/BDII are often unused (eg. Nimrod/G, GridBus) not because of the technology never given a certificate never started never configured for the site (PBS etc.) never configured to publish (GIIS or top level BDII) never checked
Lyle Winton, University of Melbourne Lessons/Recommendations NEED tools to determine what's going on (debug) – jobs and scripts must have debug output/modes – middleware debugging MUST be well documented Error codes and messages Troubleshooting Log files – application middleware must be coded for failure! service death, intermittent connection failure, data removal, proxy timeout, hangs are all to be expected all actions must include external retry and timeout – information systems eg. queue is full, application not installed, not enough memory
Lyle Winton, University of Melbourne Lessons/Recommendations Quality and Availability are key issues Create service regression test scripts! – small config changes or updates can have big consequences – run from local site (tests services) – run from remote site (tests network) Site validation/quality checks – 1 – are all services up and accessible? – 2 – can stagein+run+stageout a baseline batch job? – 3 – do I.S. conform to minimum schema standards? – 4 – are I.S. populated, accurate, and up to date? – 5 – repeat 1-4 regularly Operational metrics are essential – help determine stability and usability – eventually provide justification for using Grid
Lyle Winton, University of Melbourne Lessons/Recommendations Start talking to System/Network Admins early – education about Grid, GSI, and Globus – logging and accounting – public IPs with shared home filesystem Have a dedicated node manager, both OS and middleware – don't underestimate time required – installation and testing ~ 2-4 day expert, 5-10 days novice (with instruction) – maintenance (testing, metrics, upgrades) ~ 1/10 days Have a middleware distribution bundle – too many steps to do at each site – APAC NG hoping to solve with Xen VM images Automate general management tasks – authentication lists (VO) – CA files, especially CRLs – host cert checks and imminent expiry warnings – service up checks (auto restart?) – file clean up (GRAM logs, GASS cache?, GT4 persisted) BADG Installer single step, guided GT2 installation GridMgr manages VOs, certs, CRLs
Lyle Winton, University of Melbourne International Interoperability HEP case study – application groups had to develop coordinated dispatchers and adapters researchers jumping through hoops -> in my opinion failure – limited manpower, limited influence over implementation – if we are serious we MUST allocate serious manpower and priority with authority over Grid infrastructure – minimal services, same middleware, is not enough – test case applications are essential – operational metrics are essential
Lyle Winton, University of Melbourne Benefits Access to resources – Funding to develop expertise and for manpower – Central expertise and manpower (APAC NG) – Other infrastructure (GrangeNet, APAC NG, TransPORT SX) Early adoption has been important – Initially access to more infrastructure – Ability to provide experienced feed back Enabling large scale collaboration – eg. ATLAS produces up to 10PB/year of data 1800 people, 150+ institutes, 34 countries Aim to provide low latency access to data with 48hrs of production