Download presentation
Presentation is loading. Please wait.
Published byRandell Summers Modified over 9 years ago
1
Belle Data Grid Deployment … Beyond the Hype Lyle Winton Experimental Particle Physics, University of Melbourne eScience, December 2005
2
Lyle Winton, University of Melbourne Belle Experiment Belle in KEK, Japan – Investigates symmetries in nature – CPU and Data requirements explosion! 4 billion events needed simulating in 2004 to keep up with data production Belle MC Production effort Australian HPC has contributed Belle’s an ideal case – has real research data – has known application workflow – has real need for distributed access and processing
3
Lyle Winton, University of Melbourne HEP Simulation or (Monte Carlo) Simulated collisions or events (use Monte Carlo techniques) – used to predict what we’ll see (features of data) – Essential to support design of systems – Essential for analysis acceptances/efficiencies; fine tuning; understand uncertainties Computationally intensive – simulate beam particle collisions, interactions, decays – all components and materials (Belle is 10x10x20? m, ?000 tons, 100 µm accuracy) – tracking and energy deposition through all components – all electronics effects (signal shapes, thresholds, noise, cross- talk) – data acquisition system (DAQ) – We need ratio of greater than 3:1 for Simulated:Real data to reduce statistical fluctuations
4
Lyle Winton, University of Melbourne Background The general idea… – Investigation of Grid tools (Globus v1, v2, LCG) – Deployment to distributed testbed – Utilisation of the APAC and partner facilites – Deployment to the APAC National Grid
5
Lyle Winton, University of Melbourne Australian Belle Testbed Rapid deployment at 5 sites in 9 days – U.Melb. Physics + CS, U.Syd., ANU/GrangeNet, U.Adelaide CS – IBM Australia donated dual Xeon 2.6 GHz nodes Belle MC generation of 1,000,000 events Simulation and Analysis Demonstrated at PRAGMA4 and SC2003 Globus 2.4 middleware Data management – Globus 2 replica catalogue – GSIFTP Job management – GQSched (U.Melb Physics) – GridBus (U.Melb CS)
6
Lyle Winton, University of Melbourne Initial Production Deployment Custom built central job dispatcher – Initially used ssh and PBS commands – feared Grid was unreliable – then only 50% of facilities Grid accessible SRB (Storage Resource Broker) – Transfer of input data KEK → ANUSF → Facility – Transfer of output data Facility → ANUSF → KEK Successfully participated in Belle’s 4x10 9 event MC production during 2004 Now running on APAC NG using LCG2/EGEE
7
Lyle Winton, University of Melbourne Issues Deployment – time consuming for experts. – even more time consuming for site admins with no experience. – requires loosening security (network, unknown services, NFS on exposed boxes) – Grid services and clients generally require public IPs with open ports Middleware/Globus bugs, instabilities, failures – too many to list here – errors, logs, and manuals are frequently insufficient Distributed management – version problems between Globus (eg. globus-url-copy can hang) – stable middleware is compiled from source – but OS upgrades can break – once installed how do we keep configured considering… growing numbers of users and communities (VOs) expanding interoperable Grids (more CAs) Applications – installing by hand at each site – many require access to DB or remote data while processing – most clusters/facilities have private/off-internet compute nodes
8
Lyle Winton, University of Melbourne Issues Staging work around – GridFTP is not a problem, however, SRB is more difficult – remote queues for staging (APAC NF) – front end node staging to shared FS (via jobmanager- fork) – front end node staging via SSH No National CA (for a while) – started with explosion of toy CAs User Access Barriers – user has cert. from CA … then what? – access to facilities is more complicated (allocation/account/VO applications) – then all the above problems start! – Is Grid worth the effort?
9
Lyle Winton, University of Melbourne Observations Middleware – Everything is fabric, lack of user tools! Initially only Grid fabric (low level) – eg. Globus2 Application level or 3 rd Generation middleware – eg. LCG/EGEE, VDT – Overarching, joining, coordinating fabric – User tools for application deployment – Everybody must develop additional tools/portals for everyday user access (non-expert) No out of box solutions Real Data Grids! – Many international research big-science collaborations are data focused – This is not simply a staging issue! – Jobs need seamless access to data (at start, middle, end of job) Many site compute nodes have no external access Middleware cannot stage/replicate databases In some cases file access is determined at run time (ATLAS) – Current jobs must be modified/tailored for each site – not Grid
10
Lyle Winton, University of Melbourne Observations Information Systems – Required for resource brokering, debugging problems – MDS/GRIS/BDII are often unused (eg. Nimrod/G, GridBus) not because of the technology never given a certificate never started never configured for the site (PBS etc.) never configured to publish (GIIS or top level BDII) never checked
11
Lyle Winton, University of Melbourne Lessons/Recommendations NEED tools to determine what's going on (debug) – jobs and scripts must have debug output/modes – middleware debugging MUST be well documented Error codes and messages Troubleshooting Log files – application middleware must be coded for failure! service death, intermittent connection failure, data removal, proxy timeout, hangs are all to be expected all actions must include external retry and timeout – information systems eg. queue is full, application not installed, not enough memory
12
Lyle Winton, University of Melbourne Lessons/Recommendations Quality and Availability are key issues Create service regression test scripts! – small config changes or updates can have big consequences – run from local site (tests services) – run from remote site (tests network) Site validation/quality checks – 1 – are all services up and accessible? – 2 – can stagein+run+stageout a baseline batch job? – 3 – do I.S. conform to minimum schema standards? – 4 – are I.S. populated, accurate, and up to date? – 5 – repeat 1-4 regularly Operational metrics are essential – help determine stability and usability – eventually provide justification for using Grid
13
Lyle Winton, University of Melbourne Lessons/Recommendations Start talking to System/Network Admins early – education about Grid, GSI, and Globus – logging and accounting – public IPs with shared home filesystem Have a dedicated node manager, both OS and middleware – don't underestimate time required – installation and testing ~ 2-4 day expert, 5-10 days novice (with instruction) – maintenance (testing, metrics, upgrades) ~ 1/10 days Have a middleware distribution bundle – too many steps to do at each site – APAC NG hoping to solve with Xen VM images Automate general management tasks – authentication lists (VO) – CA files, especially CRLs – host cert checks and imminent expiry warnings – service up checks (auto restart?) – file clean up (GRAM logs, GASS cache?, GT4 persisted) BADG Installer single step, guided GT2 installation http://epp.ph.unimelb.edu.au/EPPGrid GridMgr manages VOs, certs, CRLs http://epp.ph.unimelb.edu.au/EPPGrid
14
Lyle Winton, University of Melbourne International Interoperability HEP case study – application groups had to develop coordinated dispatchers and adapters researchers jumping through hoops -> in my opinion failure – limited manpower, limited influence over implementation – if we are serious we MUST allocate serious manpower and priority with authority over Grid infrastructure – minimal services, same middleware, is not enough – test case applications are essential – operational metrics are essential
15
Lyle Winton, University of Melbourne Benefits Access to resources – Funding to develop expertise and for manpower – Central expertise and manpower (APAC NG) – Other infrastructure (GrangeNet, APAC NG, TransPORT SX) Early adoption has been important – Initially access to more infrastructure – Ability to provide experienced feed back Enabling large scale collaboration – eg. ATLAS produces up to 10PB/year of data 1800 people, 150+ institutes, 34 countries Aim to provide low latency access to data with 48hrs of production
Similar presentations
© 2025 SlidePlayer.com. Inc.
All rights reserved.