Belle Data Grid Deployment … Beyond the Hype Lyle Winton Experimental Particle Physics, University of Melbourne eScience, December 2005.

Slides:

Advertisements

Similar presentations

1 ALICE Grid Status David Evans The University of Birmingham GridPP 14 th Collaboration Meeting Birmingham 6-7 Sept 2005.

Advertisements

Andrew McNab - Manchester HEP - 17 September 2002 Putting Existing Farms on the Testbed Manchester DZero/Atlas and BaBar farms are available via the Testbed.

Legacy code support for commercial production Grids G.Terstyanszky, T. Kiss, T. Delaitre, S. Winter School of Informatics, University.

Data Management Expert Panel - WP2. WP2 Overview.

Andrew McNab - EDG Access Control - 14 Jan 2003 EU DataGrid security with GSI and Globus Andrew McNab University of Manchester

1 Software & Grid Middleware for Tier 2 Centers Rob Gardner Indiana University DOE/NSF Review of U.S. ATLAS and CMS Computing Projects Brookhaven National.

Workload Management Workpackage Massimo Sgaravatto INFN Padova.

15th January, NGS for e-Social Science Stephen Pickles Technical Director, NGS Workshop on Missing e-Infrastructure Manchester, 15 th January, 2007.

Minerva Infrastructure Meeting – October 04, 2011.

LHC Experiment Dashboard Main areas covered by the Experiment Dashboard: Data processing monitoring (job monitoring) Data transfer monitoring Site/service.

The SAM-Grid Fabric Services Gabriele Garzoglio (for the SAM-Grid team) Computing Division Fermilab.

1 Deployment of an LCG Infrastructure in Australia How-To Setup the LCG Grid Middleware – A beginner's perspective Marco La Rosa

Grid Information Systems. Two grid information problems Two problems  Monitoring  Discovery We can use similar techniques for both.

Integration and Sites Rob Gardner Area Coordinators Meeting 12/4/08.

INFSO-RI Enabling Grids for E-sciencE The US Federation Miron Livny Computer Sciences Department University of Wisconsin – Madison.

BaBar MC production BaBar MC production software VU (Amsterdam University) A lot of computers EDG testbed (NIKHEF) Jobs Results The simple question:

03/27/2003CHEP20031 Remote Operation of a Monte Carlo Production Farm Using Globus Dirk Hufnagel, Teela Pulliam, Thomas Allmendinger, Klaus Honscheid (Ohio.

INFSO-RI Enabling Grids for E-sciencE SA1: Cookbook (DSA1.7) Ian Bird CERN 18 January 2006.

QCDGrid Progress James Perry, Andrew Jackson, Stephen Booth, Lorna Smith EPCC, The University Of Edinburgh.

CERN IT Department CH-1211 Genève 23 Switzerland t Internet Services Job Monitoring for the LHC experiments Irina Sidorova (CERN, JINR) on.

G RID M IDDLEWARE AND S ECURITY Suchandra Thapa Computation Institute University of Chicago.

Computer Emergency Notification System (CENS)

10/24/2015OSG at CANS1 Open Science Grid Ruth Pordes Fermilab

© 2006 Open Grid Forum Enabling Pervasive Grids The OGF GIN Effort Erwin Laure GIN-CG co-chair, EGEE Technical Director

Grid Execution Management for Legacy Code Applications Grid Enabling Legacy Code Applications Tamas Kiss Centre for Parallel.

Tool Integration with Data and Computation Grid GWE - “Grid Wizard Enterprise”

Quick Introduction to NorduGrid Oxana Smirnova 4 th Nordic LHC Workshop November 23, 2001, Stockholm.

PHENIX and the data grid >400 collaborators Active on 3 continents + Brazil 100’s of TB of data per year Complex data with multiple disparate physics goals.

Security monitoring boxes Andrew McNab University of Manchester.

Ames Research CenterDivision 1 Information Power Grid (IPG) Overview Anthony Lisotta Computer Sciences Corporation NASA Ames May 2,

NA-MIC National Alliance for Medical Image Computing UCSD: Engineering Core 2 Portal and Grid Infrastructure.

NW-GRID Campus Grids Workshop Liverpool31 Oct 2007 NW-GRID Campus Grids Workshop Liverpool31 Oct 2007 Moving Beyond Campus Grids Steven Young Oxford NGS.

CERN Using the SAM framework for the CMS specific tests Andrea Sciabà System Analysis WG Meeting 15 November, 2007.

CLRC and the European DataGrid Middleware Information and Monitoring Services The current information service is built on the hierarchical database OpenLDAP.

Enabling Grids for E-sciencE EGEE and gLite are registered trademarks Usage of virtualization in gLite certification Andreas Unterkircher.

INFSO-RI Enabling Grids for E-sciencE OSG-LCG Interoperability Activity Author: Laurence Field (CERN)

Grid Execution Management for Legacy Code Applications Grid Enabling Legacy Applications.

Ruth Pordes November 2004TeraGrid GIG Site Review1 TeraGrid and Open Science Grid Ruth Pordes, Fermilab representing the Open Science.

Grid User Interface for ATLAS & LHCb A more recent UK mini production used input data stored on RAL’s tape server, the requirements in JDL and the IC Resource.

GCRC Meeting 2004 BIRN Coordinating Center Software Development Vicky Rowley.

US LHC OSG Technology Roadmap May 4-5th, 2005 Welcome. Thank you to Deirdre for the arrangements.

Campus grids: e-Infrastructure within a University Mike Mineter National e-Science Centre 14 February 2006.

Creating and running an application.

1 Andrea Sciabà CERN Critical Services and Monitoring - CMS Andrea Sciabà WLCG Service Reliability Workshop 26 – 30 November, 2007.

6/23/2005 R. GARDNER OSG Baseline Services 1 OSG Baseline Services In my talk I’d like to discuss two questions:  What capabilities are we aiming for.

EGEE-II INFSO-RI Enabling Grids for E-sciencE EGEE Site Architecture Resource Center Deployment Considerations MIMOS EGEE Tutorial.

1 Grid Activity Summary » Grid Testbed » CFD Application » Virtualization » Information Grid » Grid CA.

INFSO-RI Enabling Grids for E-sciencE ARDA Experiment Dashboard Ricardo Rocha (ARDA – CERN) on behalf of the Dashboard Team.

Globus and PlanetLab Resource Management Solutions Compared M. Ripeanu, M. Bowman, J. Chase, I. Foster, M. Milenkovic Presented by Dionysis Logothetis.

Tool Integration with Data and Computation Grid “Grid Wizard 2”

EGI-InSPIRE RI EGI-InSPIRE EGI-InSPIRE RI Monitoring of the LHC Computing Activities Key Results from the Services.

The National Grid Service Mike Mineter.

The GRIDS Center, part of the NSF Middleware Initiative Grid Security Overview presented by Von Welch National Center for Supercomputing.

Grid Execution Management for Legacy Code Architecture Exposing legacy applications as Grid services: the GEMLCA approach Centre.

CERN Certification & Testing LCG Certification & Testing Team (C&T Team) Marco Serra - CERN / INFN Zdenek Sekera - CERN.

EGEE-II INFSO-RI Enabling Grids for E-sciencE EGEE and gLite are registered trademarks EGEE Operations: Evolution of the Role of.

Meeting with University of Malta| CERN, May 18, 2015 | Predrag Buncic ALICE Computing in Run 2+ P. Buncic 1.

INFSO-RI Enabling Grids for E-sciencE File Transfer Software and Service SC3 Gavin McCance – JRA1 Data Management Cluster Service.

Breaking the frontiers of the Grid R. Graciani EGI TF 2012.

EGEE is a project funded by the European Union under contract IST Issues from current Experience SA1 Feedback to JRA1 A. Pacheco PIC Barcelona.

E-science grid facility for Europe and Latin America Updates on Information System Annamaria Muoio - INFN Tutorials for trainers 01/07/2008.

EGEE-II INFSO-RI Enabling Grids for E-sciencE EGEE and gLite are registered trademarks CYFRONET site report Marcin Radecki CYFRONET.

EGI-InSPIRE RI EGI-InSPIRE EGI-InSPIRE RI EGI Services for Distributed e-Infrastructure Access Tiziana Ferrari on behalf.

Jean-Philippe Baud, IT-GD, CERN November 2007

Accessing the VI-SEEM infrastructure

Tamas Kiss University Of Westminster

U.S. ATLAS Grid Production Experience

Belle II Physics Analysis Center at TIFR

Ian Bird GDB Meeting CERN 9 September 2003

Presentation transcript:

Belle Data Grid Deployment … Beyond the Hype Lyle Winton Experimental Particle Physics, University of Melbourne eScience, December 2005

Lyle Winton, University of Melbourne Belle Experiment Belle in KEK, Japan – Investigates symmetries in nature – CPU and Data requirements explosion! 4 billion events needed simulating in 2004 to keep up with data production Belle MC Production effort Australian HPC has contributed Belle’s an ideal case – has real research data – has known application workflow – has real need for distributed access and processing

Lyle Winton, University of Melbourne HEP Simulation or (Monte Carlo) Simulated collisions or events (use Monte Carlo techniques) – used to predict what we’ll see (features of data) – Essential to support design of systems – Essential for analysis acceptances/efficiencies; fine tuning; understand uncertainties Computationally intensive – simulate beam particle collisions, interactions, decays – all components and materials (Belle is 10x10x20? m, ?000 tons, 100 µm accuracy) – tracking and energy deposition through all components – all electronics effects (signal shapes, thresholds, noise, cross- talk) – data acquisition system (DAQ) – We need ratio of greater than 3:1 for Simulated:Real data to reduce statistical fluctuations

Lyle Winton, University of Melbourne Background The general idea… – Investigation of Grid tools (Globus v1, v2, LCG) – Deployment to distributed testbed – Utilisation of the APAC and partner facilites – Deployment to the APAC National Grid

Lyle Winton, University of Melbourne Australian Belle Testbed Rapid deployment at 5 sites in 9 days – U.Melb. Physics + CS, U.Syd., ANU/GrangeNet, U.Adelaide CS – IBM Australia donated dual Xeon 2.6 GHz nodes Belle MC generation of 1,000,000 events Simulation and Analysis Demonstrated at PRAGMA4 and SC2003 Globus 2.4 middleware Data management – Globus 2 replica catalogue – GSIFTP Job management – GQSched (U.Melb Physics) – GridBus (U.Melb CS)

Lyle Winton, University of Melbourne Initial Production Deployment Custom built central job dispatcher – Initially used ssh and PBS commands – feared Grid was unreliable – then only 50% of facilities Grid accessible SRB (Storage Resource Broker) – Transfer of input data KEK → ANUSF → Facility – Transfer of output data Facility → ANUSF → KEK Successfully participated in Belle’s 4x10 9 event MC production during 2004 Now running on APAC NG using LCG2/EGEE

Lyle Winton, University of Melbourne Issues Deployment – time consuming for experts. – even more time consuming for site admins with no experience. – requires loosening security (network, unknown services, NFS on exposed boxes) – Grid services and clients generally require public IPs with open ports Middleware/Globus bugs, instabilities, failures – too many to list here – errors, logs, and manuals are frequently insufficient Distributed management – version problems between Globus (eg. globus-url-copy can hang) – stable middleware is compiled from source – but OS upgrades can break – once installed how do we keep configured considering… growing numbers of users and communities (VOs) expanding interoperable Grids (more CAs) Applications – installing by hand at each site – many require access to DB or remote data while processing – most clusters/facilities have private/off-internet compute nodes

Lyle Winton, University of Melbourne Issues Staging work around – GridFTP is not a problem, however, SRB is more difficult – remote queues for staging (APAC NF) – front end node staging to shared FS (via jobmanager- fork) – front end node staging via SSH No National CA (for a while) – started with explosion of toy CAs User Access Barriers – user has cert. from CA … then what? – access to facilities is more complicated (allocation/account/VO applications) – then all the above problems start! – Is Grid worth the effort?

Lyle Winton, University of Melbourne Observations Middleware – Everything is fabric, lack of user tools! Initially only Grid fabric (low level) – eg. Globus2 Application level or 3 rd Generation middleware – eg. LCG/EGEE, VDT – Overarching, joining, coordinating fabric – User tools for application deployment – Everybody must develop additional tools/portals for everyday user access (non-expert) No out of box solutions Real Data Grids! – Many international research big-science collaborations are data focused – This is not simply a staging issue! – Jobs need seamless access to data (at start, middle, end of job) Many site compute nodes have no external access Middleware cannot stage/replicate databases In some cases file access is determined at run time (ATLAS) – Current jobs must be modified/tailored for each site – not Grid

Lyle Winton, University of Melbourne Observations Information Systems – Required for resource brokering, debugging problems – MDS/GRIS/BDII are often unused (eg. Nimrod/G, GridBus) not because of the technology never given a certificate never started never configured for the site (PBS etc.) never configured to publish (GIIS or top level BDII) never checked

Lyle Winton, University of Melbourne Lessons/Recommendations NEED tools to determine what's going on (debug) – jobs and scripts must have debug output/modes – middleware debugging MUST be well documented Error codes and messages Troubleshooting Log files – application middleware must be coded for failure! service death, intermittent connection failure, data removal, proxy timeout, hangs are all to be expected all actions must include external retry and timeout – information systems eg. queue is full, application not installed, not enough memory

Lyle Winton, University of Melbourne Lessons/Recommendations Quality and Availability are key issues Create service regression test scripts! – small config changes or updates can have big consequences – run from local site (tests services) – run from remote site (tests network) Site validation/quality checks – 1 – are all services up and accessible? – 2 – can stagein+run+stageout a baseline batch job? – 3 – do I.S. conform to minimum schema standards? – 4 – are I.S. populated, accurate, and up to date? – 5 – repeat 1-4 regularly Operational metrics are essential – help determine stability and usability – eventually provide justification for using Grid

Lyle Winton, University of Melbourne Lessons/Recommendations Start talking to System/Network Admins early – education about Grid, GSI, and Globus – logging and accounting – public IPs with shared home filesystem Have a dedicated node manager, both OS and middleware – don't underestimate time required – installation and testing ~ 2-4 day expert, 5-10 days novice (with instruction) – maintenance (testing, metrics, upgrades) ~ 1/10 days Have a middleware distribution bundle – too many steps to do at each site – APAC NG hoping to solve with Xen VM images Automate general management tasks – authentication lists (VO) – CA files, especially CRLs – host cert checks and imminent expiry warnings – service up checks (auto restart?) – file clean up (GRAM logs, GASS cache?, GT4 persisted) BADG Installer single step, guided GT2 installation GridMgr manages VOs, certs, CRLs

Lyle Winton, University of Melbourne International Interoperability HEP case study – application groups had to develop coordinated dispatchers and adapters researchers jumping through hoops -> in my opinion failure – limited manpower, limited influence over implementation – if we are serious we MUST allocate serious manpower and priority with authority over Grid infrastructure – minimal services, same middleware, is not enough – test case applications are essential – operational metrics are essential

Lyle Winton, University of Melbourne Benefits Access to resources – Funding to develop expertise and for manpower – Central expertise and manpower (APAC NG) – Other infrastructure (GrangeNet, APAC NG, TransPORT SX) Early adoption has been important – Initially access to more infrastructure – Ability to provide experienced feed back Enabling large scale collaboration – eg. ATLAS produces up to 10PB/year of data 1800 people, 150+ institutes, 34 countries Aim to provide low latency access to data with 48hrs of production