Markus Schulz - LCG Deployment

Slides:



Advertisements
Similar presentations
EGEE-II INFSO-RI Enabling Grids for E-sciencE The gLite middleware distribution OSG Consortium Meeting Seattle,
Advertisements

 Contributing >30% of throughput to ATLAS and CMS in Worldwide LHC Computing Grid  Reliant on production and advanced networking from ESNET, LHCNET and.
CASTOR Upgrade, Testing and Issues Shaun de Witt GRIDPP August 2010.
LHC Experiment Dashboard Main areas covered by the Experiment Dashboard: Data processing monitoring (job monitoring) Data transfer monitoring Site/service.
OSG End User Tools Overview OSG Grid school – March 19, 2009 Marco Mambelli - University of Chicago A brief summary about the system.
Open Science Grid Software Stack, Virtual Data Toolkit and Interoperability Activities D. Olson, LBNL for the OSG International.
LHCC Comprehensive Review – September WLCG Commissioning Schedule Still an ambitious programme ahead Still an ambitious programme ahead Timely testing.
PanDA Multi-User Pilot Jobs Maxim Potekhin Brookhaven National Laboratory Open Science Grid WLCG GDB Meeting CERN March 11, 2009.
CERN IT Department CH-1211 Geneva 23 Switzerland t Storageware Flavia Donno CERN WLCG Collaboration Workshop CERN, November 2008.
EMI 1 Release The EMI 1 (Kebnekaise) release features for the first time a complete and consolidated set of middleware components from ARC, dCache, gLite.
UK middleware deployment GridPP27 - CERN 15 th September 2011 GridPP27 - CERN 15 th September 2011 Status & plans Jeremy Coles.
1 User Analysis Workgroup Discussion  Understand and document analysis models  Best in a way that allows to compare them easily.
US LHC OSG Technology Roadmap May 4-5th, 2005 Welcome. Thank you to Deirdre for the arrangements.
6/23/2005 R. GARDNER OSG Baseline Services 1 OSG Baseline Services In my talk I’d like to discuss two questions:  What capabilities are we aiming for.
FP6−2004−Infrastructures−6-SSA E-infrastructure shared between Europe and Latin America Alexandre Duarte CERN IT-GD-OPS UFCG LSD 1st EELA Grid School.
Jan 2010 OSG Update Grid Deployment Board, Feb 10 th 2010 Now having daily attendance at the WLCG daily operations meeting. Helping in ensuring tickets.
SAM Sensors & Tests Judit Novak CERN IT/GD SAM Review I. 21. May 2007, CERN.
Markus Schulz LCG Deployment WLCG Middleware Status Report 16 th February, 2009.
Testing and integrating the WLCG/EGEE middleware in the LHC computing Simone Campana, Alessandro Di Girolamo, Elisa Lanciotti, Nicolò Magini, Patricia.
Grid Technology CERN IT Department CH-1211 Geneva 23 Switzerland t DBCF GT Upcoming Features and Roadmap Ricardo Rocha ( on behalf of the.
EMI INFSO-RI European Middleware Initiative (EMI) Alberto Di Meglio (CERN)
LCG WLCG Accounting: Update, Issues, and Plans John Gordon RAL Management Board, 19 December 2006.
Enabling Grids for E-sciencE INFSO-RI Enabling Grids for E-sciencE Gavin McCance GDB – 6 June 2007 FTS 2.0 deployment and testing.
BNL dCache Status and Plan CHEP07: September 2-7, 2007 Zhenping (Jane) Liu for the BNL RACF Storage Group.
LHCC Referees Meeting – 28 June LCG-2 Data Management Planning Ian Bird LHCC Referees Meeting 28 th June 2004.
Status of gLite-3.0 deployment and uptake Ian Bird CERN IT LCG-LHCC Referees Meeting 29 th January 2007.
SAM architecture EGEE 07 Service Availability Monitor for the LHC experiments Simone Campana, Alessandro Di Girolamo, Nicolò Magini, Patricia Mendez Lorenzo,
EGI-InSPIRE RI EGI-InSPIRE EGI-InSPIRE RI EGI Services for Distributed e-Infrastructure Access Tiziana Ferrari on behalf.
The EPIKH Project (Exchange Programme to advance e-Infrastructure Know-How) gLite Grid Introduction Salma Saber Electronic.
Implementation of GLUE 2.0 support in the EMI Data Area Elisabetta Ronchieri on behalf of JRA1’s GLUE 2.0 Working Group INFN-CNAF 13 April 2011, EGI User.
WLCG Operations Coordination Andrea Sciabà IT/SDC GDB 11 th September 2013.
Riccardo Zappi INFN-CNAF SRM Breakout session. February 28, 2012 Ingredients 1. Basic ingredients (Fabric & Conn. level) 2. (Grid) Middleware ingredients.
Availability of ALICE Grid resources in Germany Kilian Schwarz GSI Darmstadt ALICE Offline Week.
EGI-InSPIRE RI EGI-InSPIRE EGI-InSPIRE RI EGI solution for high throughput data analysis Peter Solagna EGI.eu Operations.
EGEE Data Management Services
Argus EMI Authorization Integration
WLCG IPv6 deployment strategy
Gri2Win: Porting gLite to run under Windows XP Platform
Grid Computing: Running your Jobs around the World
NGI and Site Nagios Monitoring
Sviluppi in ambito WLCG Highlights
StoRM: a SRM solution for disk based storage systems
Vincenzo Spinoso EGI.eu/INFN
Status of the SRM 2.2 MoU extension
U.S. ATLAS Grid Production Experience
Use of Nagios in Central European ROC
The gLite middleware distribution
gLite->EMI2/UMD2 transition
POW MND section.
GDB 8th March 2006 Flavia Donno IT/GD, CERN
CREAM Status and Plans Massimo Sgaravatto – INFN Padova
Comparison of LCG-2 and gLite v1.0
dCache – protocol developments and plans
Introduction to Data Management in EGI
The CREAM CE: When can the LCG-CE be replaced?
Grid2Win: Porting of gLite middleware to Windows XP platform
Short update on the latest gLite status
Summary from last MB “The MB agreed that a detailed deployment plan and a realistic time scale are required for deploying glexec with setuid mode at WLCG.
EGI UMD Storage Software Repository (Mostly former EMI Software)
Ákos Frohner EGEE'08 September 2008
TCG Discussion on CE Strategy & SL4 Move
Gri2Win: Porting gLite to run under Windows XP Platform
The INFN Tier-1 Storage Implementation
Interoperability & Standards
LCG middleware and LHC experiments ARDA project
Data Management cluster summary
LHC Data Analysis using a worldwide computing grid
gLite The EGEE Middleware Distribution
Installation/Configuration
The LHCb Computing Data Challenge DC06
Presentation transcript:

Markus Schulz - LCG Deployment LHCC Review Grid Middleware Markus Schulz - LCG Deployment February 2010, CERN

Overview Middleware(s) Computing Access Workload Management MultiUserPilotJob support Data Management Information System Infrastructure Monitoring Release Process Summary

Focus on: Changes since last year Issues Plans Will not cover all components

Middleware(s) WLCG depends on three middleware stacks ARC (NDGF) Most sites in northern Europe ~ 10 % of WLCG CPUs OSG Most North American sites > 25 % of WLCG CPUs gLite Used by the EGEE infrastructure All based on the same security infrastructure All interoperate (via the experiment’s frameworks) Variety of SRM compliant Storage Systems BestMan, dCache, STORM, DPM, Castor..

Middleware(s) All core components: Differences In production use for several years Evolution based on feedback during challenges And by linking with the LCG Architects Forum Software stabilized significantly during the last year Significant set of shared components: Condor, Globus, MyProxy, GSI OpenSSH, BDII, VOMS, GLUE 1.3 (2) Schema All support at least SL4 and SL5 Moved to 64bit on SL5 (RHEL 5), 32bit libraries for compatibility Differences gLite strives to support complex workflows directly ARC focuses on simplicity and strong coupling of data and job control OSG (VDT) moves complexity to experiment specific services

Computing Access Computing Elements (CE) EGEE: gateways to farms LCG-CE ( 450 instances) Minor work on stabilization/scalability (50u/4KJ) , bug fixes LEGACY SERVICE no port to SL5 planned CREAM-CE (69 instances (up from 26)) Significant investment on production readiness and scalability Handles direct submission (pilot job friendly) Production use by ALICE for more than 1 year Tested by all experiments ( directly or via WMS) SL4/SL5 BES standard compliant, parameter passing from grid <-> batch Future: gLite Consortium, EMI Issues: Slow uptake by sites CE LFS CPU Site

Computing Access Computing Elements (CE) ARC: gateways to farms ARC-CE ( ~20 instances) Improved scalability Moved to BDII and Glue-1.3 KnowArc features included in the release Support for pilot jobs Future: EMI CE LFS CPU Site

Computing Access Computing Elements (CE) OSG: gateways to farms OSG-CE (globus) ( >50instances) Several sites offer access to resources via Pilot factories Local (automated) submission of Pilot jobs Evaluation of GT-5 gatekeeper ( ~2Hz, > 2.5k jobs) Integration of CREAM and Condor(-G) Test phase Planning tasks and decisions that lead to deployment Review in mid March Future: OSG/Globus CE LFS CPU Site

Workload Management EGEE WMS/LB EGEE WMS/LB (124 Instances) Matches resources and requests Including data location Handles failures (re-submission) Manages complex workflows Tracks job status EGEE WMS/LB (124 Instances) Fully supports LCG-CE and CREAM-CE Early versions had some WMS<->CREAM incompatibilities Several updates during the year Much improved stability and performance LCG VOs use only a small subset of the functionality Future: gLite Consortium /EMI UI WMS

MultiUserPilotJobs Pilot Jobs (Panda, Dirac, Alien…) Framework sends jobs to sites No “physics” workload When active, the Pilot contacts the VO’s task-queue The Experiment schedules a suitable job and moves it to the Pilot and executes it This is repeated until the maximum queue time is reached MUPJs run workloads from different users The batch system is only aware of the Pilot’s identity Flexibility for the experiment Conflicts with site security policies Lack of traceability “Leaks” between users

MultiUserPilotJobs Remedy for this problem: Implementation: Changing the UID/GID according to the workload Implementation: EGEE glexec (setuid code or logging) on the Worker Node SCAS or ARGUS service to handle authorization OSG Glexec / gums In production for several years Glexec/SCAS ready for deployment Scalability and stability tests passed Deployed only on a few sites

MultiUserPilotJobs Glexec/ARGUS ARGUS is the new authorization framework for EGEE Much richer policy management than SCAS Certified Deployed on a few test sites Both solutions have little exposure to production Need some time to fully mature Future: glexec/SCAS/ARGUS gLite-Consortium/EMI

Data Management Storage Elements (SEs) External interfaces based on SRM 2.2 and gridFTP Local interfaces: POSIX, dcap, secure rfio, rfio, xrootd DPM (241) dCache (82) STORM (40) BestMan (26) CASTOR (19) “ClassicSE” (27)  legacy since 2 years…. Catalogue: LFC (local and global) File Transfer Service (FTS) Data management clients gfal/LCG-Utils CPU Site rfio xrootd SRM Grid FTP SE

Data Management Common problems: Scalability I/O operations Random I/O (analysis) Bulk operations Synchronization SEs <-> File Catalogues Quotas VO-Admin Interfaces All services improved significantly during the year.

Data Management Examples: DPM FTS Several bulk operations added Improved support for checksums RFIO improvements for analysis Improved xrootd support Next release DPM 1.8 ( end of April) User banning, VO Admin capacity FTS Many bug fixes Improved monitoring Checksum support Next Release: 2.3 ( end of April) Better handling of downtime and overload of storage elements Move from “channels” to SE representation in DBs Administrative web interface Longer term: Support for small, non-SRM SEs (T3)

Data Management Examples: CASTOR Consolidation Castor 2.1.9 deployed Improved monitoring with detailed indicators for stager and SRM performance Next release: SRMv2.9 ( February) Addresses SRM instabilities reported during the last run Improved monitoring as requested by the experiments Observation: xroot access to Castor is sufficient for analysis Further improvements: Tuning root client and xroot servers Plan: deploy native xroot instances for analysis Low latency storage Discussion started on dataflow Before summer: disk only After summer: disk + backup

Data Management Examples: dCache Introduced Chimera name space engine Improved scalability Released “Golden Release dCache 1.9.5” Functionality will be stable during first 12 months Bug fix releases as required Plans (12 months): Multiple SRM front ends (improved file open speed) NFS-4.1 (security has to be added) First performance tests are promising WebDav (https) Integration with Argus Information system and monitoring

Data Management Examples: Future: STORM Added tape backend SRM-2.2 + WLCG extensions implemented Future: dCache, STORM, DPM, FTS, LFC, clients  EMI Castor  CERN BestMan  OSG

Information System BDII Several updates during the year Improved stability and scalability Support for new GLUE-2 schema OGF standard Parallel to 1.3 to allow smooth migration Better separation of “static” and “dynamic” information Opens the door for new strategy towards scalability Issues: Complex schema Wrong data published by sites Bootstrapping Future: gLite Consortium/EMI

Information System Gstat-2.0http://gstat-prod.cern.ch/gstat/stats/GRID/ALL Information system monitor and browser Consistency checks Solid implementation based on standard components CERN/Academia Sinica Taipei

Infrastructure Monitoring Distributed system based on standard technology NAGIOS, DJANGO ActiveMQ based messaging infrastructure Integrated existing SAM tests Use MyOSG based visualisation -> MyEGEE Reflects operational structure of EGI Replaces SAM system “Grown” central system

Release Process Refined component based release process Frequent releases (2 week intervals) Monitored process Fast rollback Components have reached a high level of quality Synthetic testing is limited Fast rollback limits impact Staged Rollout Final validation in production Transition to Product Teams Responsible for: Development, Testing, Integration, Certification Based on project policies

General Evolution Move to standard building blocks Data Management ActiveMQ, Django, Nagios Globus GSI  openssl Data Management cluster file systems as building blocks STORM, BestMan, (DPM) Using standard clients NFS-4.1 Reducing complexity (FTS) Workflow management and direct control by Users Direct submission of Pilots to CREAM-CEs (no WMS) Virtualization Fabric/application independence User-controlled environments

Open Issues EC funded projects EGI/EMI Continuity Not sufficient to continue all activities at current level Change rate can be reduced Some activities can be stopped Middleware support will depend more on community support Build and integration systems will be adapted to support this Continuity Significant staff rotation and reduction Uptake of new services is very slow Development of a long-term vision After 10 years a paradigm change might be due…

Summary WLCG Middleware handles core tasks adequately Most developments targeted at: Improved control Quotas, security, monitoring, VO-admin interfaces Improved recovery from problems Catalogue/SE resynchronization Simplification Move to standard components Performance improvements Stability How stable is the software?

Open Bugs / Usage Number of Bugs is almost flat Exponential increase in usage Example: gLite Usage 2010 Usage Jan 08 Apr. 04 Open Bugs

Open Bugs/Million CPU Hours July 2005 - January 2010 Usage 2010 Usage