Download presentation
Presentation is loading. Please wait.
Published byEarl Bryan Modified over 9 years ago
1
SSC2 and Update on Multi-user Pilot Jobs Framework Mingchao Ma, STFC – RAL HEPSysMan Meeting 20/06/2008
2
Slide 2 Security Service Challenge What is it? How does it work? SSC 2 - UKI ROC experience
3
Slide 3 SSC - What is it? “The goal of the LCG/EGEE Security Service Challenge, is to investigate whether sufficient information is available to be able conduct an audit trace as part of an incident response, and to ensure that appropriate communications channels are available ” Like a fire drill!
4
Slide 4 SSC – Why and How? To check if communication channel among involved parties (Sites, VOs and Security contacts etc) is functioning; Exercises for system admins to trace users’ activities and to know various logfiles; Not intrusive – only ‘legal’ operations; No penetration and no execution of exploits; Conduct and monitor by OSCT and ROC security Officers; –CERN challenges ALL Tier1 sites; –ROC security officer challenges Tier2 sites within that ROC
5
Slide 5 Security Service Challenge SSC 1: challenges the Workload Management System (WMS) on the Grid: Resource Broker (RB) and Compute Element (CE) (2005) SSC 2: challenges the Storage Elements on the Grid (2007/2008) SSC 3: challenges the Operational Diligence of the LCG/EGEE Grid Sites (ongoing) https://twiki.cern.ch/twiki/bin/view/LCG/LCGSecurityChallenge
6
Slide 6 SSCs - UKI ROC Security Service Challenge 2 –22 Tier2 sites (SEs) UKI ROC were challenged by ROC security officer Security Service Challenge 3 –RAL Tier1 was challenged by CERN on 06 March 2008 http://www.gridpp.ac.uk/security/ssc/ https://www.gridpp.ac.uk/security/ssc/ssc2/index.html http://grid-deployment.web.cern.ch/grid- deployment/ssc/SSC_2/SSC_2_google.html
7
Slide 7 Security Service Challenge 2 Timeline –From 21 January 2008 to 10 March 2008 –In total 22 sites (SEs) challenged –Job submission: from 21 Jan. to 28 Jan –4 weeks (Feb. 2008) cool down period –GGUS ticket opened: 03 March 2008 –Challenge completed: 5pm 10 March 2008
8
Slide 8 Security Service Challenge 2 Basic Statistic –22 SEs/Sites challenged, of which: One site failed to run challenge job; One site is opt out of the challenge due to site re-built; One site is no longer part of EGEE Grid; Initial response received from the 21 sites; 18 sites acknowledged the initial alert ticket within 24 hours; 2 site acknowledged ticket within 48 hours; 1 site acknowledge ticket within 72 hours;
9
Slide 9 Security Service Challenge 2 - Result
10
Slide 10 Security Service Challenge 2 Preliminary Analysis –All responsed sites (18) found some traces of the job activities and at least identified one SE operation –Communication channel seems to work well; Most sites acknowledged ticket within 24 hours 1 sites was within 72 hours, where a new staff has no support role in GGUS, therefore unable to answer the ticket
11
Slide 11 Security Service Challenge 2 Issues observed –None of 19 sites were able to identity the Lookup operation –Some sites only provided RAW logs (though correct part of log) information with little or no analysis –A few sites experienced log missing (accidentally deleted log file due to mis-configuration; log retention is only a month, again due to mis-configuration or lost log files due to system-rebuilt etc.) –SE’s logs (syntax and format) are still too complex; it seems that it is very difficult to fully rebuild some operations (site configuration? Or Insufficient log information?); Too many logfiles!
12
Slide 12 Multi-user Pilot Jobs Framework
13
Slide 13 What is multi-user pilot Job? A multi-user pilot job, hereafter referred to simply as a pilot job, is a Grid job for which the following holds*: –a Grid job is submitted with a set of credentials belonging to either a member of the VO or to a service owned and operated by the VO –when this Grid job begins to execute at a Site, it pulls down and executes workload, hereafter called a user job, owned and submitted by a different member of the VO or multiple user jobs owned and submitted by multiple different members of the VO *Policy on Grid Multi-User Pilot Jobs https://edms.cern.ch/cedar/plsql/doc.info?cookie=7587020&document_id=855383&version=1
14
Slide 14 Pilot Jobs Framework A VO/Experiment-specific Workload Management System (WMS): –CMS glideinWMS http://indico.cern.ch/materialDisplay.py?sessionId=4&materialId=0&confId=20230 –LHCb DIRAC WMS http://indico.cern.ch/materialDisplay.py?sessionId=4&materialId=0&confId=20230 –ATLAS PanDA https://twiki.cern.ch/twiki/bin/view/Atlas/PanDA –ALICE ???
15
Slide 15 A Simplified Diagram End User Central Job Repository/VO-Specific WMS VOMS ServerMy Proxy ServerOthers Worker Node(s) Pilot Job Glexec User Job Site 1 Jobs + Proxy Submit Pilot Job + Pilot Proxy Get User Jobs & User Proxy Worker Node(s) Pilot Job Glexec User Job Site 2
16
Slide 16 Pilot Job Frameworks Review Workgroup GDB working group mandated by WLCG MB on Jan. 22, 2008 Mission –Review security issues in the pilot job framework of each experiment Pilot jobs are taken as multi-user in this context –Define a minimum set of security requirements –Advise on improvements Per framework or common to all –Report to GDB and MB Time frame is a few months Members –ALICE: Predrag Buncic –ATLAS: Torre Wenaus –CMS: Igor Sfiligoi –LHCb: Andrei Tsaregorodtsev –WLCG: Maarten Litmaath (chair) –EGEE: David Groep –FNAL: Eileen Berman –GridPP: Mingchao Ma –OSG: Mine Altunay * Content from Maarten Litmaath, GDB, 2008/06/11
17
Slide 17 Questionnaire Describe in a schematic way all components of the system. –If a component needs to use IPC to talk to another component for any reason, describe what kind of authentication, authorization, integrity and/or privacy mechanisms are in place. If configurable, specify the typical, minimum and maximum protection you can get. Describe how user proxies are handled from the moment a user submits a task to the central task queue to the moment that the user task runs on a WN, through any intermediate storage. What happens around the identity change on the WN, e.g. how is each task sandboxed and to what extent? How can running processes be accounted to the correct user? How is a task spawned on the WN and how is it destroyed? How can a site be blocked?
18
Slide 18 Questionnaire (cont.) What site security processes are applied to the machine(s) running the WMS? –Who is allowed access to the machine(s) on which the service(s) run, and how do they obtain access? –How are authorized individuals authenticated on the machine(s)? –What is the process for keeping the service(s) and OS patched and up-to-date, especially with respect to security patches? –Do you have an identified security contact? –Describe the incident response plan to deal with security incidents and reports of unauthorized use? –What services (in general) run on the machine(s) that offer the WMS service? –What processes exist to maintain audit logs (e.g. for use during an incident)? –What monitoring exists on the machine(s) to aid detection of security incidents or unauthorized use? Can you limit the users that can submit jobs to the VO WMS? How?
Similar presentations
© 2024 SlidePlayer.com. Inc.
All rights reserved.