SAZ Server/Service Update 17-May-2010 Keith Chadwick, Neha Sharma, Steve Timm.

Slides:



Advertisements
Similar presentations
Cultural Heritage in REGional NETworks REGNET Project Meeting Content Group
Advertisements

TDPS Wireless v Enhancements E1 - Multi load E2 - Driver time scheduler.
Design and Implementation of a Server Director Project for the LCCN Lab at the Technion.
MCTS Guide to Microsoft Windows Server 2008 Network Infrastructure Configuration Chapter 8 Introduction to Printers in a Windows Server 2008 Network.
Systems Analysis and Design in a Changing World, 6th Edition
Accounting Update Dave Kant Grid Deployment Board Nov 2007.
Overview Print and Document Services Print Management console Printer properties Troubleshooting.
Virtualization within FermiGrid Keith Chadwick Work supported by the U.S. Department of Energy under contract No. DE-AC02-07CH11359.
1 st December 2003 JIM for CDF 1 JIM and SAMGrid for CDF Mòrag Burgon-Lyon University of Glasgow.
Virtualization within FermiGrid Keith Chadwick Fermilab Work supported by the U.S. Department of Energy under contract No. DE-AC02-07CH11359.
Privilege separation in Condor Bruce Beckles University of Cambridge Computing Service.
TeraGrid Advanced Scheduling Tools Warren Smith Texas Advanced Computing Center wsmith at tacc.utexas.edu.
Condor-G A Quick Introduction Alan De Smet Condor Project University of Wisconsin - Madison.
CERN Using the SAM framework for the CMS specific tests Andrea Sciabà System Analysis WG Meeting 15 November, 2007.
1 Andrea Sciabà CERN Critical Services and Monitoring - CMS Andrea Sciabà WLCG Service Reliability Workshop 26 – 30 November, 2007.
OSG AuthZ components Dane Skow Gabriele Carcassi.
ClearQuest XML Server with ClearCase Integration Northwest Rational User’s Group February 22, 2007 Frank Scholz Casey Stewart
EMI INFSO-RI Argus Policies in Action Valery Tschopp (SWITCH) on behalf of the Argus PT.
Jan 2010 OSG Update Grid Deployment Board, Feb 10 th 2010 Now having daily attendance at the WLCG daily operations meeting. Helping in ensuring tickets.
Database authentication in CORAL and COOL Database authentication in CORAL and COOL Giacomo Govi Giacomo Govi CERN IT/PSS CERN IT/PSS On behalf of the.
Lecture 4 Mechanisms & Kernel for NOSs. Mechanisms for Network Operating Systems  Network operating systems provide three basic mechanisms that support.
An Introduction to Campus Grids 19-Apr-2010 Keith Chadwick & Steve Timm.
VOX Project Tanya Levshina. 05/17/2004 VOX Project2 Presentation overview Introduction VOX Project VOMRS Concepts Roles Registration flow EDG VOMS Open.
FermiGrid Keith Chadwick. Overall Deployment Summary 5 Racks in FCC:  3 Dell Racks on FCC1 –Can be relocated to FCC2 in FY2009. –Would prefer a location.
ATLAS FroNTier cache consistency stress testing David Front Weizmann Institute 1September 2009 ATLASFroNTier chache consistency stress testing.
RI EGI-TF 2010, Tutorial Managing an EGEE/EGI Virtual Organisation (VO) with EDGES bridged Desktop Resources Tutorial Robert Lovas, MTA SZTAKI.
Core and Framework DIRAC Workshop October Marseille.
Advanced Taverna Aleksandra Pawlik University of Manchester materials by Katy Wolstencroft, Aleksandra Pawlik, Alan Williams
Monitoring Dynamic IOC Installations Using the alive Record Dohn Arms Beamline Controls & Data Acquisition Group Advanced Photon Source.
Network - definition A network is defined as a collection of computers and peripheral devices (such as printers) connected together. A local area network.
FermiGrid The Fermilab Campus Grid 28-Oct-2010 Keith Chadwick Work supported by the U.S. Department of Energy under contract No. DE-AC02-07CH11359.
Doc.: IEEE /2179r0 Submission July 2007 Steve Emeott, MotorolaSlide 1 Summary of Updates to MSA Overview and MKD Functionality Text Date:
Virtualization within FermiGrid Keith Chadwick Work supported by the U.S. Department of Energy under contract No. DE-AC02-07CH11359.
FermiGrid Highly Available Grid Services Eileen Berman, Keith Chadwick Fermilab Work supported by the U.S. Department of Energy under contract No. DE-AC02-07CH11359.
Computing Infrastructure – Minos 2009/06 ● MySQL and CVS upgrades ● Hardware deployment ● Condor / Grid status ● /minos/data file server ● Parrot status.
FermiGrid - PRIMA, VOMS, GUMS & SAZ Keith Chadwick Fermilab
Progress Apama Fundamentals
How to Contribute to System Testing and Extract Results
Essentials of UrbanCode Deploy v6.1 QQ147
Server Upgrade HA/DR Integration
CPU Scheduling CSSE 332 Operating Systems
Dynamic Deployment of VO Specific Condor Scheduler using GT4
High Availability Linux (HA Linux)
Chapter 3: Process Concept
EEE Embedded Systems Design Process in Operating Systems 서강대학교 전자공학과
U.S. ATLAS Grid Production Experience
How to connect your DG to EDGeS? Zoltán Farkas, MTA SZTAKI
Mediation Event Flow.
Overview – SOE PatchTT November 2015.
Chapter 5a: CPU Scheduling
Technical Board Meeting, CNAF, 14 Feb. 2004
f f FermiGrid – Site AuthoriZation (SAZ) Service
Network Load Balancing
QlikView Licensing.
CRC exercises Not happy with the way the document for testbed architecture is progressing More a collection of contributions from the mware groups rather.
PES Lessons learned from large scale LSF scalability tests
Conditions Data access using FroNTier Squid cache Server
AliEn central services (structure and operation)
Workflow Best Practices
X in [Integration, Delivery, Deployment]
Haiyan Meng and Douglas Thain
Privilege Separation in Condor
What’s changed in the Shibboleth 1.2 Origin
Tiers vs. Layers.
An Introduction to Software Architecture
Chapter 2: Operating-System Structures
Why Background Processing?
Chapter 2: Operating-System Structures
CPU Scheduling: Basic Concepts
A tutorial on building large-scale services
Presentation transcript:

SAZ Server/Service Update 17-May-2010 Keith Chadwick, Neha Sharma, Steve Timm

Why a new SAZ Server? Current SAZ server (V2_0_1b) has shown itself extremely vulnerable to user generated authorization “tsunamis”: –Very short duration jobs –User issues condor_rm on a large (>1000) glidein. This is fixed in the new SAZ Server (V2_7_0) using tomcat and a pools of execution and hibernate threads. We have found and fixed various other bugs in the current SAZ server and sazclient. We want to add support for the XACML protocol (used by Globus). –We will NOT transition to using XACML (yet). 17-May-20101

Old (“current”) SAZ Protocol – Port 8888 Client sends the “entire” proxy to the SAZ server via port Server parses out DN, VO, Role, CA. –In SAZ V2.0.0b, the parsing logic does not work well, and frequently the SAZ server has to invoke a shell script voms- proxy-info to parse the proxy. –In the new SAZ V????, the parsing logic has been completely rewritten, and it no longer has to invoke the shell script voms- proxy-info to parse the proxy. Server performs MySQL queries. Server constructs the answer and sends it to the client. 17-May-20102

New SAZ (XACML) Protocol – Port 8443 Client parses out DN, VO, Role, CA and sends the information via XACML to the SAZ server via port Server performs MySQL queries. Server constructs the answer and sends it to the client. The new SAZ server supports both 8888 and 8443 protocols simultaneously. 17-May-20103

T T Comparison of Old & New Protocol 17-May Parse DN, VO, Role, CA Query MySQL Construct Decision Port 8888 SAZ Client Port 8443 SAZ Client Parse DN, VO, Role, CA Tomcat Container Log Decision SAZ Server Transmit Decision

Some Definitions Width = # of processes doing SAZ calls/slot or system. Depth = # of SAZ calls. Current = SAZ V2.0.0b –Currently deployed version of SAZ server. New = New SAZ Server –It handled small authorization tsunamis well –It was vulnerable to large (~1000) authorization tsunamis, (running out of file descriptors). Fixed = “Fixed” New SAZ Server –It has handled extra large (5000) authorization tsunamis without incident (ulimit to deal with the large number of open files). –It also has a greatly improved CRL access algorithm. All of the tests are run on/against SAZ servers on the fgtest systems: –fgtest[0-6] are FORMER production systems, 4+ years old, non-redundant. –Current production systems are at least 2x faster and redundant. 17-May-20105

6

7

8

9

Tsunami Testing Using fgtest systems (4+ years old). Submit the first set of SAZ tests: –jobs=50, width=1, depth=1000 Wait until all jobs are running. Trigger authorizations of the first test set. Submit the second set of SAZ tests, either: –Jobs=1000, width=1, depth=50 –Jobs=5000, width=1, depth=50 Wait until all jobs are running. Trigger authorizations of the second test set. Measure elapsed time for first and second sets 17-May

Results of the SAZ Tsunami Tests Current SAZ (V2_0_1b) – fgt5x3: –Base=50x1x1000, Tsunami=1000x1x50 –Immediately fail. New SAZ (V2_7_0) – fgt6x3: –Base=50x1x1000, Tsunami=1000x1x50 –Ran for ~few minutes, then fail (“too many open files”). Fixed New SAZ (V2_7_0) – fgt6x3: –Base=50x1x1000, Tsunami=1000x1x50 Ran without incident Average of Authorizations/second. Total elapsed time was 11,205 seconds (3.11 hours). –Base=50x1x1000, Tsunami=5000x1x50 Ran without incident Average >15 Authorizations/second, Peak >22 Authorizations/second. Total elapsed time was ~8 hours to process base+tsunami load. 17-May

SAZ Tsunami Profile (5000x1x50) 17-May

17-May

Another Tsunami (5000x1x50) 17-May

Load on fgdevsaz.fnal.gov 17-May

“Real World” Scale Tsunamis The previous tests using 5000x1x50 = 250,000 authorizations tsunamis are well beyond the actual real world experience. Most authorization tsunamis have been caused by a user issuing a “condor_rm” on a set of glide-in jobs. So comparison tests of fgdevsaz and fgt5x3 were run on a “real world” scale authorization tsunami – 5000x1x5 = 25,000 authorizations. 17-May

fgt5x3 – tsunami 5000x1x5 Black = number of condor jobs Red = number of saz network connections 11:45:12 Failures 11:45:19 25,000 Authorizations 14,183 Success 10,817 Failures 11:58:28 Elapsed time 13m 16s 17-May

fgtdevsaz – tsunami 5000x1x5 Black = number of condor jobs Red = number of saz network connections 00:46:20 25,000 Authorizations 25,000 Success 0 Failures 01:05:03 Elapsed time = 18m 43s Authorizations/sec 17-May

What’s Next? Formal Change Management Request –Risk Level 4 (Minor Change). For build & test, we propose to deploy the Fixed New SAZ service (V2_7_0) on the pair of dedicated CDF Sleeper pool SAZ servers. This will allow us to benchmark the Fixed New SAZ service on a deployment that substantially matches the current production SAZ service (LVS, redundant SAZ servers). For release, we propose to upgrade one SAZ server at a time: –VM Operating System “upgrade” from 32 bit to 64 bit. –Install of Fixed New SAZ server (V2_7_0). –Verify functionality before going to the next. 17-May

Proposed Schedule of Changes Note 1:All of the above dates are tentative, subject to approval by the Change Management Board and the corresponding stakeholder. Note 2:FermiGrid-HA will maintain continuous service availability during these changes. 17-May SAZ ServersServer 1Server 2 CDF Sleeper Pool27-Apr Apr-2010 CMS Grid Cluster Worker Node11-May May-2010 CDF Grid Cluster Worker Node12-May May-2010 GP Grid Cluster Worker Node18-May May-2010 D0 Grid Cluster Worker Node19-May May-2010 Central SAZ Server for Gatekeepers25-May May-2010

Single client SAZ V2_7_0 Performance 17-May

Multiple client SAZ V2_7_0 Performance 17-May

cdfsaz:8882 – SAZ V2.7.1 saz_tsunami.sh cdfsaz tsunami]$./saz_tsunami.sh cdfsaz [./saz_tsunami.sh] - called with parameters cdfsaz on fgtest0.fnal.gov [./saz_tsunami.sh] - sazclient=/usr/local/vdt /sazclient/bin/sazclient.new Waiting for jobs to complete... ==================================== Width=100 Depth=100 Sleep=153 Calls=10000 Duration=153 Success=9960 Communication=0 ==================================== Failure rate = 40/10,000 = 0.40% 17-May

cdfsaz:8882 – SAZ V2.7.1 saz_tsunami.sh cdfsaz Call #Fail CountTotalRemainFail % % % % % % % % % % % % % % % % % % % 17-May

cdfsaz:8882 – SAZ V2.7.1 saz_tsunami.sh cdfsaz May

cdfsaz:8882 – SAZ V2.7.1 saz_tsunami.sh cdfsaz May

Updated SAZ Server Issues Release Summary 60 second service shutdown – can cause lots of improper authorization rejections. –The shutdown duration is set by tomcat –Looking into mechanisms to improve this –Adding a temporary iptables rule at service shutdown appears promising Authorization failures due to hibernate reporting stale database/ip connections. –Already use a try/catch method –Will switch to a try/catch/retry with maximum loop count method Query performance with hibernate –Observed number of database connections is higher than expected from hibernate –Possibly caching is not being effectively used. –We are investigating (not a blocker for release). Optimization of hibernate parameters –We are reviewing all of the hibernate configuration parameters to see if there are better values (not a blocker for release) 17-May

SAZ V2.7.2 Revised hibernate configuration SAZ tsunami 5000 jobs x 1 thread/job x 250 authorizations/thread./saz_analyze.sh cdfsaz / Duration - min=18646, max=19571, avg= , count=5000, total= Sleep - min=25789, max=34153, avg=30508, count=5000, total= Success - min=249, max=250, avg= , count=5000, total= Average Rate = authorizations/second. 4 failures out of 1,250,000 authorizations. 17-May

SAZ V2.7.2 Failures [/grid/testhome/fnalgrid/.globus/.gass_cache/local/md5/83/41/e9/e ce2aeb0d151babc322/md5/96/4c/e3/d229a2b12f2687ee db0 /data] - called with parameters cdfsaz /grid/data/saz-tests/tsunami on fcdfcaf1002.fnal.gov saz_tsunami.out : PID: SAZ ERROR GSS.c:110 Handshake Failed... major_status != GSS_S_COMPLETE saz_tsunami.out : PID: SAZ ERROR SAZProtocol.c:64 handshake failed saz_tsunami.out :Thu May 13 23:58:13 CDT sazclient=fail saz_tsunami.out :[/grid/testhome/fnalgrid/.globus/.gass_cache/local/md5/83/41/e9/e ce2aeb0d151babc322/md5/96/4c/e3/d229a2 b12f2687ee db0/data] - results - loop=14275,call=250,pass=249,fail=1,comm=0,elapsed=19140 [/grid/testhome/fnalgrid/.globus/.gass_cache/local/md5/d9/c2/45/11a689a4b78ac9e4a3788e4c2e/md5/96/4c/e3/d229a2b12f2687ee db0 /data] - called with parameters cdfsaz /grid/data/saz-tests/tsunami on fcdfcaf1007.fnal.gov saz_tsunami.out : PID: SAZ ERROR GSS.c:110 Handshake Failed... major_status != GSS_S_COMPLETE saz_tsunami.out : PID: SAZ ERROR SAZProtocol.c:64 handshake failed saz_tsunami.out :Thu May 13 22:58:18 CDT sazclient=fail saz_tsunami.out :[/grid/testhome/fnalgrid/.globus/.gass_cache/local/md5/d9/c2/45/11a689a4b78ac9e4a3788e4c2e/md5/96/4c/e3/d229a2 b12f2687ee db0/data] - results - loop=8505,call=250,pass=249,fail=1,comm=0,elapsed=19414 [/grid/testhome/fnalgrid/.globus/.gass_cache/local/md5/c4/bb/1c/ f0f2ca340edcae2fe64/md5/96/4c/e3/d229a2b12f2687ee db0/d ata] - called with parameters cdfsaz /grid/data/saz-tests/tsunami on fcdfcaf1029.fnal.gov saz_tsunami.out : PID: SAZ ERROR GSS.c:110 Handshake Failed... major_status != GSS_S_COMPLETE saz_tsunami.out : PID: SAZ ERROR SAZProtocol.c:64 handshake failed saz_tsunami.out :Thu May 13 23:58:13 CDT sazclient=fail saz_tsunami.out :[/grid/testhome/fnalgrid/.globus/.gass_cache/local/md5/c4/bb/1c/ f0f2ca340edcae2fe64/md5/96/4c/e3/d229a2b 12f2687ee db0/data] - results - loop=13754,call=250,pass=249,fail=1,comm=0,elapsed=19519 [/grid/testhome/fnalgrid/.globus/.gass_cache/local/md5/5f/61/35/1b92b9a1217c399b32dad9a8b6/md5/96/4c/e3/d229a2b12f2687ee db0/ data] - called with parameters cdfsaz /grid/data/saz-tests/tsunami on fcdfcaf1049.fnal.gov saz_tsunami.out : PID: SAZ ERROR GSS.c:110 Handshake Failed... major_status != GSS_S_COMPLETE saz_tsunami.out : PID: SAZ ERROR SAZProtocol.c:64 handshake failed saz_tsunami.out :Thu May 13 23:58:13 CDT sazclient=fail saz_tsunami.out :[/grid/testhome/fnalgrid/.globus/.gass_cache/local/md5/5f/61/35/1b92b9a1217c399b32dad9a8b6/md5/96/4c/e3/d229a2 b12f2687ee db0/data] - results - loop=12613,call=250,pass=249,fail=1,comm=0,elapsed= May

What happens at xx:58? At 58 minutes past the hour, the hourly fetch-crl update is going against /usr/local/testgrid from fg0x6. We appear to have found an issue with fetch_crl – it is supposed to move in the new CRL “safely”, but it is not working 100% when NFS is involved and 5000 clients are simultaneously using openSSL. 17-May

SAZ V2.7.2 – Ready to Deploy SAZ V2.7.2 has passed all of our acceptance tests and has been demonstrated to handle authorization tsunamis that are well beyond (both in number and duration) the authorization tsunamis that have been observed “in the wild”. 17-May

Revised Schedule of Changes Note 1:All of the above dates are tentative, subject to approval by the Change Management Board and the corresponding stakeholder. Note 2:FermiGrid-HA will maintain continuous service availability during these changes. 17-May SAZ ServersServer 1Server 2Status CDF Sleeper Pool27-Apr Apr-2010Done CDF Grid Cluster Worker Node18-May May-2010planned GP Grid Cluster Worker Node19-May May-2010planned D0 Grid Cluster Worker Node25-May May-2010planned Central Gatekeeper SAZ Server26-May May-2010planned CMS Grid Cluster Worker Nodetbd