Download presentation
Presentation is loading. Please wait.
Published byGraciela Heaslip Modified over 9 years ago
1
“Managing a farm without user jobs would be easier” Clusters and Users at CERN Tim Smith CERN/IT
2
2002/10/25HEPiX fall 2002: Tim.Smith@cern.ch2 Contents The road to shared clusters Batch cluster Configuration User challenges Addressing the challenges Interactive cluster Load balancing Conclusions
3
2002/10/25HEPiX fall 2002: Tim.Smith@cern.ch3 The Demise of Free Choice 2000 200120022003
4
2002/10/25HEPiX fall 2002: Tim.Smith@cern.ch4 Cluster Aggregation
5
2002/10/25HEPiX fall 2002: Tim.Smith@cern.ch5 Organisational Compromises Clusters per Groups Sized for the average users Sized for user peaks users financiers : wasted resources Invest effort in recooperating cycles for other groups Configuration differences / specialities Bulk Production Clusters Production fluctuations dwarf those in user anal Complex cross-submission links
6
2002/10/25HEPiX fall 2002: Tim.Smith@cern.ch6 Production Farm: Planning
7
2002/10/25HEPiX fall 2002: Tim.Smith@cern.ch7 Shared Clusters lxplus001 lxbatch001 DNS load balancing LSF disk001 rfio tape001 rfio disk001 tape001 750 Batch Servers 70 Interactive Servers 120 Disk Servers
8
2002/10/25HEPiX fall 2002: Tim.Smith@cern.ch8 Simple, Uniform Shared Cluster ?
9
2002/10/25HEPiX fall 2002: Tim.Smith@cern.ch9 Partitioning Still have identified resources Uniform configuration Sharing Repartitioning or soak-up queues If owner experiment reclaims resources, must suspend soak-up jobs – stranded jobs ALICEATLASCMSLHCbALEPHDELPHIL3OPALCOMPASSNtofOPERASLAPPARCPARC IntCVSBUILDDELPHI IntCSFPublic
10
2002/10/25HEPiX fall 2002: Tim.Smith@cern.ch10 LSF Fair-Share Trade-in partition for a share Multilevel ATLAS 10%, CMS 12%, … cmsprod 45%, HiggsWG 15%, … usera 10%, userb 80%, userc 10% Extra shares for productions Effort: Juggling resources to Accounting Demonstrating fairness Protecting Policing
11
2002/10/25HEPiX fall 2002: Tim.Smith@cern.ch11 Facts and Figures Accounting LSF job records Process with C-program Load into Oracle DB Prepare plots/tables with Crystal Reports package LSFAnalyser ? Monitoring Poll the user access tools SiteAssure ?
12
2002/10/25HEPiX fall 2002: Tim.Smith@cern.ch12 CPU Time / Week Merged user analysis and production farms
13
2002/10/25HEPiX fall 2002: Tim.Smith@cern.ch13 Performance of Batch Job Slot Analysis ThuFriSa 10 min / tick
14
2002/10/25HEPiX fall 2002: Tim.Smith@cern.ch14 Challenging Batch (I) Probing boundaries Flooding Concurrent starts Uncontrolled status polling Hitting limits Disk space /tmp /pool /var Memory, Swap Full Guarantees for other user jobs? System Issues Queue drainers
15
2002/10/25HEPiX fall 2002: Tim.Smith@cern.ch15 Challenging Batch (II) Un-Fair-Share Logging onto batch machines Batch jobs which resubmit themselves Forking sessions back to remote hosts Wasting resources Spawning processes which outlive the jobs Sleeping processes Copying large AFS trees Establishing connections to dead machines
16
2002/10/25HEPiX fall 2002: Tim.Smith@cern.ch16 Counter Measures File system quotas Virtual memory limits Concurrent jobs limits per user/group Restricted access through PAM Instant response queues Master node setup Dedicated, 1GB memory Failover cluster
17
2002/10/25HEPiX fall 2002: Tim.Smith@cern.ch17 Shared Clusters lxplus001 lxbatch001 DNS load balancing LSF disk001 rfio tape001 rfio disk001 tape001 750 Batch Servers 70 Interactive Servers 120 Disk Servers LSF MultiCluster
18
2002/10/25HEPiX fall 2002: Tim.Smith@cern.ch18 Shared Clusters lxplus001 lxbatch001 DNS load balancing LSF disk001 rfio tape001 rfio disk001 tape001 750 Batch Servers 70 Interactive Servers 120 Disk Servers Single Cluster
19
2002/10/25HEPiX fall 2002: Tim.Smith@cern.ch19 Interactive Cluster DNS load balancing (ISS) Weighted load indexes load, memory swap rate, disk IO rate # processes, # sessions, # window mgr sessions Exclusion thresholds file systems full, nologins DNS publish 2 every 30 seconds Random from lowest 5
20
2002/10/25HEPiX fall 2002: Tim.Smith@cern.ch20 Daily Users 35 users / node
21
2002/10/25HEPiX fall 2002: Tim.Smith@cern.ch21 Challenging Interactive Sidestep load balancing Parallel sessions across farm Running daemons Brutal logouts Open connections Defunct processes CPU sapping orphaned processes Monitoring + beniced + Monthly reboots
22
2002/10/25HEPiX fall 2002: Tim.Smith@cern.ch22 Interactive Reboots
23
2002/10/25HEPiX fall 2002: Tim.Smith@cern.ch23 Conclusions Shared clusters present more user opportunities Both Good and Bad ! Don’t represent a panacea for sysadmins !
Similar presentations
© 2025 SlidePlayer.com. Inc.
All rights reserved.