1 Diana Scannicchio on behalf of ALICE, ATLAS, CMS, LHCb System Administration Diana Scannicchio on behalf of ALICE, ATLAS, CMS, LHCb System Administration.

Slides:



Advertisements
Similar presentations
Ljubomir Ivaniš CPU d.o.o.
Advertisements

Serverless Network File Systems. Network File Systems Allow sharing among independent file systems in a transparent manner Mounting a remote directory.
CHEP 2012 – New York City 1.  LHC Delivers bunch crossing at 40MHz  LHCb reduces the rate with a two level trigger system: ◦ First Level (L0) – Hardware.
1 Week #1 Objectives Review clients, servers, and Windows network models Differentiate among the editions of Server 2008 Discuss the new Windows Server.
S. Gadomski, "ATLAS computing in Geneva", journee de reflexion, 14 Sept ATLAS computing in Geneva Szymon Gadomski description of the hardware the.
1 Week #1 Objectives Review clients, servers, and Windows network models Differentiate among the editions of Server 2008 Discuss the new Windows Server.
Virtual Network Servers. What is a Server? 1. A software application that provides a specific one or more services to other computers  Example: Apache.
© 2010 VMware Inc. All rights reserved VMware ESX and ESXi Module 3.
70-293: MCSE Guide to Planning a Microsoft Windows Server 2003 Network, Enhanced Chapter 14: Problem Recovery.
CERN IT Department CH-1211 Genève 23 Switzerland t Integrating Lemon Monitoring and Alarming System with the new CERN Agile Infrastructure.
Hands-On Microsoft Windows Server 2008 Chapter 1 Introduction to Windows Server 2008.
User Management in LHCb Gary Moine, CERN 29/08/
Hands-On Microsoft Windows Server 2008 Chapter 1 Introduction to Windows Server 2008.
Successful Deployment and Solid Management … Close Relatives Tim Sinclair, General Manager, Windows Enterprise Management.
Microsoft Active Directory(AD) A presentation by Robert, Jasmine, Val and Scott IMT546 December 11, 2004.
An Introduction to IBM Systems Director
October, Scientific Linux INFN/Trieste B.Gobbo – Compass R.Gomezel - T.Macorini - L.Strizzolo INFN - Trieste.
Module 7: Fundamentals of Administering Windows Server 2008.
SMS 2003 Deployment and Managing Windows Security Rafal Otto Internet Services Group Department of Information Technology CERN 26 May 2016.
IT Security for the LHCb experiment 3rd Control System Cyber-Security Workshop (CS)2/HEP ICALEPCS – Grenoble Enrico Bonaccorsi, (CERN)
Paul Scherrer Institut 5232 Villigen PSI HEPIX_AMST / / BJ95 PAUL SCHERRER INSTITUT THE PAUL SCHERRER INSTITUTE Swiss Light Source (SLS) Particle accelerator.
Peter Chochula ALICE DCS Workshop, October 6,2005 DCS Computing policies and rules.
Module 2: Installing and Maintaining ISA Server. Overview Installing ISA Server 2004 Choosing ISA Server Clients Installing and Configuring Firewall Clients.
Module 11: Implementing ISA Server 2004 Enterprise Edition.
Configuration Management with Cobbler and Puppet Kashif Mohammad University of Oxford.
RAL PPD Computing A tier 2, a tier 3 and a load of other stuff Rob Harper, June 2011.
2  Supervisor : MENG Sreymom  SNA 2012_Group4  Group Member  CHAN SaratYUN Sinot  PRING SithaPOV Sopheap  CHUT MattaTHAN Vibol  LON SichoeumBEN.
F. Rademakers - CERN/EPLinux Certification - FOCUS Linux Certification Fons Rademakers.
1 The new Fabric Management Tools in Production at CERN Thorsten Kleinwort for CERN IT/FIO HEPiX Autumn 2003 Triumf Vancouver Monday, October 20, 2003.
Support in setting up a non-grid Atlas Tier 3 Doug Benjamin Duke University.
Peter Chochula DCS Remote Access and Access Control Peter Chochula.
Virtualization for the LHCb Online system CHEP Taipei Dedicato a Zio Renato Enrico Bonaccorsi, (CERN)
Management of the LHCb DAQ Network Guoming Liu * †, Niko Neufeld * * CERN, Switzerland † University of Ferrara, Italy.
An Agile Service Deployment Framework and its Application Quattor System Management Tool and HyperV Virtualisation applied to CASTOR Hierarchical Storage.
Satisfy Your Technical Curiosity Specialists Enterprise Desktop -
CERN - IT Department CH-1211 Genève 23 Switzerland t Oracle Real Application Clusters (RAC) Techniques for implementing & running robust.
Core 3: Communication Systems. Network software includes the Network Operating Software (NOS) and also network based applications such as those running.
Microsoft Management Seminar Series SMS 2003 Change Management.
ALICE Use of CMF (CC) for the installation of OS and basic S/W OPC servers and other special S/W installed and configured by hand PVSS project provided.
Online System Status LHCb Week Beat Jost / Cern 9 June 2015.
Virtualization Technology and Microsoft Virtual PC 2007 YOU ARE WELCOME By : Osama Tamimi.
Creating SmartArt 1.Create a slide and select Insert > SmartArt. 2.Choose a SmartArt design and type your text. (Choose any format to start. You can change.
Module 10: Windows Firewall and Caching Fundamentals.
Status of the Bologna Computing Farm and GRID related activities Vincenzo M. Vagnoni Thursday, 7 March 2002.
CERN - IT Department CH-1211 Genève 23 Switzerland t High Availability Databases based on Oracle 10g RAC on Linux WLCG Tier2 Tutorials, CERN,
Computer and Network Infrastructure for the LHCb RTTC Artur Barczyk CERN/PH-LBC RTTC meeting,
R. Krempaska, October, 2013 Wir schaffen Wissen – heute für morgen Controls Security at PSI Current Status R. Krempaska, A. Bertrand, C. Higgs, R. Kapeller,
Predrag Buncic (CERN/PH-SFT) Software Packaging: Can Virtualization help?
36 th LHCb Software Week Pere Mato/CERN.  Provide a complete, portable and easy to configure user environment for developing and running LHC data analysis.
DAQ & ConfDB Configuration DB workshop CERN September 21 st, 2005 Artur Barczyk & Niko Neufeld.
LINUX Presented By Parvathy Subramanian. April 23, 2008LINUX, By Parvathy Subramanian2 Agenda ► Introduction ► Standard design for security systems ►
Active Directory Domain Services (AD DS). Identity and Access (IDA) – An IDA infrastructure should: Store information about users, groups, computers and.
Platform & Engineering Services CERN IT Department CH-1211 Geneva 23 Switzerland t PES Agile Infrastructure Project Overview : Status and.
M. DOBSON ON BEHALF OF ALL EXPERIMENTS Systems Management of the DAQ systems Special thanks to Diana, Sergio, Ulrich, Loïc, Niko Systems Management, Second.
1 Enrico Bonaccorsi on behalf of ALICE, ATLAS, CMS, LHCb Virtualisation Enrico Bonaccorsi on behalf of ALICE, ATLAS, CMS, LHCb Virtualisation  Introduction.
MCSA Windows Server 2012 Pass Upgrading Your Skills to MCSA Windows Server 2012 Exam By The Help Of Exams4Sure Get Complete File From
RHEV Platform at LHCb Red Hat at CERN 17-18/1/17
DAQ LHC Workshop Configuration management systems
IT Services Katarzyna Dziedziniewicz-Wojcik IT-DB.
System Center 2012 Configuration Manager
Cloud Challenges C. Loomis (CNRS/LAL) EGI-TF (Amsterdam)
Dag Toppe Larsen UiB/CERN CERN,
High Availability Linux (HA Linux)
Progress on NA61/NA49 software virtualisation Dag Toppe Larsen Wrocław
Dag Toppe Larsen UiB/CERN CERN,
StratusLab Final Periodic Review
StratusLab Final Periodic Review
Enrico Bonaccorsi, (CERN) Loic Brarda, (CERN) Gary Moine, (CERN)
VceTests VCE Test Dumps
Design Unit 26 Design a small or home office network
Presentation transcript:

1 Diana Scannicchio on behalf of ALICE, ATLAS, CMS, LHCb System Administration Diana Scannicchio on behalf of ALICE, ATLAS, CMS, LHCb System Administration  Introduction  Configuration  Monitoring  Virtualization  Security and access  Support  Next steps and conclusions

Diana Scannicchio 2 Elba 2009 Diana Scannicchio 2 12 March 2013 Introduction: run efficiency The usage of big farms of computers is needed to take data (run)  ALICE: ~450 PCs  ATLAS: ~3000 PCs, ~150 switches  CMS: ~2900 PCs, ~150 switches  LHCb: ~2000 PCs, ~200 switches Achieve a good efficiency within the limits of available hardware, manpower, cost, …  High availability, from the system administration (not DAQ) point of view:  minimize the number of single points of failure  critical systems are unavoidable  have a fast recovery to minimize the downtime  usage of configuration management tools and monitoring systems  Complementing DAQ capability of adapting to the loss of nodes The common goal is Run Efficiency

Diana Scannicchio 3 Elba 2009 Diana Scannicchio 3 12 March 2013 Run  The farms are composed by nodes fulfilling various functions  Trigger and Data Acquisition  Detector Control Systems  Services  monitoring, authorization, access, LDAP, NTP, MySQL, Apache, …  Control Rooms  Run should survive GPN disconnection  any vital IT service is duplicated (DNS, NTP, DHCP, LDAP, DC)  event data can be locally stored for 1-2 days  ATLAS and CMS

Diana Scannicchio 4 Elba 2009 Diana Scannicchio 4 12 March 2013 Farm Architecture - ATLAS  Hierarchical structure  Central File Server (CFS)  Local File Server (LFS)  netbooted nodes  Flat structure  local installed  NetApp: centralized storage  home directories and different project areas  84 disks (6 spares), ~10 TB CFS LFS … CFSLFS LDAP MySQL DCS Web Servers NetApp Control Room Virtual Hosts Core Nagios Public Nodes DNS

Diana Scannicchio 5 Elba 2009 Diana Scannicchio 5 12 March 2013 Farm Architecture CMS  Flat structure  all nodes are local installed  NetApp: centralized storage home directories and different project areas ~17 TB CMS  Flat structure  all nodes are local installed  NetApp: centralized storage home directories and different project areas ~17 TB LHCb  Hierarchical structure  all nodes are netbooted LHCb  Hierarchical structure  all nodes are netbooted ALICE  Flat structure  all nodes are local installed ALICE  Flat structure  all nodes are local installed

Diana Scannicchio 6 Elba 2009 Diana Scannicchio 6 12 March 2013 Efficiency  Single points of failure are impossible to avoid  ATLAS: DCS, ROS, NetApp (but it is redundant)  CMS: during LS1 DCS will move to blades for a large portion, with failover to a blade on surface  Core services: DNS/DHCP/kerberos, LDAP, LFS are redundant  Fast recovery  needed especially to recover a “single point of failure” system  monitoring is a fundamental tool  to get promptly informed about failure or degradation  configuration management  to quickly (re-)install a machine as it was, e.g. on new hardware (20~40 min.)  moving DNS alias (~15 min., due to propagation, caches)  diskless nodes have no re-install downtime (~5 min.) (ATLAS, LHCb)  flexible system designed in-house to configure diskless nodes  redundant boot servers to serve boot images, NFS shares, …  Efficiency loss due to hardware failures has been negligible compared to operator errors or detector failures

7 Diana Scannicchio 7 12 March 2013

Diana Scannicchio 8 Elba 2009 Diana Scannicchio 8 12 March 2013 Configuration  Central configuration management is needed to speed up and keep under control the installation (OS and other software) on  local installed nodes  netbooted nodes  Various configuration management tools are available, the ones used are:  Quattor  CERN IT standard Configuration Management Tool being dismissed in favour of Puppet  tight control on installed packages  lack of flexibility for complex configuration and service dependencies  Puppet  high flexibility  active development community

Diana Scannicchio 9 Elba 2009 Diana Scannicchio 9 12 March 2013 Quattor and Puppet  Quattor  CMS  LHCb  ATLAS  still nodes configured by mixing with Puppet  finalizing the dismissing of Quattor in the next months  Puppet  ALICE  the first configuration is done through kickstart, then puppet  ATLAS  in use for ~3 years, ~15000 LOC  complicated servers have been the first to be managed by Puppet  on the HLT farm is complementing Quattor

Diana Scannicchio 10 Elba 2009 Diana Scannicchio March 2013 Packages and updates  Software distribution and package management  SLC and other public RPMs from CERN repositories  ALICE, ATLAS and CMS also have repositories mirrored in P2, P1 and P5  Trigger and DAQ software packaged as RPMs  ALICE and CMS: installed locally on each node  ATLAS: installed from CFS, synchronized to LFS, NFS-mounted on clients  LHCb: in-house package distribution systems (Pacman, same as for GRID)  Update policy  ATLAS  snapshot of yum repositories, versioned test/production/… groups  Quattor clients receive version list based on repository group  Puppet clients pull directly from assigned repository group  CMS  Quattor/SPMA controlled, updates are pushed as needed  ALICE  updates are propagated at well-defined moments  LHCb  updates are deployed at well-defined moments See next Thursday for detailed news

11 Diana Scannicchio March 2013

Diana Scannicchio 12 Elba 2009 Diana Scannicchio March 2013 Monitoring and alerting  Large infrastructure must be monitored automatically  proactively warned of any failure or degradation in the system  avoid or minimize downtime  What does monitoring mean?  data collection  visualization of collected data (performance, health)  alert (sms, mail) on collected data  Various monitoring packages are available, the ones in use are:  Icinga  Ganglia  Lemon  Nagios  Zabbix

Diana Scannicchio 13 Elba 2009 Diana Scannicchio March 2013 Current monitoring tools  Lemon is used by Alice for metrics retrieval and display, and alerting  monitoring Linux generic hosts and remote devices using SNMP  retrieving DAQ-specific metrics (rates, software configuration, etc)  reporting/alerting  Nagios (v2) was used by CMS and is used by ATLAS  problem with scaling in growing cluster  configuration is distributed over more servers in order to scale  Ganglia is used by ATLAS to provide detailed performance information on interesting servers (e.g. LFS, virtual hosts, …)  no alert capabilities  Icinga is already being used by CMS and LHCb  configuration is compatible with the Nagios one, so it is “easy” to migrate  data collection is performed using Gearman/mod_gearman (queue system) to distribute the work load

Diana Scannicchio 14 Elba 2009 Diana Scannicchio March 2013 Future monitoring tools  ALICE will replace Lemon with Zabbix  ATLAS will complete the migration to Icinga complementing the information with GANGLIA  Gearman/mod_gearman to reduce workload on the monitoring server and improve scaling capabilities  LHCb will also use GANGLIA See next Thursday for detailed news

15 Diana Scannicchio March 2013

Diana Scannicchio 16 Elba 2009 Diana Scannicchio March 2013 Virtualization in the present ALICE  none ALICE  none CMS  domain controllers  Icinga workers and replacement server  few detector machines CMS  domain controllers  Icinga workers and replacement server  few detector machines ATLAS  gateways  domain controllers  few windows services  development web servers  core Nagios servers  Puppet and Quattor servers  one detector machine  public nodes ATLAS  gateways  domain controllers  few windows services  development web servers  core Nagios servers  Puppet and Quattor servers  one detector machine  public nodes LHCb  web services  infrastructure services  DNS, Domain Controller, DHCP, firewalls  always a tandem for critical systems: one VM, one real  few control PCs LHCb  web services  infrastructure services  DNS, Domain Controller, DHCP, firewalls  always a tandem for critical systems: one VM, one real  few control PCs

Diana Scannicchio 17 Elba 2009 Diana Scannicchio March 2013 Virtualization in the future  Virtualization is a very fertile playground  Everyone thinking how to exploit  Offline software (analysis and simulation) will run on virtual machines on the ATLAS and CMS HLT farms  OpenStack is used for management See next Thursday for detailed news ALICE  Control Room PCs  Event Builders ALICE  Control Room PCs  Event Builders ATLAS  DCS windows systems ATLAS  DCS windows systems CMS  servers  DNS, DHCP, kerberos, LDAP slaves  DAQ services CMS  servers  DNS, DHCP, kerberos, LDAP slaves  DAQ services LHCb  general login services  gateways and windows remote desktop  all control PCs  PVSS, linux, windows, specific HW issues (CANBUS) LHCb  general login services  gateways and windows remote desktop  all control PCs  PVSS, linux, windows, specific HW issues (CANBUS)

18 Diana Scannicchio March 2013

Diana Scannicchio 19 Elba 2009 Diana Scannicchio March 2013 Authentication ATLAS  local LDAP for account information  usernames and local password if needed (e.g. generic accounts)  NICE authentication using the CERN Domain Controllers mirrors inside P1 ATLAS  local LDAP for account information  usernames and local password if needed (e.g. generic accounts)  NICE authentication using the CERN Domain Controllers mirrors inside P1 ALICE  internal usernames/passwords used for detector people  no sync with NICE users/passwords  RFID/Smartcard authentication after LS1  still no access to/from outside world ALICE  internal usernames/passwords used for detector people  no sync with NICE users/passwords  RFID/Smartcard authentication after LS1  still no access to/from outside world CMS  local kerberos server  same usernames and userID as in IT  LDAP is used to store user info and user to group mappings CMS  local kerberos server  same usernames and userID as in IT  LDAP is used to store user info and user to group mappings LHCb  Local LDAP  Local Domain Controllers  UIDs, usernames and user info are in sync with the CERN LDAP LHCb  Local LDAP  Local Domain Controllers  UIDs, usernames and user info are in sync with the CERN LDAP

Diana Scannicchio 20 Elba 2009 Diana Scannicchio March 2013 Security and Access Restriction  Web pages and Logbooks are  accessible from outside CERN and secured through CERN SSO  firewalls and reverse proxies also used  The networks are separated from GPN and TN (for ATLAS, CMS, LHCb)  exceptions are implemented via CERN LanDB Control Sets ALICE  no external/GPN access to any DAQ services ALICE  no external/GPN access to any DAQ services LHCb  no external/GPN access to any DAQ services  access is possible only with an LHCb account through the linux gateways or windows terminal servers LHCb  no external/GPN access to any DAQ services  access is possible only with an LHCb account through the linux gateways or windows terminal servers

Diana Scannicchio 21 Elba 2009 Diana Scannicchio March 2013 Security and Access Restriction ATLAS  access to the ATLAS network is controlled  RBAC (Role Based Access Control) mechanism in place to restrict user access to nodes and resources (i.e. Access Manager)  during Run Time the access is only authorized by ShiftLeader, and it is time limited  sudo rules define limited administration privileges for users  two steps for a user to login on a P1 node  first step on the gateway where roles are checked before completing the connection  second step to the internal host, managed by login script ATLAS  access to the ATLAS network is controlled  RBAC (Role Based Access Control) mechanism in place to restrict user access to nodes and resources (i.e. Access Manager)  during Run Time the access is only authorized by ShiftLeader, and it is time limited  sudo rules define limited administration privileges for users  two steps for a user to login on a P1 node  first step on the gateway where roles are checked before completing the connection  second step to the internal host, managed by login script CMS  access to the CMS network via boundary nodes (user head nodes) is not blocked at any time, any valid account can login  nodes are not restricted either (anyone can log into any machine)  sudo rules are restrictive to the types/uses of nodes  access is through password authentication only for the peripheral nodes (SSH keys not allowed)  The boundary nodes are fully fledged nodes similar to general nodes on the network CMS  access to the CMS network via boundary nodes (user head nodes) is not blocked at any time, any valid account can login  nodes are not restricted either (anyone can log into any machine)  sudo rules are restrictive to the types/uses of nodes  access is through password authentication only for the peripheral nodes (SSH keys not allowed)  The boundary nodes are fully fledged nodes similar to general nodes on the network

22 Diana Scannicchio March 2013

Diana Scannicchio 23 Elba 2009 Diana Scannicchio March 2013 Workload and requests management  Ticket systems are used to track issues and requests  ALICE and CMS use Savannah and will move to Jira  ATLAS uses Redmine for 3 years (before Jira availability)  LHCb uses OTRS and has installed Redmine  Urgent matters are managed via on-call with different philosophies  ALICE: DAQ on-call and the other DAQ experts as needed  ATLAS: direct call to TDAQ SysAdmins  CMS and LHCb: DAQ on-call is the first line, then SysAdmins

24 Diana Scannicchio March 2013

Diana Scannicchio 25 Elba 2009 Diana Scannicchio March 2013 Next steps A lot of work is planned by all experiments during LS1  Updating the Operating Systems to  SLC6 on both local installed and netbooted nodes  Windows Server 2008 or later  Complete the migration to new configuration management tool  Upgrading and improving the monitoring systems  Looking more and more at virtualization  HLT Farms will be used as virtual machines to run offline software

Diana Scannicchio 26 Elba 2009 Diana Scannicchio March 2013 Conclusions  Systems are working: we happily ran and took data  complex systems  24x7 support  Interesting and proactive “Cross Experiment” meetings to  share information  compare solutions and performances  Converging on using the same or similar tools for “objective” tasks  e.g. for monitoring and configuration management  Appropriate tools are now available to deal with big farms  big farms are now available outside in the world  CERN is no more a peculiarity  Differences observed for “subjective” tasks  restrict access or not  uniformity (netbooted) vs. flexibility (local installed)  Improvement is always possible… unfortunately it depends on costs, time and manpower

Diana Scannicchio 27 Elba 2009 Diana Scannicchio March 2013 Thanks to…  ALICE  Adriana Telesca  Ulrich Fuchs  CMS  Marc Dobson  LHCb  Enrico Bonaccorsi  Christophe Haen  Niko Neufeld