PLANNING FOR PREDICTABLE NETWORK PERFORMANCE IN THE ATLAS TDAQ C. Meirosu, B. Martin, A. Topurov, A. Al-Shabibi CHEP’06, Mumbai, India
February 2006Catalin Meirosu 2 Outline Network management, the basics TDAQ needs Proposed solution – architecture and implementation
February 2006Catalin Meirosu 3 What is network management ? Management workstation Hardwareinventory Software version Device settings Troubleshooting assistance
February 2006Catalin Meirosu 4 Network management Manager-agent model De-facto standard implementation through IETF specifications – Management Information Bases (MIBs) – Simple Network Management Protocol (SNMP) Best practices: IT Information Library (ITIL) – Configuration and change management (among others) – Emphasis on service, rather than hardware/software
February 2006Catalin Meirosu 5 The ATLAS TDAQ system Control network Courtesy Stefan Stancu
February 2006Catalin Meirosu 6 What does TDAQ need ? The TDAQ networking scenario – Pre-defined number of devices to connect to the network, staged deployment – Network resilient by design, optimised for a known traffic pattern Need to maximize network uptime – Rapid and precise fault localisation Maintain the agreed QoS for the applications known at design time while providing good service for late arrivals Use the network as a debug tool for application data transfer problems Provide real-time information on the network status to the physics operator on shift
February 2006Catalin Meirosu 7 Physics console middleware Fault and Performance management Configuration management YaTG Network administrator TDAQ network management solution Network management for TDAQ
February 2006Catalin Meirosu 8 Fault Management Detection – Where? – When? Why? Root Cause Analysis (and related methods) – Precise location of the actual fault standard component in commercial packages Physics console middleware Fault and Performance management Configuration management YaTG Network administrator TDAQ network management solution
February 2006Catalin Meirosu 9 Performance management “Why am I not receiving the advertised service ?” Traffic monitoring and reporting – Standard best practices: report 1-5 min averages Low frequency monitoring – Included in the commercial tool High frequency monitoring – YaTG (in-house development) : 1s average – Rate potentially higher, but not supported by SNMP implementations in many modern switches Physics console middleware Fault and Performance management Configuration management YaTG Network administrator TDAQ network management solution
February 2006Catalin Meirosu 10 Integration to physics console Physics operator controls the run of the experiment via the Online Software – needs to see the state of the network The fault management knows the state of the network We pass the information (via CORBA) to the Online Software – Hence the “middleware” term (mean: “in between”) Physics console middleware Fault and Performance management Configuration management YaTG Network administrator TDAQ network management solution
February 2006Catalin Meirosu 11 Configuration management Keep track of configuration changes in the network devices Push pre-defined configurations onto the network RANCID, open source tool – Covers the above basics Advanced features under discussion – Application-driven network reconfiguration ? Physics console middleware Fault and Performance management Configuration management YaTG Network administrator TDAQ network management solution
February 2006Catalin Meirosu 12 Conclusion A Network Management Solution is a must in the TDAQ context Areas to be covered – Performance management – Fault management – Configuration management – Integration with the Physics Console Sophisticated network management is expensive, but the network cannot troubleshoot itself (yet)