1 Ramiro Voicu, Iosif Legrand, Harvey Newman, Artur Barczyk, Costin Grigoras, Ciprian Dobre, Alexandru Costan, Azher Mughal, Sandor Rozsa Monitoring and.

Slides:



Advertisements
Similar presentations
A Lightweight Platform for Integration of Mobile Devices into Pervasive Grids Stavros Isaiadis, Vladimir Getov University of Westminster, London {s.isaiadis,
Advertisements

May 2005 Iosif Legrand 1 Iosif Legrand California Institute of Technology May 2005 An Agent Based, Dynamic Service System to Monitor, Control and Optimize.
May 2005 Iosif Legrand 1 Iosif Legrand California Institute of Technology ICFA WORKSHOP Daegu, May 2005 Daegu, May 2005 An Agent Based, Dynamic Service.
1 CHEP 2000, Roberto Barbera Roberto Barbera (*) Grid monitoring with NAGIOS WP3-INFN Meeting, Naples, (*) Work in collaboration with.
Network+ Guide to Networks, Fourth Edition
A Java Architecture for the Internet of Things Noel Poore, Architect Pete St. Pierre, Product Manager Java Platform Group, Internet of Things September.
MONITORING WITH MONALISA Costin Grigoras. M ONITORING WITH M ON ALISA What is MonALISA ? MonALISA communication architecture Monitoring modules ApMon.
DataGrid is a project funded by the European Union 22 September 2003 – n° 1 EDG WP4 Fabric Management: Fabric Monitoring and Fault Tolerance
1 Software & Grid Middleware for Tier 2 Centers Rob Gardner Indiana University DOE/NSF Review of U.S. ATLAS and CMS Computing Projects Brookhaven National.
1 In VINI Veritas: Realistic and Controlled Network Experimentation Jennifer Rexford with Andy Bavier, Nick Feamster, Mark Huang, and Larry Peterson
June 2003 Iosif Legrand MONitoring Agents using a Large Integrated Services Architecture Iosif Legrand California Institute of Technology.
Monitoring and controlling VRVS Reflectors Catalin Cirstoiu 3/7/2003.
Camilo Lara KIP HLT Production Readiness Review 1 HLT Cluster Management.
October 2003 Iosif Legrand Iosif Legrand California Institute of Technology.
The new The new MONARC Simulation Framework Iosif Legrand  California Institute of Technology.
Institute of Computer Science AGH Performance Monitoring of Java Web Service-based Applications Włodzimierz Funika, Piotr Handzlik Lechosław Trębacz Institute.
Cloud Computing for the Enterprise November 18th, This work is licensed under a Creative Commons.
Network+ Guide to Networks, Fourth Edition Chapter 1 An Introduction to Networking.
CERN - IT Department CH-1211 Genève 23 Switzerland t Monitoring the ATLAS Distributed Data Management System Ricardo Rocha (CERN) on behalf.
September 2005 Iosif Legrand 1 End User Agents: extending the "intelligence" to the edge in Distributed Service Systems Iosif Legrand California Institute.
Technology Overview. Agenda What’s New and Better in Windows Server 2003? Why Upgrade to Windows Server 2003 ?  From Windows NT 4.0  From Windows 2000.
A Lightweight Platform for Integration of Resource Limited Devices into Pervasive Grids Stavros Isaiadis and Vladimir Getov University of Westminster
Online Monitoring with MonALISA Dan Protopopescu Glasgow, UK Dan Protopopescu Glasgow, UK.
IMPROUVEMENT OF COMPUTER NETWORKS SECURITY BY USING FAULT TOLERANT CLUSTERS Prof. S ERB AUREL Ph. D. Prof. PATRICIU VICTOR-VALERIU Ph. D. Military Technical.
Open Science Grid The OSG Accounting System: GRATIA by Philippe Canal (FNAL) & Matteo Melani (SLAC) Mumbai, India CHEP2006.
ACAT 2003 Iosif Legrand Iosif Legrand California Institute of Technology.
Ramiro Voicu December Design Considerations  Act as a true dynamic service and provide the necessary functionally to be used by any other services.
HOPI Update Rick Summerhill Director Network Research, Architecture, and Technologies Jerry Sobieski MAX GigaPoP and TSC Program Manager Mark Johnson MCNC.
Monitoring, Accounting and Automated Decision Support for the ALICE Experiment Based on the MonALISA Framework.
February 2006 Iosif Legrand 1 Iosif Legrand California Institute of Technology February 2006 February 2006 An Agent Based, Dynamic Service System to Monitor,
1 VINCI : Virtual Intelligent Networks for Computing Infrastructures An Integrated Network Services System to Control and Optimize Workflows in Distributed.
1 Iosif Legrand, Harvey Newman, Ramiro Voicu, Costin Grigoras, Catalin Cirstoiu, Ciprian Dobre An Agent Based, Dynamic Service System to Monitor, Control.
Site operations Outline Central services VoBox services Monitoring Storage and networking 4/8/20142ALICE-USA Review - Site Operations.
DYNES Storage Infrastructure Artur Barczyk California Institute of Technology LHCOPN Meeting Geneva, October 07, 2010.
LISHEP 2004 Iosif Legrand Iosif Legrand California Institute of Technology DISTRIBUTED SERVICES.
Components of a Sysplex. A sysplex is not a single product that you install in your data center. Rather, a sysplex is a collection of products, both hardware.
Management of the LHCb DAQ Network Guoming Liu * †, Niko Neufeld * * CERN, Switzerland † University of Ferrara, Italy.
What is SAM-Grid? Job Handling Data Handling Monitoring and Information.
Overview of ALICE monitoring Catalin Cirstoiu, Pablo Saiz, Latchezar Betev 23/03/2007 System Analysis Working Group.
Monitoring with MonALISA Costin Grigoras. What is MonALISA ?  Caltech project started in 2002
6/23/2005 R. GARDNER OSG Baseline Services 1 OSG Baseline Services In my talk I’d like to discuss two questions:  What capabilities are we aiming for.
1 MonALISA Team Iosif Legrand, Harvey Newman, Ramiro Voicu, Costin Grigoras, Ciprian Dobre, Alexandru Costan MonALISA capabilities for the LHCOPN LHCOPN.
1 | © 2015 Infinera Open SDN in Metro P-OTS Networks Sten Nordell CTO Metro Business Group
Xrootd Monitoring and Control Harsh Arora CERN. Setting Up Service  Monalisa Service  Monalisa Repository  Test Xrootd Server  ApMon Module.
US LHCNet Update Dan Nae California Institute of Technology LHC OPN Meeting Munich, April 2007.
INFSO-RI Enabling Grids for E-sciencE ARDA Experiment Dashboard Ricardo Rocha (ARDA – CERN) on behalf of the Dashboard Team.
April 2003 Iosif Legrand MONitoring Agents using a Large Integrated Services Architecture Iosif Legrand California Institute of Technology.
PPDG February 2002 Iosif Legrand Monitoring systems requirements, Prototype tools and integration with other services Iosif Legrand California Institute.
JAliEn Java AliEn middleware A. Grigoras, C. Grigoras, M. Pedreira P Saiz, S. Schreiner ALICE Offline Week – June 2013.
AliEn central services Costin Grigoras. Hardware overview  27 machines  Mix of SLC4, SLC5, Ubuntu 8.04, 8.10, 9.04  100 cores  20 KVA UPSs  2 * 1Gbps.
+ AliEn site services and monitoring Miguel Martinez Pedreira.
Management of the LHCb DAQ Network Guoming Liu *†, Niko Neufeld * * CERN, Switzerland † University of Ferrara, Italy.
October 2006 Iosif Legrand 1 Iosif Legrand California Institute of Technology An Agent Based, Dynamic Service System to Monitor, Control and Optimize Distributed.
1 A Scalable Distributed Data Management System for ATLAS David Cameron CERN CHEP 2006 Mumbai, India.
03/09/2007http://pcalimonitor.cern.ch/1 Monitoring in ALICE Costin Grigoras 03/09/2007 WLCG Meeting, CHEP.
MONITORING WITH MONALISA Costin Grigoras. M ON ALISA COMMUNICATION ARCHITECTURE MonALISA software components and the connections between them Data consumers.
1 R. Voicu 1, I. Legrand 1, H. Newman 1 2 C.Grigoras 1 California Institute of Technology 2 CERN CHEP 2010 Taipei, October 21 st, 2010 End to End Storage.
Experiment Support CERN IT Department CH-1211 Geneva 23 Switzerland t DBES Author etc Alarm framework requirements Andrea Sciabà Tony Wildish.
1 Grid2003 Monitoring, Metrics, and Grid Cataloging System Leigh GRUNDHOEFER, Robert QUICK, John HICKS (Indiana University) Robert GARDNER, Marco MAMBELLI,
MONALISA MONITORING AND CONTROL Costin Grigoras. O UTLINE MonALISA services and clients Usage in ALICE Online SE discovery mechanism Data management 3.
Storage discovery in AliEn
Univa Grid Engine Makes Work Management Automatic and Efficient, Accelerates Deployment of Cloud Services with Power of Microsoft Azure MICROSOFT AZURE.
ECSAC- August 2009, Veli Losink California Institute of Technology
California Institute of Technology
StratusLab Final Periodic Review
StratusLab Final Periodic Review
Establishing End-to-End Guaranteed Bandwidth Network Paths Across Multiple Administrative Domains The DOE-funded TeraPaths project at Brookhaven National.
GGF15 – Grids and Network Virtualization
Cloud computing mechanisms
Network+ Guide to Networks, Fourth Edition
Presentation transcript:

1 Ramiro Voicu, Iosif Legrand, Harvey Newman, Artur Barczyk, Costin Grigoras, Ciprian Dobre, Alexandru Costan, Azher Mughal, Sandor Rozsa Monitoring and operational management in USLHCNet CHEP09 - March 2009 Prague

Ramiro Voicu CHEP09 Prague March Outline  MonALISA Framework  Architecture  Data handling  Automatic actions  USLHCNet  Network topology  Monitoring modules  Reliable monitoring & accounting  Alarms & triggers  Conclusions

Ramiro Voicu CHEP09 Prague March The MonALISA Architecture 3 Regional or Global High Level Services, Repositories & Clients Secure and reliable communication Dynamic load balancing Scalability & Replication AAA for Clients Distributed Dynamic Registration and Discovery- based on a lease mechanism and remote events JINI-Lookup Services Secure & Public MonALISA services Proxies HL services Agents Network of Distributed System for gathering and analyzing information based on mobile agents: Customized aggregation, Triggers, Actions Fully Distributed System with no Single Point of Failure

Ramiro Voicu CHEP09 Prague March MonALISA Service & Data Handling 4 Data Store Data Cache Service & DB Configuration Control (SSL) Predicates & Agents Data (via ML Proxy) Applications Clients or Higher Level Services WS Clients and service Web Service WSDL SOAP Lookup Service Lookup Service Registration Discovery Postgres AGENTS FILTERS / TRIGGERS Monitoring Modules Collects any type of information Dynamic (Re)Loading Push and Pull

Ramiro Voicu CHEP09 Prague March Two levels of decisions: local (autonomous), global (correlations). Actions triggered by: values above/below given thresholds, absence/presence of values, correlations between any values. Action types: alerts ( s/instant msg/atom feeds), running an external command, automatic charts annotations in the repository, running custom code, like securely ordering a ML service to (re)start a site service. ML Service Actions based on global information Actions based on local information Traffic Jobs Hosts Apps Temperature Humidity A/C Power … Sensors Local decisions Global decisions Local and Global Decision Framework Global ML Services

Ramiro Voicu CHEP09 Prague March Monitoring architecture in ALICE 6 Long History DB LCG Tools ApMon AliEn Job Agent ApMon AliEn Job Agent ApMon AliEn Job Agent MonALISA LCG Site ApMon AliEn CE ApMon AliEn SE ApMon Cluster Monitor ApMon AliEn TQ ApMon AliEn Job Agent ApMon AliEn Job Agent ApMon AliEn Job Agent ApMon AliEn CE ApMon AliEn SE ApMon Cluster Monitor ApMon AliEn IS ApMon AliEn Optimizers ApMon AliEn Brokers ApMon MySQL Servers ApMon CastorGrid Scripts ApMon API Services MonaLisaRepository Aggregated Data rss vsz cpu time run time job slots free space nr. of files open files Queued JobAgents cpu ksi2k job status disk used processes load net In/out jobs status sockets migrated mbytes active sessions MyProxy status Alerts Actions See Costin Grigoras’ poster (067): Automated agents for management and control of the ALICE Computing Grid

Ramiro Voicu CHEP09 Prague March USLHCNet  USLHCNet provides transatlantic connections of the Tier1 computing facilities at Fermilab and Brookhaven with the Tier0 and Tier1 facilities at CERN as well as Tier1s elsewhere in Europe and Asia.  Together with ESnet, Internet2 and the GEANT, USLHCNet supports connections between the Tier2 centers.  The USLHCNet core infrastructure is using the Ciena Core Director devices that provide time-division multiplexing and packet-forwarding protocols that support virtual circuits with bandwidth guarantees. The virtual circuits offer the functionality to develop efficient data transfer services with support for QoS and priorities.  Hybrid network: uses both Ciena CD and Force10 routers  4 transatlantic 10G links at the moment (6 links in the second part of this year)* * See Harvey Newman talk[502] from Monday: “Status and outlook of the HEP network”

Ramiro Voicu CHEP09 Prague March USLHCnet ML weather map

Ramiro Voicu CHEP09 Prague March Monitoring modules We developed a set of monitoring modules for USLHCNet network devices:  Force10 (SNMP & sFlow)  Traffic per interface  sFlow traffic  Link status monitoring  Ciena Core Director (TL1 – Transaction Language1)  ETTP (Ethernet Termination Point) traffic  EFLOW (Ethernet Flow) traffic  OSRP (routing protocol) topology  Dynamic circuits inside the optical core of the network

Ramiro Voicu CHEP09 Prague March USLHCnet monitoring SNMP TL1 SNMP

Ramiro Voicu CHEP09 Prague March USLHCnet redundant monitoring Each Circuit is monitored at both ends by at least two MonALISA services; the monitored data is aggregated by global filters in the repository

Ramiro Voicu CHEP09 Prague March Local and global filters  Based on the MonALISA actions framework a set of triggers have been deployed inside the service to notify by , SMS and IM the USLHCNet network engineers in case of problems  The filters developed for USLHCNet repository aggregate the redundant monitoring data (traffic and link status) collected from all the MonALISA services  The link status is computed as a logical “AND” between both end points of a link. This also cross checks the status reported by the hardware equipment.  We collect data in two repository instances, each with replicated database back-ends. These instances are dynamically balanced in DNS.

Ramiro Voicu CHEP09 Prague March USLHCnet: Precise measurements for the Operational Status on the WAN Link  Operations & management assisted by agent-based software  Used on the new CIENA equipment used for network managment

Ramiro Voicu CHEP09 Prague March USLHCnet: Traffic on different segments

Ramiro Voicu CHEP09 Prague March USLHCnet: Accounting for Integrated Traffic

Ramiro Voicu CHEP09 Prague March USLHCnet: Ciena alarms monitoring

Ramiro Voicu CHEP09 Prague March The Need for Planning and Scheduling for Large Data Transfers In Parallel Sequential 2.5 X Faster to perform the two reading tasks sequentially

Ramiro Voicu CHEP09 Prague March Dynamic restoration of lightpath if a segment has problems Monitoring Optical Switches

Ramiro Voicu CHEP09 Prague March CERN Geneva CALTECH Pasadena Starlight Manlan USLHCnet Internet2 Controlling Optical Planes Automatic Path Recovery “Fiber cut” simulations The traffic moves from one transatlantic line to the other one FDT transfer (CERN – CALTECH) continues uninterrupted TCP fully recovers in ~ 20s FDT Transfer 4 Fiber cuts simulations 200+ MBytes/sec From a 1U Node 4 fiber cut emulations For more details, see Iosif Legrand’s poster (054): A High Performance Data Transfer Service

Ramiro Voicu CHEP09 Prague March Conclusions  The MonALISA framework provides a flexible and reliable monitoring infrastructure  350+ installed services, 1.5M+ unique parameters, 25kHz value updates  Truly distributed architecture with no single points of failure  Highly modular platform  Automatic decision taking capability at both local and global levels  USLHCNet provides a state-of-the-art hybrid network with support for circuit oriented network services  Monitoring this infrastructure proved to be a challenging task, but we are running with 99.5+% monitoring uptime  We are investigating dynamic provisioning of circuits from collaborating agents