Data & Storage Services CERN IT Department CH-1211 Genève 23 Switzerland www.cern.ch/i t DSS Tape Monitoring Vladimír Bahyl IT DSS TAB Storage Analytics.

Slides:



Advertisements
Similar presentations
Steve Lewis J.D. Edwards & Company
Advertisements

IBM SMB Software Group ® ibm.com/software/smb Maintain Hardware Platform Health An IT Services Management Infrastructure Solution.
Networking Essentials Lab 3 & 4 Review. If you have configured an event log retention setting to Do Not Overwrite Events (Clear Log Manually), what happens.
Data & Storage Services CERN IT Department CH-1211 Genève 23 Switzerland t DSS TSM CERN Daniele Francesco Kruse CERN IT/DSS.
1 Chapter 7 Intrusion Detection. 2 Objectives In this chapter, you will: Understand intrusion detection benefits and problems Learn about network intrusion.
CERN - IT Department CH-1211 Genève 23 Switzerland t Transportable Tablespaces for Scalable Re-Instantiation Eva Dafonte Pérez.
CASTOR Project Status CASTOR Project Status CERNIT-PDP/DM February 2000.
1 ECM System Monitor in the CMOD Environment. © 2013 IBM Corporation Enterprise Content Management IBM ECM System Monitor Improve Availability / Lower.
CERN - IT Department CH-1211 Genève 23 Switzerland t Oracle and Streams Diagnostics and Monitoring Eva Dafonte Pérez Florbela Tique Aires.
Maintaining and Updating Windows Server 2008
Check Disk. Disk Defragmenter Using Disk Defragmenter Effectively Run Disk Defragmenter when the computer will receive the least usage. Educate users.
Barracuda Networks Confidential1 Barracuda Backup Service Integrated Local & Offsite Data Backup.
Ch. 31 Q and A IS 333 Spring 2015 Victor Norman. SNMP, MIBs, and ASN.1 SNMP defines the protocol used to send requests and get responses. MIBs are like.
CERN IT Department CH-1211 Genève 23 Switzerland t Integrating Lemon Monitoring and Alarming System with the new CERN Agile Infrastructure.
CERN - IT Department CH-1211 Genève 23 Switzerland t Monitoring the ATLAS Distributed Data Management System Ricardo Rocha (CERN) on behalf.
CERN IT Department CH-1211 Genève 23 Switzerland t Tape-dev update Castor F2F meeting, 14/10/09 Nicola Bessone, German Cancio, Steven Murray,
CERN IT Department CH-1211 Genève 23 Switzerland t EIS section review of recent activities Harry Renshall Andrea Sciabà IT-GS group meeting.
Module 10: Monitoring ISA Server Overview Monitoring Overview Configuring Alerts Configuring Session Monitoring Configuring Logging Configuring.
GGUS summary (4 weeks) VOUserTeamAlarmTotal ALICE ATLAS CMS LHCb Totals
Chapter © 2006 The McGraw-Hill Companies, Inc. All rights reserved.McGraw-Hill/ Irwin Chapter 7 IT INFRASTRUCTURES Business-Driven Technologies 7.
CERN IT Department CH-1211 Genève 23 Switzerland t Internet Services Job Monitoring for the LHC experiments Irina Sidorova (CERN, JINR) on.
Experiment Support CERN IT Department CH-1211 Geneva 23 Switzerland t DBES PhEDEx Monitoring Nicolò Magini CERN IT-ES-VOS For the PhEDEx.
CERN IT Department CH-1211 Genève 23 Switzerland t Using AI tools for IT-CS Spectrum-based monitoring Véronique Lefébure IT/CS-CE February.
CERN IT Department CH-1211 Geneva 23 Switzerland t Daniel Gomez Ruben Gaspar Ignacio Coterillo * Dawid Wojcik *CERN/CSIC funded by Spanish.
CERN - IT Department CH-1211 Genève 23 Switzerland t DB Development Tools Benthic SQL Developer Application Express WLCG Service Reliability.
Module 15 Monitoring SQL Server 2008 R2 with Alerts and Notifications.
CERN IT Department CH-1211 Genève 23 Switzerland t Internet Services Overlook of Messaging.
CERN IT Department CH-1211 Genève 23 Switzerland t Monitoring: Tracking your tasks with Task Monitoring PAT eLearning – Module 11 Edward.
CERN IT Department CH-1211 Genève 23 Switzerland t IT Monitoring WG IT/CS Monitoring System Virginie Longo September 14th 2011.
Management of the LHCb DAQ Network Guoming Liu * †, Niko Neufeld * * CERN, Switzerland † University of Ferrara, Italy.
CERN IT Department CH-1211 Geneva 23 Switzerland t GDB CERN, 4 th March 2008 James Casey A Strategy for WLCG Monitoring.
CASTOR evolution Presentation to HEPiX 2003, Vancouver 20/10/2003 Jean-Damien Durand, CERN-IT.
CERN - IT Department CH-1211 Genève 23 Switzerland t OIS Deployment of Exchange 2010 mail platform Pawel Grzywaczewski, CERN IT/OIS HEPIX.
Data & Storage Services CERN IT Department CH-1211 Genève 23 Switzerland t DSS Castor incident (and follow up) Alberto Pace.
CERN IT Department CH-1211 Genève 23 Switzerland t DBA Experience in a multiple RAC environment DM Technical Meeting, Feb 2008 Miguel Anjo.
CERN - IT Department CH-1211 Genève 23 Switzerland t High Availability Databases based on Oracle 10g RAC on Linux WLCG Tier2 Tutorials, CERN,
Computing Facilities CERN IT Department CH-1211 Geneva 23 Switzerland t CF Agile Infrastructure Monitoring HEPiX Spring th April.
Data & Storage Services CERN IT Department CH-1211 Genève 23 Switzerland t DSS New tape server software Status and plans CASTOR face-to-face.
CERN IT Department CH-1211 Geneva 23 Switzerland t A proposal for improving Job Reliability Monitoring GDB 2 nd April 2008.
CERN IT Department CH-1211 Genève 23 Switzerland t HEPiX Conference, ASGC, Taiwan, Oct 20-24, 2008 The CASTOR SRM2 Interface Status and plans.
Distributed Logging Facility Castor External Operation Workshop, CERN, November 14th 2006 Dennis Waldron CERN / IT.
Operating Systems & Information Services CERN IT Department CH-1211 Geneva 23 Switzerland t OIS Drupal at CERN Juraj Sucik Jarosław Polok.
CERN IT Department CH-1211 Genève 23 Switzerland t CERN IT Monitoring and Data Analytics Pedro Andrade (IT-GT) Openlab Workshop on Data Analytics.
FTS monitoring work WLCG service reliability workshop November 2007 Alexander Uzhinskiy Andrey Nechaevskiy.
CERN IT Department CH-1211 Genève 23 Switzerland t Migration from ELFMs to Agile Infrastructure CERN, IT Department.
Grid Technology CERN IT Department CH-1211 Geneva 23 Switzerland t DBCF GT Overview of DMLite Ricardo Rocha ( on behalf of the LCGDM team.
CERN - IT Department CH-1211 Genève 23 Switzerland Tape Operations Update Vladimír Bahyl IT FIO-TSI CERN.
Data & Storage Services CERN IT Department CH-1211 Genève 23 Switzerland t DSS Data architecture challenges for CERN and the High Energy.
03/09/2007http://pcalimonitor.cern.ch/1 Monitoring in ALICE Costin Grigoras 03/09/2007 WLCG Meeting, CHEP.
CERN IT Department CH-1211 Genève 23 Switzerland t The Tape Service at CERN Vladimír Bahyl IT-FIO-TSI June 2009.
CERN - IT Department CH-1211 Genève 23 Switzerland CASTOR F2F Monitoring at CERN Miguel Coelho dos Santos.
Developments for tape CERN IT Department CH-1211 Genève 23 Switzerland t DSS Developments for tape CASTOR workshop 2012 Author: Steven Murray.
CERN IT Department CH-1211 Genève 23 Switzerland t CERN Agile Infrastructure Monitoring Pedro Andrade CERN – IT/GT HEPiX Spring 2012.
CERN IT Department CH-1211 Genève 23 Switzerland t Increasing Tape Efficiency Original slides from HEPiX Fall 2008 Taipei RAL f2f meeting,
Ch. 31 Q and A IS 333 Spring 2016 Victor Norman. SNMP, MIBs, and ASN.1 SNMP defines the protocol used to send requests and get responses. MIBs are like.
Retele de senzori EEMon Electrical Energy Monitoring System.
Maintaining and Updating Windows Server 2008 Lesson 8.
TIFR, Mumbai, India, Feb 13-17, GridView - A Grid Monitoring and Visualization Tool Rajesh Kalmady, Digamber Sonvane, Kislay Bhatt, Phool Chand,
Experiment Support CERN IT Department CH-1211 Geneva 23 Switzerland t DBES Author etc Alarm framework requirements Andrea Sciabà Tony Wildish.
CERN IT Department CH-1211 Genève 23 Switzerland t Towards end-to-end debugging for data transfers Gavin McCance Javier Conejero Banon Sophie.
PART1 Data collection methodology and NM paradigms 1.
CERN - IT Department CH-1211 Genève 23 Switzerland CERN Tape Status Tape Operations Team IT/FIO CERN.
I/Watch™ Weekly Sales Conference Call Presentation (See next slide for dial-in details) Andrew May Technical Product Manager Dax French Product Specialist.
Tape Operations Vladimír Bahyl on behalf of IT-DSS-TAB
1 VO User Team Alarm Total ALICE ATLAS CMS
CTA: CERN Tape Archive Overview and architecture
STATEL an easy way to transfer data
iSecurity AP Journal Training
Presentation transcript:

Data & Storage Services CERN IT Department CH-1211 Genève 23 Switzerland t DSS Tape Monitoring Vladimír Bahyl IT DSS TAB Storage Analytics Seminar February 2011

CERN IT Department CH-1211 Genève 23 Switzerland t Internet Services DSS 2 Overview From low level –Tape drives; libraries Via middle layer –LEMON –Tape Log DB To high level –Tape Log GUI –SLS TSMOD What is missing? Conclusion

CERN IT Department CH-1211 Genève 23 Switzerland t Internet Services DSS 3 Low level – towards the vendors Oracle Service Delivery Platform (SDP) –Automatically opens tickets with Oracle –We also receive notifications –Requires “hole” in the firewall, but quite useful IBM TS3000 console –Central point collecting all information from 4 (out of 5) libraries –Call home via Internet (not modem) –Engineers come on site to fix issues

CERN IT Department CH-1211 Genève 23 Switzerland t Internet Services DSS 4 Low level – CERN usage SNMP –Using it (traps) whenever available –Need MIB files with SNMPTT actuators: –IBM libraries send traps on errors –ACSLS sends activity traps ACSLS –Event log messages on multiple lines concatenated into one –Forwarded via syslog to central store –Useful for tracking issues with library components (PTP) EVENT ibm3584Trap ibm3584Trap CRITICAL FORMAT ON_BEHALF: $A SEVERITY: '3' $s MESSAGE: 'ASC/ASCQ $2, Frame/Drive $6, $7' EXEC /usr/local/sbin/ibmlib-report-problem.sh $A CRITICAL NODES ibmlib0 ibmlib1 ibmlib2 ibmlib3 ibmlib4 SDESC Trap for library TapeAlert 004. DESC

CERN IT Department CH-1211 Genève 23 Switzerland t Internet Services DSS 5 Middle layer – LEMON Actuators constantly check local log files 4 situations covered: 1.Tape drive not operational 2.Request stuck for at last 3600 seconds 3.Cartridge is write protected 4.Bad MIR (Media Information Record) Ticket is created = is sent –All relevant information is provided within the ticket to speedup the resolution Workflow is followed to find a solution Dear SUN Tape Drive maintainer team, this is to report that a tape drive has became non-operational. Tape T05653 has been disabled. PROBABLE ERRORS 01/28 15:33: rlstape: tape alerts: hardware error 0, media error 0, read failure 0, write failure 0 01/28 15:33: chkdriveready: TP002 - ioctl error : Input/output error 01/28 15:33: rlstape: TP033 - drive not operational IDENTIFICATION Drive Name: T10B661D Location: acs0,6,1,13 Serial Nr: Volume ID: T05653 Library: SL8600_1 Model: T10000 Producer: STK Density: 1000GC Free Space: 0 Nb Files: 390 Status: FULL|DISABLED Pool Name: compass7_2 Tape Server: tpsrv963

CERN IT Department CH-1211 Genève 23 Switzerland t Internet Services DSS 6 Middle layer – Tape Log DB CASTOR log messages from all tape servers are processed and forwarded to central database Allows correlation of independent errors (not a complete list): –X input/output errors with Y tapes on 1 drive –X write errors on Y tapes on 1 drive –X positioning errors on Y tapes on 1 drive –X bad MIRs for 1 tape on Y drives –X write/read errors on 1 tape on Y drives –X positioning errors on 1 drive on Y drives –Too many errors on a library Archive for 120 days all logs slit by VID and tape server –Q: What happened to this tape?

CERN IT Department CH-1211 Genève 23 Switzerland t Internet Services DSS 7 Tape Log – the data Origin: rtcpd & taped log messages –All tape servers sending data in parallel Content: various file state information Volume: –Depends on the activity of the tape infrastructure –Past 7 days: ~30 GBs of text files (raw data) Frequency: –Depends on the activity of the tape infrastructure –Easily > 1000 lines / second Format: plain text

CERN IT Department CH-1211 Genève 23 Switzerland t Internet Services DSS 8 Tape Log – data transport Protocol: ( r)syslog log messages Volume: ~150 KB/second Accepted delays: YES/NO –YES: If the tape log server can not upload processed data into the database, it will try later as it has local text log file –NO: If the rsyslog daemon is not running the the tape log server, lost messages will not be processed Losses acceptable: YES (to some small extent) –The system is only used for statistics or slow reactive monitoring –Serious problem will reoccur elsewhere –We use TCP in order not to loose messages

CERN IT Department CH-1211 Genève 23 Switzerland t Internet Services DSS 9 Tape Log – data storage Medium: Oracle database Data structure: 3 main tables –Accounting –Errors –Tape history Amount of data in store: –2 GB –15-20 millions of records (2 years worth of data) Aging: no, data kept forever

CERN IT Department CH-1211 Genève 23 Switzerland t Internet Services DSS 10 Tape Log – data processing No additional post processing, once data is stored in database Data mining and visualization done online –Can take up to a minute

CERN IT Department CH-1211 Genève 23 Switzerland t Internet Services DSS 11 High level – Tape Log GUI Oracle APEX on top of data in DB Trends –Accounting –Errors –Media issues Graphs –Performance –Problems

CERN IT Department CH-1211 Genève 23 Switzerland t Internet Services DSS 12 High level – Tape Log GUI

CERN IT Department CH-1211 Genève 23 Switzerland t Internet Services DSS 13 Tape Log – pros and cons Pros –Used by DG in his talk! –Using standard transfer protocol –Only uses in-house supported tools –Developed quickly; requires little/no support Cons –Charting limitations Can live with that; see point 1 – not worth supporting something special –Does not really scale OK if only looking at last year’s data

CERN IT Department CH-1211 Genève 23 Switzerland t Internet Services DSS 14 High level – SLS Service view for users Life availability information as well as capacity/usage trends –Partially reuses Tape Log DB data Information organized per VO –Text and graphs Per day/week/month

CERN IT Department CH-1211 Genève 23 Switzerland t Internet Services DSS 15 High level – SLS

CERN IT Department CH-1211 Genève 23 Switzerland t Internet Services DSS 16 TSMOD Tape Service Manager on Duty –Weekly changing role to Resolve issues Talk to vendors Supervise interventions Acts on twice-daily summary which monitors: –Drives stuck in (dis-)mounting –Drives not in production without any reason –Requests running or queued for too long –Queue size too long –Supply tape pools running low –Too many disabled tapes since the last run Goal: have one common place to watch

CERN IT Department CH-1211 Genève 23 Switzerland t Internet Services DSS 17 What is missing? We often need the full chain –When was the tape last time successfully read? –On which drive? –What was the firmware of that drive? Users hidden within upper layers –We do not know which exact user is right now reading/writing –The only information we have is the experiment name and that is deducted from the stager hostname Details investigations often require request ID

CERN IT Department CH-1211 Genève 23 Switzerland t Internet Services DSS 18 Conclusion CERN has extensive tape monitoring covering all layers The monitoring is fully integrated with the rest of the infrastructure It is flexible to support new hardware (e.g. higher capacity media) The system is being improved as new requirements arise