IT Analytics on z Systems – IBM zAware 2.0 (z13 orderable feature) Yuk (Patrick) Chan, IBM Senior Software

IT Analytics on z Systems – IBM zAware 2.0 (z13 orderable feature) Yuk (Patrick) Chan, IBM Senior Software Engineer Twitter: @AboutPatrick 6/25/2015

4 © 2015 IBM Corporation 4 IT Analytics for System z Agenda  Background – Problem determine: existing approach and new appraoch  Why the new approach (zAware)?  Why should you care?  zAware History  Analytic on Linux  zAware UI  zAware High Level View, Operating Requirement  Setup and Configuration  zAware Use Cases

5 © 2015 IBM Corporation 5 IT Analytics for System z IT Analytics - IBM zAware IBM System z Advanced Workload Analysis Reporter

6 © 2015 IBM Corporation 6 IT Analytics for System z Background – Problem Determination, the existing and new approach Traditional Problem Avoidance and Determination  Scenario based, from known problems.  Message or Message ID based  e.g. Kernel Panic  Threshold based  e.g. % of paging space available A new and different approach – Analytics  What’s expected? Deduces what’s unexpected (i.e. anomaly)?  How it works: Highly unexpected Highly expected Normal Logs ……………… Model Realtime Logs (Tons of) ……………… IBM zAware Analytics Algorithm: IBM Research Deep System z expertise IBM zAware Analytics Algorithm: IBM Research Deep System z expertise Logs representing Normal System behaviors IBM zAware UI So that you don’t need to learn the deep as much System z knowledge

7 © 2015 IBM Corporation 7 IT Analytics for System z Background – Why the new approach (zAware)?  Complex use case, complex system, complex software, complex interaction, complex EVERYTHING! Difficult to detect using traditional method  Problems that was never seen before.  Problems that involves multiple software, firmware, systems and hardware components.  Small problems/signs that manifest into big problem. Difficult to diagnose and isolate problems  Failure involving multiple software, firmware, systems and hardware components. How to find the component, system in error?  Volume of diagnose data is not humanly consumable. One company has 18.6M msgs/day -> 215/second Another has 2.46M msgs/day -> 28/second

8 © 2015 IBM Corporation 8 IT Analytics for System z Background – Why should you care? Significant failure -Noticed by users -Detected by traditional tools Shorter Time to diagnose the problem Time to fix the problem Without IBM zAware With IBM zAware Time to diagnose the problem Earlier problem detection AVOID or shorten time to recover from an early detected problem Shorter MTTR Better met SLA Shorter MTTR Better met SLA Small problems show up as anomaly, could show up across components.

9 © 2015 IBM Corporation 9 IT Analytics for System z zAware History zEC 12 3Q2012 zAware V1 Pattern recognition for z/OS OPERLOG messages Browser based graphical UI APIs for vendor integration zAware V2 Enhanced pattern recognition algorithm Enhanced graphical UI Updated APIs with additional analytic details z13 1Q2015 zAware V2, with MCF Supported Linux on System z, syslog messages. Supported grouping of similar Linux systems z13 Jun 26, 2015

10 © 2015 IBM Corporation 10 IT Analytics for System z Analytic on Linux  Using a group of Linux Systems to build a model of “expected behavior”.  Analyzing each Linux System independently against the model.  System X didn’t contribute to the model, but available for analysis.  Allows systems that are dynamic (comes and goes) in nature to benefit from IBM zAware immediately. System A Normal Logs ……………… Model System A Realtime Logs ……………… IBM zAware IBM zAware UI System B Normal Logs ……………… System C Normal Logs ……………… Similar systems and workload (A Model Group) System X Realtime Logs ……………… What is unexpected from System A? What is unexpected from System X?

11 © 2015 IBM Corporation 11 IT Analytics for System z Log Stream How anomaly are reported?  Model Group -> System -> Date -> Interval  What is an interval?  Analysis result provided every 2 minutes incoming logs.  Analysis result hardened every 10 minutes. Harden Result Analyze using 60 minutes of logs Harden Result, every 10 minutes Currently incoming logs: temporary result every 2 minutes  60 minutes of sliding window? Relationship are found for log messages within the same windows. Different OS might have different window size. Linux – 60 minutes z/OS – 10 minutes NOW Harden Result, 10 minutes Future 10:00 PAST 10:10 10:20 …….. 10:30 10:22

12 © 2015 IBM Corporation 12 IT Analytics for System z IBM zAware GUI – Interval View Height shows number of unique messageIDs Clicking on a bar drills down to Interval Color shows anomaly score

13 © 2015 IBM Corporation 13 IT Analytics for System z IBM zAware GUI - Heatmap, Group Aggregated analysis score for group with ability to drill down Monitor multiple plexes

14 © 2015 IBM Corporation 14 IT Analytics for System z IBM zAware GUI - Systems in a group Score by the Hour Score by the Day

15 © 2015 IBM Corporation 15 IT Analytics for System z IBM zAware GUI – Interval View with details Ids are generated 1 6 8 3 2 4 5 7

16 © 2015 IBM Corporation 16 IT Analytics for System z zAware High Level View z13 IBM zAware host Linux on system z z/OS IBM zAware Host Partition zAware Server IBM zAware monitored client Linux on system z z/OS IBM zAware Web GUI to monitor results z/VM

17 © 2015 IBM Corporation 17 IT Analytics for System z Operating Requirements – IBM zAware IBM zAware Serverz13 z/OS and zLinux IFL or CP (recommend 2 partial IFL or CP) zEC12, zBC12 z/OS IBM zAware, Linux ClientLinux level SLES 10 or later RHEL 6 or later Native or as z/VM guest Linux syslog daemon (/var/log/messages) RFC5424 format Supports: rsyslog, syslog-ng Unsupported: syslog relay (direct connection to zAware)

18 © 2015 IBM Corporation 18 IT Analytics for System z zAware Setup  Purchase and install the IBM zAware Feature Code (firmware)  Loaded from the Support Element  Update firmware: SE, HMC, CDU (Concurrent Driver Upgrade), MCF/MCLs  Define I/O using HCD or HCM.  Defines zAware Partition (similar to other partitions)  Define Profile with “zAware Mode” Assign processors Assign storage size Assign network HiperSockets, shareable OSA ports or IEDN IP Address  Define storage on the zAware UI  Requires EDKD DASD  Configure security / user credential and roles on the zAware UI  Configure analytic options  Configure monitor clients

19 © 2015 IBM Corporation 19 IT Analytics for System z References  IBM System z Advanced Workload Analysis Reporter (IBM zAware) Guide SC27-2623-00   Or IBMResourceLink Library → zEC12 → Publications  IBM System z Advanced Workload Analysis Reporter (IBM zAware) Guide V2.0 SC27-2632-00   Redbook Web Doc: IBM zAware Migration from an IBM zEC12 to an IBM z13   Redbook: Extending z/OS System Management Functions with IBM zAware   IBM Mainframe Insights  The Journey to IBM zAware  zAware Installation and Startup  Top 10 Most Frequently Asked Questions About IBM zAware  IBM zAware Demo

20 © 2015 IBM Corporation 20 IT Analytics for System z Sample User Cases zOS

21 © 2015 IBM Corporation 21 IT Analytics for System z Identify anomaly Which z/OS image is having unusual message patterns? Yellow and dark blue on CB88 When did the behavior start? Around 2:30

22 © 2015 IBM Corporation 22 IT Analytics for System z Drill down - configuration error What component is having the problem? Drill down indicates 900 IRRC131I and IRRC144I messages per interval. A review of SYSLOG showed that this was the result of work being performed in the LDAP address spaces. Further analysis showed that the LDAP PC Callable Interface was not enabled. At 6:40, the function was enabled, and the 131I and 144I messages are no longer generated. Impact Unnecessary messages blocking ability to see anything else. Impacts ability to look at the console When did the behavior start? Around 2:30

23 © 2015 IBM Corporation 23 IT Analytics for System z Identify unusual behavior – quickly Which z/OS image is having unusual message patterns? Recurring yellow and dark blue on CB8C When did the behavior start? After an IPL at 13:30

24 © 2015 IBM Corporation 24 IT Analytics for System z Identify unusual behavior – quickly Which subsystem or component is abnormal? Examine high-scoring messages When did the behavior start? When did the messages start to occur? Were similar messages issued previously? Easily examine prior intervals or dates Moving left and right by interval shows messages due to TNPROC being cancelled by TCP/IP

25 © 2015 IBM Corporation 25 IT Analytics for System z Identify unusual behavior after a change Are unusual messages being issued after a change? New / updated workload (OS, middleware, apps) was introduced Detected as yellow bars Once messages confirmed as ok, can rebuild your system model, and workload now understood as “normal.” A new model included several days of new workload

26 © 2015 IBM Corporation 26 IT Analytics for System z Sample User Cases zLinux

27 © 2015 IBM Corporation 27 IT Analytics for System z Demo – 3 mirrored Linux systems got similar authentication error After an incident, zAware helps narrow down problem that traditional method wouldn’t. 3 mirrored systems has anomaly at the same time?!?

28 © 2015 IBM Corporation 28 IT Analytics for System z Note: Detailed view is removed for security reason. Detailed view shows all 3 systems has similar messages caused by login from the same host. This is not necessary a problem. This could means a new employee trying to login to a system that doesn’t have the proper UserId setup, or this could be an cyber attack.

29 © 2015 IBM Corporation 29 IT Analytics for System z Demo – System Upgrade and Restart Logger offline Testing System or Logger offline System booting zAware paints a picture of “what happened” based on anomaly.

30 © 2015 IBM Corporation 30 IT Analytics for System z Demo – Kernel Point dereference error zAware didn’t receive log data System restarted Kernel Reference Pointer Error

31 © 2015 IBM Corporation 31 IT Analytics for System z Demo – Kernel Pointer dereference error

32 © 2015 IBM Corporation 32 IT Analytics for System z Demo – Repeating Error More Anomalous than normal

