Connecticut Computer Measurement Group 2015 Spring Meeting 5 Ingredients to Executing Application Performance Management on z/OS
“The translation of IT metrics into business meaning (value) is what APM is all about.” Agreed? If so, what are the prerequisites to getting this done efficiently? Is there anything special in doing APM on the mainframe? While user satisfaction is based on response time and availability, you have to watch the consumed CPU seconds on z/OS to optimize and control costs. Motivation
Ingredient 1 One View
Our daily life is full of “communication issues“ caused by Different (technical) languages Different metric system Missing information Looking at different spots … Let’s avoid those troublemakers wherever we can How would this translate into APM? One View, One Language One Solution
End-To-End End User Perspective No Gaps No Blind Spots One View on the whole Environment Browsers / Rich-Client Mobile Apps ESB/MB/MQ Mainframe.NET Java Web Server CTG Database
End-To-End Key Metrics for each Tier One Hotspot!
Ingredient 2 Top Down
Where is the issue? On my Mobile App? A poor Network Connection? On the Web Server? On the Mainframe/DB2? … What’s the Root Cause? How could it be fixed? Detect the Hotspot and Answer Open Questions
30,000 Feet One user tap in the mobile app Introduced 12 Calls to CICS Generated 2,671 DB2 Calls Which programs are executed and why? Which program generated these DB2 statements?
Where is the issue? On my Mobile App? – No A poor Network Connection? – No On the Web Server? – A least it’s part of the problem On the Mainframe/DB2? – Yes What’s the Root Cause? Inefficient Use of the mainframe Too many DB2 Statements How can it be fixed? We need more details to answer that Detect the Hotspot and Answer Open Questions
Who: A Java application is triggering the mainframe. How: Using the CICS Transaction Gateway What: Callstack for all programs and DB statements on z/OS Zoooooooooom
Ingredient 3 Start Early
Agile Development and Continuous Integration Forces teams to automate their build and testing processes Shortens development cycles from months (years?) to weeks or even days To maintain such a system you have to watch your builds with a handful of smart KPIs Is this also applicable for z/OS? If yes, what would be a smart set of KPIs? Trend your Builds with valuable KPIs
Test Automation Build 20testPurchaseOK testSearchOK Build 17testPurchaseOK testSearchOK Build 18testPurchaseFAILED testSearchOK Build 19testPurchaseOK testSearchOK Build #Test CaseStatus # Trans. # DB2# Abend Test Framework ResultsDetailed zOS Data We identified a regresesion Problem solved Lets look behind the scenes Abend is probably reason for failed tests Problem fixed but now we have an architectural regression Now we have the functional and architectural confidence Problem fixed but now we have an architectural regression
What you currently measure What you should measure # Functional Test Failures Overall Duration Related to a User Action: # of z/OS Transaction # executed Programs # executed DB2 statements # MQ calls # Abends CPU seconds Execution Time of Tests …
Release Acceptance Testing Unit Testing Performance Testing Performance Testing Quality Gate between Stages AutomatedSemi-Automated Monitor Tests Analyze Results Integrate with Build Infrastructure
Ingredient 4 Focus
You tuned your top X z/OS transactions they are really fast and efficient now Those transactions are causing ~ 90% of your CPU time on z/OS So your mainframe environment looks like this: Let‘s assume
Are you now done with APM? Efficient, Fast & Beautiful
You can’t let these new small, agile drivers ding your beautiful car (while they are texting). But that’s exactly what can happen when distributed services are using the mainframe Too many mainframe transactions can be triggered. Huge/expensive transactions can be triggered, where only a very small portion of the response is used/required. The mainframe is simply not being used as it was designed to be used. How to tackle this issue, and prevent those dings? Now you’ve introduced the mobile users…
Analyze the top x User Actions transactions based on production data What’s the use case? APM End-to-End can tell you what your top user actions are, by invocation count or response time What mainframe transactions are currently invoked to serve this use case? APM End-End can tell you exactly what mainframe transactions, programs, and DB2 activity is generated due to these user actions Focus on user actions
Analyze the top x User Actions transactions based on production data How many times is the same transaction/data needed? Could it be cached on the distributed side? Are new transactions needed to fit the needs of the distributed side in the most efficient way? Ultimately, what you want is an efficient, fast, and beautiful experience for your mobile users Focus just on solutions for these transactions
Efficient, Fast & Beautiful
Ingredient 5 Always On
What benefits could be worth this investment? What data should be captured in production? How many MIPS are burned for this purpose? We try to reduce MIPS wherever we can. Now we should monitor ALL transaction on the mainframe, 24/7, in production? Really?! Always on – Brilliant Idea
1.You did APM in the Pre-Production You know what your transactions look like You know how they are used from the distributed side You fixed any performance issues 2.Based on this knowledge you are able to predict What production data you are interested in How much data will be captured How big the investment is to capture this data MIPS APM Infrastructure Prerequisites for always on APM
Reason #1 – Mobile Workload Pricing
What is the operational cost for your new web application or mobile application? Across the entire enterprise Which user actions are completed in this rollout? How do these new/modified actions affect mainframe activity? Impact on your KPI metrics Did these increase/decrease compared to the previous version? Baselining Session comparison. Reason #2 – Total Cost of Ownership
How is the response time? Who are the main contributors? What’s the bounce rate? What’s the conversion rate? How many user actions are failing? Failing due to z/OS activity MQ Queues CICS problems Reason #3 – User Satisfaction
Let’s discover issues before they affect the end user If a user does experience an issue What went wrong with this particular transaction? Attach this information to a ticket and pass it to development No need to reproduce the issue all the information is there The root cause is identified in minutes, with no war room Fix it and prevent other users from experiencing the same issue Reason #4 – Customer Care
Conclusion
One view – over all departments/teams, end-to-end from the tap on the mobile device down to the DB2 backend. Top down – Start at 30,000 FT and dig into the details for root cause analysis. Start early – Catch performance issues as early as possible. Focus – Just focus on the applications/transactions that matter. Always on - Trend CPU resources and response time in production, 100% of the transactions of interest, and 24x7. And don’t forget the recipe…