Demo Modules 3 Steps to diagnose a performance problem Analyzing Individual Transactions Exploring Transaction Data Isolating the problem
A. Three steps to diagnose an Application Performance Problem (5 min duration) This is the essential demo highlighting key AppInternals features. It tells a story how we can troubleshoot an application problem in three simple steps.
1) This is the End To End monitoring Dashboard for our Tradefast application. Tradefast is a multi tier .NET application utilizing web services connecting to a SQL Server. With the instrumentation of our APM solution we are able to capture the true / actual end user experience from real users, as well as monitor all the critical components of the application. The unified dashboards allow us to bring multiple performance datasets under the same pane of glass. They are fully customizable and can be viewed in mobile/tablet devices. Dashboards update real time and can be used to present data to different teams. You can easily build a custom experience for Operations, Development, Business or IT Management. The map shows where our users are coming from and what response time they are experiencing.
2) We have noticed that Users in California experience degraded performance. A sizable portion of our end users are experiencing increased application latency, 2.5% of users are experiencing delayed response time. Let us investigate the degradation further by inspecting how individual users are experiencing the home page.
Users from CA accessing the tradefast app Slowest transactions Most of delay appears to be server side 3) We are observing all users that had visited our app TradeFast from California. Every dot represents a page view by an individual user. We want to investigate the big red dots which indicate transactions with severe performance degradation. With our Big Data approach we capture and store every transaction end to end. Having this complete data set allows to troubleshoot and identify root cause as well as perform historical trending. In the current time frame, the category of delay pie chart on the right is indicating the majority of the delays for the Home page are due to slower response times from our servers as indicated by "First request to First byte".
Lets take one of the slowest transactions and look at what may be contributing to delay 4) We can select one of the slowest transaction and get more details for that user and page view. In this case the majority of delay was caused by a server side delay on the backend. The DNS, Connection time, Browser rendering time were all negligible.
5) The Server tab is giving us more details where this backend delay is coming from - Web, Application code, Database or other Remote calls. We can dissect the backend transaction and find the bottlenecks on the server side. Lets find why the server response time is so high on the backend for this transaction.
6) AppInternals can give us a per transaction map of delay for the backend multi tier transaction. We can quickly identify the bottlenecks for this one transaction. Looks like we have a delay in an application method that are the biggest culprits. There are several different data sets we can review for this transaction.
We can do a search on the previously mentioned method and see if it is a problem across the board or just 1 user, 1 server, etc. 7) Now that we know the offending class we can verify whether this is a problem across the board . As you see these degradations are happening periodically and impacting a lot of users (a dozen dots above the 5s response time line)
B. Analyzing Individual Transactions (5 min duration) The goal for this module is to dive deeper into application code and internals and be able to show our depth to Developers. Key things to stress: Web based interface Data that is easy to collaborate with Dynamic maps for every transaction Multi-tier stitching Very rich record about each transaction
Transaction details-lets deep dive Ctsecure and bondrequesthandler Summary of transaction 1) We can investigate individual transactions in depth in the Transaction Details view. We have identified Ctsecure and BondrequestHandler as bottlenecks.
Top calls 2) We can do a broader inspection of the internals of the transaction. We can get a list of the top slowest classes and methods for example, and which tier they are running on. This may be very useful if we want to systematically review performance and work with development to improve performance.
SQL queries 3) Transaction details also captures what SQL statements have been executed and their timing. Often SQL or Remote calls can be the culprit of slow application performance. Once you have narrowed down the list of SQL statements to be optimized, a DBA could help get more insight on how to speed these up.
Exceptions 4) Knowing application performance is critical but we also have to understand if we have other internal failures. In certain cases, we can encounter transactions the run very quickly that are in error state. The example here shows the transaction executed in milliseconds, but failed with a malformed SQL query and the end user received an error message in return. We can identify errors as well as performance bottlenecks.
Performance metrics contributing to user experience 5) We have a snapshot of key system metrics along with the execution of the transaction. This is useful as it can point out system bottlenecks that are impeding application performance. In this case our system CPU is pegged while this transaction is running, most likely contributing to the delays.
We can then take this analysis and create a report to send to relevant teams. 6) Collaboration is critical to achieving an Enterprise APM strategy. Individual transaction performance reports are easy to share. We can get a unique link to the full transaction report and send to our coworkers so they can get involved and help.
C. Exploring Transaction Data (5 min duration) This module provides an introduction on how to access transaction data and quickly narrow down what we are looking for. Key points: Big Data repository – storing all transactions efficiently Collaboration Fast access to billions of records Flexible and simple filtering or results Open ended search Numerous transaction fields to search on
TTW allows us to search a wide variety of data in a very short amount of time. Here we are searching on a URL being accessed by “Sam” 1) Warehouse provides a very intuitive data access mechanism. We can search on different attributes and type ahead gives us a clue of the available options. In this case we are search on a url and a user name which will give us all transactions for the home page by user sam.
We can search on all exceptions thrown. 2) We can also search on transaction that have thrown exception and errored. These are just as important as the slow transactions as they result in poor user experience (user getting 500 or 404 errors).
Stockentity class. Execution times 3) We can search on numerous attributes. In this case we are searching for all transactions that have used the stockentity class. It is very easy to find and analyze application code execution with this information. An application develop can get a list and find areas for improvement very quickly.
Can search on a wide variety of criteria More searching
The Warehouse is a Big Data store, we have very detailed complete records and can scale to billions of records per instance. The store is efficient and uses minimal disk space to store the transaction records.
E. Isolating the problem area (5 min duration) This module will showcase some quick triage scenarios and doing fault domain isolation with AppInternals data in Dashboards. Key points All Web Based real-time and role based Uis Collaboration Cross domain rich data – app, system, network, database Collecting high resolution data every second Easy workflows
Dashboards can be shared by publishing the link Collaboration is key to a successful Enterprise APM implementation. Dashboards can be shared with colleagues through a URL.
Here we see our slow pages metric in violation Here we see our slow pages metric in violation. We can double click to drill into violation. Lets walk through a few scenarios how we can isolate the problem area and focus our analysis in the right direction. I am seeing that one of my key metrics %Slow pages is exceeding the thresholds set. With a double click we can drill into the aggregate and see that it is really the home page showing performance degradations. WE have narrowed it down.
We can see the exact time the event occurred. The performance degradation occurred at 16:33
AT the same time we see 2 other key indicators deviate from their norm. At the exact same time I am seeing two of my other key indicators change. There is a slight increase in Server processing time. It is not network related from this observation. Next I notice that my Application Component charts are showing more processing in certain areas. Lets drilldown into that. I am maximizing the chart so we can inspect.
Ctsecure app code RT is spiking I can see more detail, and it looks like CTSecure Application Code (Other classes) are periodically showing more processing time. Drilling down further with a doubleclick.
CTSecure instance is degrading the most, I can right click and drill down to see all code executing within the CTSecure instance. In the sparkline view it is easy to compare a timeseries chart. I am observing that CtSecure instance is degraded the worst of all instances. I want to understand what individual classes and code may be contributing to the degradation. I right click on ctsecure and drilldown to all the code executing in CTSecure Application Code.
bondRequesthandler appears to be showing the largest delays bondRequesthandler appears to be showing the largest delays. We were able to isolate user experience to an individual method executing within a class of code. Lets look at why this is happening Looks like it is the BondRequestHandler class showing the biggest delays. I was able to isolate user experience, server and not network related, specific instance being the bottleneck, then a specific application class degrading the overall experience.
We can right click on the resource and drill down into system metrics to see if there might be a resource shortage. Looks like this is a periodic issue. My next question is “Why is the BondRequestHandle periodically degrading”. I can drilldown into System Metrics and see if there is a resource shortage.
CPU is spiking, memory is low, RT is up There definitely seems to be shortage of CPU cycles need to process the application workload. CPU is spiking and we have a higher rate of memory processing in this instance. We have to look into optimizing the application code, or adding more CPUs to speed up performance of this app under these workloads.