Performance Baselining, Benchmarking, and Monitoring for SQL Server

Performance Baselining, Benchmarking, and Monitoring for SQL Server
Kevin Kline Director of Technology, SQL Server Solutions Quest Software, Inc

Introduction What will we cover today? What is a performance baseline?
What is a performance benchmark? How to perform long-term, 24x7 monitoring? The white paper describing these concepts on the Quest website contains will be available in a few days.

The importance of service monitoring
Service monitoring = observing the health of a service in real-time. Enables DBA’s to observe service behavior proactively and quantitatively. A key objective is to know, quantitatively, what the performance of a given server is and then manage to that standard. Allows you to avoid depending on “user experience” as the key indicator for performance Helps find problems even when users aren’t on the system

Two approaches to monitoring
Proactive monitoring – take baseline measurements, benchmark metrics, and maintain an active monitoring regime. Ensures the most comprehensive service levels. Exceptions monitoring – only exceptions to “normal” service are monitored. Provides the most coverage with the least amount of time.

Proactive Monitoring Benefits –
Best chance of catching errors before they occur. Gets you out of fire fighting mode. Best information about environment and apps. Better long-term decisions. Drawbacks – Requires more time and deeper understanding of apps. Requires review and analysis of charts, graphs and other information on an on-going basis.

Building a regimen – start with the Baseline
Build a baseline performance profile. Makes you familiar with the operational behavior of each app/server. Clearly documents what is “normal” for a server and/or application. Identifies types of problems that arise even when the server is behaving normally: Some problems require a response. Some problems have no response. Where does it all begin? First, you will need to set a baseline for each application. You can do this prior to reviewing the technologies involved or developing your regimen, so that developing a baseline is a good place to start. With this baseline, you will begin to become familiar with the operational behaviors of each application. The goal of baseline monitoring is to document clearly how the application acts under normal conditions. Does the application eat up all the CPU cycles made available to it? Does it need so much memory that paging occurs? Are all the user requests serviced? After you understand how the application acts under normal load you will be able to answer these questions. It is important that you understand the application in this way so that future problems can be identified when they occur.

Goal of a baseline Tells you all about the performance of a server under normal conditions. Document and understand as many as possible (if not all) background processes Build in filters to catch “do not respond” situations before DBAs seem them; Otherwise, apathy can set it. You do not want your monitoring personnel to spend countless hours researching symptoms that simply end up needing no response. If your monitoring personnel learn on the job that a large number of symptoms exhibited by your application simply need to be ignored, it will cause the monitoring personnel to not pay as close attention to your application and may result in them missing a real error.

Building a baseline Need a single graphic representation, along with enough information to interpret the results. Use System Monitor: real-time or saved to a log. Choose a sampling interval that balances the need for data vs. the disk I/O to record the collections. Every 15 seconds is default. Local vs. Remote monitoring? Pros and cons to each. Next, you must assign and assess the SysMon counters. (Need a few more counters for building the baseline than you need for daily monitoring.) You must determine which server you will use to monitor your SQL Server. You can monitor remotely, but use of the counters across a network connection for an extended period of time could congest traffic on your network. If you have space on your SQL Server for the performance log files, it is recommended that you record performance log information locally. Because of these performance concerns, the use of performance counters needs to be properly implemented. You will need to test the number of counters and frequency of collection that best suites your environment. For the initial baseline, however, it is recommended that as many counters as desired be used with the highest frequency available.

SysMon Counters, OS Memory – Pages/sec
The number of pages read from or written to disk to resolve hard page, a situations where a process requires code or data that must be retrieved from disk. Primary indicator of the kinds of faults that cause system-wide delays. It is the sum of Memory: Pages Input/sec and Memory: Pages Output/sec, displaying the difference between the values observed in the last two samples, divided by the duration of the sample interval. Network Interface – Bytes total/sec The number of bytes traveling over the network interface per second. When dropping or trending lower, investigate whether or not network problems are interfering with your application. PhysicalDisk - Disk Transfers/sec The rate of read and write operations on the disk. Define a counter for each physical disk on the server. Enable Diskperf on some OS’es. Processor - % Processor Time The percentage of time that the processor is executing a non-Idle thread, acting as the primary indicator of processor activity. Displays the average percentage of busy time observed during the sample interval. If all processors devoted to SQL Server are at 100 percent utilization, end user requests are probably being ignored. Memory – Pages/sec. It is counted in numbers of pages, so it can be compared to other counts of pages, such as Memory: Page Faults/sec, without conversion. It includes pages retrieved to satisfy faults in the file system cache (usually requested by applications) non-cached mapped memory files. Processor - % Processor Time. It is calculated by measuring the time that the processor spends executing the thread of the idle process in each sample interval, and subtracting that value from 100 percent. (Each processor has an idle thread that consumes cycles when no other threads are ready to run). It can be viewed as the percentage of the sample interval spent doing useful work. After all of the processors devoted to SQL Server have reached 100 percent utilization, it is likely that end user requests are being ignored. Also: % Privileged Time and % User Time % Privileged Time measures the amount of time the system processor spends executing NT kernel commands. Much of this time is associated with processing SQL Server I/O requests. % User Time measures the percentage of processor time spent executing user applications such as SQL Server. If SQL Server finds all or most of the required objects in the data cache, relatively little I/O is generated resulting in a % Privileged Time as low as 5-15 percent. The amount % User Time, however, will climb to as high as percent. If, on the other hand, SQL Server generates a large amount of I/O, % Privileged Time may be higher (30-40 percent), and % User Time will be substantially lower (60-70 percent). Both counters are useful in determining how different types of operations are using system processor(s). If your system is spending too much time doing I/O, you may need to investigate the disk subsystem and how to relieve it of some I/O. Or you may need to add more memory. If your system is spending more time doing SQL Server computing, you may want to investigate denormalization, reducing of the number of joins, horizontal partitioning, or upgrading your system processor.

SysMon Counters, Database I/O
SQLServer:Access Methods - Full Scans/sec. The number of unrestricted full table or index scans. The lower, the better. SQLServer:Buffer Manager – Buffer Cache Hit Ratio The percentage of pages found in the buffer pool that did not require a read from disk. The higher the number, the less disk I/O is being generated. SQLServer:Databases - Log Growths The total number of log growths for a given database. This should be a low number. SQLServer:Databases Application Database - Percent Log Used The percentage of space in the log that is in use. Will vary over time, but should not get completely full. SQLServer:Databases Application Database - Transactions/sec The number of transactions started for the database. Should dip temporarily during checkpoints. When transactions start to queue, your disk I/O may not be fast enough. SQLServer:Buffer Manager – Buffer Cache Hit Ratio. In a well tuned system, this value should 80 or higher. SQLServer:Databases Application Database - Transactions/sec. Generally, won’t be higher than ~250 (7k rpm) or ~280 (10k rpm) per disk. Also, for physical and logical I/O, see the white paper.

SysMon Counters, Locking
SQLServer:Latches – Average Latch Wait Time The average latch wait time (in milliseconds) for latch requests that had to wait. When high, your server may be facing contention for resources, particularly memory or I/O. SQLServer:Locks – Average Wait Time The average amount of wait time (milliseconds) for each lock request that resulted in a wait. Watch for upward trending. SQLServer:Locks – Lock Waits/sec The number of lock requests that could not be satisfied immediately and forced the caller to wait before the lock was granted. SQLServer:Locks - Number of Deadlocks/sec The number of lock requests that resulted in a deadlock. Should remain relatively constant. High numbers may indicate a poorly designed application or set of transactions.

SysMon Counters, General Health
SQLServer:General Statistics - User Connections The number of users connected to the database server. Dramatic shifts in this value should be researched. SQLServer:Memory Manager - Memory Grants Pending The current number of processes waiting for a workspace memory grant. A high or rising number may indicate inadequate memory. SQLServer:User Settable – Query (a tracer query) A tracer query is a user-written query that gives you an indication of the overall speed or efficiency of the system. You define the tracer query using a procedure called sp_user_counter1. Up to 10 are allowed. SQLServer:User Settable – Query. The final counter that should be included in your baseline requires some explanation. The counter, SQLServer:User Settable – Query, is an application-specific counter. The value displayed by this counter is set by your application. To set this value, your application needs to call sp_user_counter1 and provide a numeric value. The suggested manner in which you should employ this counter is as follows: Determine an inexpensive way to check the health of your application with a single SQL statement (perhaps count the number of items that were sold in the last hour). The health check must not degrade the performance of your application in any way. Write a stored procedure that first checks the health of your application, putting the result into a variable. The stored procedure should then call sp_user_counter1 and provide that health check variable as the input parameter. Set up a scheduled job that runs this stored procedure every 15 minutes. Include the SQLServer:User Settable – Query User Counter 1 in your baseline and ongoing monitoring regimen. (You can create up to 10 user settable counters.) Because of these performance concerns, the use of performance counters needs to be properly implemented. You will need to test the number of counters and frequency of collection that best suites your environment. For the initial baseline, however, it is recommended that as many counters as desired be used with the highest frequency available.

Using SysMon Demo Open Control Panel >> Administrative Tools >> Performance Double-click Performance Logs and Alerts in left window. Enter a name for the baseline chart and click "OK". Choose the first SysMon counter in the Select Counters window and click "Add". Repeat until all counters are added then click “Close”. Select sampling interval. Longer interval consumes more space but provides less effective data. Try the default at first (15 seconds). Select “Log Files” tab and enter a location to save the data. Review it later using “View log file data”. It can be hard to tell which counter is which in a crowded field of SysMon counters. Don’t forget to use the Ctrl-H or Highlight button to highlight the specific counter you’re looking at.

Benchmarking The next step is to understand server performance under several usage scenarios that could possibly occur. This is known as benchmarking. Use the PerfMon counters as you would when baselining. Use a longer polling frequency. Ideally, you should benchmark based on actual usage. Use one of several popular benchmark scenarios available in the industry such as TPC-C or SAP, if real load is not available. Alternately, use a load generation tool. Best solution is to build benchmarking scenarios specific to your application using T-SQL scripts, SQL Profiler, or other third-party tools. Capture and review the results of the benchmarked scenarios. Again, like baselining, the goal of benchmarking is to quantitatively know the performance capabilities of your server and app before you encounter the scenario in the real world. The idea for both is to build a situation where you have predictable and insurable results every time, all the time. In baselining, you quantitatively know how the server & app will perform at “normal”. In benchmarking, you quantitatively know how the server & app will perform at the extremes. McDonalds didn’t succeed by having the best food. It succeeded by having the most consistent experience very time. That’s what we’re shooting for!

Ongoing Monitoring Is an important, if not the most important, component of proactive monitoring. If nothing else, use SysMon set to 15 minute polling frequency checking on these counters: Memory – Pages/sec Network Interface – Bytes total/sec Physical Disk – Disk Transfers/sec Processor - % Processor Time SQLServer:Access Methods - Full Scans/sec SQLServer:Buffer Manager – Buffer Cache Hit Ratio SQLServer:Databases Application Database - Transactions/sec SQLServer:General Statistics - User Connections SQLServer:Latches – Average Latch Wait Time SQLServer:Locks - Average Wait Time SQLServer:Locks - Lock Timeouts/sec SQLServer:Locks - Number of Deadlocks/sec SQLServer:Memory Manager - Memory Grants Pending The chart that you generate each day (from the log file data) should be reviewed (the following day) to determine the health of your application. Any strange behavior should be investigated. Was the server overwhelmed at any period of time? Reviewing the charts every day provides you with a deeper understanding of how your application responds to load, which will be invaluable when a problem state arises. When you find that there is a problem, it is a good idea to go back to the original baseline-monitoring mode to troubleshoot the problem. A good practice to remember is "ongoing monitoring when things look healthy, baseline monitoring when problems occur." Of course, Spotlight makes all of this much easier. After you have a few performance logs saved away, begin to do graph trend analysis with checks every week and every month. Look for trends like "Why is the CPU utilization so high on Fridays?" Also, do not always assume that your server went down simply because the counters registered no value for a period of time. Network congestion can sometimes clog and data sent by your counters may be the first thing to get lost. Use corroborating evidence and your knowledge of the apps to reach your conclusions. Compare graphs to one another. Get to know your graphs, it will serve you well in the long run.

Alerting Alerts are defined events that raise a notification of some kind. Use a well-defined tool like SQL Server Alerts & Notifications, SysMon, or Quest products like Foglight (for 24x7) or Spotlight (for real-time and short-term monitoring) to raise alerts. But which alerts to raise? At a minimum, use this reference list of alerts: Errors affecting service – specifically errors with a severity of 19 to 25! Deadlocks CPU utilization Disk utilization Scans (SQLServer:Access Methods) Errors affecting service. Each known critical error should have an alert associated with it, if possible. If the error itself cannot be tied to an alert, it is possible to write a query to test for the error state, run the query on an ongoing basis, and alert when the query finds the error. Deadlocks. Any SQLServer:Locks – Number of Deadlocks over a given threshold should generate an alert. This threshold should be set to 1 or 2 if your application does not usually cause deadlocks to occur. CPU utilization. If your application freezes or degrades at a certain CPU utilization level, you need to set an alert when the CPU utilization exceeds that level. Disk utilization. If your application freezes or degrades at a certain disk utilization level (or queue level), you need to set an alert when the disk utilization exceeds that level. Scans. Your application database should generate an alert if excessive table scans take place. The definition of excessive, however, is completely up to you. You may have many small tables that do not require indexes, which would cause many acceptable table scans to occur. Monitor the SQLServer:Access Methods - Full Scans/sec counter for a period of time to determine the baseline scan value before you set an alert to watch for the excessive value.

Sample of Alerting Demo
You can build a similar functionality with a lot of elbow grease, SQL Mail, and event-forwarding in SQL Server I describe the SQL Mail approach in an earlier presentation entitled “SQL Server 2000 Essential Checklist”.

Other monitoring activities
If you have a lot of servers, build alerts on the following. If you have one or only a few, perform daily check on the following: SQL Server log SQL Agent log Windows Application, Security, and System log SQL Server job history and trends Make sure the app upholds good error logging by ensuring that RAISERROR…WITH LOG is used extensively and that meaningful error messages, as well as all clear & summary messages, are written into the app. Monitoring the job histories associated with your application can help you spot a problem before it causes any damage. What if, for example, you notice that backups are taking a few minutes longer each and every time they run? Reviewing the job histories is also a great way to troubleshoot a problem once damage has been done. For example, you should be running regularly scheduled DBCC cleanup jobs. (Refer to my e-seminar entitled “SQL Server 2000 Essential Checklist” for a complete set of jobs you should be running). But the job histories supply only some of the information you need when managing a complex application. The procedures run by your jobs should be written so that they raise errors in the SQL Server Log when they encounter a difficulty. This will help ensure that the correct message gets to your monitoring staff. It is possible for the job history to provide no further detail than "job was terminated." This is why it is critical that the code run by your jobs include clear, descriptive error messages when error conditions arise.

Spotchecking & Monitoring Example
Sometimes all you need is to check whether a service or functionality is up or down. Quest offers this functionality through the inexpensive product called Big Brother. You can modify the HTML to monitor for almost anything. Freeware, called Big Brother at

On-going Exceptions Monitoring
The alternative to proactive monitoring is exceptions monitoring. Only the exceptions to “normal” service are monitored, though the definition of an exception is very flexible. Requires much less set up time and on-going overhead. Requires more intimate knowledge of the application since you must anticipate all of the exceptions in advance and only capture information the reveals exceptions. In other word, an exception management regimen uses limited performance monitoring, more alerts, and heavy use of error logs. An exception could be anything from hardware failure to job failure to performance degradation. Your system would gather only data concerning exceptions and would only notify you when an exception occurred. This system allows you to have the most impact for the least investment of time. Note that you should use exceptions monitoring only for on-going performance monitoring, not as a substitute for baselining and benchmarking!

Exceptions Monitoring Regime
SysMon is used only to catch problem situations: Memory – Pages/sec Network Interface – Bytes total/sec Physical Disk – Disk Transfers/sec Processor - % Processor Time SQL Buffer Manager – Cache Hit Ratio Set up real-time alerts for all foreseen problem situations (e.g. user count exceeded, failed logins, disk space low, etc.). Monitor or build alerts on the SQL Server error log. Foresee recovery & troubleshooting solutions for all common exceptions the application may experience.

Summary Build a performance baseline of server & app.
Use SysMon counters judiciously. Use 3rd party tools for added speed, easy of use, and information. Build benchmarks for better understanding of server & app performance. Review and analyze! Perform either on-going monitoring, either proactive or exception monitoring. Incorporate alerts into your on-going monitoring regimen. The key idea to remember is to understand the quantitative performance of your servers. It’s not enough to only know that users are complaining about performance. As an enterprise class DBA, you need to know how and why servers are performing they way they do.

Resources The white paper covers in greater detail the concepts discussed in this slide show. Presentation available at: White paper available at: Archives of past e-seminars available at: Microsoft SQL Server Operations Guide -

Questions & Answers me at

Performance Baselining, Benchmarking, and Monitoring for SQL Server

Similar presentations

Presentation on theme: "Performance Baselining, Benchmarking, and Monitoring for SQL Server"— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

Performance Baselining, Benchmarking, and Monitoring for SQL Server

Similar presentations

Presentation on theme: "Performance Baselining, Benchmarking, and Monitoring for SQL Server"— Presentation transcript:

Similar presentations

About project

Feedback