Fixing Page Life Expectancy

Fixing Page Life Expectancy
Steve Hood Blog: SimpleSQLServer.com I chose to talk about Page Life Expectancy because I feel that many people may know what the counter is, but very few people have a firm grasp on how to affect it. Although it is only a counter, it indicates how much potential you have to reduce reliance on the slowest part of your server, your disks. This is even true with SSDs.

What is PLE? Duration in seconds data stays in memory References
Page life expectancy is how long data is able to stay in cache. This is important because data required that isn’t in cache needs to be read from disk, and disk is always the slowest part of your server. When no data is required from disk then PLE goes up 1 second each second that passes. When anything needs to be read from disk then you see the growth slow or the counter drop. The large drops are when a large amount of data in comparison to the size of the cache needs to be read from disk and other data is pushed out of cache. References Monitor PLE in OS Perf Counters

It’s just a counter May always be low on OLAP environments
Are you waiting on PAGEIOLATCH_SH? Monitor your Wait Stats Faster disks can help compensate In the end, PLE is just another counter in OS Performance Counters. In some environments such as OLAP environments PLE could always be low. It’s simply not financially reasonable to have that much data in cache, and there is nothing wrong with accepting that you’ll be reading most of your data from disk. It being low doesn’t mean your server is running too slow, and it being high doesn’t mean that your server is running to its potential. A more critical indicator of having slowness on your server due to excessive physical reads is having high PageIOLatch wait stats. The reason is that faster disks can supply the data fast enough that there’s less impact on the duration of your queries. However, PLE being lower indicates a higher load on your disks. Even if you have faster disks you probably don’t want to rely on constantly having faster and more expensive disks to compensate. References Monitor your Wait Stats

Ideal Value Common advice of PLE > 300 is outdated
Cache Size In GB / 4 * 300 Best generic formula Even if you’ve managed to throw enough money at disks to keep your waits at a level acceptable by the users, having good PLE will still be beneficial to you. If you’re relying on fast disks then you’re putting yourself in a situation where you eventually can’t do any more. So what is a good value? Historically the official advice was anything over 300 seconds was good and anything below that was bad. This was good when we had our 32-bit servers that capped out at 4 GB of memory, but the SQL Server community outgrew that advice a long time ago. Now it’s not uncommon to have 128 GB or more memory, with 80% of that allocated to the buffer cache PLE is watching. That means that a value of 300 is saying you can expect to read over 100 GB from disk every 5 minutes. Jonathan Kehayias gave us a much more scalable formula saying that if 300 was good when we had 4 GB of cache then it should be 300 for each 4 GB we have in cache. Although different databases and situations are going to make it so no single piece of advice is accurate across the board, this is what I believe to be the most accurate generic advice you can give for an ideal value. References Jonathan Kehayias discusses the plan cache and PLE Comment discussion with Brent Ozar is as good as the article Jonathan Kehayias’s free book: Troubleshooting SQL Server

Increase PLE More memory Query Tuning Query Justification
Indexing Changes Data Cleanup Given that we have an ideal value we’d like to get it to, how do we do that? There are 5 primary ways that I came up with. Adding memory, justifying the queries that are running, tuning the queries that have to run, making sure you have the right indexes in place, and cleaning up old data.

More Memory 64 GB in 2008 and 2012 Standard 128 GB in 2014 Standard
Unlimited in Enterprise 768 GB is reasonable per physical box From someone who prides themselves on tuning, having a solution of throwing money at the issue should seem out of place. However, I realize that my time is worth something and the tasks on my list all take time and effort. If you’re paid $52,000 per year and feel you could cut the memory required by a server by half in two weeks then you just spent $2,000 on resolving the issue. Chances are you could have had the same or better effect with less risk by investing $1,000 in memory. The last statement on this list may also seem very out of place, but it’s really not when you think it through. A typical physical box will have two processors with 6 cores each. With enterprise edition costing about $7,000 per core, those 12 cores cost $84,000. Buying a new two-socket server fully loaded with 16 GB chips puts it at 768 GB at a price of about $25,000. Basically this is saying that you can have a slower server for a total investment of $100,000 or a faster server for a total investment of $109,000. When the total price of a server including licensing is kept in mind during purchasing then smaller servers are not a wise investment.

What DBs are in cache? References Script: CacheSizeByDB
When you talk about spending that much in memory the first question should be what is using all of that memory. Don’t feel bad if you can’t say where your memory is being used, a good portion of DBAs can’t tell you this. However, as an act of vengeance, I’m taking that excuse away from you right now. The query attached in the speaker notes of this slide can tell you what databases are using the buffer cache on your server. This example is the real cache from a real production server that has 512 GB of memory. About 430 GB of that memory is for the buffer cache on SQL Server, and almost a quarter of that is used for a single database. The database names were changed to protect my employment. Once you pick out a database for taking up more cache than you feel is appropriate than you can dive into more detail. INT; = cntr_value FROM sys.dm_os_performance_counters WHERE RTRIM([object_name]) LIKE '%Buffer Manager' AND counter_name = 'Total Pages'; SELECT [db_name] = CASE [database_id] WHEN 32767 THEN 'Resource DB' ELSE DB_NAME([database_id]) END, --db_buffer_pages, db_buffer_MB = db_buffer_pages / 128, db_buffer_percent = CONVERT(DECIMAL(6,3), db_buffer_pages * FROM ( SELECT database_id, db_buffer_pages = COUNT_BIG(*) FROM sys.dm_os_buffer_descriptors --WHERE database_id BETWEEN 5 AND 32766 GROUP BY database_id) src ORDER BY db_buffer_pages DESC; References Script: CacheSizeByDB

What Indexes are in Cache?
The details for a specific database will show you how much cache each index is using, and the distribution of this cache can tell you a lot about how it got there. Almost 30% of the cache used by this database is on a single index, with it having an index id of 1 you know it’s the clustered index. For that much of a clustered index there’s a descent chance that either a scan was done a while ago or there’s a query that does a lot of key lookups on this table. A more telling sign is with the indexes that have near 100% of them in cache. These indexes had a very recent scan performed against them, and there’s typically an easy fix to these. Diving into execution plans is beyond the scope of this presentation, but there’s a presentation on this topic today at 11:00. We will, however, get into finding out what queries are using these indexes and how. SELECT cached_MB , ObjName = name , index_id , index_name , Pct_Of_Cache = cast((cached_mb * 100) / cast(SUM(cached_mb) over () as DEC(30,4)) as DEC(5,2)) , Pct_InRow_Data_In_Cache = cast((100.0 * cached_MB) / (1.0 * Used_InRow_MB) as DEC(6,2)) , Used_MB , Used_InRow_MB , Used_LOB_MB FROM ( SELECT count(1)/128 AS cached_MB , obj.name , i.index_id , index_name = ISNULL(idx.name, '--heap--') , i.Used_MB , i.Used_InRow_MB , i.Used_LOB_MB FROM sys.dm_os_buffer_descriptors AS bd with (NOLOCK) INNER JOIN ( SELECT name = OBJECT_SCHEMA_NAME(object_id) + '.' + object_name(object_id) , object_id , allocation_unit_id FROM sys.allocation_units AS au with (NOLOCK) INNER JOIN sys.partitions AS p with (NOLOCK) ON au.container_id = p.hobt_id AND (au.type = 1 OR au.type = 3) UNION ALL ON au.container_id = p.partition_id AND au.type = 2 ) obj ON bd.allocation_unit_id = obj.allocation_unit_id INNER JOIN ( SELECT Name = OBJECT_NAME(PS.Object_ID) , PS.Object_ID , PS.Index_ID , Used_MB = SUM(PS.used_page_count) / 128 , Used_InRow_MB = SUM(PS.in_row_used_page_count) / 128 , Used_LOB_MB = SUM(PS.lob_used_page_count) / 128 , Reserved_MB = SUM(PS.reserved_page_count) / 128 , row_count = SUM(Row_Count) FROM sys.dm_db_partition_stats PS GROUP BY PS.OBJECT_ID ) i ON obj.object_id = i.object_id AND obj.index_id = i.index_id INNER JOIN sys.indexes idx ON idx.object_id = i.object_id AND idx.index_id = i.index_id WHERE database_id = db_id() GROUP BY obj.name , i.name , idx.name HAVING Count(*) > 128 ) x ORDER BY 1 DESC; References Script: CacheSizeByIndex (in speaker notes)

Finding Queries to Tune
Look in the Proc Cache (not 100% reliable) Most Expensive Queries Index Usage in Proc Cache Server-Side Trace or Extended Events To find the queries that are giving you problems there are two primary methods. You can either look in the proc cache, which is very good although not 100% reliable. You can also monitor the server with traces or extended events to find more expensive single calls to queries.

Proc Cache Limitations
Data since last compile Can monitor it better than that Only cacheable plans The limitations of traces and extended events are more obvious in that you’re only capturing queries that meet a threshold you specify. The Proc Cache on the other hand can be a little confusing at first. The contents of the proc cache only goes back to the last time a statement was compiled, and, depending on the settings on your server, some queries may need to be called twice just to be registered here. Other queries aren’t able to have their execution plans stored in cache and will never show up in queries that look at the proc cache. Although these limitations are here, the detail given by the proc cache makes it the best place to start your search into how the data made it into the buffer cache.

Most Expensive Queries
Uses Proc Cache For PLE focus on Physical Reads The most common way to look at the proc cache is by the “Most Expensive Queries” script that’s common to find on the internet. I attached my version to this slide, and also have a link to a way of saving this data for analysis later. Looking at the data SQL Server saves in the DMVs about the queries that are in the proc cache can give you a good idea of what queries are causing the most issues for your server. Looking at the most expensive queries according to physical reads helps you find out what queries would give you the most benefit to tune to raise PLE. However, this data is numerical statistics only and won’t give you the detail to see what pulled a specific index into the cache. SELECT TOP 20 ObjName , total_worker_time_sec = cast(qs.total_worker_time/1000/ AS DEC(20,1)) , total_elapsed_time_sec = cast(qs.total_elapsed_time/1000/ AS DEC(20,1)) , total_logical_reads_k = qs.total_logical_reads / 1000 , total_physical_reads_k = qs.total_physical_reads / 1000 , total_logical_writes_k = cast(qs.total_logical_writes / as DEC(20,1)) , qs.execution_count , DatabaseName , Stmt = qs.statement_text , Query = qs.[text] , qs.creation_time FROM (SELECT QS.* , ST.[text] , ObjName = OBJECT_NAME(ST.objectid, ST.dbid) , SUBSTRING(ST.text, (QS.statement_start_offset/2) + 1 ,((CASE statement_end_offset WHEN -1 THEN DATALENGTH(ST.text) ELSE QS.statement_end_offset END - QS.statement_start_offset)/2) + 1) AS statement_text , DatabaseName = DB_NAME(st.dbid) FROM sys.dm_exec_query_stats AS QS OUTER APPLY sys.dm_exec_sql_text(QS.sql_handle) as ST) as qs WHERE DatabaseName = DB_Name() ORDER BY 2 DESC; GO References Query Stats Monitoring – know your stats beyond your current cache Script: Most Expensive Queries (in speaker notes)

Index Usage in Proc Cache
Uses Proc Cache Goes along with “What indexes are in cache” Gives Estimated IO cost Gives Seek and Scan Predicates To get that level of detail you need to actually query the XML of the execution plans in the proc cache. For anyone familiar with querying large amounts of XML, this isn’t a cheap process and will take up to about 5 minutes to run. The results of the Index Usage script will tell you how a specific index was used by every query that used it in the proc cache. The Table Usage script does the same, just filtering by all the indexes on a table instead of just a single index. --Index Usage in Proc Cache SET TRANSACTION ISOLATION LEVEL READ UNCOMMITTED; SYSNAME = '[Index_Name_Goes_Here--Brackets_Are_Required]'; SYSNAME; = '[' + DB_NAME() + ']'; WITH XMLNAMESPACES (DEFAULT ' SELECT ObjectName = object_name(tab.objectid), 'VARCHAR(4000)') AS sql_text, --n.query('.'), 'VARCHAR(128)') AS PhysicalOp, 'VARCHAR(128)') AS IsLookup, 'VARCHAR(128)') AS EstimateRows, 'VARCHAR(128)') AS EstimateIO, 'VARCHAR(128)') AS EstimateCPU, cp.usecounts, 'VARCHAR(128)') AS DatabaseName, 'VARCHAR(128)') AS SchemaName, 'VARCHAR(128)') AS TableName, 'VARCHAR(128)') as IndexName, --i.query('.'), STUFF((SELECT DISTINCT ', ' + 'VARCHAR(128)') FROM i.nodes('./OutputList/ColumnReference') AS t(cg) FOR XML PATH('')),1,2,'') AS output_columns, FROM i.nodes('./IndexScan/SeekPredicates/SeekPredicateNew//ColumnReference') AS t(cg) FOR XML PATH('')),1,2,'') AS seek_columns, 'VARCHAR(4000)'), 'VARCHAR(4000)')) - charindex('.', 'VARCHAR(4000)'))) as Predicate, cp.plan_handle, query_plan FROM ( SELECT plan_handle, query_plan, objectid FROM ( SELECT DISTINCT plan_handle FROM sys.dm_exec_query_stats WITH(NOLOCK)) AS qs OUTER APPLY sys.dm_exec_query_plan(qs.plan_handle) tp ) as tab (plan_handle, query_plan, objectid) INNER JOIN sys.dm_exec_cached_plans AS cp ON tab.plan_handle = cp.plan_handle CROSS APPLY query_plan.nodes('/ShowPlanXML/BatchSequence/Batch/Statements/*') AS q(n) CROSS APPLY and ) as s(i) --WHERE 'VARCHAR(128)') = 1 OPTION(RECOMPILE, MAXDOP 4); SYSNAME = '[Table_Name_Goes_Here]'; = '[sysschobjs]' 'VARCHAR(128)') as EstimateIO, i.query('.'), FROM ( SELECT plan_handle, query_plan ) as tab (plan_handle, query_plan) CROSS APPLY and ) as s(i) References Script: Index Usage in Proc Cache (in speaker notes) Script: Table Usage in Proc Cache (in speaker notes)

Server-Side Trace or EE
Duration over X seconds (5) Reads over X (100,000) Look at a graph of your PLE. This is a common feature among all pieces of monitoring software, or you can capture it in intervals using the scripts at this post: When you see a drop in PLE that means something happened on the server to make it drop. It’s possible that the amount of memory allocated to SQL dropped, which would be in the data monitored by the dm_os_performance_counters post I just mentioned. However, most likely there was a query running that required data that wasn’t in cache. If that query were tuned to require less data to be in cache then it would almost always have less of an impact on PLE. The reason for the ‘almost’ is that it could require less data where none of it is in cache instead of more data that could all be in cache, although this would be an atypical case. References Tracing Introduction – Creating your first Server-Side Trace Reading Traces – Querying your Server-Side Trace Erin Stellato Making the Leap from Profiler to Extended Events

How To Tune Beyond the scope of this presentation
Execution Plans – Grant Fritchey Google SARGability BrentOzar.com SQLskills.com This is a one-hour presentation and an introductory tuning class is a week long. It’s not that I don’t want to cover this, it’s that I can’t do enough right now. Look up what SARGability is and what it means to query performance. To understand how to find issues with SARGability, reads Grant Fritchey’s book, it’s free and one of the best out there. To learn more about tuning then read the great work by everyone at Brent Ozar Unlimited and SQLskills.com. If someone passes the interview to work at one of those places then they’re better at SQL they’re better at SQL than me, and their work is peer reviewed by the best in the business.

Appropriate Queries? Did this query need to run? …in prod?
…during peak hours? One of the most important tuning methods is to not run the query at all. Ask these questions, and not just a quick run through. If you have extremely large reports that slow down your transactional database then do you need a reporting server? Yes, it adds complexity and costs, but it gives you the opportunity to start separating the quick transactional load from the slow analytical load.

Indexes – Drop Unused Data modifications pull data into cache
Index maintenance pulls data into cache An unused index wastes cache. When you update data then the page is read from disk into memory if it wasn’t there already. Then it’s updated, marked as dirty, stays in cache at least until it can be written to disk (writing to the log is done during the transaction, but not writing the actual pages to disk). Even then it typically stays in cache for a while. As the name implies, you can expect it to stay in cache for the same number of seconds as your PLE. References Indexes – Unused and Duplicated

Indexes – Remove Dupes First couple key fields match
Two indexes that could almost replace each other Even worse than unused indexes are duplicate indexes. An unused index is pulled into cache when it’s updated, stats are updated, or index maintenance is performed. Duplicate indexes will compete with each other for space in your cache. Examples of types of duplicate indexes are: Key fields are the same, in the same order – even if just the first one or two key fields on wider indexes are the same, they may be interchangeable Key fields are interchangeable. If you always search an employee table by last name and location then it doesn’t matter what order the key fields are in Look at the results from the queries in the “Index Usage in Proc Cache” to get an idea of how your indexes are being used. References Indexes – Unused and Duplicated

Indexes – Compression Enterprise only Some don’t compress at all
Some compress by 90% Index compression isn’t just for disk space. The data is still compressed when it’s in memory, and is only uncompressed as it’s read. That’s a double-edged sword in that it’s taking more CPU to uncompress it every time you read it, but less CPU, memory, and I/O for reading it and keeping it in cache. My experience is that if something can be compressed by 30% then it saves you at least as much CPU as it costs you, and it will always save you memory and I/O. The amount of compression you get will vary by every imaginable way, including what data types you have, how wide your table or index is, and what data is being stored. One thing to note is that although backup compression is almost useless if you’re using TDE, your indexes will compress just as well with or without TDE enabled. The reason for this is that index pages are compressed then encrypted. Backups can’t be compressed with TDE because TDE will make them appear to be random data before the backup is taken, and random data can’t be compressed efficiently.

Data Cleanup Any data has potential to use cache
If you can purge 25% of your data… If something can be in cache then it eventually will be in cache. The less data that is stored on your server the less potential contention you have for your cache. Think about your accountants here. If you go to an old-school accountant with filing cabinets in their office and ask for sales numbers, what will they say? If you ask for sales numbers from last quarter then they go to the cabinet next to their desk and tell you what you need to know. If you ask for detailed numbers from a couple years ago then they may have to look somewhere else to fulfill your odd request, but they can still get to that data. Get to 6 years ago and they may even have to drive you to a warehouse to find the papers, because they need to make efficient use of their limited cache (the number of filing cabinets they can fit in their office). Go past 7 years and they’ll find a nice way to tell you they hired a pyromaniac with scissors to “take care of their data”. DBAs shouldn’t be much different.

Fixing Page Life Expectancy
Steve Hood Blog: SimpleSQLServer.com

Fixing Page Life Expectancy

Similar presentations

Presentation on theme: "Fixing Page Life Expectancy"— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

Fixing Page Life Expectancy

Similar presentations

Presentation on theme: "Fixing Page Life Expectancy"— Presentation transcript:

Similar presentations

About project

Feedback