Statistics for beginners Andrii Zrobok Statistics for beginners
What is? How collect/update? Samples.
SQLSat Kyiv Team Yevhen Nedashkivskyi Alesya Zhuk Eugene Polonichko Oksana Borysenko Mykola Pobyivovk Oksana Tkach
Sponsor Sessions Starts at 13:10 Don’t miss them, they might be providing some interesting and valuable information! Room A Room B Room C 13:00 - 13:20 DevArt Microsoft Eleks 13:20 - 13:50 DB Best Intapp DataArt NULL means no session in that room at that time
Our Awesome Sponsors
Session will begin very soon :) Please complete the evaluation form from your pocket after the session. Your feedback will help us to improve future conferences and speakers will appreciate your feedback! Enjoy the conference!
Agenda Creating Updating Optimizer model Usage “Lies, blatant lies, statistics” Objects that contain statistical information about the distribution of values in one or more columns of a table or indexed view Creating Updating Optimizer model Usage
SET Statistics: creating Column-based (Auto Created, single column only) Index-based (Index created) Manual created (Create stats) SET Auto Create Statistics Auto Create Incremental Statistic -- SAMPLE TABLE CREATE TABLE [dbo].[Address]( [AddressID] [int] IDENTITY(1,1) NOT NULL, [AddressLine1] [nvarchar](60) NOT NULL, [AddressLine2] [nvarchar](60) NULL, [City] [nvarchar](30) NOT NULL, [StateProvinceID] [int] NOT NULL, [PostalCode] [nvarchar](15) NOT NULL, CONSTRAINT [PK_Address_AddressID] PRIMARY KEY CLUSTERED ([AddressID] ASC)) --TEST DATA INSERT INTO [dbo].[Address] ([AddressLine1] ,[AddressLine2] ,[City] ,[StateProvinceID] ,[PostalCode]) SELECT [AddressLine1] ,[PostalCode] FROM [Person].[Address]
Get info about statistic objects sys.stats sys.stats_columns SELECT s.stats_id StatsID, s.name StatsName, sc.stats_column_id StatsColID, c.name ColumnName FROM sys.stats s INNER JOIN sys.stats_columns sc ON s.object_id = sc.object_id AND s.stats_id = sc.stats_id INNER JOIN sys.columns c ON sc.object_id = c.object_id AND sc.column_id = c.column_id INNER JOIN sys.objects o on o.object_id = c.object_id WHERE OBJECT_NAME(s.object_id) = 'Address' and schema_name(o.schema_id) = 'dbo' ORDER BY s.stats_id, sc.column_id;
Creating statistics -- COLUMN BASED STATISTICS SELECT DISTINCT PostalCode FROM [dbo].[Address] WHERE city = 'Concord' -- INDEX BASED STATISTICS CREATE NONCLUSTERED INDEX [IDX_StateProvinceID] ON [dbo].[Address] ([StateProvinceID] ASC) What does mean WA in the name of statistic? -- STATISTICS (Manual Created) CREATE STATISTICS STA_StateProvinceID ON [dbo].[Address] (StateProvinceID) Does not automatically deleted (in case of index creating on the same column)
Statistics: details Header Density vector Histogram DBCC SHOW_STATISTICS ('[dbo].[Address]', PK_Address_AddressID) go Header Density vector Histogram -- create when need select * from [dbo].[Address] where [AddressID] = 300 go
Statistics: details (2012) Columns Description RANGE_HI_KEY upper-bound key value for the range defined by histogram step RANGE_ROWS number of rows within the interval EQ_ROWS key value equal to the RANGE_HI_KEY DISTINCT_RANGE_ROWS distinct values of the keys are within the interval AVG_RANGE_ROWS average number of rows per distinct key value in the interval Density = 1/frequency (average number of the duplicates per key value) All density is calculated based on (1 / number of distinct values)
Creating statistics: multiple columns CREATE NONCLUSTERED INDEX [idx_city_postalcode] ON [dbo].[Address] ( [City] ASC, [PostalCode] ASC ) DBCC SHOW_STATISTICS ('[dbo].[Address]', idx_city_postalcode) Histogram is creating for first column only
SET Statistic: updating Auto Update Statistics Auto Update Statistics Asynchronously Auto (sampling 20%) Rules (2014 and earlier, This is regardless of the number of rows in the table): More then 500 records: 20% + 500 records are modified Less then 500 records: 500 modifications 2008R2 SP1-SQL Server 2014. Trace Flag 2371 (global, instance level) makes the formula for large tables more dynamic (more than 25 000 records) SQL Server 2016 automatically uses this improved algorithm. With this change, statistics on large tables will be updated more often. Manual UPDATE STATISTICS / Sp_UpdateStats / Index rebuild operation sys.dm_db_stats_properties Updating statistics will result in cached plan invalidations
Statistic usage: Optimizer Server Query Optimizer - cost-based optimizer Cardinality estimation – number of record, will be returned Selectivity – percentage of rows from input that satisfy a predicate Cardinality Estimation Model (MS SQL 7) Independence (new in 2014) Uniformity Containment Inclusion (new in 2014) TF 9481 enables legacy CE behavior. TF 2312 enables the new CE behavior. --SQL Server 2014 compatibility level -- New Cardinality Estimator ALTER DATABASE [AdventureWorks2012] SET COMPATIBILITY_LEVEL = 120 SELECT city, count(*) FROM [Person].[Address] GROUP BY city OPTION (QUERYTRACEON 9481) Trace Flag 9481 reverts query compilation and execution to the pre-SQL Server 2014 legacy CE behavior for a specific statement. Trace Flag 2312 enables the new SQL Server 2014 CE for a specific query compilation and execution. Independence: Data distributions on different columns are independent unless correlation information is available. Uniformity: Within each statistics object histogram step, distinct values are evenly spread and each value has the same frequency. Containment: If something is being searched for, it is assumed that it actually exists. For a join predicate involving an equijoin for two tables, it is assumed that distinct join column values from one side of the join will exist on the other side of the join. In addition, the smaller range of distinct values is assumed to be contained in the larger range. Inclusion: For filter predicates involving a column-equal-constant expression, the constant is assumed to actually exist for the associated column. If a corresponding histogram step is non-empty, one of the step’s distinct values is assumed to match the value from the predicate
CE: under \ over estimating Under estimating rows can lead to: Memory spills to disk, for example, where not enough memory was requested for sort or hash operations. The selection of serial plan when parallelism would have been more optimal Inappropriate join strategies. Inefficient index selection and navigation strategies. Inversely, over estimating rows can lead to: Selection of a parallel plan when a serial plan might be more optimal. Inappropriate join strategy selection. Inefficient index navigation strategies (scan versus seek). Inflated memory grants. Wasted memory and unnecessarily throttled concurrency.
Sample 1: constant “=“ (RANGE_ ) SELECT Addressid, AddressLine1, Addressline2, City FROM [dbo].[Address] WHERE [StateProvinceID] = 17; DBCC SHOW_STATISTICS ('[dbo].[Address]', IDX_StateProvinceID) WITH HISTOGRAM;
Sample 1a: constant “=“ (not RANGE_) SELECT Addressid, AddressLine1, Addressline2, City FROM [dbo].[Address] WHERE City ='Alexandria'; DBCC SHOW_STATISTICS ('[dbo].[Address]', [idx_city_postalcode]) WITH HISTOGRAM;
Sample 1b: constant “between” SELECT Addressid, AddressLine1, Addressline2, City FROM [dbo].[Address] WHERE [StateProvinceID] BETWEEN 53 AND 57; DBCC SHOW_STATISTICS ('[dbo].[Address]', IDX_StateProvinceID) WITH HISTOGRAM; 0 + 412 + 16 + 16 + 0 + 57 = 501
Sample 2: local variable “=“ DECLARE @id INT = 17; SELECT Addressid, AddressLine1, Addressline2, City FROM [dbo].[Address] WHERE [StateProvinceID] = @id; SELECT 1./count(DISTINCT StateProvinceID) AS [All density] , (1./count(DISTINCT StateProvinceID))*count(*) AS Estimate FROM [dbo].[Address]; All density Estimate 0.01351351351 265.05405398514 DBCC SHOW_STATISTICS ('[dbo].[Address]', IDX_StateProvinceID) WITH DENSITY_VECTOR;
Sample 3: local variable “<“ OR “>” DELETE TOP (5) PERCENT FROM [dbo].[Address]; DECLARE @id INT = 3; SELECT Addressid, AddressLine1, Addressline2, City FROM [dbo].[Address] WHERE [StateProvinceID] < @id; DBCC SHOW_STATISTICS ('[dbo].[Address]', IDX_StateProvinceID) WITH STAT_HEADER; SELECT (COUNT(*)/100.0) * 30.0 AS real_30_pcnt, (19614/100.0) * 30.0 AS stat_30_pcnt FROM [dbo].[Address]
Sample 4: optimization of “like %” WITH rs AS (SELECT Addressid, AddressLine1, Addressline2, City, StateProvinceID FROM [dbo].[Address] WHERE AddressLine1 like '%Monti%') SELECT DISTINCT rs.City, p.StateProvinceCode, p.Name FROM rs INNER JOIN [Person].[StateProvince] p ON rs.StateProvinceID = p.StateProvinceID; -- execute query DROP STATISTICS dbo.address._WA_Sys_00000006_2E3BD7D3; USE [master] GO ALTER DATABASE [AdventureWorks2014] SET AUTO_UPDATE_STATISTICS OFF; ALTER DATABASE [AdventureWorks2014] SET AUTO_CREATE_STATISTICS OFF; -- execute the same query
Sample 4: optimization of “like %” Estimated Number of Rows (stat) Estimated Number of Rows (WO stat) Actual 1.98121 1765.26 8 --WITH STATISTICS --WITHOUT STATISTICS
Sample 5: computed columns SELECT * FROM Sales.SalesOrderDetail WHERE UnitPrice * OrderQty > 30000; SELECT (count(*)/100.0)*30 AS _30_percent FROM Sales.SalesOrderDetail;
Sample 5: computed columns ALTER TABLE Sales.SalesOrderDetail ADD total AS UnitPrice * OrderQty; SELECT * FROM Sales.SalesOrderDetail WHERE UnitPrice * OrderQty > 30000; ALTER TABLE Sales.SalesOrderDetail DROP COLUMN total;
Sample 6: two condition SELECT [AddressLine1] ,[AddressLine2] ,[City] ,[StateProvinceID] ,[PostalCode] FROM [Person].[Address] WHERE City = 'Melbourne' AND StateProvinceID = 77; SELECT count(*) FROM [Person].[Address]; -- 19614 Rows
Sample 6: two condition 2012; independent 2014; selectivity SELECT ((110./19614)*(901./19614))*19614 AS estimate; SELECT ((110./19614)*SQRT(901./19614))*19614 AS estimate;
Sample 7: filtered statistics CREATE STATISTICS Victoria ON Person.Address(City) WHERE StateProvinceID = 77; DBCC SHOW_STATISTICS ('Person.Address', Victoria);
Sample 7: filtered statistics DBCC FREEPROCCACHE GO SELECT [AddressLine1] ,[AddressLine2] ,[City] ,[StateProvinceID] ,[PostalCode] FROM [Person].[Address] where City = 'Melbourne' and StateProvinceID = 77; Use Case: Partition table Cons: updated as normal stats
Undocumented (8666) SET TRANSACTION ISOLATION LEVEL READ UNCOMMITTED; GO DBCC TRACEON (8666); WITH XMLNAMESPACES ('http://schemas.microsoft.com/sqlserver/2004/07/showplan' as p) SELECT qt.text AS SQLCommand, qp.query_plan, StatsUsed.XMLCol.value('@FieldValue','NVarChar(500)') AS StatsName FROM sys.dm_exec_cached_plans cp CROSS APPLY sys.dm_exec_query_plan(cp.plan_handle) qp CROSS APPLY sys.dm_exec_sql_text (cp.plan_handle) qt CROSS APPLY query_plan.nodes('//p:Field[@FieldName="wszStatName"]') StatsUsed(XMLCol) WHERE qt.text LIKE '%SELECT%' AND qt.text LIKE '%addressline1%'; DBCC TRACEOFF(8666);
Sample 9: auto – increment column (2012 vs 2014) ---- table creation CREATE TABLE dbo.Address_1 ( [AddressID] [int] IDENTITY NOT NULL PRIMARY KEY, [AddressLine1] [nvarchar](60) NOT NULL, [AddressLine2] [nvarchar](60) NULL, [City] [nvarchar](30) NOT NULL, [StateProvinceID] [int] NOT NULL, [PostalCode] [nvarchar](15) NOT NULL) ----- data loading (~ 19 000 records) INSERT INTO dbo.Address_1 ([AddressLine1],[AddressLine2],[City],[StateProvinceID],[PostalCode] ) SELECT [AddressLine1],[AddressLine2],[City],[StateProvinceID],[PostalCode] FROM Person.Address ----- creating / check statistics SELECT * FROM dbo.Address_1 WHERE addressid =1 ----- additional data loading (500 records) SELECT TOP 500 [AddressLine1],[AddressLine2],[City],[StateProvinceID],[PostalCode] -- SQL Query SELECT * FROM dbo.Address_1 WHERE Addressid BETWEEN 19800 AND 19950;
Sample 9: auto – increment column (2012 vs 2014) -- 2012 DBCC SHOW_STATISTICS ('dbo.Address_1',[PK__Address___091C2A1B01ECE8F5]) -- 2014 DBCC SHOW_STATISTICS ('dbo.Address_1',[PK__Address___091C2A1BD70E253B])
Sample 9: auto – increment column (2012 vs 2014) -- 2012 DBCC FREEPROCCACHE GO SELECT * FROM dbo.Address_1 WHERE Addressid BETWEEN 19800 AND 19950; -- 2014
Statistics for beginners CE Model assumptions are differ from real world Statistics are approximate Performance depends on up-to-date statistics Statistics on non-indexed column make sense Q&A
Our Awesome Sponsors
Statistics for beginners Thanks! azrobok@gmail.com