Statistics for beginners

Statistics for beginners
Andrii Zrobok Statistics for beginners

What is? How collect/update? Samples.

SQLSat Kyiv Team Yevhen Nedashkivskyi Alesya Zhuk Eugene Polonichko
Oksana Borysenko Mykola Pobyivovk Oksana Tkach

Sponsor Sessions Starts at 13:10
Don’t miss them, they might be providing some interesting and valuable information! Room A Room B Room C 13: :20 DevArt Microsoft Eleks 13: :50 DB Best Intapp DataArt NULL means no session in that room at that time 

Our Awesome Sponsors

Session will begin very soon :)
Please complete the evaluation form from your pocket after the session. Your feedback will help us to improve future conferences and speakers will appreciate your feedback! Enjoy the conference!

Agenda Creating Updating Optimizer model Usage
“Lies, blatant lies, statistics” Objects that contain statistical information about the distribution of values in one or more columns of a table or indexed view Creating Updating Optimizer model Usage

SET Statistics: creating
Column-based (Auto Created, single column only) Index-based (Index created) Manual created (Create stats) SET Auto Create Statistics Auto Create Incremental Statistic -- SAMPLE TABLE CREATE TABLE [dbo].[Address]( [AddressID] [int] IDENTITY(1,1) NOT NULL, [AddressLine1] [nvarchar](60) NOT NULL, [AddressLine2] [nvarchar](60) NULL, [City] [nvarchar](30) NOT NULL, [StateProvinceID] [int] NOT NULL, [PostalCode] [nvarchar](15) NOT NULL, CONSTRAINT [PK_Address_AddressID] PRIMARY KEY CLUSTERED ([AddressID] ASC)) --TEST DATA INSERT INTO [dbo].[Address] ([AddressLine1] ,[AddressLine2] ,[City] ,[StateProvinceID] ,[PostalCode]) SELECT [AddressLine1] ,[PostalCode] FROM [Person].[Address]

Get info about statistic objects
sys.stats sys.stats_columns SELECT s.stats_id StatsID, s.name StatsName, sc.stats_column_id StatsColID, c.name ColumnName FROM sys.stats s INNER JOIN sys.stats_columns sc ON s.object_id = sc.object_id AND s.stats_id = sc.stats_id INNER JOIN sys.columns c ON sc.object_id = c.object_id AND sc.column_id = c.column_id INNER JOIN sys.objects o on o.object_id = c.object_id WHERE OBJECT_NAME(s.object_id) = 'Address' and schema_name(o.schema_id) = 'dbo' ORDER BY s.stats_id, sc.column_id;

Creating statistics -- COLUMN BASED STATISTICS
SELECT DISTINCT PostalCode FROM [dbo].[Address] WHERE city = 'Concord' -- INDEX BASED STATISTICS CREATE NONCLUSTERED INDEX [IDX_StateProvinceID] ON [dbo].[Address] ([StateProvinceID] ASC) What does mean WA in the name of statistic? -- STATISTICS (Manual Created) CREATE STATISTICS STA_StateProvinceID ON [dbo].[Address] (StateProvinceID) Does not automatically deleted (in case of index creating on the same column)

Statistics: details Header Density vector Histogram
DBCC SHOW_STATISTICS ('[dbo].[Address]', PK_Address_AddressID) go Header Density vector Histogram -- create when need select * from [dbo].[Address] where [AddressID] = 300 go

Statistics: details (2012)
Columns Description RANGE_HI_KEY upper-bound key value for the range defined by histogram step RANGE_ROWS number of rows within the interval EQ_ROWS key value equal to the RANGE_HI_KEY DISTINCT_RANGE_ROWS distinct values of the keys are within the interval AVG_RANGE_ROWS average number of rows per distinct key value in the interval Density = 1/frequency (average number of the duplicates per key value) All density is calculated based on (1 / number of distinct values)

Creating statistics: multiple columns
CREATE NONCLUSTERED INDEX [idx_city_postalcode] ON [dbo].[Address] ( [City] ASC, [PostalCode] ASC ) DBCC SHOW_STATISTICS ('[dbo].[Address]', idx_city_postalcode) Histogram is creating for first column only

SET Statistic: updating Auto Update Statistics
Auto Update Statistics Asynchronously Auto (sampling 20%) Rules (2014 and earlier, This is regardless of the number of rows in the table): More then 500 records: 20% records are modified Less then 500 records: 500 modifications 2008R2 SP1-SQL Server Trace Flag 2371 (global, instance level) makes the formula for large tables more dynamic (more than records) SQL Server 2016 automatically uses this improved algorithm. With this change, statistics on large tables will be updated more often. Manual UPDATE STATISTICS / Sp_UpdateStats / Index rebuild operation sys.dm_db_stats_properties Updating statistics will result in cached plan invalidations

Statistic usage: Optimizer
Server Query Optimizer - cost-based optimizer Cardinality estimation – number of record, will be returned Selectivity – percentage of rows from input that satisfy a predicate Cardinality Estimation Model (MS SQL 7) Independence (new in 2014) Uniformity Containment Inclusion (new in 2014) TF 9481 enables legacy CE behavior. TF 2312 enables the new CE behavior. --SQL Server 2014 compatibility level -- New Cardinality Estimator ALTER DATABASE [AdventureWorks2012] SET COMPATIBILITY_LEVEL = 120 SELECT city, count(*) FROM [Person].[Address] GROUP BY city OPTION (QUERYTRACEON 9481) Trace Flag 9481 reverts query compilation and execution to the pre-SQL Server legacy CE behavior for a specific statement. Trace Flag 2312 enables the new SQL Server 2014 CE for a specific query compilation and execution. Independence: Data distributions on different columns are independent unless correlation information is available. Uniformity: Within each statistics object histogram step, distinct values are evenly spread and each value has the same frequency. Containment: If something is being searched for, it is assumed that it actually exists. For a join predicate involving an equijoin for two tables, it is assumed that distinct join column values from one side of the join will exist on the other side of the join. In addition, the smaller range of distinct values is assumed to be contained in the larger range. Inclusion: For filter predicates involving a column-equal-constant expression, the constant is assumed to actually exist for the associated column. If a corresponding histogram step is non-empty, one of the step’s distinct values is assumed to match the value from the predicate

CE: under \ over estimating
Under estimating rows can lead to: Memory spills to disk, for example, where not enough memory was requested for sort or hash operations. The selection of serial plan when parallelism would have been more optimal Inappropriate join strategies. Inefficient index selection and navigation strategies. Inversely, over estimating rows can lead to: Selection of a parallel plan when a serial plan might be more optimal. Inappropriate join strategy selection. Inefficient index navigation strategies (scan versus seek). Inflated memory grants. Wasted memory and unnecessarily throttled concurrency.

Sample 1: constant “=“ (RANGE_ )
SELECT Addressid, AddressLine1, Addressline2, City FROM [dbo].[Address] WHERE [StateProvinceID] = 17; DBCC SHOW_STATISTICS ('[dbo].[Address]', IDX_StateProvinceID) WITH HISTOGRAM;

Sample 1a: constant “=“ (not RANGE_)
SELECT Addressid, AddressLine1, Addressline2, City FROM [dbo].[Address] WHERE City ='Alexandria'; DBCC SHOW_STATISTICS ('[dbo].[Address]', [idx_city_postalcode]) WITH HISTOGRAM;

Sample 1b: constant “between”
SELECT Addressid, AddressLine1, Addressline2, City FROM [dbo].[Address] WHERE [StateProvinceID] BETWEEN 53 AND 57; DBCC SHOW_STATISTICS ('[dbo].[Address]', IDX_StateProvinceID) WITH HISTOGRAM; = 501

Sample 2: local variable “=“
INT = 17; SELECT Addressid, AddressLine1, Addressline2, City FROM [dbo].[Address] WHERE [StateProvinceID] SELECT 1./count(DISTINCT StateProvinceID) AS [All density] , (1./count(DISTINCT StateProvinceID))*count(*) AS Estimate FROM [dbo].[Address]; All density Estimate DBCC SHOW_STATISTICS ('[dbo].[Address]', IDX_StateProvinceID) WITH DENSITY_VECTOR;

Sample 3: local variable “<“ OR “>”
DELETE TOP (5) PERCENT FROM [dbo].[Address]; INT = 3; SELECT Addressid, AddressLine1, Addressline2, City FROM [dbo].[Address] WHERE [StateProvinceID] DBCC SHOW_STATISTICS ('[dbo].[Address]', IDX_StateProvinceID) WITH STAT_HEADER; SELECT (COUNT(*)/100.0) * 30.0 AS real_30_pcnt, (19614/100.0) * 30.0 AS stat_30_pcnt FROM [dbo].[Address]

Sample 4: optimization of “like %”
WITH rs AS (SELECT Addressid, AddressLine1, Addressline2, City, StateProvinceID FROM [dbo].[Address] WHERE AddressLine1 like '%Monti%') SELECT DISTINCT rs.City, p.StateProvinceCode, p.Name FROM rs INNER JOIN [Person].[StateProvince] p ON rs.StateProvinceID = p.StateProvinceID; -- execute query DROP STATISTICS dbo.address._WA_Sys_ _2E3BD7D3; USE [master] GO ALTER DATABASE [AdventureWorks2014] SET AUTO_UPDATE_STATISTICS OFF; ALTER DATABASE [AdventureWorks2014] SET AUTO_CREATE_STATISTICS OFF; -- execute the same query

Sample 4: optimization of “like %”
Estimated Number of Rows (stat) Estimated Number of Rows (WO stat) Actual 8 --WITH STATISTICS --WITHOUT STATISTICS

Sample 5: computed columns
SELECT * FROM Sales.SalesOrderDetail WHERE UnitPrice * OrderQty > 30000; SELECT (count(*)/100.0)*30 AS _30_percent FROM Sales.SalesOrderDetail;

Sample 5: computed columns
ALTER TABLE Sales.SalesOrderDetail ADD total AS UnitPrice * OrderQty; SELECT * FROM Sales.SalesOrderDetail WHERE UnitPrice * OrderQty > 30000; ALTER TABLE Sales.SalesOrderDetail DROP COLUMN total;

Sample 6: two condition SELECT [AddressLine1] ,[AddressLine2] ,[City]
,[StateProvinceID] ,[PostalCode] FROM [Person].[Address] WHERE City = 'Melbourne' AND StateProvinceID = 77; SELECT count(*) FROM [Person].[Address]; Rows

Sample 6: two condition 2012; independent 2014; selectivity SELECT
((110./19614)*(901./19614))*19614 AS estimate; SELECT ((110./19614)*SQRT(901./19614))*19614 AS estimate;

Sample 7: filtered statistics
CREATE STATISTICS Victoria ON Person.Address(City) WHERE StateProvinceID = 77; DBCC SHOW_STATISTICS ('Person.Address', Victoria);

Sample 7: filtered statistics
DBCC FREEPROCCACHE GO SELECT [AddressLine1] ,[AddressLine2] ,[City] ,[StateProvinceID] ,[PostalCode] FROM [Person].[Address] where City = 'Melbourne' and StateProvinceID = 77; Use Case: Partition table Cons: updated as normal stats

Undocumented (8666) SET TRANSACTION ISOLATION LEVEL READ UNCOMMITTED;
GO DBCC TRACEON (8666); WITH XMLNAMESPACES (' as p) SELECT qt.text AS SQLCommand, qp.query_plan, AS StatsName FROM sys.dm_exec_cached_plans cp CROSS APPLY sys.dm_exec_query_plan(cp.plan_handle) qp CROSS APPLY sys.dm_exec_sql_text (cp.plan_handle) qt CROSS APPLY StatsUsed(XMLCol) WHERE qt.text LIKE '%SELECT%' AND qt.text LIKE '%addressline1%'; DBCC TRACEOFF(8666);

Sample 9: auto – increment column (2012 vs 2014)
---- table creation CREATE TABLE dbo.Address_1 ( [AddressID] [int] IDENTITY NOT NULL PRIMARY KEY, [AddressLine1] [nvarchar](60) NOT NULL, [AddressLine2] [nvarchar](60) NULL, [City] [nvarchar](30) NOT NULL, [StateProvinceID] [int] NOT NULL, [PostalCode] [nvarchar](15) NOT NULL) ----- data loading (~ records) INSERT INTO dbo.Address_1 ([AddressLine1],[AddressLine2],[City],[StateProvinceID],[PostalCode] ) SELECT [AddressLine1],[AddressLine2],[City],[StateProvinceID],[PostalCode] FROM Person.Address ----- creating / check statistics SELECT * FROM dbo.Address_1 WHERE addressid =1 ----- additional data loading (500 records) SELECT TOP 500 [AddressLine1],[AddressLine2],[City],[StateProvinceID],[PostalCode] -- SQL Query SELECT * FROM dbo.Address_1 WHERE Addressid BETWEEN AND 19950;

DBCC SHOW_STATISTICS ('dbo.Address_1',[PK__Address___091C2A1B01ECE8F5]) DBCC SHOW_STATISTICS ('dbo.Address_1',[PK__Address___091C2A1BD70E253B])

DBCC FREEPROCCACHE GO SELECT * FROM dbo.Address_1 WHERE Addressid BETWEEN AND 19950;

CE Model assumptions are differ from real world Statistics are approximate Performance depends on up-to-date statistics Statistics on non-indexed column make sense Q&A

Our Awesome Sponsors

Thanks!

Statistics for beginners

Similar presentations

Presentation on theme: "Statistics for beginners"— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

Statistics for beginners

Similar presentations

Presentation on theme: "Statistics for beginners"— Presentation transcript:

Similar presentations

About project

Feedback