Statistics for beginners

Statistics for beginners
Lies, a blatant lie, statistics. Demystification of the statistics. Андрій Зробок

Agenda CREATING STATISTICS: UPDATING STATISTICS: USAGE SAMPLES:
CREATE STATISTICS (FULLSCAN, SAMPLE NNN PERCENT) CREATE INDEX AUTO-CREATING: EXECUTE SQL-QUERY STATISTICS ON SEVERAL COLUMNS (TWO FOR EXAMPLE) UPDATING STATISTICS: Automatic updates Synchronous / Asynchronous Manual updates USAGE SAMPLES: WHERE COL_NAME = VALUE WHERE COL_NAME = VARIABLE WHERE COL_NAME > VARIABLE COMPUTED COLUMN FILTERED STATISTICS (SEVERAL COLUMNS) LIKE ‘%VAR%’ 2 | 11/6/2018 | Statistics for beginners

Test data database: AdventureWorks2012 table: Person.Address
table: dbo.Address 3 | 11/6/2018 | Statistics for beginners

Optimizer Server Query Optimizer - cost-based optimizer
Cardinality estimation – number of record, will be returned Selectivity – percentage of rows from input that satisfy a predicate Memory incorrect cardinality and cost estimation inefficient plans negative impact on the performance Quality of the execution plans = accuracy of cost estimations 4 | 11/6/2018 | Statistics for beginners

CE: model assumptions SQL Server’s CE component makes certain assumptions based on typical customer database designs, data distributions, and query patterns. The core assumptions are: Independence: Data distributions on different columns are independent unless correlation information is available. Uniformity: Within each statistics object histogram step, distinct values are evenly spread and each value has the same frequency. Containment: If something is being searched for, it is assumed that it actually exists. For a join predicate involving an equijoin for two tables, it is assumed that distinct join column values from one side of the join will exist on the other side of the join. In addition, the smaller range of distinct values is assumed to be contained in the larger range. Inclusion: For filter predicates involving a column-equal-constant expression, the constant is assumed to actually exist for the associated column. If a corresponding histogram step is non- empty, one of the step’s distinct values is assumed to match the value from the predicate. Given the vast potential for variations in data distribution, volume and query patterns, there are circumstances where the model assumptions are not applicable. 5 | 11/6/2018 | Statistics for beginners

CE: under \ over estimating
Under estimating rows can lead to memory spills to disk, for example, where not enough memory was requested for sort or hash operations. Under estimating rows can also result in: The selection of serial plan when parallelism would have been more optimal. Inappropriate join strategies. Inefficient index selection and navigation strategies. Inversely, over estimating rows can lead to: Selection of a parallel plan when a serial plan might be more optimal. Inappropriate join strategy selection. Inefficient index navigation strategies (scan versus seek). Inflated memory grants. Wasted memory and unnecessarily throttled concurrency. Improving the accuracy of row estimates can improve the quality of the query execution plan and, as a result, improve the performance of the query. 6 | 11/6/2018 | Statistics for beginners

Creating / updating statistics: SETs
7 | 11/6/2018 | Statistics for beginners

Creating statistics: Test table
USE [AdventureWorks2012] GO SET ANSI_NULLS ON SET QUOTED_IDENTIFIER ON CREATE TABLE [dbo].[Address]( [AddressID] [int] IDENTITY(1,1) NOT FOR REPLICATION NOT NULL, [AddressLine1] [nvarchar](60) NOT NULL, [AddressLine2] [nvarchar](60) NULL, [City] [nvarchar](30) NOT NULL, [StateProvinceID] [int] NOT NULL, [PostalCode] [nvarchar](15) NOT NULL, [SpatialLocation] [geography] NULL, [rowguid] [uniqueidentifier] ROWGUIDCOL NOT NULL, [ModifiedDate] [datetime] NOT NULL, CONSTRAINT [PK_Address_AddressID] PRIMARY KEY CLUSTERED ( [AddressID] ASC )WITH (PAD_INDEX = OFF, STATISTICS_NORECOMPUTE = OFF, IGNORE_DUP_KEY = OFF, ALLOW_ROW_LOCKS = ON, ALLOW_PAGE_LOCKS = ON) ON [PRIMARY] ) ON [PRIMARY] TEXTIMAGE_ON [PRIMARY] ALTER TABLE [dbo].[Address] ADD CONSTRAINT [DF_dbo_Address_rowguid] DEFAULT (newid()) FOR [rowguid] ALTER TABLE [dbo].[Address] ADD CONSTRAINT [DF_dbo_Address_ModifiedDate] DEFAULT (getdate()) FOR [ModifiedDate] 8 | 11/6/2018 | Statistics for beginners

Creating statistics: data loading
set nocount on table (i int = 1 <200 begin insert (i) values +1 end INSERT INTO [dbo].[Address] ([AddressLine1] ,[AddressLine2] ,[City] ,[StateProvinceID] ,[PostalCode] ,[SpatialLocation] ) SELECT [AddressLine1] ,[StateProvinceID] +i FROM Statistics are updated (created) when need (not immediately after data loading / updating ): DBCC SHOW_STATISTICS ('[dbo].[Address]', [PK_Address_AddressID]) Statistics is empty 9 | 11/6/2018 | Statistics for beginners

Creating statistics: primary key
select * from [dbo].[Address] where [AddressID] = 1 DBCC SHOW_STATISTICS ('[dbo].[Address]', PK_Address_AddressID) 10 | 11/6/2018 | Statistics for beginners

Creating statistics: definition
Density is calculated based on the formula: (1 / frequency), where frequency indicates the average number of the duplicates per key value All density is calculated based on (1 / number of distinct values) formula, and it indicates how many rows on average every combination of key values has The RANGE_HI_KEY column stores the sample value of the key. This value is the upper-bound key value for the range defined by histogram step. The RANGE_ROWS column estimates the number of rows within the interval EQ_ROWS indicates how many rows have a key value equal to the RANGE_HI_KEY upper-bound value DISTINCT_RANGE_ROWS indicates how many distinct values of the keys are within the interval AVG_RANGE_ROWS indicates the average number of rows per distinct key value in the interval. 11 | 11/6/2018 | Statistics for beginners

Creating statistics: auto-creating
select distinct PostalCode from [dbo].[Address] where city = 'Concord‘ (1 row(s) affected) select * from sys.stats where object_id = object_id('[dbo].[Address]','U') What does mean WA in the name of statistic? DBCC SHOW_STATISTICS ('[dbo].[Address]', _WA_Sys_ _041093DD) WITH STAT_HEADER go DBCC SHOW_STATISTICS ('[dbo].[Address]', _WA_Sys_ _041093DD) WITH STAT_HEADER SQL Server stores additional information in the statistics for the string values called Trie Trees 12 | 11/6/2018 | Statistics for beginners

DBCC SHOW_STATISTICS ('[dbo].[Address]', _WA_Sys_ _041093DD) WITH DENSITY_VECTOR go DBCC SHOW_STATISTICS ('[dbo].[Address]', _WA_Sys_ _041093DD) WITH DENSITY_VECTOR 13 | 11/6/2018 | Lies, damned lies, and statistics

DBCC SHOW_STATISTICS ('[dbo].[Address]', _WA_Sys_ _041093DD) WITH HISTOGRAM go DBCC SHOW_STATISTICS ('[dbo].[Address]', _WA_Sys_ _041093DD) WITH HISTOGRAM 14 | 11/6/2018 | Statistics for beginners

Creating statistics: index
SET ANSI_PADDING ON GO CREATE NONCLUSTERED INDEX [IDX_StateProvinceID] ON [dbo].[Address] ( [StateProvinceID] ASC )WITH (PAD_INDEX = OFF, STATISTICS_NORECOMPUTE = OFF, SORT_IN_TEMPDB = OFF, DROP_EXISTING = OFF, ONLINE = OFF, ALLOW_ROW_LOCKS = ON, ALLOW_PAGE_LOCKS = ON) DBCC SHOW_STATISTICS ('[dbo].[Address]', IDX_StateProvinceID) 15 | 11/6/2018 | Statistics for beginners

Creating statistics: index (two columns)
CREATE NONCLUSTERED INDEX [idx_city_postalcode] ON [dbo].[Address] ( [City] ASC, [PostalCode] ASC )WITH (PAD_INDEX = OFF, STATISTICS_NORECOMPUTE = OFF, SORT_IN_TEMPDB = OFF, DROP_EXISTING = OFF, ONLINE = OFF, ALLOW_ROW_LOCKS = ON, ALLOW_PAGE_LOCKS = ON) ON [PRIMARY] GO DBCC SHOW_STATISTICS ('[dbo].[Address]', idx_city_postalcode) Histogram is creating for first column only 16 | 11/6/2018 | Statistics for beginners

Creating statistics: two columns
CREATE STATISTICS CityProvince ON dbo.Address(City,StateProvinceID) GO DBCC SHOW_STATISTICS ('[dbo].[Address]', CityProvince) WITH STAT_HEADER CREATE STATISTICS CityProvince ON dbo.Address(City,StateProvinceID) WITH FULLSCAN GO DBCC SHOW_STATISTICS ('[dbo].[Address]', CityProvince) WITH STAT_HEADER CREATE STATISTICS CityProvince ON dbo.Address(City,StateProvinceID) WITH SAMPLE 50 PERCENT GO DBCC SHOW_STATISTICS ('[dbo].[Address]', CityProvince) WITH STAT_HEADER 17 | 11/6/2018 | Statistics for beginners

Updating statistics Auto Rules: More then 500 records : 20% records are modified Less then 500 records : 500 modifications Amount of records are change from 0 Temp tables: after every 6 modification Filtered statistics – the same algorithm (as for usual statistics) Sp_autostats (ON;OFF auto-updating statistics for particular objects) NORECOMPUTE (Create Statistics OPTION) STATISTICS_NORECOMPUTE (Create Index OPTION) Synchronous / Asynchronous Manual UPDATE STATISTICS Sp_UpdateStats (will update all statistics that have experienced the change of at least one underlying row since the last statistics update) Index rebuild operation Does not automatically deleted (in case of index creating) Updating statistics will result in cached plan invalidations. Auto - Updating statistics: sampling 20% rows 18 | 11/6/2018 | Statistics for beginners

Updating statistics: information
SELECT OBJECT_NAME([sp].[object_id]) AS "Table", [sp].[stats_id] AS "Statistic ID", [s].[name] AS "Statistic", [sp].[last_updated] AS "Last Updated", [sp].[rows], [sp].[rows_sampled], [sp].[unfiltered_rows], [sp].[modification_counter] AS "Modifications" FROM [sys].[stats] AS [s] OUTER APPLY sys.dm_db_stats_properties ([s].[object_id],[s].[stats_id]) AS [sp] WHERE [s].[object_id] = OBJECT_ID(N'dbo.Address'); 19 | 11/6/2018 | Statistics for beginners

New 2014 CE Trace Flag 9481 reverts query compilation and execution to the pre-SQL Server 2014 legacy CE behavior for a specific statement. Trace Flag 2312 enables the new SQL Server 2014 CE for a specific query compilation and execution. --SQL Server 2014 compatibility level - New Cardinality Estimator ALTER DATABASE [AdventureWorks2012] SET COMPATIBILITY_LEVEL = 120 select city, count(*) from [Person].[Address] group by city OPTION (QUERYTRACEON 9481) --SQL Server 2012 compatibility level - Old Cardinality Estimator ALTER DATABASE [AdventureWorks2014] SET COMPATIBILITY_LEVEL = 110 select city, count(*) from [Person].[Address] group by city OPTION (QUERYTRACEON 2312) 20 | 11/6/2018 | Statistics for beginners

Undocumented (8666) 21 | 11/6/2018 | Statistics for beginners
SET TRANSACTION ISOLATION LEVEL READ UNCOMMITTED; GO DBCC TRACEON (8666); WITH XMLNAMESPACES (' as p) SELECT qt.text AS SQLCommand, qp.query_plan, AS StatsName FROM sys.dm_exec_cached_plans cp CROSS APPLY sys.dm_exec_query_plan(cp.plan_handle) qp CROSS APPLY sys.dm_exec_sql_text (cp.plan_handle) qt CROSS APPLY StatsUsed(XMLCol) WHERE qt.text LIKE '%SELECT%' AND qt.text LIKE '%addressline1%'; DBCC TRACEOFF(8666); 21 | 11/6/2018 | Statistics for beginners

Undocumented (rowcount, pagecount)
<update_stats_stream_option> ::= [ STATS_STREAM = stats_stream ] [ ROWCOUNT = numeric_constant ] [ PAGECOUNT = numeric contant ] <update_stats_stream_option> This syntax is for internal use only and is not supported. Microsoft reserves the right to change this syntax at any time. use tempdb go create table t1(i int, j int) create table t2(h int, k int) 22 | 11/6/2018 | Statistics for beginners

select distinct(i) from t1 select * from t1, t2 where i = k order by j + k update statistics t1 with rowcount = 10000, pagecount = 10000 update statistics t2 with rowcount = , pagecount = 23 | 11/6/2018 | Statistics for beginners

select distinct(i) from t1 select * from t1, t2 where i = k order by j + k 24 | 11/6/2018 | Statistics for beginners

SAMPLES 25 | 11/6/2018 | Statistics for beginners

Sample 1: constant = select addressid, AddressLine1, addressline2, city from [dbo].[Address] where [StateProvinceID] = 17 Go 8429 row(s) affected 26 | 11/6/2018 | Statistics for beginners

Sample 2: local variable =
int = 17 select addressid, AddressLine1, addressline2, city from [dbo].[Address] where [StateProvinceID] Go 8429 row(s) affected 27 | 11/6/2018 | Statistics for beginners

Sample 2a: local variable = constant
create table #t (id int not null identity(1,1) primary key, descr varchar(20) not null) go insert into #t (descr) values ('descr 0'),('descr 1'),('descr 2'),('descr 3'),('descr 4'),('descr 5'),('descr 6'),('descr 7'),('descr 8'),('descr 9') go 100 select * from #t where descr = 'descr 2' go varchar(20) = 'descr 2' select * from #t where descr 28 | 11/6/2018 | Statistics for beginners

Sample 3: local variable < (>)
delete top (5) percent from [dbo].[Address] go int = 9 select addressid, AddressLine1, addressline2, city from [dbo].[Address] where [StateProvinceID] 0 row(s) affected DBCC SHOW_STATISTICS ('[dbo].[Address]', IDX_StateProvinceID) 29 | 11/6/2018 | Lies, damned lies, and statistics

Sample 4: like % 30 | 11/6/2018 | Statistics for beginners
;with rs as (select addressid, AddressLine1, addressline2, city, StateProvinceID from [dbo].[Address] where AddressLine1 like '%Monti%') select distinct rs.city, p.StateProvinceCode, p.Name from rs inner join [Person].[StateProvince] p on rs.StateProvinceID = p.StateProvinceID go ;with rs as (select addressid, AddressLine1, addressline2, city, StateProvinceID from [dbo].[Address] where AddressLine1 like '%Circle') DBCC SHOW_STATISTICS ('[dbo].[Address]', _WA_Sys_ _041093DD) WITH STAT_HEADER 30 | 11/6/2018 | Statistics for beginners

Sample 4: like % (with statistics)
(1244 row(s) affected) (23321 row(s) affected) 31 | 11/6/2018 | Statistics for beginners

Sample 4: like % (with statistics)
(23321 row(s) affected) (1244 row(s) affected) 32 | 11/6/2018 | Statistics for beginners

Sample 4: like % (without statistics)
drop statistics dbo.address._WA_Sys_ _041093DD go USE [master] GO ALTER DATABASE [AdventureWorks2012] SET AUTO_UPDATE_STATISTICS OFF ALTER DATABASE [AdventureWorks2012] SET AUTO_CREATE_STATISTICS OFF 33 | 11/6/2018 | Statistics for beginners

Estimated With statistics 1078 vs Without statistics Estimated With statistics vs Without statistics 35 | 11/6/2018 | Statistics for beginners

Sample 5: computed columns
SELECT (count(*)/100.0)*30 as _30_percent FROM Sales.SalesOrderDetail go SET STATISTICS PROFILE ON GO SELECT * FROM Sales.SalesOrderDetail WHERE UnitPrice * OrderQty > 30000 SET STATISTICS PROFILE OFF 36 | 11/6/2018 | Statistics for beginners

Sample 5: computed columns
ALTER TABLE Sales.SalesOrderDetail ADD total AS UnitPrice * OrderQty DBCC SHOW_STATISTICS ('[Sales].[SalesOrderDetail]', _WA_Sys_ F_44CA3770) ALTER TABLE Sales.SalesOrderDetail DROP COLUMN total 37 | 11/6/2018 | Statistics for beginners

Sample 7: two condition (2012; independent)
SELECT [AddressLine1] ,[AddressLine2] ,[City] ,[StateProvinceID] ,[PostalCode] FROM [Person].[Address] where city = 'Melbourne' and stateprovinceid = 77 DBCC SHOW_STATISTICS ('[Person].[Address]', IX_Address_StateProvinceID) DBCC SHOW_STATISTICS ('[Person].[Address]', _WA_Sys_ _164452B1) Select ((901.0/count(*)) * (110.0/count(*))) * count(*) as EstimatedNumberofRows from [Person].[Address] 38 | 11/6/2018 | Statistics for beginners

Sample 7: two condition (2014; selectivity)
[AddressLine1] ,[AddressLine2] ,[City] ,[StateProvinceID] ,[PostalCode] FROM [Person].[Address] where city = 'Melbourne' and stateprovinceid = 77 39 | 11/6/2018 | Statistics for beginners

Sample 7: filtered staistics
CREATE STATISTICS Victoria ON Person.Address(City) WHERE StateProvinceID = 77 40 | 11/6/2018 | Statistics for beginners

Sample 7: filtered statistics
DBCC FREEPROCCACHE GO SELECT [AddressLine1] ,[AddressLine2] ,[City] ,[StateProvinceID] ,[PostalCode] FROM [Person].[Address] where city = 'Melbourne' and stateprovinceid = 77 Partition table 41 | 11/6/2018 | Statistics for beginners

Sample 8: out of date statistics
DBCC SHOW_STATISTICS ('[dbo].[Address]', PostalCode) INSERT INTO [dbo].[Address] ([AddressLine1] ,[AddressLine2],[City],[StateProvinceID],[PostalCode]) VALUES ('AddressLine1', 'AddressLine2','City',5,N'YO16') GO select AddressLine1,AddressLine2, city from [dbo].[Address] where postalcode = N'YO16' logical reads 42 | 11/6/2018 | Statistics for beginners

Sample 8: out of date statistics (memory)
select city,count(*) from [dbo].[Address] where postalcode = N'YO16' group by city 43 | 11/6/2018 | Statistics for beginners

Sample 8: corrected out of date statistics
update statistics [dbo].[Address] [postalcode] with fullscan select AddressLine1,AddressLine2, city from [dbo].[Address] where postalcode = N'YO16' logical reads 86357 44 | 11/6/2018 | Statistics for beginners

Sample 8: corrected out of date statistics
select city,count(*) from [dbo].[Address] where postalcode = N'YO16' group by city 45 | 11/6/2018 | Statistics for beginners

Sample 9: auto – increment column (2012 vs 2014)
create table dbo.Address_1 ( [AddressID] [int] IDENTITY NOT NULL PRIMARY KEY, [AddressLine1] [nvarchar](60) NOT NULL, [AddressLine2] [nvarchar](60) NULL, [City] [nvarchar](30) NOT NULL, [StateProvinceID] [int] NOT NULL, [PostalCode] [nvarchar](15) NOT NULL ) insert into dbo.Address_1 ([AddressLine1] , [AddressLine2] , [City] , [StateProvinceID] , [PostalCode] ) select [AddressLine1] , [PostalCode] from person.address SELECT * FROM dbo.Address_1 where addressid =1 insert into dbo.Address_1 ([AddressLine1] , [AddressLine2] , [City] , [StateProvinceID] , [PostalCode] ) select top 500 [AddressLine1] , [PostalCode] from person.address 46 | 11/6/2018 | Statistics for beginners

dbcc freeproccache go SELECT * FROM dbo.Address_1 where addressid between and 19950 48 | 11/6/2018 | Statistics for beginners

Question: variation between estimated and actual row
There is no hard-coded variance that is guaranteed to indicate an actionable cardinality estimate problem. Instead, there are several overarching factors to consider beyond just differences between estimated and actual row counts: Does the row estimate skew result in excessive resource consumption? For example, spills to disk because of underestimates of rows or wasteful reservation of memory caused by row overestimates. Does the row estimate skew coincide with specific query performance problems (e.g., longer execution time than expected)? 49 | 11/6/2018 | Statistics for beginners

Model assumptions are differ from real world Statistics are approximate Performance depends on up-to-date statistics Statistics on non-indexed column make sense Q&A The end 50 | 11/6/2018 | Statistics for beginners

Statistics for beginners

Similar presentations

Presentation on theme: "Statistics for beginners"— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

Statistics for beginners

Similar presentations

Presentation on theme: "Statistics for beginners"— Presentation transcript:

Similar presentations

About project

Feedback