Statistics for beginners

Slides:

Advertisements

Similar presentations

Cardinality How many rows? Distribution How many distinct values? density How many rows for each distinct value? Used by optimizer A histogram 200 steps.

Advertisements

SQL Performance 2011/12 Joe Chang, SolidQ

Working with SQL Server Database Objects

Dave Ballantyne Clear Sky SQL. ›Freelance Database Developer/Designer –Specializing in SQL Server for 15+ years ›SQLLunch –Lunchtime usergroup –London.

Denny Cherry twitter.com/mrdenny.

Oracle Database Administration Lecture 6 Indexes, Optimizer, Hints.

Module 7 Reading SQL Server® 2008 R2 Execution Plans.

Database Management 9. course. Execution of queries.

Ashwani Roy Understanding Graphical Execution Plans Level 200.

Primary Key, Cluster Key & Identity Loop, Hash & Merge Joins Joe Chang

1 Chapter 10 Joins and Subqueries. 2 Joins & Subqueries Joins – Methods to combine data from multiple tables – Optimizer information can be limited based.

Database Systems Design, Implementation, and Management Coronel | Morris 11e ©2015 Cengage Learning. All Rights Reserved. May not be scanned, copied or.

Indexes / Session 2/ 1 of 36 Session 2 Module 3: Types of Indexes Module 4: Maintaining Indexes.

Database Fundamental & Design by A.Surasit Samaisut Copyrights : All Rights Reserved.

© IBM Corporation 2005 Informix User Forum 2005 John F. Miller III Explaining SQLEXPLAIN ®

Maciej Pilecki | Project Botticelli Ltd.. SELECT Bio FROM Speakers WHERE FullName=‘Maciej Pilecki’;  Microsoft Certified Trainer since 2001  SQL Server.

SQL SERVER DAYS 2011 Table Indexing for the.NET Developer Denny Cherry twitter.com/mrdenny.

Physical Database Design Purpose- translate the logical description of data into the technical specifications for storing and retrieving data Goal - create.

Query Processing – Implementing Set Operations and Joins Chap. 19.

SQL Server Statistics DEMO SQL Server Statistics SREENI JULAKANTI,MCTS.MCITP,MCP. SQL SERVER Database Administration.

Execution Plans Detail From Zero to Hero İsmail Adar.

SQL Basics Review Reviewing what we’ve learned so far…….

Module 6: Creating and Maintaining Indexes. Overview Creating Indexes Understanding Index Creation Options Maintaining Indexes Introducing Statistics.

SQL Server Statistics DEMO SQL Server Statistics SREENI JULAKANTI,MCTS.MCITP SQL SERVER Database Administration.

Diving into Query Execution Plans ED POLLACK AUTOTASK CORPORATION DATABASE OPTIMIZATION ENGINEER.

SQL Server Magic Buttons! What are Trace Flags and why should I care? Steinar Andersen, SQL Service Nordic AB Thanks to Thomas Kejser for peer-reviewing.

Data Integrity & Indexes / Session 1/ 1 of 37 Session 1 Module 1: Introduction to Data Integrity Module 2: Introduction to Indexes.

What’s new in Tabular 2016? Polonychko Yevgen. SQLSat Kyiv Team Vitaliy Popovych Mykola Pobyivovk Yevhen Nedashkivskyi Olena Smoliak Oksana Borysenko.

SQL Server Statistics and its relationship with Query Optimizer

Stored Procedures – Facts and Myths

Query Tuning without Production Data

UFC #1433 In-Memory tables 2014 vs 2016

Query Tuning without Production Data

Query Tuning without Production Data

Reading execution plans successfully

Reading Execution Plans Successfully

Statistics for beginners

CHAPTER 5: PHYSICAL DATABASE DESIGN AND PERFORMANCE

Introduction to Execution Plans

Introducing the SQL Server 2016 Query Store

Statistics And New Cardinality Estimator (CE)

SQL Server 2017 has more cool features than just running on Linux

Now where does THAT estimate come from?

Cardinality Estimator 2014/2016

Query Optimization Statistics: The Driving Force Behind Good Performance G. Vern Rabe -

Introduction to AWS Redshift

Statistics What are the chances

Cardinality Estimates in SQL Server 2014

JULIE McLAIN-HARPER LINKEDIN: JM HARPER

Statistics: What are they and How do I use them

Azure SQL DWH: Tips and Tricks for developers

SQL Server 2016 Execution Plan Analysis Liviu Ieran

Reading Execution Plans Successfully

Hugo Kornelis Now where does THAT estimate come from? The nuts and bolts of cardinality estimation.

Microsoft SQL Server 2014 for Oracle DBAs Module 7

Introduction To Structured Query Language (SQL)

Database systems Lecture 3 – SQL + CRUD

Ascending Key Problem in SQL Server Large Tables

Statistics for beginners – In-Memory OLTP

Introduction To Structured Query Language (SQL)

Introduction to Execution Plans

SQL Database on IoT devices could you? should you? would you?

“Magic numbers”, local variable and performance

Diving into Query Execution Plans

Introduction to Execution Plans

Get data insights faster with Data Wrangling

Reading execution plans successfully

T-SQL Basics: Coding for performance

Introduction to Execution Plans

SQL Like Languages in Azure IoT

Presentation transcript:

Statistics for beginners Andrii Zrobok Statistics for beginners

What is? How collect/update? Samples.

SQLSat Kyiv Team Yevhen Nedashkivskyi Alesya Zhuk Eugene Polonichko Oksana Borysenko Mykola Pobyivovk Oksana Tkach

Sponsor Sessions Starts at 13:10 Don’t miss them, they might be providing some interesting and valuable information! Room A Room B Room C 13:00 - 13:20 DevArt Microsoft Eleks 13:20 - 13:50 DB Best Intapp DataArt NULL means no session in that room at that time 

Our Awesome Sponsors

Session will begin very soon :) Please complete the evaluation form from your pocket after the session. Your feedback will help us to improve future conferences and speakers will appreciate your feedback! Enjoy the conference!

Agenda Creating Updating Optimizer model Usage “Lies, blatant lies, statistics” Objects that contain statistical information about the distribution of values in one or more columns of a table or indexed view Creating Updating Optimizer model Usage

SET Statistics: creating Column-based (Auto Created, single column only) Index-based (Index created) Manual created (Create stats) SET Auto Create Statistics Auto Create Incremental Statistic -- SAMPLE TABLE CREATE TABLE [dbo].[Address]( [AddressID] [int] IDENTITY(1,1) NOT NULL, [AddressLine1] [nvarchar](60) NOT NULL, [AddressLine2] [nvarchar](60) NULL, [City] [nvarchar](30) NOT NULL, [StateProvinceID] [int] NOT NULL, [PostalCode] [nvarchar](15) NOT NULL, CONSTRAINT [PK_Address_AddressID] PRIMARY KEY CLUSTERED ([AddressID] ASC)) --TEST DATA INSERT INTO [dbo].[Address] ([AddressLine1] ,[AddressLine2] ,[City] ,[StateProvinceID] ,[PostalCode]) SELECT [AddressLine1] ,[PostalCode] FROM [Person].[Address]

Get info about statistic objects sys.stats sys.stats_columns SELECT s.stats_id StatsID, s.name StatsName, sc.stats_column_id StatsColID, c.name ColumnName FROM sys.stats s INNER JOIN sys.stats_columns sc ON s.object_id = sc.object_id AND s.stats_id = sc.stats_id INNER JOIN sys.columns c ON sc.object_id = c.object_id AND sc.column_id = c.column_id INNER JOIN sys.objects o on o.object_id = c.object_id WHERE OBJECT_NAME(s.object_id) = 'Address' and schema_name(o.schema_id) = 'dbo' ORDER BY s.stats_id, sc.column_id;

Creating statistics -- COLUMN BASED STATISTICS SELECT DISTINCT PostalCode FROM [dbo].[Address] WHERE city = 'Concord' -- INDEX BASED STATISTICS CREATE NONCLUSTERED INDEX [IDX_StateProvinceID] ON [dbo].[Address] ([StateProvinceID] ASC) What does mean WA in the name of statistic? -- STATISTICS (Manual Created) CREATE STATISTICS STA_StateProvinceID ON [dbo].[Address] (StateProvinceID) Does not automatically deleted (in case of index creating on the same column)

Statistics: details Header Density vector Histogram DBCC SHOW_STATISTICS ('[dbo].[Address]', PK_Address_AddressID) go Header Density vector Histogram -- create when need select * from [dbo].[Address] where [AddressID] = 300 go

Statistics: details (2012) Columns Description RANGE_HI_KEY upper-bound key value for the range defined by histogram step RANGE_ROWS number of rows within the interval EQ_ROWS key value equal to the RANGE_HI_KEY DISTINCT_RANGE_ROWS distinct values of the keys are within the interval AVG_RANGE_ROWS average number of rows per distinct key value in the interval Density = 1/frequency (average number of the duplicates per key value) All density is calculated based on (1 / number of distinct values)

Creating statistics: multiple columns CREATE NONCLUSTERED INDEX [idx_city_postalcode] ON [dbo].[Address] ( [City] ASC, [PostalCode] ASC ) DBCC SHOW_STATISTICS ('[dbo].[Address]', idx_city_postalcode) Histogram is creating for first column only

SET Statistic: updating Auto Update Statistics Auto Update Statistics Asynchronously Auto (sampling 20%) Rules (2014 and earlier, This is regardless of the number of rows in the table): More then 500 records: 20% + 500 records are modified Less then 500 records: 500 modifications 2008R2 SP1-SQL Server 2014. Trace Flag 2371 (global, instance level) makes the formula for large tables more dynamic (more than 25 000 records) SQL Server 2016 automatically uses this improved algorithm. With this change, statistics on large tables will be updated more often. Manual UPDATE STATISTICS / Sp_UpdateStats / Index rebuild operation sys.dm_db_stats_properties Updating statistics will result in cached plan invalidations

Statistic usage: Optimizer Server Query Optimizer - cost-based optimizer Cardinality estimation – number of record, will be returned Selectivity – percentage of rows from input that satisfy a predicate Cardinality Estimation Model (MS SQL 7) Independence (new in 2014) Uniformity Containment Inclusion (new in 2014) TF 9481 enables legacy CE behavior. TF 2312 enables the new CE behavior. --SQL Server 2014 compatibility level -- New Cardinality Estimator ALTER DATABASE [AdventureWorks2012] SET COMPATIBILITY_LEVEL = 120 SELECT city, count(*) FROM [Person].[Address] GROUP BY city OPTION (QUERYTRACEON 9481) Trace Flag 9481 reverts query compilation and execution to the pre-SQL Server 2014 legacy CE behavior for a specific statement. Trace Flag 2312 enables the new SQL Server 2014 CE for a specific query compilation and execution. Independence: Data distributions on different columns are independent unless correlation information is available. Uniformity: Within each statistics object histogram step, distinct values are evenly spread and each value has the same frequency. Containment: If something is being searched for, it is assumed that it actually exists. For a join predicate involving an equijoin for two tables, it is assumed that distinct join column values from one side of the join will exist on the other side of the join. In addition, the smaller range of distinct values is assumed to be contained in the larger range. Inclusion: For filter predicates involving a column-equal-constant expression, the constant is assumed to actually exist for the associated column. If a corresponding histogram step is non-empty, one of the step’s distinct values is assumed to match the value from the predicate

CE: under \ over estimating Under estimating rows can lead to: Memory spills to disk, for example, where not enough memory was requested for sort or hash operations. The selection of serial plan when parallelism would have been more optimal Inappropriate join strategies. Inefficient index selection and navigation strategies. Inversely, over estimating rows can lead to: Selection of a parallel plan when a serial plan might be more optimal. Inappropriate join strategy selection. Inefficient index navigation strategies (scan versus seek). Inflated memory grants. Wasted memory and unnecessarily throttled concurrency.

Sample 1: constant “=“ (RANGE_ ) SELECT Addressid, AddressLine1, Addressline2, City FROM [dbo].[Address] WHERE [StateProvinceID] = 17; DBCC SHOW_STATISTICS ('[dbo].[Address]', IDX_StateProvinceID) WITH HISTOGRAM;

Sample 1a: constant “=“ (not RANGE_) SELECT Addressid, AddressLine1, Addressline2, City FROM [dbo].[Address] WHERE City ='Alexandria'; DBCC SHOW_STATISTICS ('[dbo].[Address]', [idx_city_postalcode]) WITH HISTOGRAM;

Sample 1b: constant “between” SELECT Addressid, AddressLine1, Addressline2, City FROM [dbo].[Address] WHERE [StateProvinceID] BETWEEN 53 AND 57; DBCC SHOW_STATISTICS ('[dbo].[Address]', IDX_StateProvinceID) WITH HISTOGRAM; 0 + 412 + 16 + 16 + 0 + 57 = 501

Sample 2: local variable “=“ DECLARE @id INT = 17; SELECT Addressid, AddressLine1, Addressline2, City FROM [dbo].[Address] WHERE [StateProvinceID] = @id; SELECT 1./count(DISTINCT StateProvinceID) AS [All density] , (1./count(DISTINCT StateProvinceID))*count(*) AS Estimate FROM [dbo].[Address]; All density Estimate 0.01351351351 265.05405398514 DBCC SHOW_STATISTICS ('[dbo].[Address]', IDX_StateProvinceID) WITH DENSITY_VECTOR;

Sample 3: local variable “<“ OR “>” DELETE TOP (5) PERCENT FROM [dbo].[Address]; DECLARE @id INT = 3; SELECT Addressid, AddressLine1, Addressline2, City FROM [dbo].[Address] WHERE [StateProvinceID] < @id; DBCC SHOW_STATISTICS ('[dbo].[Address]', IDX_StateProvinceID) WITH STAT_HEADER; SELECT (COUNT(*)/100.0) * 30.0 AS real_30_pcnt, (19614/100.0) * 30.0 AS stat_30_pcnt FROM [dbo].[Address]

Sample 4: optimization of “like %” WITH rs AS (SELECT Addressid, AddressLine1, Addressline2, City, StateProvinceID FROM [dbo].[Address] WHERE AddressLine1 like '%Monti%') SELECT DISTINCT rs.City, p.StateProvinceCode, p.Name FROM rs INNER JOIN [Person].[StateProvince] p ON rs.StateProvinceID = p.StateProvinceID; -- execute query DROP STATISTICS dbo.address._WA_Sys_00000006_2E3BD7D3; USE [master] GO ALTER DATABASE [AdventureWorks2014] SET AUTO_UPDATE_STATISTICS OFF; ALTER DATABASE [AdventureWorks2014] SET AUTO_CREATE_STATISTICS OFF; -- execute the same query

Sample 4: optimization of “like %” Estimated Number of Rows (stat) Estimated Number of Rows (WO stat) Actual 1.98121 1765.26 8 --WITH STATISTICS --WITHOUT STATISTICS

Sample 5: computed columns SELECT * FROM Sales.SalesOrderDetail WHERE UnitPrice * OrderQty > 30000; SELECT (count(*)/100.0)*30 AS _30_percent FROM Sales.SalesOrderDetail;

Sample 5: computed columns ALTER TABLE Sales.SalesOrderDetail ADD total AS UnitPrice * OrderQty; SELECT * FROM Sales.SalesOrderDetail WHERE UnitPrice * OrderQty > 30000; ALTER TABLE Sales.SalesOrderDetail DROP COLUMN total;

Sample 6: two condition SELECT [AddressLine1] ,[AddressLine2] ,[City] ,[StateProvinceID] ,[PostalCode] FROM [Person].[Address] WHERE City = 'Melbourne' AND StateProvinceID = 77; SELECT count(*) FROM [Person].[Address]; -- 19614 Rows

Sample 6: two condition 2012; independent 2014; selectivity SELECT ((110./19614)*(901./19614))*19614 AS estimate; SELECT ((110./19614)*SQRT(901./19614))*19614 AS estimate;

Sample 7: filtered statistics CREATE STATISTICS Victoria ON Person.Address(City) WHERE StateProvinceID = 77; DBCC SHOW_STATISTICS ('Person.Address', Victoria);

Sample 7: filtered statistics DBCC FREEPROCCACHE GO SELECT [AddressLine1] ,[AddressLine2] ,[City] ,[StateProvinceID] ,[PostalCode] FROM [Person].[Address] where City = 'Melbourne' and StateProvinceID = 77; Use Case: Partition table Cons: updated as normal stats

Undocumented (8666) SET TRANSACTION ISOLATION LEVEL READ UNCOMMITTED; GO DBCC TRACEON (8666); WITH XMLNAMESPACES ('http://schemas.microsoft.com/sqlserver/2004/07/showplan' as p) SELECT qt.text AS SQLCommand, qp.query_plan, StatsUsed.XMLCol.value('@FieldValue','NVarChar(500)') AS StatsName FROM sys.dm_exec_cached_plans cp CROSS APPLY sys.dm_exec_query_plan(cp.plan_handle) qp CROSS APPLY sys.dm_exec_sql_text (cp.plan_handle) qt CROSS APPLY query_plan.nodes('//p:Field[@FieldName="wszStatName"]') StatsUsed(XMLCol) WHERE qt.text LIKE '%SELECT%' AND qt.text LIKE '%addressline1%'; DBCC TRACEOFF(8666);

Sample 9: auto – increment column (2012 vs 2014) ---- table creation CREATE TABLE dbo.Address_1 ( [AddressID] [int] IDENTITY NOT NULL PRIMARY KEY, [AddressLine1] [nvarchar](60) NOT NULL, [AddressLine2] [nvarchar](60) NULL, [City] [nvarchar](30) NOT NULL, [StateProvinceID] [int] NOT NULL, [PostalCode] [nvarchar](15) NOT NULL) ----- data loading (~ 19 000 records) INSERT INTO dbo.Address_1 ([AddressLine1],[AddressLine2],[City],[StateProvinceID],[PostalCode] ) SELECT [AddressLine1],[AddressLine2],[City],[StateProvinceID],[PostalCode] FROM Person.Address ----- creating / check statistics SELECT * FROM dbo.Address_1 WHERE addressid =1 ----- additional data loading (500 records) SELECT TOP 500 [AddressLine1],[AddressLine2],[City],[StateProvinceID],[PostalCode] -- SQL Query SELECT * FROM dbo.Address_1 WHERE Addressid BETWEEN 19800 AND 19950;

Sample 9: auto – increment column (2012 vs 2014) -- 2012 DBCC SHOW_STATISTICS ('dbo.Address_1',[PK__Address___091C2A1B01ECE8F5]) -- 2014 DBCC SHOW_STATISTICS ('dbo.Address_1',[PK__Address___091C2A1BD70E253B])

Sample 9: auto – increment column (2012 vs 2014) -- 2012 DBCC FREEPROCCACHE GO SELECT * FROM dbo.Address_1 WHERE Addressid BETWEEN 19800 AND 19950; -- 2014

Statistics for beginners CE Model assumptions are differ from real world Statistics are approximate Performance depends on up-to-date statistics Statistics on non-indexed column make sense Q&A

Our Awesome Sponsors

Statistics for beginners Thanks! azrobok@gmail.com