Joe Chang Jchang6 @ yahoo . com qdpma.com Indexing Joe Chang Jchang6 @ yahoo . com qdpma.com
About Joe SQL Server consultant since 1999 Query Optimizer execution plan cost formulas (2002) True cost structure of SQL plan operations (2003?) Database with distribution statistics only, no data (2004) Decoding statblob/stats_stream write your own stats Disk IO cost structure Tools for system monitoring, execution plan analysis Freelance consultant since 1999, specializing in SQL Server performance. Reversed engineered the SQL Server query optimizer cost formulas (2001). Database with no data, but having the data distribution statistics from a production system. Automated index and execution plan cross reference analysis on www.qdpma.com (ExecStats). Indexing is one of foundations of databases, and is taught at the beginner level books? Unfortunately most of this is not entirely correct or on solid basis, and so it is important to learn what is true. What is usually taught is that selectivity is most important. In fact grouping is as Important. See http://www.qdpma.com/ Download: http://www.qdpma.com/ExecStatsZip.html Blog: http://sqlblog.com/blogs/joe_chang/default.aspx
Indexing Fundamental topic covered in most Intro SQL Index key must be highly selective Or it won’t be used But its not entirely correct
TPC-C database schema Examples are based on TPC-C tables Warehouse w_id 1:10 District d_w_id d_id 1:3000 Customer c_w_id c_d_id c_id Orders o_w_id o_d_id o_c_id o_id history h_c_w_id h_date h_c_d_id h_c_id h_amount 1:10 Order_line ol_w_id ol_d_id ol_c_id ol_o_id ol_id Examples are based on TPC-C tables
Nonclustered index details CREATE CLUSTERED INDEX (Col1, Col2) CREATE INDEX IX ON Table (Col3, Col4) INCLUDE(C5) Explicit keys: Col3, Col4, Implicit keys: Col1, Col2 Full key: Col3, Col4, Col1, Col2 If one or more clustered index key columns are part of the explicit nonclustered index key, then other cluster key are implicit
Index Seek Examples Clustered index seek nonclustered index seek, no key lookup nonclustered index seek, + key lookup for columns not in nonclustered index Table scan – when no suitable index (or forced with hint)
Index Selectivity – Why? Plan cost 9.767, 3000 rows Plan cost 91.17, 30000 rows Plan cost 102.7, (IO: 101 ) 136366 pages, 1090MB Plan cost (IO portion) of is approximately 1/320 per key lookup row (random) 1/1350 per page in table scan. (See Execution Plan Cost Formulas slide deck) Ratio of Key lookup row to table scan page is 4.2:1, with CPU portion 3.5:1
Plan Cost Key Lookup Table scan IO portion approximately 1/320 per row (random) Table scan 1/1350 per page Ratio of Key lookup row to table scan page is 4:21 IO + CPU portion 3.5:1
Loop Joins – similar to key lookup Customer2 clustered on c_id only Plan cost 91.10, 30000 rows customers clustered on identity(-ish) Customer3 clustered on warehouse, same as orders2 Plan cost 13.53, 30000 rows
Index Important Points Selectivity is important But so is locality (grouping rows into common pages) Applicable when Multiple tables have a common grouping column(s) Impacts choice of primary key and/or cluster key Key Lookup (IO portion) costs are roughly 1/320 per row (with adjustments) in a large table unless the query optimizer knows the rows are in a limited number of pages
Big Picture The Execution Plan links all the elements of performance SQL Tables natural keys Indexes Execution Plan Statistics & Compile parameters Compile Row estimate propagation errors Storage Engine Hardware DOP Memory Parallel plans Recompile temp table / table variable Query Optimizer Index & Stats Maintenance API Server Cursors: open, prepare, execute, close? SET NO COUNT Information messages Tables and SQL combined implement business logic Natural keys with unique indexes, not SQL Index and Statistics maintenance policy 1 Logic may need more than one execution plan? Compile cost versus execution cost? Plan cache bloat? The Execution Plan links all the elements of performance Index tuning alone has limited value Over indexing can cause problems as well Client App also important
Indexing Objectives No such thing as perfect Indexing is trade-offs, what is more important Insert/Update/Delete performance Select performance (& compile overhead) Maintenance? Also need to consider statistics update, compile parameters
Topics Primary Key, Cluster Key Nonclustered indexes Included columns Filtered index Columnstore - Not covered here, see slide decks by Jimmy May Special – also not covered here XML, Spatial, Hash – memory optimized tables Related: Partitioning Partition to distribute or concentrate
Identity, Primary Key, Cluster Key These are three different things Primary – uniquely identifies row/record Identity/Row GUID – mechanism for generating key Identity is useful, but should not be always used Guid – only use when absolutely no alternatives Consider a natural key for dimension tables Cluster Key – physical organization of table nonclustered indexes implicitly incorporates cluster key columns
Clustered Index Identity or other sequentially increasing value Always inserted to the last page In theory, no fragmentation in the clustered index (or a nonclustered index have such as key) B-tree will become unbalanced Grouping Good for multi-row SELECT queries Gets fragmented with inserts
Common Grouping Option Table A a_id Table C a_id b_id c_id Table B Table D d_id (unique) Cluster key a_id, b_id Cluster key a_id, b_id, c_id Unique nonclustered index on c_id Cluster key c_id, d_id If the cluster/primary key is on the parent table key + a local key, Does the local key need to be an identity? Example: Orders – LineItem LineItem table key is OrderId + LineItem sequence
Nonclustered Index Key columns Optional WITH options Include columns Filter condition WITH options Row/page compression Fill factor Wish list, would be nice if we could: Specify different fill factors for leaf and upper levels Rebuild only upper levels, or only leaf level
Index Write Overhead Insert write overhead Update Write overhead always Update Write overhead overhead only when modified column is part of index Index row moves if key column updated Delete Always Take away: Pay attention to IUD frequency If updates are frequent, which columns?
Nonclustered Index Key Strategy SELECT xxx FROM WHERE selective search arguments AND (not so) or non-selective SARGS (GROUP BY) xxx (ORDER BY) xxx Index key should have important selective SARGs & possibly either the GROUP BY or ORDER BY Less important SARGs can be in INCLUDE list
Include List All (selected) columns negates need for key lookup a major cost in execution plan for multi-row queries Considerations Fat include list -> almost another copy of the table? Update implications? Leave frequently updated columns out of include list? More work when updated column is in key, less when in include Options If a smaller include can minimize need for key lookups This is good enough
Indexing Scenario Query has a moderately selective equality SARG & several additional WHERE clause conditions not amenable to index seek, but cumulatively reduce rows Sometimes, row reduction occurs after a join Many columns are needed (impractical to include all) Option Index Key on important equality SARG Other arguments in the INCLUDE list Rely on Key Lookup for remaining columns
B-tree Index depth: or INDEXPROPERTY sys.dm_db_index_physical_stats root IL 2 IL 2 IL 2 IL 3 IL 3 IL 3 Index depth: INDEXPROPERTY or sys.dm_db_index_physical_stats
Temporary Indexing Permanent indexes for common operations For maintenance or upgrade operations Drop/disable indexes -> op -> recreate Or create index -> op -> drop
Partitioning Can be used to concentrate active rows Example: date – year, month, day etc. Can be used to distribute active rows over all partitions Example guid, hash, etc. Partitioning trick Partition key not the clustered index lead key Example: Cluster key, OrderId, DateKey (partition on date) Query with OrderId only : index seek on all partitions On date only: scan single partition
Summary Both Selectivity and Grouping/Locality important Effects Key Lookup -> alternative is table scan Indexing trade-offs, no one rule for all cases Consider insert/update/read & maintenance Missing Indexes DMV not intelligent advice! Extreme high perf. requires verification
Related Statistics recomputed at first 6 rows modified, first 500 rows, then every 20% Newer versions of SQL Server auto-recompute at lower threshold (than 20%) for very large tables Default statistics sample problematic with grouping What are the compile parameter values on the first execute after a statistics recompute?