Understanding Faster-SQL with Indexes

Understanding Faster-SQL with Indexes
Anjum Niaz Systems Limited

Agenda Analyzing Performance Problems Causes
Overview of Query Processing Primary Key vs. Clustering vs. Nonclustering SQL: set based expression / serial execution Intro to execution plans Table & index access methods Table storage formats – Heaps & Clustered Indexes Nonclustered Indexes (on Heaps vs on Clustered Indexes) Covering Indexes Join operators Nested Loops Join, Merge Join & Hash Join Joins and order of joins Query tips used with Indexes Review Discussion

The challenges and problems of application
Higher speed means more business Less resource consumption means more concurrent users Problems A Constantly-changing Environment

What affects performance?
Hardware Disk space RAM Network CPU Operating System settings Database Server parameter settings Database Design Indexes on Database Tables SQL statement

Why SQL statement affecting performance
90% 60% It is easy to write functional SQL. It is harder to write efficient, high performing SQL.

Why is it so hard? Why don’t people tune their SQL?
Too busy now. I’ll do it later. Is that what the DB optimizer is for? I’m a Java, not an SQL, programmer. I don’t know how. Generated by Toplink/LINQ, not my business. It’s works. I’ve got my data. I’m happy.

Overview of Query Processing
Web Form Applic. Front end SQL interface SQL Security Parser Catalog Relational Algebra(RA) Optimizer Executable Plan (RA+Algorithms) Concurrency Plan Executor Files, Indexes & Access Methods Crash Recovery Database, Indexes 22

Detail of the top SQL Query (SELECT …) Query Parser
Relational Algebra Expression (Query Tree) Query Optimizer Plan Generator Plan Cost Estimator Catalog Manager Query Tree + Algorithms (Plan) Plan Evaluator

Parsing and Optimization
The Parser Verifies that the SQL query is syntactically correct, that the tables and attributes exist, and that the user has the appropriate permissions. Translates the SQL query into a simple query tree (operators: relational algebra plus a few other ones) The Optimizer: Generates other, equivalent query trees (Actually builds these trees bottom up) For each query tree generated: Selects algorithms for each operator (producing a query plan) estimates the cost of the plan Chooses the plan with lowest cost (of the plans considered, which is not necessarily all possible plans)

Dynamic Programming A no-brainer approach to these 4 tasks could take forever. For medium-large queries there are millions of plans and it can take a millisecond to compute each plan cost, resulting in hours to optimize a query. This problem was solved in 1979 [668] by Patsy Selinger's IBM team using Dynamic Programming. The trick is to solve the problem bottom-up: First optimize all one-table subqueries Then use those optimal plans to optimize all two-table subqueries Use those results to optimize all three-table subqueries, etc.

Primary Key vs. Clustering vs. Nonclustering
A primary key is a logical concept, not a physical concept Indexes are physical concepts, not logical concepts There is a strong correlation between the logical concept of a key and the physical concept of an index By default, when you define relationships as part of table design, you will build indexes to support the joins / lookups By default, when you define a primary key, you will create a unique clustered index on the table Unique is good, clustered isn’t always good When you define a clustered index, the server automatically appends the key column(s) (plus a unique identifier, if necessary) to the nonclustered indexes

Intro to execution plans
Why is understanding Execution Plans important? Provides insight into query execution steps / processing efficiency SQL Server occasionally makes mistakes Tune performance problems at the source (query efficiency) More effective than tuning hardware Other diagnostics only reveal consequences of poor query execution plans High CPU is only a consequence of poorly tuned query plans High disk utilisation usually just a consequence of poorly tuned query plans Waitstats reveals high resource utilization of poorly tuned query plans Locking usually just a consequence of poorly tuned query plans Tuning query execution plans is a VERY important tuning technique

Returns CustID, OrderID & OrderDate for orders > 1st Jan 2005
SQL: set based expression / serial execution SQL syntax based on “set based” expressions (no processing rules) Returns CustID, OrderID & OrderDate for orders > 1st Jan 2005 No processing rules included in SQL statement, just the “set” of data to be returned Query execution is serial SQL Server “compiles” query into a series of sequential steps which are executed one after the other Individual steps also have internal sequential processing (eg table scans are processed one page after another & per row within page) Execution Plans Display these steps

Intro to execution plans – a simple example
Using our existing sample query.. Read Execution Plans from top right to bottom left (loosely) Note: plan starts at [SalesOrderHeader] even though [Customer] is actually named first in query expression 4 1 2 3 Right angle blue arrow in table access method icon represents full scan (bad) Stepped blue arrow in table access method icon represents index “seek”, but could be either a single row or a range of rows Plan starts with Clustered Index Scan of [SalesOrderHeader] (Full scan of table, as no index exists) “For Each” row returned from [SalesOrderHeader].. (Nested Loops are execution plan terminology for “For Each”, but we’ll come back to this later) Find row in [Customer] with matching CustomerID Return rows formatted with CustomerID, SalesOrderID & OrderDate columns

Number of rows returned shown in Actual Execution Plan
Execution plan node properties Number of rows returned shown in Actual Execution Plan Ordered / Unordered – displays whether scan operation follows page “chain” linked list (next / previous page # in page header) or follows Index Allocation Map (IAM) page Search predicate. WHERE filter in this case, but can also be join filter Name of Schema object accessed to physically process query – typically an index, but also possibly a heap structure Mouse over execution plan node reveals extra properties..

No physical ordering of table rows (despite this display)
“Heap” Table Storage No physical ordering of table rows (despite this display) Scan cannot complete just because a row is located. Because data is not ordered, scan must continue through to end of table (heap) Table storage structure used when no clustered index on table Rarely used as CIXs added to PKs by default Oracle uses Heap storage by default (even with PKs) No physical ordering of rows Stored in order of insertion New pages added to end of “heap” as needed NO B-Tree index nodes (no “index”) No b-tree with HEAPs, so no lookup method available unless other indexes are present. Only option is to scan heap Query execution example: Select FName, Lname, PhNo from Customers where Lname = ‘Smith’

CIX also provides b-tree lookup pages, similar to a regular index
“Clustered Index” Table Storage Table rows stored in physical order of clustered index key column/s – CustID in this case. create clustered index cix_CustID on customers (CustID) CIX also provides b-tree lookup pages, similar to a regular index Table rows stored in leaf level of clustered index, in physical order of index column/s (key/s) B-Tree index nodes also created on indexed columns Each level contains entries based on “cluster key” value from the first row in pages from lower level Default table storage format for tables WITH a primary key Can only have one CIX per table (as table storage can only be sorted one way) Query execution example: Select FName, Lname, PhNo From Customers where CustID = 23

Non-Clustered Index (on Heap storage)
Create nonclustered index ncix_lname on customers (lname) B-tree structure contains one leaf row for every row in base table, sorted by index column values. Each row contains a “RowID”, an 8 byte “pointer” to heap storage (RowID actually contains File, Page & Slot data) If index does NOT cover query, RowID lookups performed to get values for non-indexed columns Query execution example: Select Lname, Fname from Customers where Lname = ‘Smith’

Non-Clustered Index (on Clustered Index storage)
create nonclustered index ncix_lname on customers (lname) B-tree structure contains one leaf row for every row in base table, sorted by index column values. (same as when NCIX is on a heap) Instead of a RowID, each row’s clustered index “key” value is stored in the index leaf level instead. This means RowID bookomarks cannot be performed (as RowID is not available). Instead, bookmark lookups are performed, which are considerably more expensive Bookmark Lookup Query execution example: Select Lname, Fname From Customers Where Lname = ‘Smith’

Non-Clustered Index (Covering Index)
create nonclustered index ncix_lname on customers (lname, fname) NCIX now “covers” query because all columns named in query are present in NCIX “Covering” indexes significantly reduce query workload by removing bookmark lookups (& RowID lookups) Query execution example: Select Lname, Fname from Customers where Lname = ‘Smith’

Join Operators (intra-table operators)
Nested Loop Join Original & only join operator until SQL Server 7.0 “For Each Row…” type operator Takes output from one plan node & executes another operation “for each” output row from that plan node Merge Join Scans both sides of join in parallel Ideal for large range scans where joined columns are indexed If joined columns aren’t indexed, requires expensive sort operation prior to Merge Hash Join “Hashes” values of join column/s from one side of join Usually smaller side “Probes” with the other side Usually larger side Hash is conceptually similar to building an index for every execution of a query Hash buckets not shared between executions Worst case join operator Useful for large scale range scans which occur infrequently

Nested Loops Join (IX range scan + IX seek)
An index seek (3 page reads) is performed against SalesOrderDetail FOR EACH row found in the seek range. If a large number of rows are involved in execution plan node (not just results) this can be very costly select p.Class, sod.ProductID from Production.Product p join Sales.SalesOrderDetail sod on p.ProductID = sod.ProductID where p.Class = ‘M‘ and sod.SpecialOfferID = 2 1 3

Values on either side of ranges being merged compared for (in)equality
Merge Join (IX range scan + IX range scan) Values on either side of ranges being merged compared for (in)equality select p.Class, sod.ProductID from Production.Product p join Sales.SalesOrderDetail sod on p.ProductID = sod.ProductID where p.Class = ‘M‘ and sod.SpecialOfferID = 2 1

Nested Loops Join vs Merge Join
Comparing Nested Loops vs Merge Join Nested Loops is often far less efficient than Merge Setting up Merge operator more expensive than Nested Loops But cost savings in terms of IO pay off very quickly – only hundreds of rows required Common misconception – “Merge requires left hand columns in indexes to be the same” Not always true. Note index definitions for the previous example were: create index ix_Product_Class_ProductID on Production.Product (Class, ProductID) ProductID on RIGHT hand side of both indexes to support JOIN whilst left hand columns support range seek. create index ix_SalesOrderDetail_SpecialOfferID_ProductID on Sales.SalesOrderDetail (SpecialOfferID, ProductID) Where only a few rows are involved (tens or perhaps hundreds) there’s little difference Where many rows involved (per node or resultset), difference can be huge Merge can be very costly if SORT operation required Important that well designed indexes exist first to avoid large scale sorting within plan

Joining path matters select * from A, B /* table A has 100,000 records */ where A.key = B.key /* table B has 1,000 records */ Path from table A to table B: which means that we open table A, looking at each row to then use an index to search for matching rows in table B: Number of Operations (A→B) = 100,000 * RoundUp(LN(1000) / LN(2)) / 2 = 100,000 * 10 / 2 = 500,000 Path from table B to table A: which means that we open B table, looking at each row to then use an index to search for matching rows in table A: Number of Operations (B→A) = 1000 * RoundUp(LN(100,000) / LN(2)) / 2 = 1000 * 17 / 2 = 8,500 Path from B→A is around 59 times faster than the speed of A→B

Optimizer Hints A hint tells the optimizer to ignore its algorithm in part, for example Order the joins in a certain way Use a particular index Use a type of join for a pair of tables. Oracle has over 120 possible hints SQL Server

Review SQL Syntax is Set based but execution is serial
No such thing as “set based execution” Execution Plans describe execution steps chosen by SQL Server Only describes current behavior – doesn’t describe solutions Useful to verify expected behavior rather than look for answers SQL Server can only optimize based on existing indexes Most performance tuning solutions come from designing good indexes Other useful tools Set statistics io on Shows table level workload (reads) Profiler Capture Execution Plans at run time

Indexed Fields Know your indexes and use them to your advantage.

Indexed Fields If you want the index used, don’t perform an operation on the field. Replace SELECT * from A where SALARY with where SALARY -1000

Indexed Fields Index will not be used when a function is used.
SELECT * from A where substr(name, 1, 3) = 'Wil‘

Indexed Fields WHERE clause Avoid using <> (not equal to)
Like '%SA%'

Indexed Fields Sometimes DO disable the index SELECT * FROM A
WHERE SALARY + 0 = '10000' AND DEPT = 'IT' WHERE EMP_SEX = 'm'

Indexed Fields Do not have default value set to NULL.
If it is a number field and lowest value is 0, then: Replace SELECT * FROM A WHERE NUMBER IS NOT NULL with (normally faster response time) WHERE NUMBER >0

Indexed Fields Replace Outer Join with Union.
If both A.State and B.State have a unique indexed: Replace SELECT A.CITY, B.CITY FROM A,B WHERE A.STATE=B.STATE With UNION SELECT NULL, B.CITY FROM B WHERE NOT EXISTS (SELECT 'X' FROM A Where A.STATE=B.STATE) The Outer join on table B will do a full table scan. Union can take advantage of the indexes. The table driving path can also be changed.

EXIST and IN Sub-query Assume table A,B relationship is one to many.
The following statements have the same results. SELECT * FROM A WHERE A.CITY IN (SELECT B.CITY FROM B) WHERE EXISTS (SELECT CITY FROM B WHERE A.CITY = B.CITY)

Use IN Sub-query SELECT * FROM A WHERE A.CITY IN (SELECT B.CITY FROM B) A.CITY is indexed, B.CITY is not indexed, and table B has much less rows than A. SELECT * FROM A WHERE A.CITY IN (SELECT B.CITY FROM B) A.CITY is indexed, B.CITY is indexed, and table B has much less rows than A.

I/O Comparison Case Study
Table1 Id FirstName LastName Designation X1 … 25 more columns Scenario 1 Table1 has Clustered Index on FirstName, LastName and Designation Scenario 2 Table1 has Non-Clustered Index on FirstName, LastName and Designation

Query 1: Which has more I/O Query 2: Which scenario has more I/O Why?
Select FirstName, LastName, Designation From Table1 Where Name =‘John’ AND LastName = ‘Walker’ Select FirstName, LastName, Designation, X1 From Table1 Where Name =‘John’ AND LastName = ‘Walker’ Query 1: Which has more I/O Query 2: Which scenario has more I/O Why? ‘Include’ is the answer for non-clustered Index. Use Include very wisely

Discussions Questions & Answers

References Most content is taken from different Microsoft presentations. Microsoft MVP Deep Dives Kalen Delaney – Sql Server Internals

Understanding Faster-SQL with Indexes

Similar presentations

Presentation on theme: "Understanding Faster-SQL with Indexes"— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

Understanding Faster-SQL with Indexes

Similar presentations

Presentation on theme: "Understanding Faster-SQL with Indexes"— Presentation transcript:

Similar presentations

About project

Feedback