Quantitative Performance Analysis Joe Chang

Slides:



Advertisements
Similar presentations
Understanding SQL Server Query Execution Plans
Advertisements

Module 13: Performance Tuning. Overview Performance tuning methodologies Instance level Database level Application level Overview of tools and techniques.
Adam Jorgensen Pragmatic Works Performance Optimization in SQL Server Analysis Services 2008.
SQL Performance 2011/12 Joe Chang, SolidQ
EXECUTION PLANS By Nimesh Shah, Amit Bhawnani. Outline  What is execution plan  How are execution plans created  How to get an execution plan  Graphical.
Automating Performance … Joe Chang SolidQ
Dos and don’ts of Columnstore indexes The basis of xVelocity in-memory technology What’s it all about The compression methods (RLE / Dictionary encoding)
Slide: 1 Presentation Title Presentation Sub-Title Copyright 2010 Robert Haas, EnterpriseDB Corporation. Creative Commons 3.0 Attribution. The PostgreSQL.
Shimin Chen Big Data Reading Group.  Energy efficiency of: ◦ Single-machine instance of DBMS ◦ Standard server-grade hardware components ◦ A wide spectrum.
Virtual techdays INDIA │ 9-11 February 2011 SQL 2008 Query Tuning Praveen Srivatsa │ Principal SME – StudyDesk91 │ Director, AsthraSoft Consulting │ Microsoft.
GCSE Computing - The CPU
Microsoft SQL Server Administration for SAP Performance Monitoring and Tuning.
SQL Server Query Optimizer Cost Formulas Joe Chang
Capacity Planning in SharePoint Capacity Planning Process of evaluating a technology … Deciding … Hardware … Variety of Ways Different Services.
Parallel Execution Plans Joe Chang
SQL Server 2005 Performance Enhancements for Large Queries Joe Chang
Convergence /20/2017 © 2013 Microsoft Corporation. All rights reserved. Microsoft, Windows, and other product names are or may be registered trademarks.
Introduction to Databases Chapter 8: Improving Data Access.
Database Systems: Design, Implementation, and Management Eighth Edition Chapter 10 Database Performance Tuning and Query Optimization.
Module 8 Improving Performance through Nonclustered Indexes.
Task Scheduling for Highly Concurrent Analytical and Transactional Main-Memory Workloads Iraklis Psaroudakis (EPFL), Tobias Scheuer (SAP AG), Norman May.
1 Introduction to Parallel Computing. 2 Multiprocessor Architectures Message-Passing Architectures –Separate address space for each processor. –Processors.
Physical Database Design & Performance. Optimizing for Query Performance For DBs with high retrieval traffic as compared to maintenance traffic, optimizing.
Module 7 Reading SQL Server® 2008 R2 Execution Plans.
© Dennis Shasha, Philippe Bonnet – 2013 Communicating with the Outside.
Data Warehousing 1 Lecture-24 Need for Speed: Parallelism Virtual University of Pakistan Ahsan Abdullah Assoc. Prof. & Head Center for Agro-Informatics.
MonetDB/X100 hyper-pipelining query execution Peter Boncz, Marcin Zukowski, Niels Nes.
Ashwani Roy Understanding Graphical Execution Plans Level 200.
Insert, Update & Delete Performance Joe Chang
Primary Key, Cluster Key & Identity Loop, Hash & Merge Joins Joe Chang
1 Performance Tuning Next, we focus on lock-based concurrency control, and look at optimising lock contention. The key is to combine the theory of concurrency.
Achieving Scalability, Performance and Availability on Linux with Oracle 9iR2-RAC Grant McAlister Senior Database Engineer Amazon.com Paper
Srihari Makineni & Ravi Iyer Communications Technology Lab
Parallel Execution Plans Joe Chang
Large Data Operations Joe Chang
Parallel Execution Plans Joe Chang
1 Chapter 10 Joins and Subqueries. 2 Joins & Subqueries Joins – Methods to combine data from multiple tables – Optimizer information can be limited based.
TPC-H Studies Joe Chang
T-SQL: Simple Changes That Go a Long Way DAVE ingeniousSQL.com linkedin.com/in/ingenioussql.
Srik Raghavan Principal Lead Program Manager Kevin Cox Principal Program Manager SESSION CODE: DAT206.
Hyper Threading Technology. Introduction Hyper-threading is a technology developed by Intel Corporation for it’s Xeon processors with a 533 MHz system.
SQL Server Scaling on Big Iron (NUMA) Systems Joe Chang TPC-H.
Query Optimizer Execution Plan Cost Model Joe Chang
Physical Database Design Purpose- translate the logical description of data into the technical specifications for storing and retrieving data Goal - create.
Infrastructure for Data Warehouses. Basics Of Data Access Data Store Machine Memory Buffer Memory Cache Data Store Buffer Bus Structure.
Query Optimization CMPE 226 Database Systems By, Arjun Gangisetty
CPSC 404, Laks V.S. Lakshmanan1 Evaluation of Relational Operations – Join Chapter 14 Ramakrishnan and Gehrke (Section 14.4)
1 Copyright © 2005, Oracle. All rights reserved. Following a Tuning Methodology.
TOP 10 Thinks you shouldn’t do with/in your database
Lab Activities 1, 2. Some of the Lab Server Specifications CPU: 2 Quad(4) Core Intel Xeon 5400 processors CPU Speed: 2.5 GHz Cache : Each 2 cores share.
How to kill SQL Server Performance Håkan Winther.
Scott Fallen Sales Engineer, SQL Sentry Blog: scottfallen.blogspot.com.
Execution Plans Detail From Zero to Hero İsmail Adar.
Diving into Query Execution Plans ED POLLACK AUTOTASK CORPORATION DATABASE OPTIMIZATION ENGINEER.
Indexing strategies and good physical designs for performance tuning Kenneth Ureña /SpanishPASSVC.
Understanding and Improving Server Performance
Design Patterns for SSIS Performance
Scaling SQL with different approaches
T-SQL: Simple Changes That Go a Long Way
Query Tuning without Production Data
Installation and database instance essentials
Software Architecture in Practice
Introduction to Execution Plans
JULIE McLAIN-HARPER LINKEDIN: JM HARPER
Transactions, Locking and Query Optimisation
SQL Server Query Optimizer Cost Formulas
Introduction to Execution Plans
Diving into Query Execution Plans
Introduction to Execution Plans
Introduction to Execution Plans
Presentation transcript:

Quantitative Performance Analysis Joe Chang

Objectives Estimate DB performance early What design/architecture decisions impact performance? Is the project/architecture feasible? Production database performance tuning Reduces guess work, but there are easier ways Server Performance Characteristics Processor Architecture: Pentium III – Xeon – Itanium 2– Opteron System Architecture: 2, 4, 8, 16-way etc

SPEC CPU 2000 Integer Pentium M 2.0GHz 90nm, all others 130nm Xeon 2.4GHz/512K base: 913 (used later in this presentation) Pentium III Xeon 700MHz/2M : 431

TPC-C Performance – SQL Server # CPUsSystemtpm-C$/tpm-CMem(GB)# Disks 1HP (3.2GHz/2M)35,030$ HP (3.2GHz/2M)60,364$ IBM x365102,667$ IBM x445156,195$ Unisys ES ,869$ Unisys ES ,148$ # CPUsSystemtpm-C$/tpm-CMem(GB)# Disks 4HP DL585115, Xeon 3.2GHz/2M & Xeon MP 3.0GHz/4M Opteron 2.2GHz/1M IA-32 limited max memory (64GB), AWE overhead, bus architecture

Itanium 2 (SQL Server) vs IBM Power 5 # CPUsSystemtpm-C$/tpm-CMem(GB)# Disks 4Pwr5 570 /Oracle194,391$ Pwr5 570 /UDB429,900$ Pwr5 570 /UDB809,144$ Pwr5 695 /Oracle1,601,785$ Pwr5 695 /UDB3,210,540$ IBM Power 4+/5 1.9GHz # CPUsSystemtpm-C$/tpm-CMem(GB)# Disks 4HP rx ,065$ Bull175,366$ Unisys ES ,037$ NEC Express577,531$ HP Superdome786,646$ Itanium 2 1.5GHz/6M

Scaling P n / P 1 = S ** log 2 (n) P n Performance with n processorsP 1 Perf. with 1 processor S Scale Factor nNumber of processors

Unit of Measure – CPU-Cycles Query costs measured in CPU-cycles Alternative: CPU-sec Cost = Runtime (sec) × CPU Util. × Available CPU-cycles ÷ Iterations Available CPU-cycles = Number of CPUs × Frequency Example 4 x 700MHz = 2.8B cycles/sec CPU-cycles does not imply CPU instructions, Unit of time same as CPU clock 1GHz CPU: time unit = 1ns 2GHz CPU: time unit = 0.5ns All tests on Windows 2000/2003, SQL Server 2000 SP1-3

CPU-Cycles dependencies CPU-cycles on one processor architecture has no relation to another – ex. Pentium III, Pentium 4, Itanium, Opteron Some platform dependencies – cache size, bus speed, SMP Notes: Some platform dependencies – cache size, bus speed, SMP ProcessorSystem/Cache/MemPerformanceCost / trans. Itanium 24 x 1.5GHz/6M /64G121, M Xeon4 x 3.0GHz/4M /32G102, M Opteron4 x 2.2GHz/1M /32G105, M

Cost Structure - Model Stored Procedure Call Cost = RPC cost (once per procedure) + Type cost (once per procedure?) + Query costs (one or more per procedure) Query – one or more components Component Cost = Cost for component operation base + Cost per additional row or page Only stored procedures are examined

RPC Cost Cost of RPC to SQL Server includes: 1) Network roundtrip 2) SQL Server handling costs Calls to SQL Server are made with RPC (not SQL Batch) Profiler -> Event Class: RPC ADO.NET Command Type Stored Procedure or Text with parameters Command Text without parameters: SQL Batch

Type Cost? Blank Procedure: ~250,000 CPU-cycles CREATE PROC p_ProcBlank AS RETURN Proc with Single Query ~320K CPU-cycles CREATE PROC p_ProcSingleQuery AS SELECT … FROM TableA WHERE ID Proc with Two Queries~360K CPU-cycles CREATE PROC p_ProcTwoQueries AS SELECT … FROM TableA WHERE ID SELECT … FROM TableB WHERE ID RPC Cost ~250K CPU-cycles, Type Cost ~30K CPU-cycles, Query Cost ~40K CPU-cycles

Blank RPC Performance All systems with 2 processors, Xeon 3.2GHz with 800MHz bus RPC’s are expensive relative to simple query 120K 270K 240K 260K 195K 154K 140K Costs in CPU-Cycles

RPC Cost Processor PIIIPIII X Xeon Opteron Itanium 2*It2 CPUs RPC cost 140K 200K K155K290K350K 270K (2.xx driver) Type Cost Select20-30K ~5K35-55K ~20K~8K Systems: Pentium III 2x 600MHz/256K, 2x 733MHz/256K, PIII Xeon2x 500MHz/2M, 4x 700MHz/2M, 4x900/2M Xeon (P4) 2x 2.0GHz/512K2x 2.4GHz/512K Opteron2x 2.2GHz/1M Itanium 22x 900MHz/1.5M8x1.5GHz/6M OS: W2K, W2K3, various sp SQL Server 2000, various sp PIII: Intel PRO/100+, Others: Broadcom Gigabit Ethernet driver 5.xx+ *Itanium 2 system booted with 2, 4 or 8 processors (4P config may have had procs from more than 1 cell) Costs in CPU-Cycles

RPC Cost – Fiber versus Threads PIII Xeon - TCP 1P2P4P FE-Thread105K150K200K FE-Fiber 95K120K170K Xeon - TCP GE-Thread210K 250K GE-Fiber200K230K Xeon - VI VI Thread190K VI Fiber160K180K Itanium 2 - TCP1P2P4P8P Thread105K155K290K350K Fiber 95K145K260K300K Costs in CPU-Cycles Broadcom Gigabit Ethernet driver 5.xx, 6.xx, 7.xx (270K for 2P 2.xx driver) VI: QLogic QLA2350, drivers: qla , qlvika 1.1.3

RPC Cost TCP vs Named Pipes PIII Xeon 4PTCPnamed pipes FE-Thread200K315K FE-Fiber170K370K Xeon, Thread1P 2P GE, TCP210K 250K GE, Named Pipes320K360K Broadcom Gigabit Ethernet driver 5.xx, 6.xx, 7.xx (270K for 2P 2.xx driver) VI: QLogic QLA2350, drivers: qla , qlvika Costs in CPU-Cycles

RPC Costs – owner, case PIIIPIII XP4/Xeon RPC cost 140K 140K?250K sp_executesql 210K Unspecified owner, Ex: user1 calls procedure owned by dbo +100K on 4P PIII, +100K on 2P Xeon, 300K on 8P Itanium2 Case mismatch: Actual procedure: p_Get_Rows Called procedure:p_get_rows +100K on 4P PIII, +150K on 2P Xeon, +300K on 8P Itanium2 Costs in CPU-Cycles

Single Row Select Costs Clustered Index Seek Does cost depend on index depth? Role of I/O count in cost Index Seek with Bookmark Lookup Does cost depend on table type? Heap versus clustered index Table Scan

Single Row Index Seek vs Index Depth Index depth: 50K rows -2 both, 200K 3-Cl, 2-NC, 500K 3 Pentium III 2x733

Batching Queries into a Single RPC Each query: clustered index seek returning 1 row

Single Row Cost Summary Index depth Plan – no, True cost -yes Cost versus index depth not fully examined Fill factor dependence not tested Bookmark Lookup – Table type Plan cost -no True cost higher for clustered index than heap

Multi-row Select Queries Queries return aggregate - Not a test of network bandwidth Single Table Example = = AVG(randDecimal) FROM M3C_01 WHERE GroupID 1 Count, 1 AVG on 5 byte decimal Join Example SELECT count(*), AVG(m.randDecimal), min(n.randDecimal) FROM M2A_02 m INNER LOOP JOIN M2B_02 n ON n.ID = m.ID WHERE m.GroupID AND n.GroupID 1 Count, 2 AVG on 5 byte decimal Table Scan tests return either - a single row - 1 Count, 1 aggregate on 5 byte decimal

Index Seek – Aggregates Xeon 2x3.06GHz/512K

Multi-row Bookmark Lookups Table lock escalation at >5K rows ? 2x2.2GHz/1M Opteron 1 Count, 1 AVG on decimal

Table Scan Xeon 3.2GHz/800MHz FSB HT

Table Scan – Component Cost Total cost = RPC + Type + Base + per page costs PIII (X)P4/Xeon OpteronIt2 2P8P Type + Base cost 60K/40K 145K35K90K Cost per page: NOLOCK: 24K -16K23-35K TABLOCK: 24K 25K16K PAGLOCK: 26K 26K 20K17K23-35K ROWLOCK: 140K 250K 110K100K150K Table Scan or Index Scan – Plan Formula I/O: per page CPU: per row Measured Table Scan cost formulas for 99 rows per page Costs in CPU-Cycles per page

Table Scan Cost per Page 2x733MHz/256KB Pentium III

Bookmark – Table Scan Cross-over Plan Costs: ≤1GB 2x733MHz/256K Pentium III

Loop, Hash and Merge Join Loop joins: Case 1, 2, 3 etc Hash joins: in-memory versus others Merge: regular versus many-to-many Merge join with sort operations Locks – page lock

Joins – Locking Granularity Xeon 2x2.0GHz/512K Count+2 AVG(decimal) Hash join spools to tempdb

Loop, Hash & Merge Joins Xeon 2x3.06GHz/512K(Count)

Processor Architecture Pentium III, Xeon Opteron, Itanium 2

Index Seek COUNT(*) + AVG(Decimal)

Bookmark Lookup Heap, Page Lock, sequential rows 1 Count, 1 AVG on decimal

Table Scan NOLOCK

Hyper-Threading Xeon 3.06GHz/512K (130nm) vs. Xeon 3.2GHz/1M & 800MHz FSB (90nm)

Hyper-Threading CPU linear with throughput CPU non-linear with throughput Exercise caution in extrapolating system performance based on low CPU measurements

Index Seek – Single Row

Index Seek - Multiple Rows

Table Scan - NOLOCK Xeon, Xeon/800 NOLOCK

Bookmark Lookup – Cl. IX, Seq, PL 1 Count, Clustered Index, Sequential rows, Page Lock

Bookmark Lookup – Heap, Seq, PL 1 Count, Heap, Sequential rows, Page Lock

Loop Join

Hash Join

Merge Join

Hyper-Threading Summary Lower cost RPC Improves performance of low cost queries May improve performance on some ops Degrades performance on others Xeon 800MHz FSB with 90nm core better than 130nm

Scaling Itanium 2, 1.5GHz/6M, 1, 2, 4, 8, 16 CPUs

Scaling First System: HP rx way Itanium 2 1.5GHz/6M Boot with /NUMPROC = xx Query: Count + 1 or 2 AVG(Decimal) Second System: Unisys ES7000 Aries way Itanium 2 1.5GHz/6M Booted with all 16 CPUs, Affinity Mask = 1, 2, … Query: Count only SQL Server 2000, Service Pack 3 (build 760)

Index Seek 9X difference between 16P and 1P

Bookmark Lookup 1-way Minor differences between Table org & lock granularity

Bookmark Lookup – 16 way Major differences! 100X !!

Bookmark Lookup – Clustered Index Really bad news beyond 2P, especially bad ~5K rows

Bookmark Lookup – Cl. IX, PL No scaling beyond 2P, but no disaster at 5K rows

Bookmark Lookup – Heap Performance degradation at 5K rows

Bookmark Lookup – Heap Page Lock Excellent scaling, 8P anomaly?

Joins – 1P Up to 10X difference between loop, hash & merge

Joins – 16P Up to 100X difference between loop, hash & merge

Loop Join No scaling beyond 2P

Hash Join Scales to 8P, peaks at ~5K rows

Merge Join Scales well all around

Table Scans Questionable scaling, peaks at 300 pages, cache size ?

Scaling Summary Avoid Bookmark Lookups on Tables with Clustered indexes Use Heap tables Use NOLOCK, especially ~5000 rows / query Avoid Loop Joins Design for Merge or Hash Joins Avoid table scans (Itanium only?)

Bookmark HP 32P, SP3 build 760/818

Bookmark HP 32P, SP4 build 2000?

Joins HP 32P, SP3 build 760/818

Joins HP 32P, SP4 build 2000?

Table Scan HP 32P,

Bookmark Lookup (PL Sequential)

Index Seek

Loop Join

Hash Join

Merge

Joins HP 32P, SP4 build 2000?

Database Design for Performance Round-trip minimization – RPC cost Row count management – Cost per row Indexes – isolate queried rows to a limited number of adjacent pages quickly, not most selective columns 1st Design for low cost operations Covered index instead of bookmark lookups Merge joins, Indexed views Avoid excessive logic NOLOCK on non-transactional data

Statistics Accuracy & Relevance More than keeping statistics up to date Data queried needs to reflect data in table Avoid populating database with test data having different distribution than live data

Performance Issues NC Index Seek + Bookmark Lookup vs. Table Scan Query optimizer switches to table scan too soon for in-memory, too late for disk bound data Watch table scan lock level Row count plan versus actual cost issues May be related to WHERE clause conditions Lock hints Merge and Hash joins vs. Loop joins Fixed costs favor consolidation Both in RPC and queries

Summary Query cost structure Fixed costs identified Costs applied once per RPC Component operations examined Base cost and cost per row or page Lock hints – Row, Page, Table, No Lock

Additional Information SQL Server Quantitative Performance Analysis Server System Architecture Processor Performance Direct Connect Gigabit Networking Parallel Execution Plans Large Data Operations Transferring Statistics SQL Server Backup Performance with Imceda LiteSpeed

Quantitative Performance Analysis Backup Slides

Subjects Cost measurement Test procedure, Unit of measure Query Costs Single row, table scan, multi-row, joins Discrepancies with plan costs Logical I/O, Lock Hints, Conditions Database design implications Design, coding and indexes

Test Procedure Load generator Drive DB server CPU utilization as high as possible (80-100%) Multiple load generators running same query Single thread will not fully load DB server Network propagation time is significant Most queries run on single processor Server may have more than one processor Component operation cost derived Comparing two queries differing in one op

Index Seek by Aggregates SELECT COUNT(*) SELECT COUNT(*), CONVERT int to bigint, etc SELECT COUNT(*), SUM(int), SELECT COUNT(*), SUM(Money)

Join Cost Linearity on 32-bit Hash join cost per row is somewhat linear up to ~5M rows Duration then jumps due to disk IO

Join Cost Linearity on Itanium 2 Loop & Merge join cost per row independent of row count Hash join cost per row is not (no spooling to temp)

Hash Join – row size Optimizer cost: depends on OS size, not IS size Actual cost: depends on # of columns and OS/IS

Type + Component Base Cost ProcessorPIIIPIII XXeonOptItanium 2 # of CPUs Frequency ~ M1.5G Cache 256K2M512K1M1.5M6M Aggregate 100K40K135K45K 35K66K Table Scan 60K40K145K50K35K40K90K Loop Join 110K24K130K50K45K10K15K Hash Join 500K250K620K250K225K210K320K Merge Join 150K54K170K80K75K55K110K Merge + Sort 145K320K140K230K Many-to-Many 250K550K200K320K Costs in CPU-Cycles Big cache lowers base cost?

Cost per Row ProcessorPIIIXeonOptItanium 2 2P, 4P, 8P Index Seek1.3K1.8K1.1K1.0K Bookmark LU (CL)16K20K12K11K16K27K BL (CL) page lock11K15K8K 13K18K Bookmark LU (Heap)11K14K9K10K11K16K BL (Heap) page lock7K9K5K6K8K Loop Join16K25K13K11K16K23K Loop Join, page lock13K20K11.5K9K14K21K Hash Join8.5K11K7.5K7K Hash Join, page lock6.5K8K6.0K5K Merge Join6.5K10K5.0K6K Merge Join, page lock3.0K4K2.5K3K Merge+Sort7.5K9K6K7K Many-To-Many PL32K40K18K31K Costs in CPU-Cycles

IUD Cost Structure 2xXeon*Notes RPC cost 240,000Higher for threads, owner m/m Type Cost 130,000once per procedure IUD Base 170,000once per IUD statement Single row IUD300,000Range: 200, ,000 Multi-row Insert Cost per row 90,000cost per additional row INSERT, UPDATE & DELETE cost structure very similar Multi-row UPDATE & DELETE not fully investigated *Use Windows NT fibers on

INSERT Cost Structure Index and Foreign Key not fully explored Early measurements: 50-70,000 per additional index 50-70,000 per foreign key Assumes inserts into “sequential” segments Cluster on Identity or Cluster Key + Identity Cluster on row GUID can cause substantial disk loading!