Workload-Management für komplexe Data Warehousing Umgebungen Hermann Bär, Data Warehousing Product Management
Why Anglo-German? © 2010 Oracle Corporation
Agenda What is a concurrent (mixed workload) environment? Planning for workload management Tools and methods Resource definition and management Step-by-step workload management Identify workloads Manage system resources Restrict resource usage Curb runaway queries Monitor and tune
A Mixed Workload 2005 Major Changes for your Data Warehouse Department A supplies data to the DW daily and runs reports Department B supplies data to the DW daily and runs reports 10101000101 10101000101 Data Marts Daily batch windows Ad-hoc queries Downtime OK
A Mixed Workload 2011 Major Changes for your Data Warehouse All Departments On-Line Applications CEO Strategy Finance Marketing CRM Live Systems Stock Tracking Direct Business Impact 10101000101 10101000101 Real Time Feeds Enterprise Data Warehouse Write-Backs 10101000101 10101000101 Classic Reporting Deep Analytics Long running reports Heavy Analytical Content Investigative querying Predictive Modeling Scenario Analysis Data Mining
A Mixed Workload Sample Requirements Workloads should use critical system resources according to their priority CPU, I/O Tactical workload must run with expected DOP Running at diminished DOP or queuing results in unacceptable performance Full utilization of critical resources Avoid inefficient schemes that require dedicated resources Servers dedicated to services Separate data marts / warehouses Flexible resource allocation E.g. priority of ETL is based on time of day
Agenda What is a concurrent (mixed workload) environment? Planning for workload management Tools and methods Resource definition and management Step-by-step workload management Identify workloads Manage system resources Restrict resource usage Curb runaway queries Monitor and tune © 2010 Oracle Corporation
Workload Management for DW Three Main Components Database Architecture Hardware Architecture Define Workload Plans Filter Exceptions Manage Resources Monitor Workloads Adjust Workload Plans EDW Data Layers Data Mart Strategy Sandboxes Active HA/DR Strategy Compression Strategies Storage Media Hierarchy © 2010 Oracle Corporation
Workload Management for DW What we are covering today… Define Workloads Filter Exceptions Manage Resources Monitor Workloads Adjust Plans Execute Workloads Monitor Workloads Adjust Workload Plans IORM RAC OEM DBRM Define Workload Plans The RAC piece includes things like: Services Server Pools (Grid Infrastructure) to provide elasticity (add servers to pool to increase memory) Instance Caging (consolidation) © 2010 Oracle Corporation
Agenda What is a concurrent (mixed workload) environment? Planning for workload management Tools and methods Resource definition and management Step-by-step workload management Identify workloads Manage system resources Restrict resource usage Curb runaway queries Monitor and tune © 2010 Oracle Corporation
Tools and Methods Resource allocation Resource management Database (processing) nodes Services, Server Pools, and consumer groups Instance caging IO resource management Resource management Consumer Groups Within a single database Across multiple databases Workload-driven database resource management Thresholds and actions
Services Use services to restrict the number of nodes 1 2 3 4 5 6 7 8 Service Gold Service Silver Use services to restrict the number of nodes Dynamic allocation and re-routing Divide 8 Node cluster, where Service Gold is 3 nodes
Services and Server Pools Service Gold Service Silver 1 2 3 4 5 6 7 8 Expand a service by expanding the pool of servers it has access to Expand Service Gold to 4 nodes Shrink Service Silver to 4 nodes
Instance Caging Limit (“cage”) the amount of CPU for a given instance 1 2 3 4 5 6 7 8 Limit (“cage”) the amount of CPU for a given instance Divide 8 Node cluster, where two databases get half of the CPUs per node
Sample Instance Caging 4 CPU server Workload is a mix of OLTP transactions, parallel queries, and DMLs from Oracle Financials
I/O Resource Management on Exadata Global I/O resource management Prioritize multiple individual databases Prioritize workloads within a single database Prioritize a certain type of workload across all databases Prioritize all tactical queries Deprioritize all ad-hoc queries Data Mart A Data Mart B Enterprise Data Warehouse
Sample I/O Utilization Queries from TPC-H benchmark suite Disk utilization measured via iostat
Database Resource Manager Single framework to do workload management including CPU Session control Thresholds IO (Exadata has IO Resource Manager) Parallel statement queuing Each consumer group now needs to be managed in terms of parallel statement queuing New settings / screens to control queuing in Enterprise Manager and in DBRM packages
Database Resource Manager (DBRM) 1 2 3 4 5 6 7 8 Grp 1 Grp 2 Grp 3 Resource Management within a single database Divide a system horizontally across nodes Uses Resources Plans and Groups to model and assign resources Allows for prioritization and flexibility in resource allocation
DBRM with Services Resource management within a single database 1 2 3 4 5 6 7 8 Grp 1 Grp 3 Grp 4 Grp 2 Grp 5 Service Gold Service Silver Resource management within a single database Service-aware resource management Make sure to fully utilize the resources
DBRM with Services and Instance Caging 1 2 3 4 5 6 7 8 Grp 1 Grp 3 Grp 4 Grp 2 Grp 5 Grp 6 Grp 7 Service Gold Service Silver Three individual databases Resource management across cluster between databases Fine grain resource management within single databases
Agenda What is a concurrent (mixed workload) environment? Planning for workload management Tools and methods Resource definition and management Step-by-step workload management Identify workloads Manage system resources Restrict resource usage Curb runaway queries Monitor and tune © 2010 Oracle Corporation
Step 1: Understand the Workload Review the workload to find out: Who is doing the work? What types of work are done on the system? When are certain types being done? Where are performance problem areas? What are the priorities, and do they change during a time window? Are there priority conflicts?
Workload Management Request Queue Execute Assign Ad-hoc Workload Each request: Executes on a RAC Service Which limits the physical resources Allows scalability across racks Assign Each request assigned to a consumer group: OS or DB Username Application or Module Action within Module Administrative function Ad-hoc Workload Each consumer group has: Resource Allocation (example: 10% of CPU/IO resources) Directives (example: 20 active sessions) Thresholds (example: no jobs longer than 2 min) Reject Downgrade
Workload Management Request Static Reports Queue Assign Tactical Queries Queue Ad-hoc Workload Execute Reject Downgrade Queue
Step 2: Map the Workload to the System Create the resource consumer groups Map to users or applications Map to estimated execution time Other criteria Create the required resource plans For example: Nighttime vs. daytime, online vs. offline Set the overall priorities Which resource group gets most resources Cap max utilizations Drill down into parallelism, queuing and session throttles
Resource Manager User Interface © 2010 Oracle Corporation
Database Resource Manager Session to Consumer Group Mapping Rules Consumer Groups Tactical service = ‘CRM’ client program = ‘OBIEE’ client program = ‘OBIEE’ && module = ‘AdHoc’ client program = ‘Oracle Data Mining’ query has been running > 1 hour estimated execution time of query > 12 hours service = ‘ETL’ Reports Low-Priority ETL Create Consumer Groups for each type of workload Create rules to dynamically map sessions to Consumer Groups
Step 3: Manage CPU CPU is a critical resource Goal Solution Even more critical on Exadata Exadata Smart Scan only returns useful data blocks Exadata Flash Cache completes I/Os in microseconds Result is heavy CPU loads Goal Allocate sufficient CPU to Tactical, Reports, and ETL to satisfy performance objectives Allocate excess CPU to Low-Priority workloads Solution Configure CPU allocations in Database Resource Plan
The DBA can create a Night Time Plan that allocates more CPU to ETL Step 3: Manage CPU The DBA can create a Night Time Plan that allocates more CPU to ETL Day Time Plan Level 1 Level 2 Tactical 60% Reports 20% ETL 20% Low-Priority 100% Any CPU unused by Tactical, Reports, or ETL is allocated to Low-Priority sessions Very fine-grained scheduling Resource Manager mimics an OS scheduler Resource Manager schedules at a 100 ms quantum All sessions run, but some sessions run more frequently than others Low-priority session yields to a high-priority session within a quantum Background processes are not managed Backgrounds are high-priority and not CPU-intensive © 2010 Oracle Corporation
CPU Scheduling with Resource Manager Sessions wait on “resmgr:cpu quantum” event Oracle-Internal CPU Queue Tactical Reports Resource Plan: Tactical 75% Reports 25% (Tactical picked 3 out of 4 times) CPU Resource Manager Sessions scheduled every 100 ms © 2010 Oracle Corporation
Step 4: Manage I/O Disk bandwidth is a critical resource Goal Solution Key to exceptional query performance? One query can utilize a high percentage of each disk’s bandwidth Multiple concurrent parallel queries result in heavy disk loads Goal Allocate sufficient I/O bandwidth to Tactical, Reports, and ETL to satisfy performance objectives Allocate excess I/O bandwidth to Low-Priority workloads Solution Configure I/O allocations in Database Resource Plan Enable Exadata I/O Resource Manager
Exadata I/O Resource Manager Issue enough I/Os to keep each disk busy. Queue the rest. When an I/O completes: 1) Pick a Consumer Group queue 2) Issue the I/O request from the head of that queue T T Database Resource Plan T Tactical I/Os R R Database I/O Resource Manager Reports I/Os T E T T T T E T E T T ETL I/Os L L L L Outstanding I/O Requests Disk Low-Priority I/Os Exadata Storage Cell
Exadata I/O Resource Manager Configure Exadata I/O Resource Manager using the Database Resource Plan Same plan used to manage CPU Specify resource allocations per Consumer Group Resource allocation == disk utilization Background and ASM I/Os automatically managed Critical I/Os prioritized: instance recovery, LGWR, control file, etc. Use IORM metrics to track I/O load per Consumer Group (IOPS, MBPS, disk utilization %) I/O throttling per Consumer Group © 2010 Oracle Corporation
Step 5: Manage Parallel Execution Parallel servers are a limited resource Global database limit specified by parallel_max_servers Too many concurrent parallel statements causes thrashing When there are no more parallel servers Critical statements may run serially When parallel servers free up, no way to boost DOP of running statements With 11.2, Oracle automatically decides if a statement Executes in parallel or not and what DOP it will use Can execute immediately or will be queued
Parallel Statement Queuing Tactical Tactical No more parallel servers available – Parallel statements are now queued Parallel servers are available – Parallel statements run immediately Available Servers: 128 Available Servers: 64 Available Servers: 0 Available Servers: 32 Batch Parallel Statement Queue Coordinator Batch Batch Parallel Statement Queue Ad-Hoc Running Parallel Statements © 2010 Oracle Corporation
Queuing Shown in Enterprise Management
Ordering Parallel Statements DBAs want to control the order that parallel queries are dequeued Prioritize tactical queries over batch and ad-hoc queries Impose a user-defined policy for ordering queued parallel statements Solution with 11.2.0.2 Separate queues per Consumer Group Resource Plan specifies which queue’s parallel statements are issued next © 2010 Oracle Corporation
Ordering Parallel Statements Since there are no more Tactical parallel statements, we pick either Batch or Ad-Hoc. Batch is selected 70% of the time after Ad-Hoc. Since Tactical is Priority 1, its parallel statements are always selected first. When parallel servers become available, the resource plan is used to select a queue. The head parallel statement from that queue is run. Available Servers: 16 Available Servers: 0 64 Tactical Tactical Tactical Tactical Tactical Tactical Queue Parallel Statement Queue Coordinator Batch Batch Batch Batch Batch Queue Ad-Hoc Ad-Hoc Ad-Hoc Ad-Hoc Ad-Hoc Resource Plan: Priority 1: Tactical Priority 2, 70%: Batch Priority 2, 30%: Ad-Hoc Ad-Hoc Queue Running Queries © 2010 Oracle Corporation
Reserving Parallel Servers for Critical Workloads Flood of batch queries can use up all parallel servers Tactical queries are forced to queue Solution Limit the percentage of parallel servers a Consumer Group can use For example, parallel queries from the Batch Consumer Group can only use 50% of the parallel servers Reserves parallel servers for Tactical queries Limit the degree of parallelism of non-critical workloads © 2010 Oracle Corporation
Reserving Parallel Servers for Critical Workloads Since parallel servers are available, Tactical queries can be run immediately Available Servers: 32 Available Servers: 64 Available Servers: 48 Batch limited to 50% of the parallel servers Tactical Tactical Tactical Queue Parallel Statement Queue Coordinator Batch Batch Batch Batch Batch Batch Batch Batch Queue Resource Plan: Priority 1: Tactical Priority 2, 70%: Batch Priority 2, 30%: Ad-Hoc Ad-Hoc Queue Running Queries © 2010 Oracle Corporation
Step 6: Restrict Resource Usage Requirement Consistent, predictable performance for workloads Useful for hosted environments and departmental apps Solution Cap the CPU utilization for a Consumer Group Cap the disk utilization for a Consumer Group Day Time Plan Allocation Limit Tactical 60% Sales Reports 15% 30% Marketing Reports 15% 30% ETL 10% © 2010 Oracle Corporation
Step 7: Manage Runaway Queries Runaway queries are caused by Missing indicies Unexpected inputs Bad execution plans Severely impact performance of well-behaved queries Very hard to completely eradicate! Query 1 Query 2 Query 3 Query 4 Query Time
Manage Runaway Queries Define runaway queries: Estimated execution time Actual execution time Actual number of I/Os (11.1) Actual bytes of I/O (11.1) Manage runaway queries: Switch to another consumer group Lower-priority consumer group Consumer group with max CPU utilization limit (11.2) Abort call Kill session
Manage Runaway Queries For Tactical consumer group, runaway means: 30+ sec Switch to “Low Priority” consumer group! For Reports consumer group, runaway means: 32GB+ I/Os Abort query! For Ad-Hoc consumer group, runaway means: 24+ hour estimated execution time Don’t execute!
Consumer Group Settings Overview
Step 3: Run and Adjust the Workload Run a workload for a period of time and look at the results DBRM Adjust: Overall priorities Scheduling of switches in plans Queuing System Adjust: How many PX statements PX Queuing levels vs. Utilization levels (should we queue less?)
Resource Manager - End to End Test scenario: 2 workloads in a data warehouse Short tactical queries queries Long running deep (batch) analysis Goal: Run batch and tactical analysis concurrently Don’t impact response time of tactical queries!
Resource Manager - End to End © 2010 Oracle Corporation
Questions 50
Additional Information Instance Caging http://www.oracle.com/technetwork/database/features/performance/instance-caging-wp-166854.pdf Resource Manager http://www.oracle.com/technetwork/database/features/performance/resource-manager-twp-133705.pdf © 2010 Oracle Corporation