Achieving Scalability, Performance and Availability on Linux with Oracle 9iR2-RAC Grant McAlister Senior Database Engineer Amazon.com Paper 32110
Agenda Why Oracle on Linux and RAC The Tests Scaling Performance Availability Choice of Interconnect Conclusion
Why Linux Lower Total Cost of Ownership Near commodity hardware and support Multiple O/S and hardware vendors Common platform (IA-32) for entire enterprise Unix look and feel New enterprise kernel No database conversions when changing Linux hardware or O/S
Why RAC on Linux Cost Ability to use near commodity systems (2-4 processors) Lower level of support needed on system units The need for availability Young and rapidly evolving O/S Near commodity hardware and support The need to scale database beyond 8 processors The need for large amounts of memory > 32GBytes
The Tests Real life workloads Not modified or partitioned to support RAC Used automatic space management Workload #1 Simple workload of small queries with little locking. Workload #2 Typical nasty workload with many inserts, updates and select for updates causing a lot of locking and blocking.
Workload #1 Single Instance Profile Load Profile ~~~~~~~~~~~~ Per Second Per Transaction Redo size: 77, , Logical reads: 4, Block changes: Physical reads: Physical writes: User calls: 11, Parses: Sorts: Executes: Transactions: % Blocks changed per Read: Recursive Call %: 0.68 Rollback per transaction %: 0.82 Rows per Sort: Top 5 Wait Events on a single instance Avg Total Wait wait Waits Event Waits Timeouts Time (s) (ms) /txn db file sequential read 560, , log file sync 180, log file parallel write 188, , latch free 87,584 6, db file parallel write 5,794 2,
Workload #2 Single Instance Profile Load Profile ~~~~~~~~~~~~ Per Second Per Transaction Redo size: 244, , Logical reads: 14, Block changes: 1, Physical reads: Physical writes: User calls: 2, Parses: Sorts: Executes: Transactions: % Blocks changed per Read: Recursive Call %: 4.16 Rollback per transaction %: 0.96 Rows per Sort: Top 5 wait events on a single instance Avg Total Wait wait Waits Event Waits Timeouts Time (s) (ms) /txn db file sequential read 346, , enqueue free buffer waits db file scattered read 141, log file sync 207,
The Hardware and Software Software Oracle Red Hat Advanced Server 2.1 (2.4.9-e.3) Hardware 3 types of clusters that each have 4 nodes 2 Pentium III Xeon 1.126GHz & 5 Gbytes of RAM 2 Pentium 4 Xeon DP 2.4GHz & 4 Gbytes of RAM 4 Pentium 4 Xeon MP 1.6GHz & 10 Gbytes of RAM Database files were on raw partitions
Scaling The ability to produce higher transactional volumes when adding additional processors or additional nodes.
Scaling of workload #1
Scaling of workload #2
Some workloads scale better
Some of the differences EventWaitsTime (s)%Total Elapsed Time CPU time2, global cache null to x62,6462, db file sequential read391,4741, buffer busy global cache15, log file sync158, Top 5 workload #1 timed events EventWaitsTime (s)%Total Elapsed Time global cache cr request1,324,75619, buffer busy global cache53,41111, enqueue38,79511, global cache null to x88,9086, CPU time5, Top 5 workload #2 timed events
Performance The time taken to perform a query is important Execution time influences transactional volume Can cause dramatic changes in the end user response time Stock Exchange Internet Retailer Bank Only you know what is reasonable for your database and application
Execution times for workload #1
Execution times for workload #2
Some ways to improve Make sure you database is well tuned for single instance operation Consider using different block sizes for hot indexes Hash partition hot tables and indexes Partition the workload
Availability Minimize failures by building clusters with as few single points of failure as possible. Setup your RAC cluster to recover from node and instance failure as quickly as possible.
Redundant RAC Configuration
Instance recovery time MTTR Target=120MTTR Target=240MTTR Target Not Set Cluster Reconfigured 222 Recovery Started91012 Redo Log First Pass 1113 Redo Log Second Pass Total Time fast_start_mttr_target is the key
Node failure recovery time Recovery Time= Failure detection + Instance recovery Failure detection = (MissCount * 1 second) MissCount parameter in found in cmcfg.ora When MissCount = 20 and fast_start_mttr_target=120 All workload #2 processing resumed in less than 1 minute after crashing a node.
Impact of a single node failure Node failedCluster Reconfigured CM ejects nodeRecovery Complete
Choice of Interconnect 1000Mbit (Gigabit) Ethernet Latency ~ 0.07 ms Transfer Rate MBytes per second More expensive but becoming common with the advent of gigabit over copper. 100Mbit Ethernet Latency ~ 0.20 ms Transfer Rate - 10 MBytes per second Common and inexpensive
100mbit vs. Gigabit
Conclusions RAC scaled at 90% on a simple workload RAC scaled consistently at 55+% on a complex workload There is an impact to query performance depending on your workload You can recover from failures in less than 1 minute When configured correctly a RAC cluster can scale, perform and be highly available.
A Q & Q U E S T I O N S A N S W E R S