Jonathan Lewis jonathanlewis.wordpress.com www.jlcomp.demon.co.uk Reading an AWR Report Jonathan Lewis jonathanlewis.wordpress.com www.jlcomp.demon.co.uk
Who am I ? Independent Consultant 31+ years in IT 26+ using Oracle Strategy, Design, Review, Briefings, Educational, Trouble-shooting Oracle author of the year 2006 Select Editor’s choice 2007 UKOUG Inspiring Presenter 2011 ODTUG 2012 Best Presenter (d/b) UKOUG Inspiring Presenter 2012 UKOUG Lifetime Award (IPA) 2013 Member of the Oak Table Network Oracle ACE Director O1 visa for USA Jonathan Lewis © 2014
When you could be reading ADDM Why ? When you could be reading ADDM Maybe you're using Statspack Not licensed / Using SE But it's not tracking a user problem Don't you have application instrumentation Can't you enable extended trace What about sampling through ASH Jonathan Lewis © 2014
What's slowing me down? NAME VALUE Good run session logical reads 151,302 CPU used by this session 2,744 DB time 3,054 30 sec physical reads 3 NAME WAITS WAIT_SEC AVG_CS MAX_CS db file sequential read 3 0.02 .57 5 NAME VALUE Bad run session logical reads 151,302 CPU used by this session 3,038 DB time 19,924 3:19 physical reads 79,184 NAME WAITS WAIT_SEC AVG_CS MAX_CS db file sequential read 79,184 166.29 .21 5 Even very good instrumentation of key processes may not tell you WHY the process varies in performance beyond the fact that "the load on the machine was different". Sometimes the only way to find why MY process is slow is to scan "the system" for other processes that are denying me the resources I need. Jonathan Lewis © 2014
What's slowing me down? Variation in performance: NAME VALUE Argh! session logical reads 151,302 CPU used by this session 3,401 DB time 88,858 14:50 physical reads 139,623 NAME WAITS WAIT_SEC AVG_CS MAX_CS db file sequential read 139,623 851.70 .61 5 Variation in performance: 0:30 Everything is in the Oracle cache 3:20 A lot of data is read by Oracle, but in the SAN cache 14:50 Most blocks come from the SAN, from disk reads (not shown) It’s all on disc, and the discs are overloaded. Jonathan Lewis © 2014
What's the source? Time gives us a fourth dimension. Trouble-shooting means picking and aggregating the right slice Service Session System Cursor 1 Cursor 2 Cursor 3 Cursor 4 v$sesstat, v$session_event, v$sess_time_model v$service_stats, v$service_event, v$servicemetric_history, v$serv_mod_act_stats v$sysstat v$system_event v$sysmetric_history v$sys_time_model From 10g, and important for consolidation (esp. 12c) Latches Segments Locks Etc … Jonathan Lewis © 2014
AWR 12c Jonathan Lewis © 2014 Load Profile Instance Efficiency Percentages Top 10 Foreground Events by Total Wait Time Wait Classes by Total Wait Time Host CPU Instance CPU IO Profile Memory Statistics Cache Sizes Shared Pool Statistics Time Model Statistics Operating System Statistics Operating System Statistics - Detail Foreground Wait Class Foreground Wait Events Background Wait Events Wait Event Histogram (x4) Service Statistics Service Wait Class Stats SQL ordered by ( x10) Key Instance Activity Stats Instance Activity Stats Instance Activity Stats - Absolute Values Instance Activity Stats - Thread Activity IOStat by Function summary IOStat by Filetype summary IOStat by Function/Filetype summary Tablespace IO Stats File IO Stats Buffer Pool Statistics Checkpoint Activity Instance Recovery Stats MTTR Advisory Buffer Pool Advisory PGA Aggr Summary PGA Aggr Target Stats PGA Aggr Target Histogram PGA Memory Advisory Shared Pool Advisory SGA Target Advisory Streams Pool Advisory Java Pool Advisory Buffer Wait Statistics Enqueue Activity Undo Segment Summary Undo Segment Stats Latch Activity Latch Sleep Breakdown Latch Miss Sources Mutex Sleep Summary Parent Latch Statistics Child Latch Statistics Segments by Logical Reads (14) In-Memory Segments by (x 4) Dictionary Cache Statsd Library Cache Activity Memory Dynamic Components Memory Resize Operations Summary Memory Resize Ops Process Memory Summary SGA Memory Summary SGA breakdown difference Replication System Resource Usage Replication SGA Usage GoldenGate Capture GoldenGate Capture Rate GoldenGate Apply Reader GoldenGate Apply Coordinator GoldenGate Apply Server GoldenGate Apply Coordinator Rate GoldenGate Apply Reader and Server Rate XStream Capture XStream Capture Rate XStream Apply Reader XStream Apply Coordinator XStream Apply Server XStream Apply Coordinator Rate XStream Apply Reader and Server Rate Table Statistics by DML Operations Table Statistics by Conflict Resolutions Replication Large Transaction Statistics Replication Long Running Transaction Statistics Streams CPU/IO Usage Streams Capture Streams Capture Rate Streams Apply Streams Apply Rate Buffered Queues Buffered Queue Subscribers Rule Set Persistent Queues Persistent Queues Rate Persistent Queue Subscribers Resource Limit Stats Shared Servers Activity Shared Servers Rates Shared Servers Utilization Shared Servers Common Queue Shared Servers Dispatchers init.ora Parameters init.ora Multi-Valued Parameters ASH report ADDM Report Jonathan Lewis © 2014
State your intention Know the environment Know the application Check: Strategy State your intention Know the environment Know the application Check: Load profile Top N waits Time model OSStats Follow the clues Jonathan Lewis © 2014
Meaning (11.2.0.3) We’ll be coming back to this set of figures later. Load Profile Per Second Per Transaction Per Exec Per Call ~~~~~~~~~~~~ --------------- --------------- ---------- ---------- DB Time(s): 9.5 0.2 0.00 0.03 DB CPU(s): 2.3 0.1 0.00 0.01 Redo size: 2,337,609.3 57,989.6 Logical reads: 50,080.9 1,242.4 Block changes: 9,547.3 236.8 Physical reads: 3,644.2 90.4 Physical writes: 324.6 8.1 User calls: 279.6 6.9 Parses: 607.2 15.1 Hard parses: 18.4 0.5 W/A MB processed: 3.4 0.1 Logons: 7.4 0.2 Executes: 3,087.4 76.6 Rollbacks: 0.0 0.0 Transactions: 40.3 This is a little busy (2M redo per second requires is not trivial). 3,000 executes per second suggests code that might be doing lots of little bits of activity rather than using efficient bulk-processing statements. Parses > User calls tells us that users are calling (probably) pl/sql procedures to get the work done and the pl/sql does more parsing and executing internally. Key point - if you don't know what the numbers mean how do you interpret them as "high", "acceptable" or "low" - e.g. what does "Physical reads:" actually cover. We’ll be coming back to this set of figures later. Jonathan Lewis © 2014
Meaning (11.2.0.4) Load Profile Per Second Per Transaction Per Exec Per Call ~~~~~~~~~~~~~~~ --------------- --------------- --------- --------- DB Time(s): 0.1 0.2 0.00 0.10 DB CPU(s): 0.1 0.1 0.00 0.06 Redo size (bytes): 39,133.7 64,071.1 Logical read (blocks): 1,936.5 3,170.5 Block changes: 106.1 173.7 Physical read (blocks): 36.0 59.0 Physical write (blocks): 12.3 20.1 Read IO requests: 5.0 8.2 Write IO requests: 2.1 3.4 Read IO (MB): 0.3 0.5 Write IO (MB): 0.1 0.2 User calls: 1.3 2.1 Parses (SQL): 47.8 78.2 Hard parses (SQL): 14.5 23.8 SQL Work Area (MB): 2.0 3.3 Logons: 0.1 0.2 Executes (SQL): 216.2 354.0 Rollbacks: 0.0 0.0 Transactions: 0.6 11.2.0.4 tries to be a little more information in the Load Profile. "Physical reads" is counting blocks, not read requests. Still have to check if the Read IO figures are about JUST the data blocks, or include control file, redo log, etc. Jonathan Lewis © 2014
Meaning (detail 1) physical reads 32,907,882 3,644.2 90.4 Instance Activity Stats DB/Inst: xxxxxxxx/xxxxxxxxx Snaps: nnnnn-nnnnn -> Ordered by statistic name Statistic Total per Second per Trans physical read IO requests 2,616,625 289.8 7.2 physical read bytes 269,581,369,344 29,853,086.2 740,571.5 physical read total IO requests 2,709,009 300.0 7.4 physical read total bytes 292,207,811,584 32,358,708.7 802,729.0 physical read total multi block 1,970,223 218.2 5.4 physical reads 32,907,882 3,644.2 90.4 physical reads cache 32,850,160 3,637.8 90.2 physical reads cache prefetch 30,247,730 3,349.6 83.1 physical reads direct 57,722 6.4 0.2 physical reads direct (lob) 0 0.0 0.0 physical reads direct temporary 53,528 5.9 0.2 physical reads prefetch warmup 0 0.0 0.0 Lots of different types of reading go into the final "Physical reads" figure (and some of the reading mechanisms don't) End value Parameter Name Begin value (if different) db_file_multiblock_read_count 16 Jonathan Lewis © 2014
Meaning (detail 2) table scans (long tables) 89,451 9.9 0.3 Instance Activity Stats DB/Inst: xxxxxxxx/xxxxxxxxx Snaps: nnnnn-nnnnn -> Ordered by statistic name Statistic Total per Second per Trans table fetch by rowid 69,345,113 7,679.2 190.5 table fetch continued row 14,419,654 1,596.8 39.6 table scan blocks gotten 121,994,885 13,509.6 335.1 table scan rows gotten 6,946,407,158 769,236.0 19,082.6 table scans (direct read) 0 0.0 0.0 table scans (long tables) 89,451 9.9 0.3 table scans (rowid ranges) 890 0.1 0.0 table scans (short tables) 3,552,179 393.4 9.8 index fast full scans (full) 5,265 0.6 0.0 The db_file_mulitblock_read_count = 16, so for a tablescan prefetch = 15/16 Almost all the reads are tablescan Cache Sizes Begin End ~~~~~~~~~~~ ---------- ---------- Buffer Cache: 10,240M 10,240M Std Block Size: 8K Shared Pool Size: 14,336M 14,336M Log Buffer: 71,232K Jonathan Lewis © 2014
Meaning (detail 3) Buffer Pool Statistics DB/Inst: xxxxxxxx/xxxxxxxx Snaps: nnnnn-nnnnn Free Writ Buffer Number of Pool Buffer Physical Physical Buff Comp Busy P Buffers Hit% Gets Reads Writes Wait Wait Waits --- ---------- ---- ------------ ------------ ----------- ------ ------ -------- D 1,209,600 93 451,789,797 32,849,808 2,715,974 0 0 2.87E+06 ------------------------------------------------------------- Segments by Table Scans DB/Inst: xxxxxxxx/xxxxxxxxx Snaps: nnnnn-nnnnn -> Total Table Scans: 94,716 -> Captured Segments account for 28.7% of Total Tablespace Subobject Obj. Table Owner Name Object Name Name Type Scans %Total ---------- ---------- -------------------- ---------- ----- ------------ ------- xxxxxx_AA xxxxxx_CC XXXXXXXXXXXXXXX P20140709 TABLE 25,990 27.44 SYS SYSTEM I_USER2 INDEX 603 .64 SYS SYSTEM I_OBJ2 INDEX 350 .37 xxxxxx_BB xxxxxx_DAT YYYYYYYYYYYYYYY TABLE 92 .10 xxxxxx_CC xxxxxx_CC PK_HOLIDAYS INDEX 32 .03 Adding cache doesn't (often) stop tablescans This set of segment stats is actually limited to long tablescans and index fast full scans. Jonathan Lewis © 2014
Meaning (detail 4) SQL ordered by Executions DB/Inst: xxxxxxxx/xxxxxxxxx Snaps: nnnnn-nnnnn -> %CPU - CPU Time as a percentage of Elapsed Time -> %IO - User I/O Time as a percentage of Elapsed Time -> Total Executions: 27,880,039 -> Captured SQL account for 15.7% of Total Elapsed Executions Rows Processed Rows per Exec Time (s) %CPU %IO SQL Id ------------ --------------- -------------- ---------- ----- ----- ------------- 9,585,991 9,585,801 1.0 353.5 93.1 0 2m8jv10f2nmn9 Module: ABCDEF begin :con := "MY_VPD_POLICY"."GET_PREDICATE_X"(:sn, :on); end; 7,604,028 7,603,739 1.0 307.9 93.6 0 3r4tsa6t7c9z1 begin :con := "MY_VPD_POLICY"."GET_PREDICATE_Y"(:sn, :on); end; 1,527,965 1,527,366 1.0 509.6 95.2 0 d3xqdu13ufh4r SELECT SOME_COLUMN FROM SOME_TABLE WHERE X_CODE=:B2 AND Y_CODE=:B1 While checking for a possible source of 3,500,000 short tablescans we can see: a) the SQL that is inside pl/sql which might be 1.5M of them, possibly due to poor indexing. b) 17M executions of calls used by Row Level Security (Virtual Private Database) which might be reduced by making security functions context sensitive. c) A small fraction of the total execution count is due to the SQL we've captured - have we lost a lot of SQL executions in the interval ? Jonathan Lewis © 2014
Begin at the Beginning DB Name DB Id Instance Inst Num Startup Time Release RAC ------------ ----------- ------------ -------- --------------- ----------- --- xxxxxxxx 1242197329 xxxxxxxxx 3 08-Mar-14 05:04 11.2.0.3.0 YES Host Name Platform CPUs Cores Sockets Memory(GB) ---------------- -------------------------------- ---- ----- ------- ---------- xxxxxxxxxxxx Linux x86 64-bit 24 24 4 125.47 Snap Id Snap Time Sessions Curs/Sess --------- ------------------- -------- --------- Begin Snap: nnnnn 19-Mar-14 12:30:15 234 6.5 End Snap: nnnnn 19-Mar-14 15:00:45 260 7.6 Elapsed: 150.50 (mins) DB Time: 1,434.42 (mins) Available CPU time = CPU Count * elapsed time in seconds = 24 * 9,030 = 216,700 Jonathan Lewis © 2014
"Top 5" Top 5 Timed Foreground Events ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ Avg wait % DB Event Waits Time(s) (ms) time Wait Class ------------------------------ ------------ ----------- ------ ------ ---------- resmgr:cpu quantum 79,070 34,560 437 40.2 Scheduler DB CPU 20,722 24.1 db file scattered read 2,160,917 8,987 4 10.4 User I/O read by other session 1,935,872 8,617 4 10.0 User I/O PX Nsq: PQ load info query 20,437 4,027 197 4.7 Other Resmgr: cpu quantum says: "you're out of CPU", but we know we're not using much CPU. Jonathan Lewis © 2014
Service Stats Service Statistics DB/Inst: xxxxxxxx/xxxxxxxxx Snaps: nnnnn-nnnnn -> ordered by DB Time Physical Logical Service Name DB Time (s) DB CPU (s) Reads (K) Reads (K) ---------------------------- ------------ ------------ ------------ ------------ xxxxxx_AA_P1 36,455 12,341 912 141,262 xxxxxx_BB_P1 18,145 1,886 29,359 59,667 SYS$USERS 12,600 675 1 1,247 HH_P2 7,968 5,712 2,629 247,643 xxxxxx_CC_P1 4,390 5 0 326 xxxxxx_DD_P1 1,855 2 0 142 xxxxxx_EE_P1 1,499 2 0 83 xxxxxx_FF_P1 1,476 2 0 83 xxxxxx_GG_P1 1,310 2 0 105 HH_P1 334 90 1 1,079 ------------------------------------------------------------- The answer is resource manager being used to limit services - one service is trying to go far beyond it's CPU allocation while the machine is not otherwise CPU loaded. Jonathan Lewis © 2014
Service Wait Class Stats Service Wait Class Stats DB/Inst: xxxxxxxx/xxxxxxxxx Snaps: nnnnn-nnnnn -> Wait Class info for services in the Service Statistics section. -> Total Waits and Time Waited displayed for the following wait classes: User I/O, Concurrency, Administrative, Network -> Time Waited (Wt Time) in seconds Service Name User I/O User I/O Concurcy Concurcy Admin Admin Network Network Total Wts Wt Time Total Wts Wt Time Total Wts Wt Time Total Wts Wt Time xxxxxx_AA_P1 326316 1027 225878 989 0 0 2359008 27 xxxxxx_BB_P1 3821827 16129 38917 6 0 0 0 0 SYS$USERS 7774 8 48816 54 4 1 2219 0 HH_P2 434126 1696 75466 37 0 0 342376 4 ... The stats reported for services are very limited - we can't add the various times reported for the xxxxxx_AA_P1 service to get its total time reported. This could be a problem for users of 12c multi-tenant, where each PDB gets its own service: there's not a lot of information in the AWR report about how a service is handling its load. Jonathan Lewis © 2014
Time Model Time Model Statistics DB/Inst: xxxxxxxx/xxxxxxxx3 Snaps: nnnnn-nnnnn -> Total time in database user-calls (DB Time): 86064.9s Statistic Name Time (s) % of DB Time sql execute elapsed time 59,190.6 68.8 DB CPU 20,721.7 24.1 PL/SQL execution elapsed time 20,238.3 23.5 parse time elapsed 19,954.0 23.2 hard parse elapsed time 18,331.1 21.3 failed parse elapsed time 145.8 .2 hard parse (sharing criteria) elapsed time 88.5 .1 inbound PL/SQL rpc elapsed time 26.2 .0 connection management call elapsed time 26.0 .0 PL/SQL compilation elapsed time 25.8 .0 repeated bind elapsed time 4.4 .0 sequence load elapsed time 3.6 .0 hard parse (bind mismatch) elapsed time 1.9 .0 DB time 86,064.9 background elapsed time 3,397.3 background cpu time 769.4 Time model stats don't often highlight anything interesting - but the parse times shown here are unusually high. Jonathan Lewis © 2014
Instance Efficiency Instance Efficiency Percentages (Target 100%) ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ Buffer Nowait %: 99.37 Redo NoWait %: 100.00 Buffer Hit %: 92.73 In-memory Sort %: 100.00 Library Hit %: 99.31 Soft Parse %: 96.97 Execute to Parse %: 80.33 Latch Hit %: 99.75 Parse CPU to Parse Elapsd %: 52.36 % Non-Parse CPU: 49.91 Statistic Total per Second per Trans -------------------------------- ------------------ -------------- ------------- parse count (describe) 12 0.0 0.0 parse count (failures) 118,538 13.1 0.3 parse count (hard) 166,302 18.4 0.5 parse count (total) 5,483,250 607.2 15.1 parse time cpu 1,037,917 114.9 2.9 parse time elapsed 1,982,122 219.5 5.5 execute count 27,880,039 3,087.4 76.6 Percentages are rarely useful because they hide scale. But if you know the scale (and the derivation) you may spot a surprising result and check further details Jonathan Lewis © 2014
Puzzle DB Name DB Id Instance Inst Num Startup Time Release RAC xxxxx 0123456789 xxxxxx 1 17-Aug-14 04:26 11.2.0.3.0 NO Host Name Platform CPUs Cores Sockets Memory(GB) xxxxxxxx AIX-Based Systems (64-bit) 248 62 1229.00 Cache Sizes Begin End ~~~~~~~~~~~ ---------- ---------- Buffer Cache: 258,048M 258,048M Std Block Size: 8K Shared Pool Size: 20,480M 20,480M Log Buffer: 550,172K SGA breakdown difference DB/Inst: xxxxxx/xxxxxx Snaps: nnnnnn-nnnnnn -> ordered by Pool, Name Pool Name Begin MB End MB % Diff java free memory 3,554.2 3,554.2 0.00 large free memory 3,524.6 3,524.6 0.00 shared private strands 216.1 216.1 0.00 Big SGA allocate in 512MB granules, which can lead to a lot of "lost" memory Jonathan Lewis © 2014
Anomaly Instance Activity Stats DB/Inst: xxxxxx/xxxxxx Snaps: nnnnnn-nnnnnn -> Ordered by statistic name Statistic Total per Second per Trans HSC Compressed Segment Block Cha 0 0.0 0.0 HSC Heap Segment Block Changes 23,426,994 25,991.9 65.5 Heap Segment Array Inserts 134,301 149.0 0.4 LOB table id lookup cache misses 5 0.0 0.0 Number of read IOs issued 110,605 122.7 0.3 Statistic Total per Second per Trans IMU CR rollbacks 3,816 0.5 0.0 IMU Flushes 148,709 20.6 0.0 IMU Redo allocation size 339,798,648 47,026.3 44.1 IMU commits 7,425,789 1,027.7 1.0 IMU contention 2,296 0.3 0.0 IMU pool not allocated 154,567 21.4 0.0 IMU recursive-transaction flush 1,536 0.2 0.0 IMU undo allocation size 16,757,286,384 2,319,116.2 2,175.8 IMU- failed to get a private str 154,567 21.4 0.0 Why are there no IMU stats. (Why isn't private redo being used?) 11.2.0.4 splits Instance Activity Stats into Key Instance Activity Stats Other Instance Activity Stats Jonathan Lewis © 2014