Download presentation
Presentation is loading. Please wait.
Published byEustacia Horton Modified over 6 years ago
1
Part II : Waits Events and the Geeks who love them
Kyle Hailey
2
Wait Events Wait Events Copyright 2006 Kyle Hailey
3
And the Geeks Who Love Them
Copyright 2006 Kyle Hailey
4
In this Presentation: Introduction to Waits Tuning Methodology
Plan of Action Statspacks, AWR or OEM for Collection Data Based on Waits Using Waits to Solve Bottlenecks New and Exciting Tuning Methodology Hold on to the seat of your pants because we are also looking at Oracle OEM 10g Copyright 2006 Kyle Hailey
5
Database is Hung! Everybody blames the database
Yet 9 out of 10 dba’s agree it’s not the database How do you prove it to management? On the off chance it’s the database, what do we do? The database is so often blamed for performance problems. Problems points are easily lost in todays multilayer applications. The applications often used by 100s or 1000s of users and performance problems can mean large monetary losses not to mention problems that could arrise at hospitals or other important industries. As a DBA it is important ot be able first to determine if the database is running correctly or not. This first step of assessing that the database is actuall “OK” is often missed and the debate rages for days weeks or months inside companies as to where the problem is. Being able to pin the problem clearly on the application when the application is actually the problem can save companies a lot of time and money. If the databse is the problem, then being able to quickly isolate and resolve the problem becomes paramount to the DBA. Copyright 2006 Kyle Hailey
6
Database: Guilty until proven innocent
Where is the problem? How do you find out? In the users mind, the application can do no wrong – must be the database. Copyright 2006 Kyle Hailey
7
Oracle’s Defense WAIT EVENTS After years of false accusations
Oracle took action and created a defense system: WAIT EVENTS To the rescue Oracle is the best instrumented database on the market which can save time and money on development and tuning After years of false accusations Oracle has created a systematic defense stategy to not only prove it’s innocence but redirect the blame toward the real culprit, the application Well sometimes Copyright 2006 Kyle Hailey
8
Oracle Instrumentation
CPU Locks Redo Lib Cache Buffer Cache Network IO Copyright 2006 Kyle Hailey
9
Waits Introduced in v7 Revolutionized tuning
Changed from Ratio Guesswork to empirical measure of time lost to bottlenecks 10g added the crucial addition ASH Not only identifies bottlenecks but Who (session, service, package, procedure) Where (CPU, Wait) When (time) What (SQL statement) Copyright 2006 Kyle Hailey
10
Tuning Methodology Machine Oracle Run queue (CPU) Paging
Check other applications reduce CPU usage or add CPUs Paging Reduce memory usage or add memory Oracle Waits + CPU > Available CPU Tune waits CPU 100% Tune SQL Else low waits, available CPU then It’s the application In order to tune an Oracle database the first step in a complete analysis is to verify the machine because there are a couple of factors that can only be clearly determined by looking at machine statistics. Those two factors are Memory Usage CPU Usage Memory and CPU problems will have tell tale repercussions on Oracle performance statistics and thus can be deduced from just looking at the Oracle statistics, but it is clear just to start with the machine statistics. CPU For CPU, we check the “run queue” which is the number of processes that are ready to run but have to wait for the CPU. A machine free of CPU contention would have a run queue of 0 and could have CPU usage near 100% at the same time. A high CPU usage can be a good sign that the system is being utilized fully. On the other hand a high run queue will indicated that there is more demand for CPU than CPU power available. High run queue can be determined via Oracle statistics by looking at ASH data and seeing if more sessions are marked as being on the CPU than the number of CPUs available. For example if there are 4 sessions average active on CPU in the ASH data and only 2 CPUs then the machine is CPU bound. Solutions for high run queue are either to add more processors or reduce the load on the CPU. If the CPU is mainly being used by Oracle, then that is going to mean tuning the application and ther SQL queries. Memory If the machine is paging out to disk it means there is a memory crunch and can dramatically slow down Oracle. Oracle will sometimes indicate a paging problem through a spike in “latch free” waits but the only guarenteed method of diagnosing this problem is looking at the machine statistics. Machines have statistics for paging and free memory. Often there can be some free memory even when there is paging out because machines start paging out before memory is completely filled. Solutions if the machine is paging out are either to add more memory or to reduce memory usage. Memory usage can be reduced by reducing Oracle cache sizes or reducing Oracle session memory usage. We are going to concentrate here on WAITS Copyright 2006 Kyle Hailey
11
Dependable Tuning Strategy
Determine AAS : Run Statspack or AWR Report Top 5 Timed Events ~50 lines down from top Need Available CPU Elapsed Time CPU_COUNT ASH Report : ashrpt.sql OEM 10g Performance Page does everything If there is a wait bottleneck tune the wait Copyright 2006 Kyle Hailey 11
12
Tuning Methodology Graphics
Maximum CPU line – ADDM report (folder with checkmark) Run ADDM Now – Run ASH Report – Top Activity – CPU Used – Wait Classes - Relax, it’s the application Get to Work! Copyright 2006 Kyle Hailey
13
Waits beyond OEM OEM identifies Wait problems
Provides solutions with ADDM sometimes But What do you do when ADDM isn’t sufficient? What do you do if you don’t have OEM 10g? Waits Need to know about waits How they work How to analyze them Copyright 2006 Kyle Hailey
14
We’ll discuss Waits in these logical database areas
Wait Areas Buffer Cache I/O Locks Waits Library Cache Redo SQL*Net We’ll discuss Waits in these logical database areas Copyright 2006 Kyle Hailey 14
15
Wait Tree Write IO Read IO Rollback Buffer Busy Free lists IO
Cache Latches IO Read Buffer Cache Library Cache Library Cache Shared Pool Waits Lock TX Row Lock Redo TX ITL Lock SQL Net HW Lock Log File Log Buffer Log File Sync Copyright 2006 Kyle Hailey 15
16
v$active_session_history
When ADDM fails or we don’t have ADDM we can collect the necessary information from v$active_session_history Session (user, service, client, package, procedure, etc) SQL statement For IO related waits CURRENT_OBJ# ,CURRENT_FILE# ,CURRENT_BLOCK# Blocking_Session P1 P2 P3 Copyright 2006 Kyle Hailey 16
17
What are P1,P2,P3 ? Each Wait has a 3 parameters P1,P2,P3
Give detailed information Meaning different for each wait Meaning definitions in V$event_name Select name, parameter1, parameter2, parameter3 from v$event_name; col parameter1 for a10 col parameter2 for a10 col parameter3 for a10 select parameter1 ,parameter2 , parameter3 from v$event_name where name = '&1'; Copyright 2006 Kyle Hailey 17
18
Wait Arguments Example
select parameter1 ,parameter2 , parameter3 from v$event_name; NAME PARAMETER PARAMETER PARAMETER3 latch: cache buffers chains address number tries free buffer waits file# block# set-id# buffer busy waits file# block# class# latch: redo copy address number tries log buffer space switch logfile command log file sync buffer# db file sequential read file# block# blocks enq: TM - contention name|mode object # table/partition undo segment extension segment# enq: TX - row lock contention name|mode usn<<16 | slot sequence row cache lock cache id mode request library cache pin handle address pin address *mode+namesp library cache load lock object address lock address *mask+namesp pipe put handle address record length timeout Copyright 2006 Kyle Hailey 18
19
Wait Analysis requires p1,p2,p3
Of the top 30 wait events 8 can be solved without ASH The rest need Sql_id and/or P1,P2,P3 free buffer waits log buffer space log file switch (archiving needed) log file switch (checkpoint incomplete) log file switch completion log file sync switch logfile command write complete waits Copyright 2006 Kyle Hailey 19
20
Difficult Waits These 4 waits have multiple causes Latches Locks
p2 = latch # (p1= address, p3= tries) Locks p1 = lock type and mode ( p2 = id1, p3= id2) Buffer Busy p3 = block class#, p1= file, p2=block (in 9i p3 was the bbw type) Row Cache Lock p1 = cache id (p2 = mode, p3=request) Copyright 2006 Kyle Hailey 20
21
Wait Analysis Find SQL waiting Find extended wait information
Most often the tuning answer lies in looking at what the application is doing, and changing it Find extended wait information Parameter1, Parameter2, Parameter3 Sometimes the wait events that are found are not in the documentation and it takes some educated guesswork to figure out the problem
22
Waits we will Ignore One thing that makes waits difficult is knowing which ones to look at and which ones to ignore. Background Idle Resource Manager Copyright 2006 Kyle Hailey
23
Background Processes SGA PMON Library Cache Buffer Cache SMON
Log Buffer Buffer Cache SGA Library Cache PMON SMON DBWR LGWR User1 User2 User3 REDO Log Files Data Files Copyright 2006 Kyle Hailey
24
Background & Foreground
Background Processes DBWR LGWR PMON SMON Etc Foreground Processes SQL*Plus Pro*C SQL*Forms Oracle applications Only interested in Foreground waits Copyright 2006 Kyle Hailey
25
Background Waits ASH V$session_wait joined to v$session
Avoid Background waits in ASH with V$session_wait joined to v$session Select …from v$active_session_history where SESSION_TYPE='FOREGROUND' select … from v$session s, v$session_wait w where w.sid=s.sid and s.type='USER' Copyright 2006 Kyle Hailey
26
Idle Waits Filtered Out of ASH by default 10g 9i
where wait_class != ‘Idle’ Create a list 9i Create a list with Documentation List created from 10g Stats$idle_events from statspack Select name from v$event_name where wait_class=‘Idle’; SQL*Net message from client Copyright 2006 Kyle Hailey
27
Parallel Query Waits Filter Out
Parallel Query Wait events are unusable Save waits are both idle and waits Parallel Query Waits start with ‘PX’ or ‘KX’ PX Deq: Par Recov Reply PX Deq: Parse Reply Copyright 2006 Kyle Hailey
28
Resource Manager Waits
Resource manager throttles user Creates wait Obfuscates problems 10g select name from v$event_name where wait_class='Scheduler'; Copyright 2006 Kyle Hailey
29
RAC Waits Select event from v$event_name where wait_class=‘Cluster’;
RAC waits are certainly interesting but will be covered outside of this presentation. You are on your own Check documentation If you are not using RAC then no worries 10g 9i RAC and OPS waits usually contain the word “global” Select event from v$event_name where wait_class=‘Cluster’; Copyright 2006 Kyle Hailey
30
Latches Protect areas of memory from concurrent use Light weight locks
Bit in memory Atomic processor call Fast and cheap Gone if memory is lost Often used in cache coherency management Changes to a data block Exclusive Generally Sharing reading has been introduced for some latches Copyright 2006 Kyle Hailey
31
Finding Latches “latch free”
Covers many latches, find the problem latch by select name from v$latchname where latch# = p1; OR Find highest sleeps in Statspack latch section In 10g, important latches have a wait event latch: cache buffers chains latch: shared pool latch: library cache Copyright 2006 Kyle Hailey
32
Enqueues aka Locks “Enqueue” wait – covers all locks pre 10
Protect data against concurrent changes Lock info written into data structures Block headers Data blocks Written in cache structures Shareable in compatible modes Copyright 2006 Kyle Hailey
33
Locks 10g 10g breaks all Enqueues out
enq: HW - contention Configuration enq: TM - contention Application enq: TX - allocate ITL entry Configuration enq: TX - index contention Concurrency enq: TX - row lock contention Application enq: UL - contention Application Copyright 2006 Kyle Hailey
34
Row Cache Lock Need p1 to see the cache type
SQL> select cache#, parameter from v$rowcache; CACHE# PARAMETER 1 dc_free_extents 4 dc_used_extents 2 dc_segments 0 dc_tablespaces 5 dc_tablespace_quotas 6 dc_files 7 dc_users 3 dc_rollback_segments 8 dc_objects 17 dc_global_oids 12 dc_constraints SQL> select cache#, parameter from v$rowcache; CACHE# PARAMETER 1 dc_free_extents 4 dc_used_extents 2 dc_segments 0 dc_tablespaces 5 dc_tablespace_quotas 6 dc_files 7 dc_users 3 dc_rollback_segments 8 dc_objects 17 dc_global_oids 12 dc_constraints Copyright 2006 Kyle Hailey
35
Row Cache Lock Statspack
^LDictionary Cache Stats for DB: ORA9 Instance: ora9 Snaps: 1 -2 ->"Pct Misses" should be very low (< 2% in most cases) ->"Cache Usage" is the number of cache entries being used ->"Pct SGA" is the ratio of usage to allocated size for that cache Get Pct Scan Pct Mod Final Cache Requests Miss Reqs Miss Reqs Usage dc_object_ids dc_objects ,129 dc_segments dc_tablespaces dc_usernames dc_sequences , , ^LDictionary Cache Stats for DB: ORA9 Instance: ora9 Snaps: 1 -2 ->"Pct Misses" should be very low (< 2% in most cases) ->"Cache Usage" is the number of cache entries being used ->"Pct SGA" is the ratio of usage to allocated size for that cache Get Pct Scan Pct Mod Final Cache Requests Miss Reqs Miss Reqs Usage dc_object_ids dc_objects ,129 dc_segments dc_tablespaces dc_usernames dc_users Copyright 2006 Kyle Hailey
36
Additional Support AWR Tables – on disk for 7 days by default
DBA_HIST_ACTIVE_SESS_HISTORY 1 in 10 ASH samples DBA_HIST_SEG_STAT Good for ITL and buffer busy wait DBA_HIST_SYSTEM_EVENT Important for getting avg wait times DBA_HIST_SQLSTAT sql execution deltas DBA_HIST_SYSMETRIC_SUMMARY Statistics avg, max, min Metric Tables – in memory deltas V$EVENTMETRIC Copyright 2006 Kyle Hailey
37
All Events over 7 days select count(*), event from
( select event from DBA_HIST_ACTIVE_SESS_HISTORY where sample_time < ( select min(sample_time) from v$active_session_history) union all select event from v$active_session_history ) group by event order by event / COUNT(*) EVENT 342 Data file init write 3 L1 validation 3 LGWR wait for redo copy 4 Log file init write 200 PX Deq Credit: send blkd 22 SGA: allocation forcing component growth 3 SQL*Net break/reset to client 1 SQL*Net more data to client 14 Streams AQ: qmn coordinator waiting for slave to start 3284 buffer busy waits 2 buffer deadlock 74 buffer exterminate 780 control file parallel write 9 control file sequential read 12674 db file parallel write 1537 db file scattered read 3831 db file sequential read 41 db file single write 8 direct path read 31 direct path write 47 direct path write temp 5 enq: CF - contention 3 enq: CI - contention 805 enq: FB - contention 944 enq: HW - contention 1 enq: IM - contention for blr 476 enq: RO - fast object reuse 32 enq: SQ - contention 34 enq: TC - contention 18972 enq: TM - contention 1851 enq: TX - allocate ITL entry 90 enq: TX - contention 402 enq: TX - index contention 11587 enq: TX - row lock contention 2278 enq: UL - contention 1962 free buffer waits 31 inactive session 4 kksfbc child completion 1069 latch free 1 latch: In memory undo latch 1071 latch: cache buffers chains 241 latch: cache buffers lru chain 43 latch: library cache 9 latch: library cache pin 1 latch: shared pool 7 library cache load lock 94 library cache lock 93 library cache pin 99 local write wait 555 log buffer space 879 log file parallel write 340 log file switch (checkpoint incomplete) 98 log file switch completion 453 log file sync 50 null event 121 os thread startup 53 rdbms ipc reply 1236 read by other session 2 reliable message 12 row cache lock 180 wait for a undo record 28 wait for stopper event to be increased 127 wait list latch free 25 write complete waits Copyright 2006 Kyle Hailey
38
P1 P2 OBJN OTYPE FILEN BLOCKN SQL_ID BLOCK_TYPE
Example ASH Query Select ash.p1, ash.p2, CURRENT_OBJ#||' '||o.object_name objn, o.object_type otype, CURRENT_FILE# filen, CURRENT_BLOCK# blockn, ash.SQL_ID, w.class ||' '||to_char(ash.p3) block_type from v$active_session_history ash, ( select rownum class#, class from v$waitstat ) w, all_objects o where event='buffer busy waits' and w.class#(+)=ash.p3 and o.object_id (+)= ash.CURRENT_OBJ# and ash.session_state='WAITING' and ash.sample_time > sysdate - &1/(60*24) Order by sample_time col p1 for col p2 for col p3 for select ash.p1,ash.p2,ash.p3, ash.SQL_ID, count(*) cnt, w.class from v$ash ash, ( select rownum class#, class from v$waitstat ) w where event='buffer busy waits' and w.class#=ash.p3 group by ash.p1,ash.p2,ash.p3, ash.SQL_ID, w.class; P1 P2 OBJN OTYPE FILEN BLOCKN SQL_ID BLOCK_TYPE BBW_INDEX_VAL_I INDEX avm49ys4k7t6 data block 1 BBW_INDEX_VAL_I INDEX wqps1quuxqr4 data block 1 BBW_INDEX_VAL_I INDEX wqps1quuxqr4 data block 1 BBW_INDEX_VAL_I INDEX wqps1quuxqr4 data block 1 Copyright 2006 Kyle Hailey
39
Average Wait Times Historic
select btime, (time_ms_end-time_ms_beg)/nullif(count_end-count_beg,0) avg_ms from ( to_char(s.BEGIN_INTERVAL_TIME,'DD-MON-YY HH24:MI') btime, total_waits count_end, time_waited_micro/1000 time_ms_end, Lag (e.time_waited_micro/1000) OVER( PARTITION BY e.event_name ORDER BY s.snap_id) time_ms_beg, Lag (e.total_waits) OVER( PARTITION BY e.event_name ORDER BY s.snap_id) count_beg from DBA_HIST_SYSTEM_EVENT e, DBA_HIST_SNAPSHOT s where s.snap_id=e.snap_id and e.event_name= '&1' order by begin_interval_time ) order by btime; column avg_ms for 999, BTIME AVG_MS 08-JAN-08 01: 08-JAN-08 02: 08-JAN-08 03: 08-JAN-08 04: 08-JAN-08 05: 08-JAN-08 06: Copyright 2006 Kyle Hailey
40
Avg Wait times now select en.name,
(time_waited)/nullif(wait_count,0) avg_ms, wait_count from v$eventmetric e, v$event_name en where e.event# = en.event# and en.name like '%&1%‘; NAME AVG_MS WAIT_COUNT db file sequential read db file scattered read db file parallel write Copyright 2006 Kyle Hailey
41
Object Translation Object ID File # and Block #
42
Wait interface Weaknesses
Logons EM 10g shows these on perf page Time model helps V$SYS_TIME_MODEL connection management call elapsed time I’ve had problems Paging/Memory issues CPU starvation Null Events Bugs – read external table reports CPU From Tanel Poder Advanced Oracle Troubleshooting Guide: When the wait interface is not enough [part 1] Filed under: Unix/Linux, Troubleshooting, Internals, Oracle — 9:38 pm Welcome to read my first real post on this blog! If I ever manage to post any more entries, the type and style of content will be pretty much as this one: some Oracle problem diagnosis and troubleshooting techniques with some OS and hardware touch in it. Mmm… internals ;-) Nevertheless I am also a fan of systematic approaches and methods so I plan to propose some less known OS and Oracle techniques for reducing guesswork in advanced troubleshooting even further. Ok, to the topic. Troubleshooting. Troubleshooting = finding out what is going on. This post covers one unexplained issue I once had with Oracle external tables - which eventually turned out to be a problem with Oracle wait interface instrumentation. I used some of these “what’s going on” techniques to find out… what’s going on. Solaris 10 x64 / Oracle ________________________________________ I worked on a project for which I needed to read data through an external table from an Unix pipe ( ever wanted to load compressed flat file contents to Oracle on-the-fly? ;-) I created a Unix pipe: $ mknod /tmp/tmp_pipe p I created an Oracle external table, reading from that pipe: Connected to: Oracle Database 10g Enterprise Edition Release Production With the Partitioning, OLAP and Data Mining options USERNAME INSTANCE_NAME HOST_NAME VER STARTED SID SERIAL# SPID TANEL SOL solaris CREATE DIRECTORY dir AS '/tmp'; Directory created. CREATE TABLE ext ( 2 value number 3 ) 4 ORGANIZATION EXTERNAL ( 5 TYPE oracle_loader 6 DEFAULT DIRECTORY dir ACCESS PARAMETERS ( FIELDS TERMINATED BY ';' MISSING FIELD VALUES ARE NULL (value) ) LOCATION ('tmp_pipe') 13 ) 14 ; Table created. select * from ext; So far so good… unfortunately this select statement never returned any results. As it turned out later, the gunzip over remote ssh link which should have fed the Unix pipe with flat file data, had got stuck. Without realizing that, I approached this potential session hang condition with first obvious check - a select from V$SESSION_WAIT: select sid, event, state, seq#, seconds_in_wait, p1,p2,p3 2 from v$session_wait 3 where sid = 470; SID EVENT STATE SEQ# SECONDS_IN_WAIT P P P3 470 db file sequential read WAITED KNOWN TIME / 470 db file sequential read WAITED KNOWN TIME 470 db file sequential read WAITED KNOWN TIME The STATE and SECONDS_IN_WAIT columns in V$SESSION_WAIT say we have been crunching the CPU for last two hours, right? (as WAITED… means NOT waiting on any event, in this case the EVENT just shows the last event on which we waited before getting on CPU) Hmm.. let’s check it out: $ prstat -p 724 PID USERNAME SIZE RSS STATE PRI NICE TIME CPU PROCESS/NLWP 724 oracle M 533M sleep :00:00 0.0% oracle/1 prstat reports that this process is currently in sleep state, is not using CPU and has used virtually no CPU during its 2-hour “run” time! Let’s check with ps (which is actually a quite powerful tool): $ ps -o user,pid,s,pcpu,time,etime,wchan,comm -p 724 USER PID S %CPU TIME ELAPSED WCHAN COMMAND oracle S : :18:08 ffffffff8135cadc oracleSOL01 ps also confirms that the process 724 has existed for over 2 hours 18 minutes (ELAPSED), but has only used roughly 1 second of CPU time (TIME). The state column “S” also indicates the sleeping status. So, either Oracle V$SESSION_WAIT or standard Unix tools are lying to us. From above evidence it is pretty clear that it’s Oracle who’s lying (also, in cases like that, lower-level instrumentation always has a better chance to know what’s really going on at the upper level than vice versa). So, let’s use truss (or strace on Linux, tusc on HP-UX) to see if our code is making any system calls or is sleeping within a system call… $ truss -p 724 read(14, 0xFFFFFD7FFD6FDE0F, ) (sleeping…) Hmm, as no followup is printed to this line, it looks like the process is waiting for a read operation on a file descriptor 14 to complete. Which file is this fd 14 about? $ pfiles 724 724: oracleSOL01 (LOCAL=NO) ...snip... 14: S_IFIFO mode:0644 dev:274,2 ino: uid:100 gid:300 size:0 O_RDONLY|O_LARGEFILE /tmp/tmp_pipe …snip… So from here it’s already pretty obvious where the problem is. There is no data coming from the tmp_pipe. This led me to check what was my gunzip doing on the other end of the pipe and it was stuck, in turn waiting for ssh to feed more data into it. And ssh had got stuck due some network transport issue. The baseline is that you can rely on low-level (OS) tools to identify what’s really going on when higher level tools (like Oracle wait interface) provide weird or contradicting information, in this case the Oracle wait interface was not recording external table read wait events. I reported this info to Oracle people and I think it has been filed as a bug by now. This was only a simple demo, identifying a pretty clear case of a session hang, however with use of a pretty intrusive tool ( I would not attach truss to a busy production instance process without thinking twice ). However there are other options. In the next part of this guide ( when I manage to write it ) I will deal with more complex problems like what to do when the session is not reporting significant waits and is spinning heavily on CPU. Using Oracle and Unix tools it is quite easy to figure out the execution profile of a spinning server process, even without connecting to Oracle at all ( do I hear pstack, mdb and stack tracing? ;-) As I’ve just started blogging, I would appreciate any feedback, including about things like blog layout, font sizes, readability, understandability etc. Also I think it will take few days before I manage to post the Part 2 of this troubleshooting guide. Thank you for your patience reading through this :-) Copyright 2006 Kyle Hailey
43
Dependable Tuning Strategy
Run Statspack/AWR report Top 5 Timed Events ~50 lines down from top Need Available CPU Elapsed Time CPU_COUNT OEM 10g Performance Page does everything ! OEM doesn’t solve the problem Query v$active_session_history directly Copyright 2006 Kyle Hailey
44
Summary Waits make Tuning Easy Use
Check Machine Health Tune Waits Tune CPU Tune SQL Change Application Architecture Use OEM10g Statspack/AWR, S/ASH Ignore Background, Idle, Resmgr, PQO Use ASH if OEM fails See for more info s Copyright 2006 Kyle Hailey
Similar presentations
© 2025 SlidePlayer.com. Inc.
All rights reserved.