Root Cause and Other DBA Urban Legends

Slides:



Advertisements
Similar presentations
Overview of performance tuning strategies Oracle Performance Tuning Allan Young June 2008.
Advertisements

Exadata Distinctives Brown Bag New features for tuning Oracle database applications.
A comparison of MySQL And Oracle Jeremy Haubrich.
Common Mistakes Developers Make By Bryan Oliver SQL Server Mentor at SolidQ.
Complete CompTIA A+ Guide to PCs, 6e Chapter 5: Logical Troubleshooting © 2014 Pearson IT Certification
10 Things Not To Do With SQL SQLBits 7. Some things you shouldn’t do.
Chapter 14 Chapter 14: Server Monitoring and Optimization.
Chapter 8: Systems analysis and design
Sofia, Bulgaria | 9-10 October SQL Server 2005 High Availability for developers Vladimir Tchalkov Crossroad Ltd. Vladimir Tchalkov Crossroad Ltd.
Problem Determination Your mind is your most important tool!
LiveCycle Data Services Introduction Part 2. Part 2? This is the second in our series on LiveCycle Data Services. If you missed our first presentation,
1 Robert Wijnbelt Health Check your Database A Performance Tuning Methodology.
Physical Database Design & Performance. Optimizing for Query Performance For DBs with high retrieval traffic as compared to maintenance traffic, optimizing.
Oracle9i Performance Tuning Chapter 1 Performance Tuning Overview.
The Persistence of Memory (Issues) Brian Hitchcock OCP 8, 8i, 9i DBA Sun Microsystems NoCOUG Brian Hitchcock April.
1099 Why Use InterBase? Bill Todd The Database Group, Inc.
Performance Dash A free tool from Microsoft that provides some quick real time information about the status of your SQL Servers.
Views Lesson 7.
Siebel CRM Unicode Conversion – The DBA Perspective Brian Hitchcock OCP 8, 8i, 9i DBA Sun Microsystems DCSIT Technical.
Copyright © 2005, SAS Institute Inc. All rights reserved. JDBC, SAS and Multi-user Update Mitchel Soltys Manager Open Data Access.
Construction Planning and Prerequisite
Root Cause and Other DBA Urban Legends Brian Hitchcock OCP 10g DBA Sun Microsystems SunFed.
MISSION CRITICAL COMPUTING Siebel Database Considerations.
Oracle Applications 11i Concepts II Brian Hitchcock OCP 11i DBA -- OCP 10g DBA Sun Microsystems Brian Hitchcock.
1 Copyright © 2005, Oracle. All rights reserved. Following a Tuning Methodology.
Version Control and SVN ECE 297. Why Do We Need Version Control?
for all Hyperion video tutorial/Training/Certification/Material Essbase Optimization Techniques by Amit.
Log Shipping, Mirroring, Replication and Clustering Which should I use? That depends on a few questions we must ask the user. We will go over these questions.
Same Plan Different Performance Mauro Pagano. Consultant/Developer/Analyst Oracle  Enkitec  Accenture DBPerf and SQL Tuning Training Tools (SQLT, SQLd360,
Ch. 31 Q and A IS 333 Spring 2016 Victor Norman. SNMP, MIBs, and ASN.1 SNMP defines the protocol used to send requests and get responses. MIBs are like.
You Inherited a Database Now What? What you should immediately check and start monitoring for. Tim Radney, Senior DBA for a top 40 US Bank President of.
1 Copyright © 2005, Oracle. All rights reserved. Oracle Database Administration: Overview.
Cosc 5/4765 Database security. Database Databases have moved from internal use only to externally accessible. –Organizations store vast quantities of.
Emdeon Office Batch Management Services This document provides detailed information on Batch Import Services and other Batch features.
INTRODUCTION TO DESKTOP SUPPORT
Fundamental of Databases
You Inherited a Database Now What?
Mobile Testing - Bug Report
Tips for SQL Server Performance and Resiliency
Cleveland SQL Saturday Catch-All or Sometimes Queries
Debugging Intermittent Issues
Get to know SQL Manager SQL Server administration done right 
Science Fair Tips PJAS Presentations.
Hitting the SQL Server “Go Faster” Button
Informatica PowerCenter Performance Tuning Tips
Debugging Intermittent Issues
Reporting Overview Business Goals Demystify the report menu
Introduction to Operating System (OS)
What is QuickBooks File Doctor The demands of the industry have taken a great leap with advancement in the technology. This advancement has caused various.
Auditing in SQL Server 2008 DBA-364-M
Types of SQL Commands Farrokh Alemi, PhD
Weird Stuff I Saw While ... Supporting a Java Team
From 4 Minutes to 8 Seconds in an Hour (or a bit less)
Hitting the SQL Server “Go Faster” Button
20 Questions with Azure SQL Data Warehouse
Complete CompTIA A+ Guide to PCs, 6e
LESSON 01 Hands-on Training Execution
Four Rules For Columnstore Query Performance
You Inherited a Database Now What?
SOFTWARE TECHNOLOGIES
Journey to the Cloud – Guidance and Lessons Learned
An Introduction to Debugging
The Troubleshooting theory
Database System Architecture
Security Principles and Policies CS 236 On-Line MS Program Networks and Systems Security Peter Reiher.
SQL Performance for DBAs
Lecture 34: Testing II April 24, 2017 Selenium testing script 7/7/2019
Using wait stats to determine why my server is slow
CSE 326: Data Structures Lecture #14
Vendor Software Lessons From Consulting Vendor Software.
Presentation transcript:

Root Cause and Other DBA Urban Legends Brian Hitchcock OCP 10g DBA Sun Microsystems brian.hitchcock@sun.com brhora@aol.com www.brianhitchcock.net SunFed DBA Brian Hitchcock October 19, 2006 Page 1

For DBA Issues I'm told that I must find root cause, Can't resolve the issue without root cause Must have testing environment that is Similar enough to production to recreate the issue And the fix Need extensive expertise to solve performance issues 10046 trace Wait events Extents size/number Physical spindles Rebuilding indexes Undocumented init parameters

What is My Experience Centralized DBA support team 2000+ databases Database users open cases with DBA team Cases randomly assigned to DBAs Not assigned based on experience or expertise My cases are probably typical of all the cases

My Experience Finding root cause costs money Building and maintaining test system(s) costs money Test system Needs to be able to recreate production issue Database, network, apps and web servers, load balancers Users around the world Root cause and test system(s) are only worthwhile If having them is less expensive than not having them. Perhaps these are Urban Legends? Root cause Test system(s) Specific expertise required

My Experiences Review major cases I worked on over the last 2 years Cases are presented in chronological order, oldest first For each case What worked, i.e. what was the real-world solution? Was root cause identified? Was a test system available to verify root cause and the solution? What extensive expertise was required?

Keeping Score For each case, look for 3 things Root cause Test system(s) Specific expertise Listed earlier (10046, wait events, extents, etc.) Were any of the following useful in resolving the issue? Open Mind (anything can and will happen) Communication (it's difficult) Simple solutions Any active db users? Reboot one or more components?

Case 1 – CRM Local Language Moved to UTF8, add local language columns SQL ran slower 10046 trace, explain plan Found the single step slowing execution Existing index wasn’t updated to have new local language column Recreated index Performance returned to normal

Case 1 – Three Things Root cause Test system identified the issue Existing indexes not optimal for new schema Corrected indexes fixed the problem Test system identified the issue Specific Expertise 10046 Explain plan Indexes

Case 2 – Storefront Slow App support team reports database slow Check database, 1-3 active users at most Active users gone in 1-2 seconds App users report 30 second response time Have App support person use app Watch database See single active user connect, complete, disconnect 2 seconds maximum Remainder of response time Web servers, load balancers etc. Application server was locked up – reboot fixed it

Case 2 – Three Things Root cause Test Environment Not identified Test Environment Exists, but not even close to production Complex production application environment Accurate test environment Expensive to build and maintain Hard to find qualified testers No one completely understands how production was built Impact of layoffs, outsourcing Hard to recreate what you don’t understand No specific expertise required

Case 3 – Contracts Slow App users report message ‘Problem contacting the database: No available resource…’ Sar shows CPU 90% idle Statspack snapshot for last hour shows top wait events 70% CPU time 20% db file read App server rebooted, performance returns to normal

Case 3 – Three Things Root cause unknown How to determine what caused the app server to hang? Test system does not have all the components of prod Don’t know why production locked up Tough to reproduce in test system Simple solutions – no specific expertise required Reboot app server Low-risk, cheap, quick No special experience required If it doesn’t work, time to put more resources into the issue

Case 4 – Storefront Users report app hanging Few or no active users in database Active users taking much longer than normal Isolate single SQL statement that is running slow Explain plans show good/bad execution plans Changes from good to bad at random times Various attempts to isolate give conflicting results We assume that the problem is stable, it isn’t 10053 trace, watch optimizer choose execution plan Optimizer changes from good to bad plan one column of one table has 2 versus 3 distinct values

Case 4 – Three Things Root cause App developers gone (outsourced) Fix? not sure… App developers gone (outsourced) No one knows how code really works Why are these values appearing and disappearing? App is inserting/deleting the rows with the 3rd distinct value This change causes optimizer to choose bad plan Bug or feature? Fix? Create rows so column always has 3 distinct values

Case 4 – Three Things (cont’d) No test system Test system has small percentage of production data App system too complex to reproduce Multiple strings, load balancers, app servers Hardware, support, upgrades, patches Difficult to get testing resources – people are expensive Changes are tested for functionality Changes aren’t tested for performance Until released in production Specific expertise Explain plan 10053 trace

Case 5 – Customer Demo Users report slow performance Database shows 4 inactive sessions Inactive sessions holding locks Kill these 4 sessions Performance is fine

Case 5 – Three Things Root cause – unknown Test system Why some sessions inactive and holding locks? App developers gone Easier to kill sessions once in a while Real root cause would be expensive to find Test system Don’t have test system – app developed on the cheap Don’t know how app works No specific expertise required Kill database sessions

Case 6 – Executive Dashboard User reports database is sorting dates incorrectly Phone on mute – laugh Users have really good drugs today! Connect to db Verify that indeed dates are being sorted backwards From NLS experience, look at actual bytes of dates There are extra bytes that I can’t explain Go back to the docs for DATE type Expand DATE format to include all the possible fields Suddenly it all makes sense!

Case 6 – What’s the Story? The dates are being correctly sorted – because They are from BC! OK, now what? This is sales data from Fred Flintstone? Is Barney setting up his own IT department? Check the basics App code uses 10g JDBC User says this was a requirement But the database is 9i – it just gets better and better User finds Metalink note 10g JDBC issues with 9i database Doesn’t describe our issue, but it’s a start…

Case 6 – What to Do? Setup a 10g database and test the app code against it No test system, in fact, no other system of any kind Critical executive reporting system, can’t be down for long Very limited disk space Why not upgrade existing 9i database? Upgrades can cause problems If this isn’t a 10g to 9i issue, why risk the db upgrade? Install 10g db, full export 9i db, import into 10g db User tests app code against 10g database No more dates from BC! Remove 9i database, expand 10g database to match User happy Executive’s reports don’t show sales to the Flintstones

Case 6 – Three Things Root cause Test system Expertise required Not clear Tried 10g database and issue went away Did we really identify the root cause? Was it a feature of 10g JDBC? A bug? Test system None Expertise required Minimal configuration control would have prevented this Hardest part was accepting what Oracle was telling us Dates from BC Not at all the way things are supposed to be

Case 7 – Customer Demo Users calls, application is slow Blocking processes, restart database twice in one day App still slow Ask user to connect and start using app I watch database for active users Over 20 seconds before new db user appears Db user is done and gone in less than a second User reports 30 seconds before results appear in browser I tried ping between the db server and the app server 200ms with packet loss Ask network group to investigate Network switch in data center has failed

Case 7 – Three Things Root cause Network switch – no question This was not a database problem But we could have wasted a lot of time with SQL tracing etc. Need to confirm that the database is the problem Then work the issue as a db tuning issue No test system available How would you reproduce the data center network? Expertise required Look for active users in database ping between db and app server

Case 8 – Data Warehouse User reports application slow User can’t truncate a table – command hangs Loading data into warehouse is also hanging Watch database while user starts load process During load User is using third-party app to do the load No active users performing inserts When no load is happening Truncate table runs quickly User contacts app vendor Vendor “changes some parameters for the load process” Data load runs normally, truncate table runs normally

Case 8 – Three Things Root cause Test system Expertise required Looking at the db showed no active inserts during data load Needed to verify that db wasn’t the issue Force vendor to perform Test system This was a dev database so there was a ‘test’ system Data load issues identified, resolved in dev environment Expertise required Quickly check for database problems Communicate with user to understand what is happening Politics – dealing with vendor Vendor really, really wants it to be a database problem

Case 9 – Software Registry User trying to truncate table Part of a data feed process Other users in database Performing deletes on same table Blocking truncate Kill delete processes Truncate still blocked Watch database processes Delete process starts every hour on the hour Cron job starts delete hourly Part of data feed process Why don’t the app owners know this?

Case 9 – Three Things Root cause Test system – exists Unknown Why was cron in place that user didn’t know about? Test system – exists But doesn’t have the production data feeds Expertise required None – basic DBA skills Not a database performance issue Basic app configuration was the issue

Case 10 – CRM Reporting User reports app is too slow Specific selects are taking 10-15 seconds STATSPACK snapshots are being taken Check snapshots over same time for last week Performance of the top SQL hasn’t changed User agrees that performance hasn’t changed Resources not available to make performance better When is a performance problem not a problem? When you don’t want to pay to fix it

Case 10 – Three Things Root cause Test system Expertise required User perception Test system None Expertise required Document that performance hasn’t changed Offer to work the issue if user will get funding

Case 11 – Configurator Users getting error What is causing this? Can’t allocate memory in shared pool What is causing this? Spend some time looking at the database Looks normal Reboot database to clear shared pool Watch database after restart Shared pool doesn’t fill up

Case 11 – Three Things Root cause Test system Expertise needed Don’t know why shared pool fills up Test system How to recreate problem in test system? Expertise needed Basic DBA skills Flush shared pool Problem didn’t reoccur One-time or infrequent set of circumstances

Case 12 – Pricing Database User reports slow performance Check database server CPU and iowait very high Watch SQL executing Explain plan for worst SQL shows full table scans No indexes on tables being scanned Test server Same tables do have indexes While recreating indexes All the needed indexes reappear Cron job for pricing data runs every two weeks Rebuilds indexes and loads data

Case 12 – Three Things Root cause Test system Expertise required What caused indexes to be dropped? Unknown Test system Used to verify that indexes were missing Can’t reproduce whatever dropped the indexes Expertise required Basic DBA skills

Case 13 – Contracts Database User reports error when selecting from table Invalid row id Some queries on same table don’t report error Looking at table and row ids Some queries use index No errors Other queries use table Specific rows have invalid row ids Rebuild table Create table as select * from <table>

Case 13 – Three Things Root cause Test system Expertise required Why did row ids become invalid? Unknown Test system Exists, can’t reproduce cause of invalid row ids Expertise required Basic DBA skills

Case 14 – Software Download User reports application slow in Dev environment SQL so slow, application server times out Sar shows CPU near 0% idle User executes problem SQL Watch database SQL completes, very slowly, database is working Explain plan shows full table scans Look at tables involved Production – last analyzed NULL, indexes, runs fast Test – recently analyzed, indexes, runs slow Drop stats, rebuild indexes, reanalyze Performance returns to normal

Case 14 – Three Things Root cause Test system Expertise required What happened to indexes, statistics in production? Unknown Test system Exists Doesn’t match production – which is correct? Can’t reproduce dropped indexes and/or statistics Expertise required Basic DBA skills

Case 15 – BRIO User reports SQL failing with error Watch database Can’t allocate memory in shared pool Watch database Same SQL runs most of the time Error occurs once in a while Solutions Reduce sort area size Free up more memory for shared pool Increase physical memory assigned to database Error occurred so infrequently Users decided not to change system

Case 15 – Three Things Root cause Test system Expertise required Unknown Just too many users for a brief time? Test system Yes, but how to reproduce this issue (user load)? Expertise required Basic DBA skills

Case 16 – Revenue App Automated alert Examine database Report log switching hung Database stopped for transactions Examine database Can’t find anything wrong Restart database Log switch works No further alerts No user problems

Case 16 – Three Things Root Cause Test system Expertise required Unknown Test system Exists, but no help Expertise required Basic DBA skills

Scoreboard

Observations How many times did we Un-scientific results Identify root cause? Have a complete test system? Need specific expertise? Un-scientific results Root cause – 2 out of 16 = 13% Test system – 3 out of 16 = 19% Specific expertise – indexes 1/16 = 6% Simple solutions worked 94% of the time

Conclusions Brian’s off his meds Urban legends? You do need Ignore him – this isn’t typical Urban legends? You don’t have to find root cause You don’t have to have a complete test environment You don’t need extensive expertise in specific DBA areas You do need DBA experience Open mind – strange stuff happens all the time Communication skills – you don’t know what’s happening Looking at training resources Why become more expert at things that are rarely needed? Why not become familiar with things you know little about?

Opinion Specific expertise can be obtained when needed Based on the results shown 94% of the time, don’t need expertise to be productive 15 of 16 cases solved with general DBA skills Some expertise can be automated or outsourced Tracing, wait events Focus on skills that can’t be put into a GUI Communication Ability to solve problems, not just database issues