Troubleshooting beyond what you understand Or: How to figure out what’s broken so you can get some help from the real owner because your stuff never breaks. Right? Ryan McCauley #492 – Phoenix 2016
Ryan McCauley VB6/VB.NET developer for 10 years Full-time DBA/T-SQL dev for 6 years Currently employed by Cable ONE as Data and Reporting Manager Microsoft Certified Professional (MCTS – SQL 2008 DBA) Active on Experts-Exchange and StackOverflow Twitter: @SQLRyan Blog: www.trycatchfinally.net Email: Ryan@KilaniMcCauley.com SQL SATURDAY | #492 | PHOENIX 2016
It Was a Dark and Stormy Night Also, applications are broken somewhere… Talk about the rotating DNS issue Connections to SQL Server intermittent Information comes in slowly SQL SATURDAY | #492 | PHOENIX 2016
Agenda Today Ground rules Techniques Major symptoms Common confusion Next steps SQL SATURDAY | #492 | PHOENIX 2016
Ground Rules SQL SATURDAY | #492 | PHOENIX 2016
Ground Rules Never say “randomly”, say “intermittent” It’s not just your components Consider their interaction and what’s around intermittent is something you don't yet understand, but it always has a cause when you say "random", you're saying you can't own it because it's not in your control Given same inputs, behavior of computers is always consistent See everything as something you own and can influence – you’re not helpless SQL SATURDAY | #492 | PHOENIX 2016
Ground Rules Something always changed! Always! Just maybe not in purpose Don’t take anything for granted! Both this class and in troubleshooting Monitoring only has a single perspective
Techniques SQL SATURDAY | #492 | PHOENIX 2016
Techniques Figure out what it’s not If that’s true, what else would be true? Make the problem as small as possible Need to isolate it to prove it Does it work at all? Where can you connect from? Myers-Briggs and S (focus on resolving the examples) vs N (every example needs to fit pattern first) Small problem - You need to isolate it to prove it, especially to others Reproduce the problem in a second location with as much different as possible SQL SATURDAY | #492 | PHOENIX 2016
Techniques Is it consistent? Can you find somewhere it’s not broken? Shared vs. Dedicated components VMs can dramatically complicate things Time it takes when it does run - does it vary? Is it quick or slow? same sources always broken? DAC FTP issue - 1 server takes 0.5 seconds, other 7 take 12-14 seconds, even for failed login Which components are shared vs. dedicated? VMs complicate this issue because everything is shared and live migration is seamless SQL SATURDAY | #492 | PHOENIX 2016
Simplify everything! Things your service depends on How they get to your service Your service Customers
Major symptom – cheat sheet SQL SATURDAY | #492 | PHOENIX 2016
Major Symptoms, part 1 Never works Intermittently not accessible Firewall or app not listening Intermittently not accessible What’s changing? Load balancer/cluster? Always slow but consistent Hardware config/resource Likely not load on shared components SQL SATURDAY | #492 | PHOENIX 2016
Major Symptoms, part 2 Intermittent slowness Unchanging or predictable Hardware bottleneck or shared resource? Unchanging or predictable More likely configuration Shifting or unpredictable More likely capacity somewhere VM as shared component, harder to see the impact SQL SATURDAY | #492 | PHOENIX 2016
Common Confusion SQL SATURDAY | #492 | PHOENIX 2016
Common Confusion Login failures vs. firewall timeouts Ever used TCPING? Know common ports! Firewall rules – when are they evaluated? If somebody says “Kerberos”, it’s probably not Ping isn’t the same as making sure the path is open! Ping doesn’t use a TCP port at all Talk about subnets/VLANs SQL SATURDAY | #492 | PHOENIX 2016
Slightly less dark and stormy… Let’s approach our outage again Resolve the DNS issue If time, talk about either Firewall timeouts when we moved reporting servers (5 minutes) Mis-aligned disks on clusters = consistently slow read times SQL SATURDAY | #492 | PHOENIX 2016
Next Steps SQL SATURDAY | #492 | PHOENIX 2016
Next Steps Learn about what you don’t know Shadowing, training, ask! Specialized knowledge not required, but can help If you don’t understand concept, ask It’s not resolved until you understand why! Root cause analysis is critical Don’t let “root cause analysis” be “it’s not happening anymore” or it resolved itself = it’s not resolved until you know it’s not going to happen again! SQL SATURDAY | #492 | PHOENIX 2016
Thanks for attending, and visit the sponsors! SQL SATURDAY | #492 | PHOENIX 2016
Platinum Level Sponsors Gold Level Sponsors Venue Sponsor Key Note Sponsor Pre Conference Sponsor
Silver Level Sponsors Bronze Level Sponsors