Troubleshooting beyond what you understand Or: How to figure out what’s broken so you can get some help from the real owner because your stuff never breaks. Right? Ryan McCauley #597 – Phoenix 2017
Ryan McCauley VB6/VB.NET developer for 10 years Full-time DBA/T-SQL dev for 6 years Currently Data and Reporting Manager at CableONE Microsoft Certified Professional (MCTS – SQL 2008 DBA) Active on Experts-Exchange and StackOverflow Twitter: @SQLRyan Blog: www.trycatchfinally.net Email: Ryan@KilaniMcCauley.com SQL SATURDAY | #597 | PHOENIX 2017
It Was a Dark and Stormy Night Also, applications are broken somewhere… Talk about the rotating DNS (backup NIC issue) Connections to SQL Server intermittent, but even Information comes in slowly – learn from it SQL SATURDAY | #597 | PHOENIX 2017
Agenda Today Ground rules Techniques Major symptoms Common confusion Next steps SQL SATURDAY | #597 | PHOENIX 2017
Ground Rules SQL SATURDAY | #597 | PHOENIX 2017
Ground Rules Never say “randomly”, say “intermittent” It’s not just your components Consider their interaction and what’s around You can always influence intermittent is something you don't yet understand, but it always has a cause when you say "random", you're saying you can't own it because it's not in your control Given same inputs, behavior of computers is always consistent See everything as something you own and can influence – you’re not helpless SQL SATURDAY | #597 | PHOENIX 2017
Ground Rules Something always changed! Always! Just maybe not on purpose Don’t take anything for granted! Both this class and in troubleshooting Monitoring only has a single perspective Only trust what you’ve verified
Techniques SQL SATURDAY | #597 | PHOENIX 2017
Techniques Figure out what it’s not If that’s true, what else would be true? Make the problem as small as possible Need to isolate it to prove it Does it work at all? Where can you connect from? Myers-Briggs and S (focus on resolving the examples) vs N (every example needs to fit pattern first) Small problem - You need to isolate it to prove it, especially to others Reproduce the problem in a second location with as much different as possible Hard to test system, need to test components SQL SATURDAY | #597 | PHOENIX 2017
Techniques Is it consistent? Can you find somewhere it’s not broken? Shared vs. Dedicated components VMs can dramatically complicate things Time it takes when it does run - does it vary? Is it quick or slow? same sources always broken? DAC FTP issue - 1 server takes 0.5 seconds, other 7 take 12-14 seconds, even for failed login Which components are shared vs. dedicated? VMs complicate this issue because everything is shared and live migration is seamless SQL SATURDAY | #597 | PHOENIX 2017
Simplify everything! How they get to your service Things your service depends on Your service Customers
Major symptom – cheat sheet SQL SATURDAY | #597 | PHOENIX 2017
Major Symptoms, part 1 Never works Intermittently not accessible Firewall or app not listening Intermittently not accessible What’s changing? Load balancer/cluster? Always slow but consistent Hardware config/resource Likely not load on shared components SQL SATURDAY | #597 | PHOENIX 2017
Major Symptoms, part 2 Intermittent/inconsistent slowness Hardware bottleneck or shared resource? Unchanging or predictable More likely configuration Shifting or unpredictable More likely capacity somewhere VM as shared component, harder to see the impact SQL SATURDAY | #597 | PHOENIX 2017
Common Confusion SQL SATURDAY | #597 | PHOENIX 2017
Common Confusion Login failures vs. firewall timeouts Ever used TCPING? Know common ports! Firewall rules – when are they evaluated? People blame “Kerberos” as a catch-all Ping isn’t the same as making sure the path is open! Ping doesn’t use a TCP port at all Talk about subnets/VLANs SQL SATURDAY | #597 | PHOENIX 2017
Slightly less dark and stormy… Back to the beginning… Resolve the DNS issue If time, talk about either Firewall timeouts when we moved reporting servers (5 minutes) Mis-aligned disks on clusters = consistently slow read times SQL SATURDAY | #597 | PHOENIX 2017
Next Steps SQL SATURDAY | #597 | PHOENIX 2017
Next Steps Learn about what you don’t know Shadowing, training, ask! Specialized knowledge not required, but can help If you don’t understand concept, ask It’s not resolved until you understand why! Root cause analysis is critical Don’t let “root cause analysis” be “it’s not happening anymore” or it resolved itself it’s not resolved until you know it’s not going to happen again! SQL SATURDAY | #597 | PHOENIX 2017
Thanks for attending, please visit the sponsors and complete an evaluation! SQL SATURDAY | #597 | PHOENIX 2017
The Sponsors!