Presentation is loading. Please wait.

Presentation is loading. Please wait.

Windows Azure SQL Database (WASD) Troubleshooting

Similar presentations


Presentation on theme: "Windows Azure SQL Database (WASD) Troubleshooting"— Presentation transcript:

1 Windows Azure SQL Database (WASD) Troubleshooting
I will assume basic SQL Server knowledge Bob Ward Principal Architect Escalation Engineer 2 mins Assume SQL knowledge We will fly fast Questions at the end Last slide has link to download slides and scripts

2 Prepare React Prevent My Goals for You Today 1 min
Prepare – What you learn about problems could help you prepare better to develop and deploy Azure Databases applications Prevent – Learning about what could go wrong could help you prevent them from happening React – Learning this information can help you be better prepared to react quickly and efficiently should they happen.

3 What Will We Cover Today
The Azure Troubleshooting Challenge Troubleshooting Connectivity WASD Errors Query Performance Practical Advice and Tips 30 secs Less on Query Performance because troubleshooting is very similar to the “box”

4 The Azure Troubleshooting Challenge
WASD is a platform service (PAAS) This is not a VM running SQL Server “box” (IAAS) Multi-tenant platform You are sharing a SQL instance with other databases from other customers You are abstracted from the SQL Server instance, Windows, and computer server Less admin tasks means lower TCO but also means less access You are isolated to a specific database You have a logical server and a master but most things are done in your database Most things are database scoped (Ex. DMVs) We make decisions to maximize all database availability Application design may be required The service can be updated far quicker than the “box” product 3 mins

5 WASD Connectivity Errors
Use min 30sec login timeout WASD specific errors Firewall blocked in Azure Windows authentication not supported Invalid login – Invalid account or password Denial of Service – After a large number of login failures Network related errors “…Server not found” Connection Timeout Expired Msg 121 “.. The semaphore timeout period has expired” You could lose connectivity Idle connections terminated after 30 minutes (Msg and 10054) We may forcibly disconnect on failover/some errors or change to MAXSIZE Retries you need to take into account 3 mins Only call out Firewall, Windows Auth, and DOS Only call out Connection Timeout Expired Talk about all in last column Login timeout of 30 secs – You need this to be able to meet our SLA requirements. Note I hit many of these network and conn errors while on a flight using gogo wifi. Using a 30sec login timeout helped me connect in many cases.

6 Example Connectivity Errors
Network latency Be sure to give this to support 40XXX errors unique to WASD May see this after deleting a server After getting dropped on idle connection 3 mins Login is a two-phase protocol for SQL Server: pre-phase and post-phase so there are more opportunities to timeout or lose a connection when going from your client to the cloud. The last error occurred after I deleted the server in the portal and tried to connect.. In this situation I could telnet and ping but couldn’t connect. If you use the portal you can see that server is not listed Consider this website

7 Troubleshooting Connectivity
Configuration issues WASD Firewall and your firewall Allow Windows Azure Service Is it our service or your internet? Windows Azure Management Portal Windows Azure Service Dashboard Windows Azure SQL Database Connectivity Troubleshooting Guide General Tools to use ping.exe, telnet.exe, tracert.exe SQL Server 2012 Management Studio – Free with SQL Server 2012 Express ostress.exe and sqlcmd.exe (username server name>) SQL Database Management Portal – New System Views (Event Tables) – in master database sys.event_log sys.database_connection_stats 5 mins Tracert.exe. If I don’t see a MS network show up how do I know if it a problem with the MS network or my network. My suggestion is to tracert.exe when all is well and when you see a network like this one (msn is the keyword) you have hit the MS network successfully. xe ch1-16c-1b.ntwk.msn.net [ ] The end of this ntwk.msn.net is the key Windows Azure Portal gets servers from subscriptions but list of databases per server by hitting our clusters SQL Database Management Portal web page hits an endpoint in our clusters and is useful to test connectivity Event Tables – I usually see about a 10-15minute delay in entries in these tables but I’ve seen it take sometimes up to 30minutes History tables – not real time

8 Demo Tools for Connectivity 5 mins
1. Show Azure Portal and various config featrues. Tak about what it means to see have the WA portal be successful. 2. Show Service Dashboard and how to read history 3. Run telnetme.cmd and show blank means it “worked”. Note you need Ctrl+”]” to get out of this. 4. Show picture of tracert and point out DNS resolution for server name and which route point shows the MS network. Now use server atrc45thlv.database.windows.net which is the West Europe Data Center in Amsterdam 4. Connect from SQL Management Portal and talk about what that means. Show how to take server name and use to conrect directly to it 5. From the context of master, show Event Table entries for connectivity using script in connectivity\event_tables.sql. Point out that if you connect from SSMS and get a login failure you will not get a database context in event table. Also point out that the a row can be updated after first seen if something changes in that interval.

9 WASD Errors full list here Failover Governance and Quota
Throttling Limits Engine Throttling “Not supported” Database copy Federation These can result in connection termination and possible future rejection of work Many “box” errors still apply – Ex = deadlock Msg 40XXX range can be seen in sys.messages in SQL Server 2012 3 mins Most of the focus is on governance, and throttling. Need to make a comment that we are evolving the service so may be changing over time governance, quotas, throttling, and throttling limits to make the service more reliable and predictable.

10 Failover SHUTDOWN is in progress.
We may decide to “move you” to a replica of your database to another server Your database, the instance, or the computer is “unhealthy” We may need to patch the instance and/or computer What will you see? Msg 40197 “..Server not available” Implement retry logic in your application SHUTDOWN is in progress. 3 mins The partition is in transition and transactions are being terminated.

11 Resource ID : 1 = worker threads
Governance Max number of concurrent worker threads (currently 180) per database Msg if you exceed the limit Connection terminated. Retry when your concurrent work subsides Check for blocking problems or inefficient queries Msg if the overall system has too many workers You may get less than 180 max Connection terminated. You can retry but it may take longer to stabilize Still could be an application issue but a service issue could also be occurring Resource ID : 1 = worker threads 3 mins

12 Quotas Quota errors for space used
Msg when you run out of space for your max size for your db Only reads and DELETE/DROP allowed until you free up space Use sys.dm_db_partition_stats to find what is consuming space Solutions Increase max size Delete data or drop tables/indexes Partition out database But…freeing up may not be immediately recognized Changing MAXSIZE disconnects all users 2 mins Don’t spend much time here. Briefly talk different message for “out of space” than on box, different DMV to find space usage, and that freeing up may not happen immediately.

13 Throttling Limits Error Condition Rebuild index Online 40549
We have a service called a “Watchdog Service” querying the instance for “conditions” to terminate connections to prevent resource problems. We also call these “Watchdogs alerts” We will kill the session with a “reason”. The “reason” is the error message you get Application gets an error message (high severity) and connection terminated (KILL/ROLLBACK status) Sometimes retry works but these usually require some change on your part throttling_long_transaction in sys.event_log We monitor all databases and look for conditions to prevent problems Error Condition 40549 Session blocking system task for long period of time (20 secs) 40550 Session is consuming too many locks (1 million) 40551 Session is consuming too much tempdb space (5Gb) 40552 Transaction consuming too much log space or active transaction preventing log truncation 40553 Session consuming memory (16Mb) and there are memory waits (20secs) Rebuild index Online 3 mins NOTE: In my testing for 40552, I was disconnected with Msg and did not receive this error.

14 Engine Throttling This is more of a legacy monitoring method used to keep instances healthy Another external service monitors the health of the instance and computer Soft throttling – we have detected a resource issue so pick specific databases Hard throttling – entire instance at risk so all databases are affected How it Works Existing requests run to completion New requests for existing connections and new connections may get Msg and connection terminated depending on type of request Reason code in Error has more details on soft vs hard, what will be rejected, and why throttling in sys.event_log 3 mins 0x8003 x03 = RejectAll x80 = Hard Throttling on I/O Decode reason codes Another resource

15 “Not Supported” Errors
USE <db> not supported – specify when connecting ALTER DATABASE supported minimally (Ex. Name, Edition, MAXSIZE, READ_ONLY) All DBCC commands not supported except for DBCC SHOW_STATISTICS Database scoped DMVs supported Feature Support for Windows Azure SQL Database Unsupported Transact-SQL Statements (Windows Azure SQL Database) Partially Supported Transact-SQL Statements (Windows Azure SQL Database) 1 min Go over quickly or skip Pay attention to this web link

16 Demo Using Event Tables to Troubleshoot WASD Errors 10 mins
Connect with SSMS with context of master and show sys_event_log using errors\event_log_errors.sql to show throttling errors and deadlocks I encountered Talk about how I caused the throttling limit errors Show deadlock error and how it is the same deadlock XML as produced by trace flag 1222. 4. Open up the .xdl file in SSMS to show the deadlock graph

17 WASD and Query Performance
Stick to the basics….. Running or waiting? Blocking or CPU? Is it your application, Windows Azure role, your computer, or queries? Is it network latency? Differences from when “good”? Did the query plan change? Proper indexes – Avoid scans, large sorts, …. Auto create and Auto update stats on by default There are methods to optimize performance specific to Azure Windows Azure SQL Database and SQL Server -- Performance and Scalability Compared and Contrasted Inevitably you may have to shard your data “Chatty” applications don’t usually perform well Avoid large result sets Application problems may show up earlier on this platform (Ex. Transaction keeping the log from being truncated) 3 mins Another example. Trying to return a lot of rows from the service back to “on-premise” may get frequently disconnected due to network issues

18 WASD Performance Scenarios
Interesting Performance Scenarios On-premise clients may see higher ASYNC_NETWORK_IO waits Small transactions may result in WRITELOG and SE_REPL* waits Deadlocks (Msg 1205) just like the “box” – Use sys.event_log to debug Troubleshooting Query Timeouts Could just be blocking Trace your queries so you know which one timed out Examine query plan and tune the query/indexes 3 mins Show my top wait stats for my database where I fed it a ton of small INSERTs and tried to SELECT back a huge row set. WRITELOG – small avg wait time SE_REPL_COMMIT_ACK – small avg wait time ASYNC_NETWORK_IO – larger than normal wait time

19 Dynamic Management Views (DMV) for Performance
Find out currently running requests in your database. Use this to detect blocking sys.dm_exec_requests Find out the performance of queries that have run in your database. Look here for worst performing queries sys.dm_exec_query_stats Display the query plan of a specific query sys.dm_exec_query_plan Aggregation history of waits – Some new for WASD Only shows any wait_type with count > 0 sys.dm_db_wait_stats Could indexes help query performance? “missing index DMVs” 2 mins Include pointer to this doc link

20 A look at WASD Wait Types
2 mins

21 Demo Troubleshooting Query Performance on WASD 5 mins
Run scripts slowest_queries.sql to show which query is slowest and what does plan say about it. Notice slowest query has 200+ seconds but low CPU time. If you now look at waits you see IO_COMPLETION, a sign of sort spils to disk If you look at the plan you see a SORT operator here but we used cl index seek. Note that other queries in this list actually come from using the portal and management studio If time show whoisusingio.sql to show biggest I/O users Now show the performance information from the SQL Management Portal and see this query and the plan

22 Watch Out for These 2 mins Keep database copies for “user error”
Be careful dropping servers and databases in portal DML may fail if no clustered index (temp tables excluded) DMVs are database scoped Databases have RCSI on by default – tables can be larger DATETIME in all data centers is stored as UTC time You may not have access to objects that appear in catalog views Non-supported or partial supported commands/features System Views Unique to WASD 2 mins

23 Before you contact support
Check the Azure forums: MSDN or stackoverflow Check the service dashboard Is it Windows Azure? On-premise problem? Have exact error message(s) available Have TracingID available Do you know the query? Do you have application retry logic? Give us the date and time of issue with “observed” timezone Is this happening now or in the past? 3 mins This link has some suggestions on what to retry on We can do RCA but…. It can take some time and we may not have enough history

24 References Retry Logic for Transient Failures in Windows Azure SQL Database Error Messages (Windows Azure SQL Database) Windows Azure SQL Database Performance and Elasticity Guide Windows Azure SQL Database Connection Management sys.event_log documentation CSS SQL Escalation Blog Troubleshoot and Optimize Queries with Windows Azure SQL Database

25 Questions? Thank you!

26 The Troubleshooting Checklist
Does the Windows Azure Portal work and list your databases? Is there a dashboard posting for an outage in your region? Does the SQL Management Portal work? Does SQL Server Management Studio work? Is there an internet provider issue? Is your firewall configuration correct? Is the problem Windows Azure vs WASD? Is there blocking? Are your queries and index tuned? Is this really an application retry issue? Governance, quotas, limits, and throttling are “part of this platform” Have you looked at Event Tables?


Download ppt "Windows Azure SQL Database (WASD) Troubleshooting"

Similar presentations


Ads by Google