Presentation is loading. Please wait.

Presentation is loading. Please wait.

Keeping our websites running - troubleshooting with Appdynamics Benoit Villaumie Lead Architect Guillaume Postaire Infrastructure Manager.

Similar presentations


Presentation on theme: "Keeping our websites running - troubleshooting with Appdynamics Benoit Villaumie Lead Architect Guillaume Postaire Infrastructure Manager."— Presentation transcript:

1 Keeping our websites running - troubleshooting with Appdynamics Benoit Villaumie Lead Architect Guillaume Postaire Infrastructure Manager

2 Introduction Our Company Karavel –Founded 2001 –#1 package Travel Website in France –4 Million unique visitors a month –Mainly B2C, but also B2B –15 brands, 10 white label –One M&A every Year

3 Our Application History 2008 – Monolithic Years Tomcat, MySql Expensive to maintain & Scale ‘Too Big To Fail’ 2009 – Distributed SOA Tomcat, Web Services & MySql Easier to maintain & scale Became incredibly complex to manage Design for failure

4 Managing this Complexity History of Architecture Issues Slow SQL Queries, Timeout & Pool Exhaustion Slow 3 rd Party Web Services Open Source Framework Bugs Resource & Memory Leakages Long and Painful Firefighting Plenty of log Files on multiple servers Thread & Heap Dumps Few jmx metrics, but never the needed one Lack of Historical data

5 Our AppDynamics Experience – Who ? Today 50+ people in Karavel use AppDynamics: Products Owners Developers Architects Ops

6 Our AppDynamics Experience – Root Causes Memory Leakage Over Consumption Performance Regression Application Bugs Architectural Changes Infrastructure Changes

7 Our AppDynamics Experience – Methodology Discard quickly wrong hypotheses => wide spectrum investigation Investigate deeper interesting ones Once under control, create alerts and dashboards Communicate the methodology to the team

8 Commons Issues

9 Commons Issues : Response Time

10

11

12 Analyze functionality on cluster / node response time cluster mean response time node mean response time

13 Commons Issues : Response Time

14 Analyze functionality by Business Transaction BT mean response time All BT mean response time

15 Commons Issues : Response Time

16 related to a resource consumed by the application (databases, webservices, …) related to a performance regression implementation  request snapshot & drill down functionality

17 Commons Issues : Response Time

18 Analyze functionality on CPU GC Time Spent / mn (ms) vs CPU Time Spent / mn (ms) CPU ms / mn GC CPU ms / mn x100 (but depend of your code)

19 Commons Issues : Response Time

20 related to Garbage Collecting OverActivity/!\ memory problem  Analyze functionality on GC Time Spent / mn (ms) memory used GC Time Spent / mn (ms)

21 Commons Issues : Response Time

22 related to a resource leak (CPU, FD, …) related to a selfish process that dries server resources (CPU, Thread, FD)  Analyze functionality  Then class/method found by Thread Dump  Or ps, vmstat, top Nb of thread

23 Commons Issues : Errors

24 /!\ errors do not mean broken user experience meteo is broken

25 Commons Issues : Errors Identify the error kind and the business transactions  Troubleshoot > Error rates, then choose the error class that has a drop in number

26 Commons Issues : Errors Identify the error kind and the business transactions  Troubleshoot > Error rates > details

27 Commons Issues : Memory

28 Memory Problem  Monitor > Application Infrastructure > Memory

29 Commons Issues : Memory Memory leak, look at Tenured Gen Behavior

30 Commons Issues : Memory Then, investigate Object Instance Tracking

31 Commons Issues : Memory Memory overconsumption, look at Eden Space

32 Commons Issues : Memory Then, investigate Object Instance Tracking (again)

33 Commons Issues : Memory But sometimes, your VM needs only more memory Why ? Ask the developers. They should know (?)

34 Commons Issues : Backend C process Mysql backend

35 Commons Issues : Backend

36

37 How to monitor a legacy C socket process ? Get minimal info and set alert from the consumer process

38 Commons Issues : Backend We have a problem Mean response time

39 Commons Issues : Backend Max response time Mean response time Timeout not normal behavior Contact the editor

40 Commons Issues : Backend New version Editor forces us to stop monitoring Another version Mean response time

41 Alerts & Dashboards

42 Alerts & Dashboards : proactive detection  Reduce Mean Time Detection NOC Dashboard > Health status on critical Business Transaction NOC Dashboard

43 Alerts & Dashboards : proactive detection Alerts (ops & devs) :  on response time  on err/mn  on stall Application Health Alerts Criteria

44 Alerts & Dashboards : simplify resolution  reduce Mean Time Resolution Application Health Dashboard  cluster response time  node response time  node error rate  node call number Application Health Dashboard

45 Alerts & Dashboards : simplify resolution  reduce Mean Time Resolution Infrastructure Health Dashboard  node memory usage  node CPU usage  node Thread number Infrastructure Health Dashboard

46 Weekly Review Alerting is fine BUT some regressions may not be detected response time degradation on 4 weeks

47 Weekly Review Our Dashboard Safety Belt Weekly Performance Review Weekly Error Review (coming soon) Weekly Performance Dashboard

48 Capacity planning How to ease : software tuning hardware renew Event planning

49 Capacity planning

50

51 Next Steps Use Workflows and automatic Remediations Integrate Splunk Tag deployment event inside AppDynamics Improve knowledge sharing among customers

52 Questions ?


Download ppt "Keeping our websites running - troubleshooting with Appdynamics Benoit Villaumie Lead Architect Guillaume Postaire Infrastructure Manager."

Similar presentations


Ads by Google