Keeping our websites running - troubleshooting with Appdynamics Benoit Villaumie Lead Architect Guillaume Postaire Infrastructure Manager.

Keeping our websites running - troubleshooting with Appdynamics Benoit Villaumie Lead Architect Guillaume Postaire Infrastructure Manager

Introduction Our Company Karavel –Founded 2001 –#1 package Travel Website in France –4 Million unique visitors a month –Mainly B2C, but also B2B –15 brands, 10 white label –One M&A every Year

Our Application History 2008 – Monolithic Years Tomcat, MySql Expensive to maintain & Scale ‘Too Big To Fail’ 2009 – Distributed SOA Tomcat, Web Services & MySql Easier to maintain & scale Became incredibly complex to manage Design for failure

Managing this Complexity History of Architecture Issues Slow SQL Queries, Timeout & Pool Exhaustion Slow 3 rd Party Web Services Open Source Framework Bugs Resource & Memory Leakages Long and Painful Firefighting Plenty of log Files on multiple servers Thread & Heap Dumps Few jmx metrics, but never the needed one Lack of Historical data

Our AppDynamics Experience – Who ? Today 50+ people in Karavel use AppDynamics: Products Owners Developers Architects Ops

Our AppDynamics Experience – Root Causes Memory Leakage Over Consumption Performance Regression Application Bugs Architectural Changes Infrastructure Changes

Our AppDynamics Experience – Methodology Discard quickly wrong hypotheses => wide spectrum investigation Investigate deeper interesting ones Once under control, create alerts and dashboards Communicate the methodology to the team

Commons Issues

Commons Issues : Response Time

Analyze functionality on cluster / node response time cluster mean response time node mean response time

Analyze functionality by Business Transaction BT mean response time All BT mean response time

related to a resource consumed by the application (databases, webservices, …) related to a performance regression implementation  request snapshot & drill down functionality

Analyze functionality on CPU GC Time Spent / mn (ms) vs CPU Time Spent / mn (ms) CPU ms / mn GC CPU ms / mn x100 (but depend of your code)

related to Garbage Collecting OverActivity/!\ memory problem  Analyze functionality on GC Time Spent / mn (ms) memory used GC Time Spent / mn (ms)

related to a resource leak (CPU, FD, …) related to a selfish process that dries server resources (CPU, Thread, FD)  Analyze functionality  Then class/method found by Thread Dump  Or ps, vmstat, top Nb of thread

Commons Issues : Errors

/!\ errors do not mean broken user experience meteo is broken

Commons Issues : Errors Identify the error kind and the business transactions  Troubleshoot > Error rates, then choose the error class that has a drop in number

Commons Issues : Errors Identify the error kind and the business transactions  Troubleshoot > Error rates > details

Commons Issues : Memory

Memory Problem  Monitor > Application Infrastructure > Memory

Commons Issues : Memory Memory leak, look at Tenured Gen Behavior

Commons Issues : Memory Then, investigate Object Instance Tracking

Commons Issues : Memory Memory overconsumption, look at Eden Space

Commons Issues : Memory Then, investigate Object Instance Tracking (again)

Commons Issues : Memory But sometimes, your VM needs only more memory Why ? Ask the developers. They should know (?)

Commons Issues : Backend C process Mysql backend

Commons Issues : Backend

How to monitor a legacy C socket process ? Get minimal info and set alert from the consumer process

Commons Issues : Backend We have a problem Mean response time

Commons Issues : Backend Max response time Mean response time Timeout not normal behavior Contact the editor

Commons Issues : Backend New version Editor forces us to stop monitoring Another version Mean response time

Alerts & Dashboards

Alerts & Dashboards : proactive detection  Reduce Mean Time Detection NOC Dashboard > Health status on critical Business Transaction NOC Dashboard

Alerts & Dashboards : proactive detection Alerts (ops & devs) :  on response time  on err/mn  on stall Application Health Alerts Criteria

Alerts & Dashboards : simplify resolution  reduce Mean Time Resolution Application Health Dashboard  cluster response time  node response time  node error rate  node call number Application Health Dashboard

Alerts & Dashboards : simplify resolution  reduce Mean Time Resolution Infrastructure Health Dashboard  node memory usage  node CPU usage  node Thread number Infrastructure Health Dashboard

Weekly Review Alerting is fine BUT some regressions may not be detected response time degradation on 4 weeks

Weekly Review Our Dashboard Safety Belt Weekly Performance Review Weekly Error Review (coming soon) Weekly Performance Dashboard

Capacity planning How to ease : software tuning hardware renew Event planning

Capacity planning

Next Steps Use Workflows and automatic Remediations Integrate Splunk Tag deployment event inside AppDynamics Improve knowledge sharing among customers

Questions ?

Keeping our websites running - troubleshooting with Appdynamics Benoit Villaumie Lead Architect Guillaume Postaire Infrastructure Manager.

Similar presentations

Presentation on theme: "Keeping our websites running - troubleshooting with Appdynamics Benoit Villaumie Lead Architect Guillaume Postaire Infrastructure Manager."— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

Keeping our websites running - troubleshooting with Appdynamics Benoit Villaumie Lead Architect Guillaume Postaire Infrastructure Manager.

Similar presentations

Presentation on theme: "Keeping our websites running - troubleshooting with Appdynamics Benoit Villaumie Lead Architect Guillaume Postaire Infrastructure Manager."— Presentation transcript:

Similar presentations

About project

Feedback