Keeping our websites running - troubleshooting with Appdynamics Benoit Villaumie Lead Architect Guillaume Postaire Infrastructure Manager.

Slides:

Advertisements

Similar presentations

ManageEngine IT360 Product Overview

Advertisements

Advanced Troubleshooting with Debug Diagnostics on IIS 6

DynaTrace Platform.

Digital Edge Solutions Overview Services – Application Support.

Presentation Date Top Down Performance Management with OEM Grid Control Or how I learned to stop worrying and love OEM Grid Control 10/1/2010 John Darrah.

Enhancing Application Performance Root Causes and Quick Solutions.

13,000 Jobs and counting…. Advertising and Data Platform Our System.

System Center 2012 R2 Overview

Test Case Management and Results Tracking System October 2008 D E L I V E R I N G Q U A L I T Y (Short Version)

AppMetrics Overview “Maximize the availability of your applications built on the Microsoft platform”

AppMetrics – Monitoring for COM+ Components Scott Matsumoto Chief Technical Officer Xtremesoft, Inc.

Cisco Confidential 1 © 2010 Cisco and/or its affiliates. All rights reserved. Next Generation Monitoring in Cisco Security Cloud Leon De Jager and Nitin.

ManageEngine IT360 Product Overview A Comprehensive Business Service Management Solution.

JProbe. 1. JProbe Use JProbe Profile –identify method and line level performance bottlenecks Use JProbe Memory Debugger –investigating memory leaks and.

Memory issues in production systems. Production system Restricted access Application, DB, Application server, log files Debugging, monitoring Investigation.

Adding scalability to legacy PHP web applications Overview Mario A. Valdez-Ramirez.

Henrico Dolfing Business Segment Partners. Océ Document Technologies GmbH2 June, NET Framework Version 3.0.

ManageEngine TM Applications Manager 8 Monitoring Custom Applications.

Modern Application Lifecycle Pla n Develop + Test Monitor + Learn Release.

Lower costs and improve predictability Automation Enable service owners to focus on work that adds business value Reduce error-prone manual activities.

Accelerating the Software Development Lifecycle Jim Hirschauer, Technology Evangelist.

Loupe /loop/ noun a magnifying glass used by jewelers to reveal flaws in gems. a logging and error management tool used by.NET teams to reveal flaws in.

Memory Leak Overview and Tools. AGENDA  Overview of Java Heap  What is a Memory Leak  Symptoms of Memory Leaks  How to troubleshoot  Tools  Best.

Copyright © 2007 Quest Software The Changing Role of SQL Server DBA’s Bryan Oliver SQL Server Domain Expert Quest Software.

System Center Operations Manager 2007 Dave Northey Microsoft Ireland.

Introduction and simple using of Oracle Logistics Information System Yaxian Yao

JOnAS developer workshop – /02/2004 status Emmanuel Cecchet

AppMetrics and SCOM Working Together to Maximize the availability of Your applications.

2 Copyright © 2006, Oracle. All rights reserved. Performance Tuning: Overview.

Computer Measurement Group, India Optimal Design Principles for better Performance of Next generation Systems Balachandar Gurusamy,

Ideas to Improve SharePoint Usage 4. What are these 4 Ideas? 1. 7 Steps to check SharePoint Health 2. Avoid common Deployment Mistakes 3. Analyze SharePoint.

JA-SIG 12/4/20051 JMX For Monitoring and Maintenance JA-SIG - December 4, 2005 – Atlanta, GA Eric Dalquist Division of Information Technology University.

Platinu m Sponsor s Silver Sponsors Gold Sponsor s.

AppDynamics Ohio User Group. What is ExactTarget? Software as a Service Marketing 500 million s sent a day 200 million web transactions a day.

Send all X-Ray’s to All X-Ray’s received by App Man will be scrubbed of any Customer Names or Identity using.

CONFIDENTIAL INFORMATION CONTAINED WITHIN 9200 – J2EE Performance Tuning How-to  Michael J. Rozlog  Chief Technical Architect  Borland Software Corporation.

System Center Operations Manager 2007 Overview Amit Gatenyo Infrastructure & Security Team Lead Dario.

 Load balancing is the process of distributing a workload evenly throughout a group or cluster of computers to maximize throughput.  This means that.

Computing Facilities CERN IT Department CH-1211 Geneva 23 Switzerland t CF Automatic server registration and burn-in framework HEPIX’13 28.

Creating SmartArt 1.Create a slide and select Insert > SmartArt. 2.Choose a SmartArt design and type your text. (Choose any format to start. You can change.

Middleware Monitoring Using Dynatrace Plugins By Todd Ellis IT Manager, Omnicare February 25, 2015.

Compuware Corporation Deliver Reliable Applications Faster Dave Kapelanski Automated Testing Manager.

Jorke Odolphi Product Technology Specialist WebCentral Using Microsoft Operations Manager To Monitor And Maintain Your Farm.

Detecting, Managing, and Diagnosing Failures with FUSE John Dunagan, Juhan Lee (MSN), Alec Wolman WIP.

 cfObjective(ANZ)– November 17-18, 2011  Mike Brunt – CFWhisperer.

A way to develop software that emphasizes communication, collaboration, and integration between development and IT operations teams.

#SummitNow Inspecting Alfresco – Tools and Techniques Nathan McMinn Technical Consultant - Alfresco.

Troubleshooting Dennis Shasha and Philippe Bonnet, 2013.

SQL Advanced Monitoring Using DMV, Extended Events and Service Broker Javier Villegas – DBA | MCP | MCTS.

Improve query performance with the new SQL Server 2016 query store!! Michelle Gutzait Principal Consultant at

Managing Microsoft SQL 2000 with MOM MOM Overview Why Monitor SMS 2003 with MOM 2005 The SMS 2003 Management Pack Inside The Management Pack Best.

The Ultimate SharePoint Admin Tool

Understanding the New PTC System Monitor (PSM/Dynatrace) Application’s Capabilities and Advanced Usage Stephen Vaillancourt PTC Technical Support –Technical.

Performance Management

Performance Management

Troubleshooting SQL Server high CPU usage

性能测试那些事儿刘博 ..

Solving Performance Bottlenecks for Spark Developers

Upgrading to Microsoft SQL Server 2014

Get to know SysKit Monitor

Proactive RCA with Vitrage, Kubernetes, Zabbix and Prometheus

VMware vRealize® Operations™ Management Pack for Pure Storage

PerfView Measure and Improve Your App’s Performance for Free

AppMetrics® Benefits “Maximize the availability of your applications built on the Microsoft platform”

Backup Monitoring – EMC NetWorker

Backup Monitoring – EMC NetWorker

5 Azure Services Every .NET Developer Needs to Know

Keeping ConfigMgr Clean

Johan Lindberg, inRiver

Presentation transcript:

Keeping our websites running - troubleshooting with Appdynamics Benoit Villaumie Lead Architect Guillaume Postaire Infrastructure Manager

Introduction Our Company Karavel –Founded 2001 –#1 package Travel Website in France –4 Million unique visitors a month –Mainly B2C, but also B2B –15 brands, 10 white label –One M&A every Year

Our Application History 2008 – Monolithic Years Tomcat, MySql Expensive to maintain & Scale ‘Too Big To Fail’ 2009 – Distributed SOA Tomcat, Web Services & MySql Easier to maintain & scale Became incredibly complex to manage Design for failure

Managing this Complexity History of Architecture Issues Slow SQL Queries, Timeout & Pool Exhaustion Slow 3 rd Party Web Services Open Source Framework Bugs Resource & Memory Leakages Long and Painful Firefighting Plenty of log Files on multiple servers Thread & Heap Dumps Few jmx metrics, but never the needed one Lack of Historical data

Our AppDynamics Experience – Who ? Today 50+ people in Karavel use AppDynamics: Products Owners Developers Architects Ops

Our AppDynamics Experience – Root Causes Memory Leakage Over Consumption Performance Regression Application Bugs Architectural Changes Infrastructure Changes

Our AppDynamics Experience – Methodology Discard quickly wrong hypotheses => wide spectrum investigation Investigate deeper interesting ones Once under control, create alerts and dashboards Communicate the methodology to the team

Commons Issues

Commons Issues : Response Time

Analyze functionality on cluster / node response time cluster mean response time node mean response time

Commons Issues : Response Time

Analyze functionality by Business Transaction BT mean response time All BT mean response time

Commons Issues : Response Time

related to a resource consumed by the application (databases, webservices, …) related to a performance regression implementation  request snapshot & drill down functionality

Commons Issues : Response Time

Analyze functionality on CPU GC Time Spent / mn (ms) vs CPU Time Spent / mn (ms) CPU ms / mn GC CPU ms / mn x100 (but depend of your code)

Commons Issues : Response Time

related to Garbage Collecting OverActivity/!\ memory problem  Analyze functionality on GC Time Spent / mn (ms) memory used GC Time Spent / mn (ms)

Commons Issues : Response Time

related to a resource leak (CPU, FD, …) related to a selfish process that dries server resources (CPU, Thread, FD)  Analyze functionality  Then class/method found by Thread Dump  Or ps, vmstat, top Nb of thread

Commons Issues : Errors

/!\ errors do not mean broken user experience meteo is broken

Commons Issues : Errors Identify the error kind and the business transactions  Troubleshoot > Error rates, then choose the error class that has a drop in number

Commons Issues : Errors Identify the error kind and the business transactions  Troubleshoot > Error rates > details

Commons Issues : Memory

Memory Problem  Monitor > Application Infrastructure > Memory

Commons Issues : Memory Memory leak, look at Tenured Gen Behavior

Commons Issues : Memory Then, investigate Object Instance Tracking

Commons Issues : Memory Memory overconsumption, look at Eden Space

Commons Issues : Memory Then, investigate Object Instance Tracking (again)

Commons Issues : Memory But sometimes, your VM needs only more memory Why ? Ask the developers. They should know (?)

Commons Issues : Backend C process Mysql backend

Commons Issues : Backend

How to monitor a legacy C socket process ? Get minimal info and set alert from the consumer process

Commons Issues : Backend We have a problem Mean response time

Commons Issues : Backend Max response time Mean response time Timeout not normal behavior Contact the editor

Commons Issues : Backend New version Editor forces us to stop monitoring Another version Mean response time

Alerts & Dashboards

Alerts & Dashboards : proactive detection  Reduce Mean Time Detection NOC Dashboard > Health status on critical Business Transaction NOC Dashboard

Alerts & Dashboards : proactive detection Alerts (ops & devs) :  on response time  on err/mn  on stall Application Health Alerts Criteria

Alerts & Dashboards : simplify resolution  reduce Mean Time Resolution Application Health Dashboard  cluster response time  node response time  node error rate  node call number Application Health Dashboard

Alerts & Dashboards : simplify resolution  reduce Mean Time Resolution Infrastructure Health Dashboard  node memory usage  node CPU usage  node Thread number Infrastructure Health Dashboard

Weekly Review Alerting is fine BUT some regressions may not be detected response time degradation on 4 weeks

Weekly Review Our Dashboard Safety Belt Weekly Performance Review Weekly Error Review (coming soon) Weekly Performance Dashboard

Capacity planning How to ease : software tuning hardware renew Event planning

Capacity planning

Next Steps Use Workflows and automatic Remediations Integrate Splunk Tag deployment event inside AppDynamics Improve knowledge sharing among customers

Questions ?