1 Perspectives from Operating a Large Scale Website Dennis Lee.

Slides:



Advertisements
Similar presentations
1 Perspectives from Operating a Large Scale Website Dennis Lee VP Technical Operations, Marchex.
Advertisements

Reliable Multicast for Time-Critical Systems Mahesh Balakrishnan Ken Birman Cornell University.
Distributed Data Processing
Implementing A Simple Storage Case Consider a simple case for distributed storage – I want to back up files from machine A on machine B Avoids many tricky.
©Ian Sommerville 2004Software Engineering, 7th edition. Chapter 17 Slide 1 Rapid software development.
Sponsored by the U.S. Department of Defense © 2005 by Carnegie Mellon University 1 Pittsburgh, PA Dennis Smith, David Carney and Ed Morris DEAS.
Keeping our websites running - troubleshooting with Appdynamics Benoit Villaumie Lead Architect Guillaume Postaire Infrastructure Manager.
VoipNow Core Solution capabilities and business value.
How to Optimize Your Existing Regression Testing Arthur Hicken May 2012.
API Design CPSC 315 – Programming Studio Fall 2008 Follows Kernighan and Pike, The Practice of Programming and Joshua Bloch’s Library-Centric Software.
A 100,000 Ways to Fa Al Geist Computer Science and Mathematics Division Oak Ridge National Laboratory July 9, 2002 Fast-OS Workshop Advanced Scientific.
1 CSSE 477 – A bit more on Performance Steve Chenoweth Friday, 9/9/11 Week 1, Day 2 Right – Googling for “Performance” gets you everything from Lady Gaga.
Wade Wegner Windows Azure Technical Evangelist Microsoft Corporation Windows Azure AppFabric Caching.
Enterprise Business Processes and Applications (IS 6006) Masters in Business Information Systems 10 th Feb 2009 Fergal Carton Business Information Systems.
Computer Science 162 Section 1 CS162 Teaching Staff.
Microprocessors Introduction to ia64 Architecture Jan 31st, 2002 General Principles.
EEC-681/781 Distributed Computing Systems Lecture 3 Wenbing Zhao Department of Electrical and Computer Engineering Cleveland State University
Live for today as if it is your last day but plan for tomorrow as if it will last forever!
Agile Testing with Testing Anywhere The road to automation need not be long.
SPRING 2011 CLOUD COMPUTING Cloud Computing San José State University Computer Architecture (CS 147) Professor Sin-Min Lee Presentation by Vladimir Serdyukov.
National Manager Database Services
What is Software Architecture?
CONTINUOUS INTEGRATION, DELIVERY & DEPLOYMENT ONE CLICK DELIVERY.
Distributed Data Stores – Facebook Presented by Ben Gooding University of Arkansas – April 21, 2015.
Cloud Computing for the Enterprise November 18th, This work is licensed under a Creative Commons.
Business Continuity and Disaster Recovery Chapter 8 Part 2 Pages 914 to 945.
TESTING.
12/02/04www.cis.ksu.edu/~meiyappa Enterprise Resource Planning Meiyappan Thandayuthapani CIS 764.
CS525: Special Topics in DBs Large-Scale Data Management Hadoop/MapReduce Computing Paradigm Spring 2013 WPI, Mohamed Eltabakh 1.
Performance Concepts Mark A. Magumba. Introduction Research done on 1058 correspondents in 2006 found that 75% OF them would not return to a website that.
1 Software Development Configuration management. \ 2 Software Configuration  Items that comprise all information produced as part of the software development.
Hadoop/MapReduce Computing Paradigm 1 Shirish Agale.
Overview of Cloud Computing Sven Rosvall ACCU
Data Structures & Algorithms and The Internet: A different way of thinking.
1 Moshe Shadmon ScaleDB Scaling MySQL in the Cloud.
CONFIDENTIAL INFORMATION CONTAINED WITHIN 9200 – J2EE Performance Tuning How-to  Michael J. Rozlog  Chief Technical Architect  Borland Software Corporation.
SONIC-3: Creating Large Scale Installations & Deployments Andrew S. Neumann Principal Engineer, Progress Sonic.
Eduardo Gutarra Velez. Outline Distributed Filesystems Motivation Google Filesystem Architecture The Metadata Consistency Model File Mutation.
1 Legacy Code From Feathers, Ch 2 Steve Chenoweth, RHIT Right – Your basic Legacy, from Subaru, starting at $ 20,295, 24 city, 32 highway.
Information Technology Division Executive Office for Administration and Finance Service Oriented Architecture An Enterprise Approach to Enabling the Business.
Software Engineering 2004 Jyrki Nummenmaa 1 BACKGROUND There is no way to generally test programs exhaustively (that is, going through all execution.
Lesson 9. * Testing Your browser * Using different browser tools * Using conditional comments with * Dealing with future compatibility problems.
CS 5150 Software Engineering Lecture 2 Software Processes 1.
Ashish Prabhu Douglas Utzig High Availability Systems Group Server Technologies Oracle Corporation.
Lecture 4 Page 1 CS 111 Online Modularity and Virtualization CS 111 On-Line MS Program Operating Systems Peter Reiher.
CloudWay.ro Gives Clients Fast Invoicing, Stock Management, and Resource Planning via Microsoft Azure and Azure SQL Database MICROSOFT AZURE ISV PROFILE:
Component Patterns – Architecture and Applications with EJB copyright © 2001, MATHEMA AG Component Patterns Architecture and Applications with EJB Markus.
Distributed Logging Facility Castor External Operation Workshop, CERN, November 14th 2006 Dennis Waldron CERN / IT.
Lecture III: Challenges for software engineering with the cloud CS 4593 Cloud-Oriented Big Data and Software Engineering.
CS223: Software Engineering Lecture 13: Software Architecture.
INFSO-RI Enabling Grids for E-sciencE FTS failure handling Gavin McCance Service Challenge technical meeting 21 June.
Breaking Up Is Hard To Do From Monolith to Microservices.
Chapter 3 : Designing a Consolidation Strategy MCITP Administrator: Microsoft SQL Server 2005 Database Server Infrastructure Design Study Guide (70-443)
(re)-Architecting cloud applications on the windows Azure platform CLAEYS Kurt Technology Solution Professional Microsoft EMEA.
 Cloud Computing technology basics Platform Evolution Advantages  Microsoft Windows Azure technology basics Windows Azure – A Lap around the platform.
Content Management in Windows Azure Thom Robbins, Chief Evangelist, Kentico CMS.
Cofax Scalability Document Version Scaling Cofax in General The scalability of Cofax is directly related to the system software, hardware and network.
Amazon Web Services. Amazon Web Services (AWS) - robust, scalable and affordable infrastructure for cloud computing. This session is about:
Jeff Kern NRAO/ALMA.  Scaling and Complexity ◦ SKA is not just a bigger version of existing systems  Higher Expectations  End to End Systems  Archive.
Considerations for operating MS SQL in the cloud, in production, DR, or hybrid scenarios. By Nick Rubtsov.
Maximum Availability Architecture Enterprise Technology Centre.
LECTURE 34: WEB PROGRAMMING FOR SCALE
Software Engineering Introduction to Apache Hadoop Map Reduce
X in [Integration, Delivery, Deployment]
Arrested by the CAP Handling Data in Distributed Systems
LECTURE 32: WEB PROGRAMMING FOR SCALE
LECTURE 33: WEB PROGRAMMING FOR SCALE
Analysis models and design models
P1 : Distributed Bitcoin Miner
LECTURE 33: WEB PROGRAMMING FOR SCALE
Presentation transcript:

1 Perspectives from Operating a Large Scale Website Dennis Lee

2 System Architecture Amazon.com Target.com Amazon.co.uk NBA.com … Catalog Search Customers Ordering Catalog DBSearch IndicesCustomer Ordering Session/Auth Sessions Presentation Tier Services Databases/ Persistent Store Similar structures for fulfillment centers, supply chain, metrics, finance, etc.

3 Goal 99.9% availability –99.9% of the time, anyone who wants to order from Amazon can order –0.1% of $8B = $8MM Lots of things can go wrong: –Deployment, bad machines, bad services, bad software This implies there can be no scheduled downtime

4 Scale makes this more challenging Customers – 40MM+ Traffic MM pages per day Machines – 10,000’s Developers – 1,000’s Websites – 10’s And others: –Distribution center facilities, drop ship partners, lines of code, databases, etc. Fast growth implies that “something” that works today will not work tomorrow.

5 Some General Lessons Decouple – allow for independent evolution –Key Question: can I deploy the pieces independently without breaking the “whole system” –Avoid structures that have to be the same across different systems Pass messages not objects; Avoid shared schemas across different databases – database engines don’t deal well with exceptions and schema changes forces downtime –Run-time versioning is the first step in the road to hell Key symptom: running multiple instances of service to support different API’s Better: one instance of service supporting multiple versions of API –Always prefer many small things to a few big things especially if you don’t fully control the big things. e.g., databases, multicast, code-base Test, test, test - insist on Unit Tests and automated regression suites –Human nature will tend to concentrate on the “happy” case – need both discipline to write the tests and investment in improving test infrastructure –Watch out for unexpected points of failure Failure mode testing is critical – know what your system will do if one of the “non-critical” systems fail. e.g., calls to systems without “fast fail”

6 Some General Lessons Computer systems model the real world – and the real world might be imperfect. –For complicated data structures, getting to 100% consistent state across large distributed systems is very expensive and we probably don’t need it –Time-bounded eventual consistency is usually good enough Example: Deployment, inventory levels –Token passing across systems works better than paying for cross system consistent state Example: order state in “stale” caches

7 Designing for Availability Design for recovery - availability = MTBF x MTTR –Fast undo, fast restart –Can your system start from scratch? i.e., deal with the cold start problem Avoid impulse functions –Reduce operations that require system restarts E.g., software upgrades, configuration changes Watch out for positive feedback loops –Retries can be deadly –Fast-fail is critical for availability Don’t wait for a dead system Don’t beat up a system in trouble –Per client access control and throttling are important - defend against DOS attacks Breaking up a monolithic system can reduce availability if you don’t do it right –e.g., Break up a monolithic system with 99% availablility: Two systems have to be up: Availability = 99% x 99% = 98% One of two systems have to be up: Availability = 1 – (1% x 1%) = 99.99%

8 Operating for Availability (technical) Latency (not just throughput) is critical –Customers care about latency –Latency and availability are related via capacity. Capacity calculations depend on: (1) average latency of dependent systems (2) average offered load by client systems Caches change these in dramatic ways so systems will probably not survive a cache disappearing => caches may need to be Tier-1. –Poor latency usually indicates a badly scaled or badly designed system Details matter –Single machines failures can be costly e.g., least connections and the ‘bad’ machine –Tuning the system is the full-time job – for a team! e.g., configuring the system for new business, watching performance, dealing with implications of upgrades, dealing with scaling limits of your providers –Memory leaks can kill –Getting machines up-to-speed fast simplifies scaling

9 Operating for Availability (management) Deployment rules –Deploy small changes frequently Eases debugging and improves availability Keeps your political options freer (allows you to say no more often) –Clean up all the way after failed deployments (don’t let these interfere with non-related deployments) –Critical to having both correctness, performance, and “quality of software” standards enforced before software gets deployed Final check should go through an automated system –Guarantee to run all checks –Arguments about exceptions become arguments about goodness of principle rather than ‘just let this one through’ Migration – moving from legacy to the “best new system” –Always actively manage migration and budget for it Out of the box, most new systems need to be tuned Legacy systems do not just die – they have to be shot –e.g., search system, presentation tier –Legacy systems have a cost: they have to be managed and have to be scaled Teams rapidly lose expertise in the old system especially as migration to new system progress

10 Contact Info