CSE 124 Networked Services Fall 2009 B. S. Manoj, Ph.D 10/29/20091CSE 124 Networked Services Fall 2009 Some of these slides are adapted from various sources/individuals including but not limited to the images and text from the IEEE/ACM digital libraries. Use of these slides other than for pedagogical purpose for CSE 124, may require explicit permissions from the respective sources.
Announcements Midterm: November 5 Help sheet specs (optional) Must be computer printed Single side of a letter size paper 1-inch margin all sides 10 point font Single-spaced Times New Roman Font Programming Project-2 Innovation on Networked Services Computing services over Cluster computing project ….. 10/29/20092CSE 124 Networked Services Fall 2009
Discussion on 10/29/2009CSE 124 Networked Services Fall by
Why Giant Scale Services? Access anywhere, any time – Home, office, coffee shop, airport etc. Available via multiple devices – Computers, smart-phones, set-top boxes etc Groupware support – Centralization of services helps – Calendars, e-vite, etc Lower overall cost – End-user device utilization: 4% – Infrastructure resources: 80% – Fundamental cost advantage over stand-alone application Simplified service update – Most powerful long term advantage – No physical distribution of software or hardware is necessary 10/29/2009CSE 124 Networked Services Fall 20094
Key assumptions Service provider has limited control over – Clients – Network (except the intranet) Queries drive the traffic – Web or database queries – Eg. http, ftp, or RPC Read-only queries greatly out-number data updates – Read dominates write – Product evaluations vs purchases – Stock quotes vs trading 10/29/2009CSE 124 Networked Services Fall 20095
Basic Model of Giant Scale Services Clients – Web browsers, clients, XML and SOAP The Best-effort IP network – Access to the service The load manager – Level indirection to balance the load – To prevent faults Servers – System workers – Combines CPU, memory, and disks Persistence data storage – Replicated or partitioned database – Spread across servers’ disks – May include network attached storage, DBMs, or RAIDs Services backplane – Optional system-area-network – Handles inter server traffic 10/29/2009CSE 124 Networked Services Fall 20096
Clusters in Giant-Scale Servers Clusters – Collections of commodity servers Main benefits – Absolute scalability Many new services must serve a fraction of the world’s population – Cost and performance Compelling reason for clusters Bandwidth and operational costs dwarf hardware cost – Independent components Help handling faults – Incremental scalability Helps handle uncertainty and expense of growing the service Typically 3 years depreciation lifetime A unit rack space quadruples in computing power in every 3 years 10/29/2009CSE 124 Networked Services Fall 20097
Load Management Simple load management strategy – DNS – Round-robin IP address distribution DNS – It does not hide down/inactive servers – Short time-to-live is an option – Many browsers mishandle expired DNS info Layer-4/Layer-7 switches – Transport/Application layer switches – Processes higher layer info at wirespeeds – Helps fault tolerance – High throughput: 20Gbps – Can detect down nodes: by connection status 10/29/2009CSE 124 Networked Services Fall 20098
Load Management (contd) Service specific layer-7 switches – A user: Walmart – Can track session information Smart client – End-to-end approach – Using alternative server info (DNS) 10/29/2009CSE 124 Networked Services Fall 20099
Key Requirements of Giant Scale Services High Availability – Much like other communication services – Always available water, electricity, telephone etc To handle – component failures – Natural disasters – Growth and evolution Design points (reduce failures) – Symmetry – Internal disks – No people, wires, monitors Offsite clusters – Contracts limit temperature and power variations 10/29/2009CSE 124 Networked Services Fall
Availability Metrics Availability metrics – Uptime: – Fraction of time the site is available (e.g..9999= 99.99%; 8.64 seconds/day) – MTBF: Mean time between failures – MTTF: Mean time to repair – Two ways to improve uptime Increase MTBF or reduce MTTR – Hard to improve and verify MTBF – MTTR reduction is preferred – Total time required to verify improvement is less for MTTR 10/29/2009CSE 124 Networked Services Fall
Availability Metrics Yield: – As an availability metric; not as throughput metric – Similar to uptime, but translates to user experience – Not all seconds have equal value A second lost when there are no queries A second lost during peak hour is really an issue 10/29/2009CSE 124 Networked Services Fall
Availability Metrics Harvest: – A query may be answered partially or fully – Harvest determines how much info returned – Can help in ensuring user satisfaction while handling faults – inbox loads, but task list or contacts are not – Ebay auction info loaded, but not user profile Key point: – We can control how faults affect Yield, Harvest or both – Total Capacity remains the same – Fault In a Replicated system – Reduced yield In a Partitioned system – Reduced harvest 10/29/2009CSE 124 Networked Services Fall
DQ Principle Data /query x Queries/second -> Constant – A useful metric for giant-scale systems Represents a physical capacity bottleneck – Max I/O bandwidth or total disk seeks/second At high utilization – a giant-scale system approaches the constant Includes all overhead – Data copying, presentation layers, and network-bound issues – Each node has a different DQ value Easy to measure the relative impact of faults on DQ values – Because different systems have different DQ values 10/29/2009CSE 124 Networked Services Fall
DQ Value (contd..) Overall DQ value linearly scales with number of nodes – Helps in sampling of the behavior of the entire system – Small test cluster can predict the behavior of the entire system Inktome: 4 node clusters are used to predict the impact of software updates on 100 node clusters – DQ impact must be evaluated before any proposed HW/SW changes – Linear reduction of DQ is linear with faults 10/29/2009CSE 124 Networked Services Fall
DQ Value (contd..) Future demand – DQ addition required is to be estimated Fault impact and DQ – Degrade DQ linearly with number of faults (failed nodes) How DQ impacts the – Uptime – Yield – Harvest DQ may be handled differently – Data intensive services DQ applies mostly to this category Data base access Majority of top 100 sites are data intensive – Computation intensive services Simulation, super computing, or – Communication intensive services Chat, news, or VoIP 10/29/2009CSE 124 Networked Services Fall
Replication Vs Partitioning Replication – Traditional method for improving availability – How DQ affects Replication E.g, two node cluster, one fault – Harvest: 100% – Yield: 50% Maintains D and reduces Q Partition – How DQ affects Partition E.g, two node cluster, one fault – Harvest: 50% – Yield: 100% Reduces D and maintains Q DQ drops to 50% in both cases 10/29/2009CSE 124 Networked Services Fall
Load Redirection and Replication Traditional replication provisions excess capacity – Load redirection is required on faults – Replicas handle load handled by the failed nodes – Hard to achieve under high utilization k out of n failures will demand a redirection of k/(n-k) load over to the remaining n-k nodes Loss of 2 out of 5 nodes implies a redirected load of 2/3 and an overload of 5/3 (166%) 10/29/2009CSE 124 Networked Services Fall
Replication and DQ 10/29/2009CSE 124 Networked Services Fall Replication of disks is cheap Storage is cheap, but not processing But to access the data, DQ points is required Partitioning has no real savings over replication (in terms of DQ points) The same DQ points are needed In some rare cases, replication can demand more DQ points
Replication and DQ (contd..) Replication and partition can be used to provide better control over availability Partition the data first to suitable size – Replicate based on the importance of data – Easy to grow the system via replication than partition Replication can be based on data’s importance – Which data is lost in the event of a fault – Replication of key data At the cost of some extra disks – A fault can still result in 1/n data loss, but of lesser importance – Replication can be made random the lost harvest a random subset of data avoids hotspots in the partitions – Inktome search: Partial replication – system: Full replication – Clustered Web: No replication 10/29/2009CSE 124 Networked Services Fall
Graceful Degradation Degradation under faults must be trouble free A Graceful degradation is affected by – High peak-to-average ratio 1.6:1 to 6:1 and even 10:1 – Single event bursts Movie ticket sales, Football matches, breaking sensational news – Natural disasters and power failures DQ can drop very high Can happen independently 10/29/2009CSE 124 Networked Services Fall
Graceful degradation under faults DQ principle gives new opportunities – Either maintain D and limit Q or reduce D and maintain Q Admission Control (AC) – Maintain D, reduces Q – Maintains harvest Dynamic database reduction (cut the data size by half) – Reduces D, maintains Q – Maintains Yield Graceful degradation can be achieved at various degrees – combination of the above two Key question: How saturation should affect: uptime, yield, harvest, and Quality of Service 10/29/2009CSE 124 Networked Services Fall
Access control strategies Cost based AC – Perform AC based on estimated query cost (in DQs) – Reduces the average D per query – Denying one expensive query can retain many inexpensive queries Net gain in query and harvest – Another method is probabilistic AC Helps retrying queries will lead to success – Reduced yield, increased harvest Priority or value based AC – Datek handles stock queries differently from other queries – Queries will be execuited within 60 seconds or they charge no commission – Drop low valued queries and thus DQ points – Reduced yield, increased harvest Reduced data freshness – When saturated, a financial site can make stock quotes expire less frequently – Not only reduces freshness, but also DQ requirement – Increased yield, reduced harvest 10/29/2009CSE 124 Networked Services Fall
Disaster tolerance Disaster – Complete loss of one or more replicas Natural disasters can affect all the replicas in a geographical location Fire or other disasters affect only one replica Disaster tolerance – deals with managing replica groups and graceful degradation for handling disaster Key questions: How many locations and how many replicas 2 replicas in 3 locations: 2/6 loss in a natural disaster Each remaining locations must handle 50% more traffic Inktome – Current approach Reduce D by 50% at remaining locations – Best approach Reduce D by 2/3 and thereby increases Q by 3/2 10/29/2009CSE 124 Networked Services Fall
Disaster Tolerance (contd..) Load management is another issue in Disaster Tolerance – When clusters fail, Layer-4 switches do not help DNS – Long failover response time (several hours) Smart clients – Are more suitable to quick failovers (seconds to minutes) 10/29/2009CSE 124 Networked Services Fall
Evolution and Growth Giant scale services need to be frequently updated – Product revisions, software bug fixes, security updates, or addition of new services – Hard to detect problems slow memory leaks non-deterministic bugs – Continued growth plan is essential Online evolution process Evolution with minimal downtime Giant scale services are frequently updated – Acceptable quality software Target MTBF Minimal MTTR No cascading failures 10/29/2009CSE 124 Networked Services Fall
Online evolution Process Each online evolution phase a certain amount of DQ points Total DQ loss for n nodes – n-number of nodes, u- time required per node Total DQ loss = DQ x upgrade time per node. Upgrades – Software upgrades quick and new and old systems can co-exist Can be done by controlled reboot during MTTR – Hardware upgrades are harder 10/29/2009CSE 124 Networked Services Fall
Upgrade approaches Fast reboot – Quickly reboot with the upgrades – Downtime cannot be avoided – Effect on yield can be contained by scheduling the reboot at off-peak hours – Staging area and automation are essential – Upgrades happen simultaneously 10/29/2009CSE 124 Networked Services Fall
Rolling upgrades – Upgrades nodes one at a time in a wave manner – One node is down at a time – Old and new systems may co-exist – Compatibility between old and new systems is a must Partitioned system – Harvest will be affected, yield is unaffected Replicated system – Harvest and yield are unaffected – Upgrade happens on replica at a time – Still conducted at off-peak hours to avoid affecting the yield due to faults 10/29/2009CSE 124 Networked Services Fall Upgrade approaches
Big flip – Most complicated among the three – Upgrade the cluster one half at a time – Switch off all traffic to it, take down a half, upgrade it – Turn the upgraded part on, direct new traffic to the upgraded part, wait for old traffic in the to-be-upgraded part to complete – The upgraded half runs while the old half is taken down – One version (half) runs at a time – 50% DQ loss Replicas: 50% loss of D (yield) Partitions: 50% loss of Q (harvest) Big flip is powerful – Hardware, OS, schema, networking, physical relocation can all be done Inktome did it twice 10/29/2009CSE 124 Networked Services Fall Upgrade approaches
Discussion 10/29/2009CSE 124 Networked Services Fall
Summary Get the basics right – Use symmetry to simplify the analysis and management Decide on availability metrics – Yield and Harvest are more important than uptime Focus on MTTR at least as much as MTBF – Repair time is easier to affect for an evolving system Understand load redirection during faults – Replication is insufficient, higher DQ demand is to be considered Graceful degradation – Intelligent admission control and dynamic database reduction can help Use DQ analysis of all upgrades – Evaluate all upgrade options and DQ demand in advance and do capacity planning Automatic upgrades as much as possible – Develop automatic upgrade options such as rolling upgrades, ensure simple way to revert to old version 10/29/2009CSE 124 Networked Services Fall