Cluster Computing Overview CS241 Winter 01 © Armando Fox
© 2001 Stanford Today’s Outline n Clustering: the Holy Grail l The Case For NOW l Clustering and Internet Services l Meeting the Cluster Challenges n Cluster case studies l GLUnix l SNS/TACC l DDS?
© 2001 Stanford Cluster Prehistory: Tandem NonStop n Early (1974) foray into transparent fault tolerance through redundancy l Mirror everything (CPU, storage, power supplies…), can tolerate any single fault (later: processor duplexing) l “Hot standby” process pair approach l What’s the difference between high availability and fault tolerance? n Noteworthy l “Shared nothing”--why? l Performance and efficiency costs? l Later evolved into Tandem Himalaya, which used clustering for both higher performance and higher availability
© 2001 Stanford Pre-NOW Clustering in the 90’s n IBM Parallel Sysplex and DEC OpenVMS l Targeted at conservative (read: mainframe) customers l Shared disks allowed under both (why?) l All devices have cluster-wide names (shared everything?) l 1500 installations of Sysplex, 25,000 of OpenVMS Cluster n Programming the clusters l All System/390 and/or VAX VMS subsystems were rewritten to be cluster-aware l OpenVMS: cluster support exists even in single-node OS! l An advantage of locking into proprietary interface
© 2001 Stanford Networks of Workstations: Holy Grail Use clusters of workstations instead of a supercomputer. n The case for NOW l difficult for custom designs to track technology trends (e.g. uproc perf. increases at 50%/yr, but design cycles are 2-4 yrs) l No economy of scale in 100s => +$ l Software incompatibility (OS & apps) => +$$$$ l “Scale makes availability affordable” (Pfister) l “systems of systems” can aggressively use off-the-shelf hardware and OS software n New challenges (“the case against NOW”): l performance and bug-tracking vs. dedicated system l underlying system is changing underneath you l underlying system is poorly documented
© 2001 Stanford Clusters: “Enhanced Standard Litany” n Hardware redundancy n Aggregate capacity n Incremental scalability n Absolute scalability n Price/performance sweet spot n Software engineering n Partial failure management n Incremental scalability n System administration n Heterogeneity
© 2001 Stanford Clustering and Internet Services n Aggregate capacity l TB of disk storage, THz of compute power (if we can harness in parallel!) n Redundancy l Partial failure behavior: only small fractional degradation from loss of one node l Availability: industry average across “large” sites during 1998 holiday season was 97.2% availability (source: CyberAtlas) l Compare: mission-critical systems have “four nines” (99.99%)
© 2001 Stanford Clustering and Internet Workloads n Internet vs. “traditional” workloads l e.g. Database workloads (TPC benchmarks) l e.g. traditional scientific codes (matrix multiply, simulated annealing and related simulations, etc.) n Some characteristic differences l Read mostly l Quality of service (best-effort vs. guarantees) l Task granularity n “Embarrasingly parallel”…why? l HTTP is stateless with short-lived requests l Web’s architecture has already forced app designers to work around this! (not obvious in 1990)
© 2001 Stanford Meeting the Cluster Challenges n Software & programming models n Partial failure and application semantics n System administration n Two case studies to contrast programming models l GLUnix goal: support “all” traditional Unix apps, providing a single system image l SNS/TACC goal: simple programming model for Internet services (caching, transformation, etc.), with good robustness and easy administration
© 2001 Stanford Software Challenges n What is the programming model for clusters? l Explicit message passing (e.g. Active Messages) l RPC (but remember the problems that make RPC hard) l Shared memory/network RAM (e.g. Yahoo! directory) l Traditional OOP with object migration (“network transparency”): not relevant for Internet workload? n Programming model should support decent failure semantics and exploit inherent modularity of clusters l Traditional uniprocessor programming idioms/models don’t seem to scale up to clusters l Question: Is there a “natural to use” cluster model that scales down to uniprocessors, at least for Internet-like workloads? l Later in the quarter we’ll take a shot at this
© 2001 Stanford Partial Failure Management n What does partial failure mean for… l a transactional database? l A read-only database striped across cluster nodes? l A compute-intensive shared service? n What are appropriate “partial failure abstractions”? l Incomplete/imprecise results? l Longer latency? n What current programming idioms make partial failure hard? l Hint: remember the original RPC papers?
© 2001 Stanford System Administration on a Cluster Thanks to Eric Anderson (1998) for some of this material. n Total cost of ownership (TCO) way high for clusters due to administration costs n Previous Solutions l Pay someone to watch l Ignore or wait for someone to complain l “Shell Scripts From Hell” (not general vast repeated work) n Need an extensible and scalable way to automate the gathering, analysis, and presentation of data
© 2001 Stanford System Administration, cont’d. Extensible Scalable Monitoring For Clusters of Computers (Anderson & Patterson, UC Berkeley) n Relational tables allow properties & queries of interest to evolve as the cluster evolves n Extensive visualization support allows humans to make sense of masses of data n Multiple levels of caching decouple data collection from aggregation n Data updates can be “pulled” on demand or triggered by push
© 2001 Stanford Visualizing Data: Example n Display aggregates of various interesting machine properties on the NOW’s n Note use of aggregation, color
© 2001 Stanford Case Study: The Berkeley NOW n History and Pictures of an early research cluster Pictures l NOW-0: four HP-735’s l NOW-1: 32 headless Sparc-10’s and Sparc-20’s l NOW-2: 100 UltraSparc 1’s, Myrinet interconnect l inktomi.berkeley.edu: four Sparc-10’s n Ultra’s, 200 CPU’s total l NOW-3: eight 4-way SMP’s n Myrinet interconnection l In addition to commodity switched Ethernet l Originally Sparc SBus, now available on PCIbus
© 2001 Stanford The Adventures of NOW: Applications n AlphaSort: 8.41 GB in one minute, 95 UltraSparcs l runner up: Ordinal Systems nSort on SGI Origin, 5 GB) l pre-1997 record, 1.6 GB on an SGI Challenge n 40-bit DES key crack in 3.5 hours l “NOW+”: headless and some headed machines n inktomi.berkeley.edu (now inktomi.com) l now fastest search engine, largest aggregate capacity n TranSend proxy & Top Gun Wingman Pilot browser l ~15,000 users, 3-10 machines
© 2001 Stanford NOW: GLUnix n Original goals: l High availability through redundancy l Load balancing, self-management l Binary compatibility l Both batch and parallel-job support n I.e., single system image for NOW users l Cluster abstractions == Unix abstractions l This is both good and bad…what’s missing compared to early 90’s proprietary cluster systems? n For portability and rapid development, build on top of off- the-shelf OS (Solaris)
© 2001 Stanford GLUnix Architecture n Master collects load, status, etc. info from daemons l Repository of cluster state, centralized resource allocation l Pros/cons of this approach? n Glib app library talks to GLUnix master as app proxy l Signal catching, process mgmt, I/O redirection, etc. l Death of daemon is treated as a SIGKILL by master GLUnix Master NOW node glud daemon NOW node glud daemon NOW node glud daemon 1 per cluster
© 2001 Stanford GLUnix Retrospective n Trends that changed the assumptions l SMP’s have replaced MPP’s, and are tougher to compete with for MPP workloads l Kernels have become extensible n Final features vs. initial goals l Tools: glurun, glumake (2nd most popular use of NOW!), glups/glukill, glustat, glureserve l Remote execution--but not total transparency l Load balancing/distribution--but not transparent migration/failover l Redundancy for high availability--but not for the “GLUnix master” node n Philosophy: Did GLUnix ask the right question (for our purposes)?
© 2001 Stanford TACC/SNS n Specialized cluster runtime to host Web-like workloads l TACC: transformation, aggregation, caching and customization-- elements of an Internet service l Build apps from composable modules, Unix-pipeline-style n Goal: complete separation of *ility concerns from application logic l Legacy code encapsulation, multiple language support l Insulate programmers from nasty engineering
© 2001 Stanford TACC Examples n Simple search engine l Query crawler’s DB l Cache recent searches l Customize UI/presentation n Simple transformation proxy l On-the-fly lossy compression of inline images (GIF, JPG, etc.) l Cache original & transformed l User specifies aggressiveness, “refinement” UI, etc. C T T $ $ A A T T $ $ C DB html
© 2001 Stanford Cluster-Based TACC Server n Component replication for scaling and availability n High-bandwidth, low-latency interconnect n Incremental scaling: commodity PC’s C $ LB/FT Interconnect FE $$ W W W T W W W A GUI Front Ends CachesCaches User Profile Database WorkersWorkers Load Balancing & Fault Tolerance Administration Interface
© 2001 Stanford “Starfish” Availability: LB Death l FE detects via broken pipe/timeout, restarts LB C $ Interconnect FE $$ W W W T LB/FT W W W A
© 2001 Stanford “Starfish” Availability: LB Death l FE detects via broken pipe/timeout, restarts LB C $ Interconnect FE $$ W W W T LB/FT W W W A New LB announces itself (multicast), contacted by workers, gradually rebuilds load tables If partition heals, extra LB’s commit suicide FE’s operate using cached LB info during failure
© 2001 Stanford “Starfish” Availability: LB Death l FE detects via broken pipe/timeout, restarts LB C $ Interconnect FE $$ W W W T LB/FT W W W A New LB announces itself (multicast), contacted by workers, gradually rebuilds load tables If partition heals, extra LB’s commit suicide FE’s operate using cached LB info during failure
© 2001 Stanford SNS Availability Mechanisms n Soft state everywhere l Multicast based announce/listen to refresh the state l Idea stolen from multicast routing in the Internet! n Process peers watch each other l Because of no hard state, “recovery” == “restart” l Because of multicast level of indirection, don’t need a location directory for resources n Load balancing, hot updates, migration are “easy” l Shoot down a worker, and it will recover l Upgrade == install new software, shoot down old l Mostly graceful degradation
© 2001 Stanford SNS Availability Mechanisms, cont’d. n Orthogonal mechanisms l Composition without interfaces l Example: Scalable Reliable Multicast (SRM) group state management with SNS l Eliminates O(n 2 ) complexity of composing modules l State space of failure mechanisms is easy to reason about n What’s the cost? n More on orthogonal mechanisms later
© 2001 Stanford Administering SNS n Multicast means monitor can run anywhere on cluster Extensible via self- describing data structures and mobile code in Tcl
© 2001 Stanford Clusters Summary n Many approaches to clustering, software transparency, failure semantics l An end-to-end problem that is often application-specific l We’ll see this again at the application level in harvest vs. yield discussion n Internet workloads are a particularly good match for clusters l What software support is needed to mate these two things? l What new abstractions do we want for writing failure-tolerant applications in light of these techniques?