1 CS 525 Advanced Distributed Systems Spring 2011 Indranil Gupta (Indy) Distributed Monitoring April 5, 2011 All Slides © IG Acknowledgments: Some slides.

Slides:



Advertisements
Similar presentations
Dynamic Replica Placement for Scalable Content Delivery Yan Chen, Randy H. Katz, John D. Kubiatowicz {yanchen, randy, EECS Department.
Advertisements

CAN 1.Distributed Hash Tables a)DHT recap b)Uses c)Example – CAN.
Efficient Event-based Resource Discovery Wei Yan*, Songlin Hu*, Vinod Muthusamy +, Hans-Arno Jacobsen +, Li Zha* * Chinese Academy of Sciences, Beijing.
P2P data retrieval DHT (Distributed Hash Tables) Partially based on Hellerstein’s presentation at VLDB2004.
Ion Stoica, Robert Morris, David Karger, M. Frans Kaashoek, Hari Balakrishnan MIT and Berkeley presented by Daniel Figueiredo Chord: A Scalable Peer-to-peer.
Peer to Peer and Distributed Hash Tables
Peer-to-Peer Systems Chapter 25. What is Peer-to-Peer (P2P)? Napster? Gnutella? Most people think of P2P as music sharing.
Fabián E. Bustamante, 2007 Meridian: A lightweight network location service without virtual coordinates B. Wong, A. Slivkins and E. Gün Sirer SIGCOM 2005.
Search and Replication in Unstructured Peer-to-Peer Networks Pei Cao, Christine Lv., Edith Cohen, Kai Li and Scott Shenker ICS 2002.
Gossip Scheduling for Periodic Streams in Ad-hoc WSNs Ercan Ucan, Nathanael Thompson, Indranil Gupta Department of Computer Science University of Illinois.
LightFlood: An Optimal Flooding Scheme for File Search in Unstructured P2P Systems Song Jiang, Lei Guo, and Xiaodong Zhang College of William and Mary.
Resilient Peer-to-Peer Streaming Paper by: Venkata N. Padmanabhan Helen J. Wang Philip A. Chou Discussion Leader: Manfred Georg Presented by: Christoph.
Scribe: A Large-Scale and Decentralized Application-Level Multicast Infrastructure Miguel Castro, Peter Druschel, Anne-Marie Kermarrec, and Antony L. T.
P2p, Spring 05 1 Topics in Database Systems: Data Management in Peer-to-Peer Systems March 29, 2005.
Web Caching Schemes1 A Survey of Web Caching Schemes for the Internet Jia Wang.
Cis e-commerce -- lecture #6: Content Distribution Networks and P2P (based on notes from Dr Peter McBurney © )
Topics in Reliable Distributed Systems Lecture 2, Fall Dr. Idit Keidar.
Extensible Scalable Monitoring for Clusters of Computers Eric Anderson U.C. Berkeley Summer 1997 NOW Retreat.
CMPE 150- Introduction to Computer Networks 1 CMPE 150 Fall 2005 Lecture 22 Introduction to Computer Networks.
Application Layer Multicast
Object Naming & Content based Object Search 2/3/2003.
CS218 – Final Project A “Small-Scale” Application- Level Multicast Tree Protocol Jason Lee, Lih Chen & Prabash Nanayakkara Tutor: Li Lao.
Topics in Reliable Distributed Systems Fall Dr. Idit Keidar.
1 CS 194: Distributed Systems Distributed Hash Tables Scott Shenker and Ion Stoica Computer Science Division Department of Electrical Engineering and Computer.
Epidemic Techniques Chiu Wah So (Kelvin). Database Replication Why do we replicate database? – Low latency – High availability To achieve strong (sequential)
P2P Course, Structured systems 1 Introduction (26/10/05)
“Umbrella”: A novel fixed-size DHT protocol A.D. Sotiriou.
1CS 6401 Peer-to-Peer Networks Outline Overview Gnutella Structured Overlays BitTorrent.
Multicast Communication Multicast is the delivery of a message to a group of receivers simultaneously in a single transmission from the source – The source.
P2P File Sharing Systems
A User Experience-based Cloud Service Redeployment Mechanism KANG Yu.
Communication (II) Chapter 4
COCONET: Co-Operative Cache driven Overlay NETwork for p2p VoD streaming Abhishek Bhattacharya, Zhenyu Yang & Deng Pan.
Overcast: Reliable Multicasting with an Overlay Network CS294 Paul Burstein 9/15/2003.
2: Application Layer1 Chapter 2 outline r 2.1 Principles of app layer protocols r 2.2 Web and HTTP r 2.3 FTP r 2.4 Electronic Mail r 2.5 DNS r 2.6 Socket.
Distributed Session Announcement Agents for Real-time Streaming Applications Keio University, Graduate School of Media and Governance Kazuhiro Mishima.
1 Distributed Hash Tables (DHTs) Lars Jørgen Lillehovde Jo Grimstad Bang Distributed Hash Tables (DHTs)
A Peer-to-Peer Approach to Resource Discovery in Grid Environments (in HPDC’02, by U of Chicago) Gisik Kwon Nov. 18, 2002.
An IP Address Based Caching Scheme for Peer-to-Peer Networks Ronaldo Alves Ferreira Joint work with Ananth Grama and Suresh Jagannathan Department of Computer.
REED: Robust, Efficient Filtering and Event Detection in Sensor Networks Daniel Abadi, Samuel Madden, Wolfgang Lindner MIT United States VLDB 2005.
1 REED: Robust, Efficient Filtering and Event Detection in Sensor Networks Daniel Abadi, Samuel Madden, Wolfgang Lindner MIT United States VLDB 2005.
November 17, 2015Department of Computer Sciences, UT Austin1 SDIMS: A Scalable Distributed Information Management System Praveen Yalagandula Mike Dahlin.
Scalable Content- Addressable Networks Prepared by Kuhan Paramsothy March 5, 2007.
Paper Survey of DHT Distributed Hash Table. Usages Directory service  Very little amount of information, such as URI, metadata, … Storage  Data, such.
A Utility-based Approach to Scheduling Multimedia Streams in P2P Systems Fang Chen Computer Science Dept. University of California, Riverside
Distributed Monitoring and Management Presented by: Ahmed Khurshid Abdullah Al-Nayeem CS 525 Spring 2009 Advanced Distributed Systems.
1 Distributed Hash Table CS780-3 Lecture Notes In courtesy of Heng Yin.
LightFlood: An Efficient Flooding Scheme for File Search in Unstructured P2P Systems Song Jiang, Lei Guo, and Xiaodong Zhang College of William and Mary.
Plethora: Infrastructure and System Design. Introduction Peer-to-Peer (P2P) networks: –Self-organizing distributed systems –Nodes receive and provide.
On Reducing Mesh Delay for Peer- to-Peer Live Streaming Dongni Ren, Y.-T. Hillman Li, S.-H. Gary Chan Department of Computer Science and Engineering The.
1. Efficient Peer-to-Peer Lookup Based on a Distributed Trie 2. Complex Queries in DHT-based Peer-to-Peer Networks Lintao Liu 5/21/2002.
CS 6401 Overlay Networks Outline Overlay networks overview Routing overlays Resilient Overlay Networks Content Distribution Networks.
Algorithms and Techniques in Structured Scalable Peer-to-Peer Networks
Click to edit Master title style Multi-Destination Routing and the Design of Peer-to-Peer Overlays Authors John Buford Panasonic Princeton Lab, USA. Alan.
LOOKING UP DATA IN P2P SYSTEMS Hari Balakrishnan M. Frans Kaashoek David Karger Robert Morris Ion Stoica MIT LCS.
INTERNET TECHNOLOGIES Week 10 Peer to Peer Paradigm 1.
P2P Search COP6731 Advanced Database Systems. P2P Computing  Powerful personal computer Share computing resources P2P Computing  Advantages: Shared.
NCLAB 1 Supporting complex queries in a distributed manner without using DHT NodeWiz: Peer-to-Peer Resource Discovery for Grids Sujoy Basu, Sujata Banerjee,
An overlay for latency gradated multicasting Anwitaman Datta SCE, NTU Singapore Ion Stoica, Mike Franklin EECS, UC Berkeley
09/13/04 CDA 6506 Network Architecture and Client/Server Computing Peer-to-Peer Computing and Content Distribution Networks by Zornitza Genova Prodanoff.
Large Scale Sharing Marco F. Duarte COMP 520: Distributed Systems September 19, 2004.
Amazon Web Services. Amazon Web Services (AWS) - robust, scalable and affordable infrastructure for cloud computing. This session is about:
CS 425 / ECE 428 Distributed Systems Fall 2015 Indranil Gupta (Indy) Peer-to-peer Systems All slides © IG.
COS 461: Computer Networks
Intra-Domain Routing Jacob Strauss September 14, 2006.
Plethora: Infrastructure and System Design
Dynamic Routing and OSPF
AWS Cloud Computing Masaki.
Replica Placement Model: We consider objects (and don’t worry whether they contain just data or code, or both) Distinguish different processes: A process.
COS 461: Computer Networks
Presentation transcript:

1 CS 525 Advanced Distributed Systems Spring 2011 Indranil Gupta (Indy) Distributed Monitoring April 5, 2011 All Slides © IG Acknowledgments: Some slides by Steve Ko, Jin Liang, Ahmed Khurshid,Abdullah Al-Nayeem

2 Distributed Applications run over Many Environments The Internet Gnutella peer to peer system PlanetLab Grids, Datacenters, Cloud Computing Distributed applications Large scale 1000’s of nodes Unreliable nodes and network Churned node join, leave, failure

3 Management and Monitoring of Distributed Applications Monitoring of nodes, system-wide or per-node Many applications can benefit from this, e.g., –DNS, cooperative caching, CDN, streaming, etc., on PlanetLab –Web hosting in server farms, data centers, Grid computations, etc. A new and important problem direction for next decade –[CRA03], [NSF WG 05], [IBM, HP, Google, Amazon] Goal more end-to-end than cluster or network management Today typically constitutes 33% of TCO of distributed infrastructures –Will only get worse with consolidation of data centers, and expansion and use of PlanetLab

4 What do sysadmins want? Two types of Monitoring Problems: I.Instant (on-demand) Queries across node population, requiring up-to- date answers –{average, max, min, top-k, bottom-k, histogram etc.} –for {CPU util, RAM util, disk space util, app. characteristics, etc.} –E.g., max CPU, top-5 CPU, avg RAM, etc. II.Long-term Monitoring of node contribution –availability, bandwidth, computation power, disk space Requirements: Low bandwidth –For instant queries, since infrequent –For long-term monitoring, since node investment Low memory and computation Scalability Addresses failures and churn Good performance and response

5 Existing Solutions: Bird’s Eye View CENTRALIZED DECENTRALIZED Decentralized overlays Instant Queries Long-term monitoring

6 Existing Monitoring Solutions Centralized/Infrastructure-based Decentralized

7 Existing Monitoring Solutions Centralized/Infrastructure-based: user scripts, CoMon, Tivoli, Condor, server+backend, etc. –Efficient and enable long-term monitoring 1.Provide stale answers to instant queries (data collection throttled due to scale) –CoMON collection: 5 min intervals –HP OpenView: 6 hours to collect data from 6000 servers! 2.Often require infrastructure to be maintained –No in-network aggregation –Could scale better Decentralized

8 Existing Monitoring Solutions Centralized/Infrastructure-based Decentralized: Astrolabe, Ganglia, SWORD, Plush, SDIMS, etc. –Nodes organized in an overlay graph, where nodes maintain neighbors according to overlay rules, E.g., distributed hash tables (DHTs) – Pastry-based E.g., hierarchy - Astrolabe –Can answer instant queries but need to be maintained all the time Nodes spend resources on maintaining peers according to overlay rules => Complex failure repair –Churn => node needs to change its neighbors to satisfy overlay rules –(Can you do a quick and dirty overlay, without maintenance?  MON)

9 Another Bird’s Eye View 9 (Data) Scale (Attribute and churn) Dynamism Centralized (e.g., DB) Scaling centralized (e.g., Replicated DB) Static (e.g., attr:CPU type) Dynamic (e.g., attr: CPU util.) Centralized (e.g., DB) Decentralized

10 MON: Instant Queries for Distributed System Management

11 MON Query Language 1. select avg( ) [where ] 2. select top k [where ] 3. select histo( ) [where ] 4. select [where ] 5. select grep(, ) [where ] 6. select run( ) [where ] 7. count and depth: number of nodes in, and tree-depth of, overlay 8. push = either –system metric(s), e.g., CPU, RAM, disk, or –application-based metrics, e.g., average neighbor delay (p2p streaming), number of neighbors (partitionability of overlay), etc. = any boolean expression over the system resources –E.g., CPU > 50

12 MON: Management Overlay Networks Supports Instant queries –Need instaneous answers –Inconsistency ok push software updates Basic Idea: Ephemeral overlays 1.For each query, build an overlay on-demand 2.Use overlay to complete query 3.Do not maintain on- demand overlay ? n1 n2 n3 n4 n5 n6

13 Why On-Demand? Maintained overlays, e.g., DHT-based overlays maintenance bandwidth complex failure repair On-demand approach is: Simple Light-weight Suited to management –Sporadic usage –Amenable to overlay reuse Maintained Overlays, e.g., DHT-based On-demand overlays Management Commands On-demand overlay construction Each node maintains a small list of neighbors MON Architecture

14 Membership by Gossip Partial membership list at each node (asymmetric) Contains random few other nodes; fixed in size (log(N)) Periodic membership exchange [SCAMP01] [SWIM02] [TMAN04] [CYCLON06] –Measure delay –Detect failure: use age (last heard from time) to eliminate old entries - O(log(N)) time for spreading failure information [EGHKK03, DGHIL85] Weakly consistent – may contain stale entries n1 n2 n3 n4 n5 n6 node n4 n5 IP:Port 128.X.X.X: X.X.X:6020 RTT n664.X.X.X: Membership list at noden3

15 On-demand trees: Randomized Algorithms Simple algorithm (randk) –Each node randomly selects k children from its membership list –Each child acts recursively Improved Algorithm (twostage) –Membership augmented with list of “nearby” nodes Nearby nodes discovered via gossip –Two stage construction of tree First h hops – select random children After h hops – select local children DAG construction: similar to tree Weakly consistent membership list is ok –Retry prospective children –Or settle for fewer than k children

16 Tree Building Example n1 n2 n3 n4 n5 n6 (k=2)

17 Software Push Receiver-driven, multi-parent download DAG structure helps: bandwidth, latency, failure-resilience

18 Tree Construction Performance PlanetLab slice of 330 hosts Median response time is less than 1.5 seconds 95% responses in less than 2 seconds 97% coverage Bandwidth=10 Bps 10 s gossip interval k=5 children

19 Software Push Bandwidth DAG median bandwidth about same as tree But DAG faster overall due to replication PlanetLab slice of 330 hosts

20 Comparison with DHT Tree Scribe/SDIMS trees built over Pastry DHT. In a PlanetLab slice with 115 nodes. Pastry takes about twice as long to answer queries, does not ensure coverage when there are failures (or spends persistent bandwidth for routing table repair)

21 On-demand vs. Maintained Overlay Choice depends on Query Rate = q per second Maintained overlay –Neighbor Maintenance (up to date neighbors) = B Bps –Per-query cost for k child selection = C Bytes –Bandwidth = B + q.C On-demand approach –Weakly consistent neighbors = B/m Bps (m > 1) –Per-query cost = m.C Bytes –Bandwidth = B/m + q.m.C On-demand preferable when: B/m + m.q.C < B + q.C –q < B/mC In MON, B = 10 Bps, and C = 400 Bytes On-demand approach preferable for query rates of under one query every 40.4 seconds –Queries injected by users have think times of several minutes –Query rate far better than centralized collection tools which have periods of O(minutes) –Yet saves bandwidth Maintained Overlays, e.g., DHT-based On-demand overlays

22 Spanning the Spectrum MON overlays in PlanetLab (325 node slice) can be reused for up to 6-30 minutes Maintained Overlays, e.g., DHT-based On-demand overlays Function of max_drop= Maximum number of dropped nodes from overlay that application can tolerate (kept track of by MON) How long can an on-demand overlay survive without maintenance?

23 Discussion Using partial DHTs to build better on- demand trees? On-demand DHTs? On-demand datastructures for anything? What about groups that do not span the entire system?

24 Astrolabe: A Robust and Scalable Technology for Distributed System Monitoring, Management, and Data Mining

25 User Interface Query Datacenter as a Database –SQL queries on the datacenter Each server contributes one or more tuples E.g.: grep for a file name – SELECT COUNT(*) AS file_count FROM files WHERE name = ‘game.db’ A “database” = A “Management Information Base” (MIB) Astrolabe = distributed MIB

26 Astrolabe Zone Hierarchy /uiuc/ece/n1/uiuc/cs/n4/uiuc/cs/n6/cornell/n2/cornell/cs/n3 /berkeley /eecs/n5 /berkeley /eecs/n7 /berkeley /eecs/n8 /uiuc/ece /uiuc/cs /cornell/cs /berkeley/eecs /uiuc/cornell/berkeley / Zone ~ domain name but customizable Zone hierarchy is determined by administrators Each host runs Astrolabe agent

27 Astrolabe = Decentralized MIB uiuc cornell berkeley /uiuc/cs/n4 /uiuc/cs /uiuc / cs ece n4 n6 Load = 0.3 Load = 0.5 (Own) SELECT MIN(Load) as Load Load = 0.3 Time = 121 Time = 101 Time = 130 Other aggregation funcs: MAX (attribute) SUM (attribute) AVG (attribute) FIRST(n, attribute) Updates aggregated lazily up the tree Using a user-defined aggregation function

28 Astrolabe’s workings a little more complex /uiuc/ece/n1/uiuc/cs/n4/uiuc/cs/n6 /uiuc/ece /uiuc/cs /uiuc /cornell /berkeley / cs ece uiuc cornell berkeley …… n4 n6 Load = 0.1 Load = 0.3 system process Load = 0.1 Disk = 1.2TB Service: A(1.1) progress = 0.7 files system process Load = 0.3 Disk = 0.6TB Service: A(1.0) progress = 0.5 files Agent (/uiuc/cs/n6) has local copy of these ancestor MIBs + MIBs of siblings of these ancestors Agent (/uiuc/cs/n6) has local copy of these ancestor MIBs + MIBs of siblings of these ancestors

29 Gossip Protocol Conceptually: –Sibling zones gossip and exchange the MIBs of all their sibling zones –This propagates information upwards eventually Leaf zone: correspond to actual servers Internal node zone: collection of servers in that subtree zone In reality: Each agent, periodically, selects another agent at random, and exchanges information with it –If the two agents are in same zone, they exchange MIB information about that zone –If in different zones, they exchange MIB information about their lowest common ancestor zone –And then gossip for all ancestor zones

30 Gossip Protocol (continued) For efficiency, each zone elects a set of leader servers to act as representatives of that zone –Representatives participate in gossip protocol, then propagate information down to other servers in that zone (also via gossip) –Agent may be elected to represent multiple zones, but no more zones than its # ancestors How gossip happens at a representative agent: –Pick a zone (=ancestor) to gossip in –Pick a child of that ancestor –Randomly pick one of the contact agents from that zone –Gossip messages pertaining to MIBs of that zone, and all its ancestor zones (up to the root) –Gossip results in merge of entries, based on timestamps (timestamps assumed global)

31 Etcetera Eventual consistency guarantee for data: Each update eventually is propagated. If updates cease, everyone converges. AFC (aggregation function certificates): programmable aggregation functions; propagated throughout system Membership protocol: similar to gossip-style membership + ~ Bimodal Multicast Experimental results: simulations; see paper Astrolabe (or a variant of it) is rumored to be running inside Amazon’s EC2/S3 cloud

32 Discussion Non-leaf zones: have representative leaders vs. make all descendants responsible? –What are the tradeoffs? Up to date-ness of answers to queries? Truly on-demand querying system? Timestamps assumed to be global – why may this be ok/not ok?

33 Moara: Flexible and Scalable Group- Based Querying System

34 Querying Groups of Nodes PlanetLab slice sizes HP’s Utility Computing

35 Problem and Approach Query groups of nodes –Groups are typically small –Groups are dynamic Groups are specified implicitly, via a predicate –E.g., ((CPU util < 10%) and (rack = R1)) or (Mem util < 500 MB and (rack=R2 or rack=R1)) –That is, logical expressions: (A and B) or (C and (D or E)) Each expression is One approach: flood query. Bad! Especially for repeated queries. Moara’s approach: –Query a small set of servers that would be superset of the group (that is those matching the predicate) –For repeated queries, maintain overlay –Overlay = collection of trees. One tree per basic term (e.g., CPU util < 10%). –Optimize tree management cost vs. query cost

36 Query Rewriting Rewrite query into Conjunctive normal form (CNF) –Provably gives lowest number of terms Reduce #terms: –Use covers based on tree size to reduce terms Cover (A and B) = min ((cover(A), cover(B), cover(A U B)) –Use semantic information E.g., CPU < 10% is a subset of CPU < 20%

37 Example Moara Tree T FT TF T TF T: CPU-Util > 50% F: CPU-Util <= 50% Key idea: Build one tree per term (or reuse tree for that term if it already exists)

38 Moara’s Tree Maintenance Two extreme candidate approaches: –Never update tree Zero management cost, but high query cost (flood) May be ok if query rate < churn rate –Aggressively prune out subtrees that do not satisfy term Low query cost, but management cost high if churn rate high May be ok if churn rate < query rate Query rate is user-based, churn rate is system- and term-dependent Churn rate different for each node   need a decentralized protocol for maintaining each part of a given tree, in order to minimize overall bandwidth utilization  So you get best of all worlds in terms of bandwidth

39 State Machine at Each Moara node SAT = This subtree satisfies term (some node in it) PRUNE = tell parent to not forward queries to this subtree UPDATE = node will update its PRUNE variable at parent as satisfiability changes Adaptation policy is local at each node Compares bandwidth cost based on: Local rate of change in SAT, and Local rate of queries seen

40 Emulab (500 instances) 40 Latency (ms)Msg/query

41 Simulation (10,000 instances) 41 Msgs/node Group Churn Queries No tree management, one global tree Moara with aggressive tree management Moara with adaptive tree management

42 Discussion Tree per term and potentially predicates: too many trees? –How do you garbage collect entire trees? –What information do you need to maintain the right set of trees? –Interesting optimization problem! What language do sysadmins like to query in? Sysadmins often care about how information is visible visually (e.g., CoMON, Zenoss) –Automatically inferred queries? –Artificial Intelligence/Machine Learning techniques to learn what are the queries sysadmins want most?

43 Questions?