ACME: a platform for benchmarking distributed applications David Oppenheimer, Vitaliy Vatkovskiy, and David Patterson ROC Retreat 12 Jan 2003.

Slides:



Advertisements
Similar presentations
A Lightweight Platform for Integration of Mobile Devices into Pervasive Grids Stavros Isaiadis, Vladimir Getov University of Westminster, London {s.isaiadis,
Advertisements

Efficient Event-based Resource Discovery Wei Yan*, Songlin Hu*, Vinod Muthusamy +, Hans-Arno Jacobsen +, Li Zha* * Chinese Academy of Sciences, Beijing.
1 Routing Techniques in Wireless Sensor networks: A Survey.
Distributed components
Self-Correlating Predictive Information Tracking for Large-Scale Production Systems Zhao, Tan, Gong, Gu, Wambolt Presented by: Andrew Hahn.
Extensible Scalable Monitoring for Clusters of Computers Eric Anderson U.C. Berkeley Summer 1997 NOW Retreat.
Based on last years lecture notes, used by Juha Takkinen.
1 Draft of a Matchmaking Service Chuang liu. 2 Matchmaking Service Matchmaking Service is a service to help service providers to advertising their service.
Naming in Wireless Sensor Networks. 2 Sensor Naming  Exploiting application-specific naming and in- network processing for building efficient scalable.
OSMOSIS Final Presentation. Introduction Osmosis System Scalable, distributed system. Many-to-many publisher-subscriber real time sensor data streams,
Probabilistic Data Aggregation Ling Huang, Ben Zhao, Anthony Joseph Sahara Retreat January, 2004.
EEC-681/781 Distributed Computing Systems Lecture 3 Wenbing Zhao Department of Electrical and Computer Engineering Cleveland State University
Squirrel: A decentralized peer- to-peer web cache Paul Burstein 10/27/2003.
Slide 1 ISTORE: System Support for Introspective Storage Appliances Aaron Brown, David Oppenheimer, and David Patterson Computer Science Division University.
Distributed Systems Management What is management? Strategic factors (planning, control) Tactical factors (how to do support the strategy practically).
Streaming Data, Continuous Queries, and Adaptive Dataflow Michael Franklin UC Berkeley NRC June 2002.
Winter Retreat Connecting the Dots: Using Runtime Paths for Macro Analysis Mike Chen, Emre Kıcıman, Anthony Accardi, Armando Fox, Eric Brewer
Applied Architectures Eunyoung Hwang. Objectives How principles have been used to solve challenging problems How architecture can be used to explain and.
Presenter: Chi-Hung Lu 1. Problems Distributed applications are hard to validate Distribution of application state across many distinct execution environments.
Wail Omar, ISCW’04, China, Date:, Slide 1 An Open Standard Description Language for Semantic Grid Services Assembly for Autonomic Computing Overlay Wail.
Tanenbaum & Van Steen, Distributed Systems: Principles and Paradigms, 2e, (c) 2007 Prentice-Hall, Inc. All rights reserved DISTRIBUTED SYSTEMS.
Tanenbaum & Van Steen, Distributed Systems: Principles and Paradigms, 2e, (c) 2007 Prentice-Hall, Inc. All rights reserved DISTRIBUTED SYSTEMS.
SCAN: a Scalable, Adaptive, Secure and Network-aware Content Distribution Network Yan Chen CS Department Northwestern University.
A brief overview about Distributed Systems Group A4 Chris Sun Bryan Maden Min Fang.
DISTRIBUTED SYSTEMS Principles and Paradigms Second Edition ANDREW S
A Lightweight Platform for Integration of Resource Limited Devices into Pervasive Grids Stavros Isaiadis and Vladimir Getov University of Westminster
Network Aware Resource Allocation in Distributed Clouds.
DISTRIBUTED COMPUTING
Managing a Cloud For Multi Agent System By, Pruthvi Pydimarri, Jaya Chandra Kumar Batchu.
Cluster Reliability Project ISIS Vanderbilt University.
Wireless Networks of Devices (WIND) Hari Balakrishnan and John Guttag MIT Lab for Computer Science NTT-MIT Meeting, January 2000.
Tanenbaum & Van Steen, Distributed Systems: Principles and Paradigms, 2e, (c) 2007 Prentice-Hall, Inc. All rights reserved DISTRIBUTED SYSTEMS.
©NEC Laboratories America 1 Huadong Liu (U. of Tennessee) Hui Zhang, Rauf Izmailov, Guofei Jiang, Xiaoqiao Meng (NEC Labs America) Presented by: Hui Zhang.
IMDGs An essential part of your architecture. About me
RELATIONAL FAULT TOLERANT INTERFACE TO HETEROGENEOUS DISTRIBUTED DATABASES Prof. Osama Abulnaja Afraa Khalifah
Sensor Database System Sultan Alhazmi
Cracow Grid Workshop, October 27 – 29, 2003 Institute of Computer Science AGH Design of Distributed Grid Workflow Composition System Marian Bubak, Tomasz.
May PEM status report. O.Bärring 1 PEM status report Large-Scale Cluster Computing Workshop FNAL, May Olof Bärring, CERN.
HUAWEI TECHNOLOGIES CO., LTD. Page 1 Survey of P2P Streaming HUAWEI TECHNOLOGIES CO., LTD. Ning Zong, Johnson Jiang.
Heavy and lightweight dynamic network services: challenges and experiments for designing intelligent solutions in evolvable next generation networks Laurent.
Copyright © 2002 Intel Corporation. Intel Labs Towards Balanced Computing Weaving Peer-to-Peer Technologies into the Fabric of Computing over the Net Presented.
1 MSc Project Yin Chen Supervised by Dr Stuart Anderson 2003 Grid Services Monitor Long Term Monitoring of Grid Services Using Peer-to-Peer Techniques.
DISTRIBUTED COMPUTING Introduction Dr. Yingwu Zhu.
Distributed Computing Systems CSCI 4780/6780. Distributed System A distributed system is: A collection of independent computers that appears to its users.
Tanenbaum & Van Steen, Distributed Systems: Principles and Paradigms, 2e, (c) 2007 Prentice-Hall, Inc. All rights reserved DISTRIBUTED SYSTEMS.
What is SAM-Grid? Job Handling Data Handling Monitoring and Information.
Enabling Self-management of Component-based High-performance Scientific Applications Hua (Maria) Liu and Manish Parashar The Applied Software Systems Laboratory.
Distributed Computing Systems CSCI 4780/6780. Scalability ConceptExample Centralized servicesA single server for all users Centralized dataA single on-line.
Tanenbaum & Van Steen, Distributed Systems: Principles and Paradigms, 2e, (c) 2007 Prentice-Hall, Inc. All rights reserved DISTRIBUTED SYSTEMS.
CS 6401 Overlay Networks Outline Overlay networks overview Routing overlays Resilient Overlay Networks Content Distribution Networks.
第 1 讲 分布式系统概述 §1.1 分布式系统的定义 §1.2 分布式系统分类 §1.3 分布式系统体系结构.
Improving System Availability in Distributed Environments Sam Malek with Marija Mikic-Rakic Nels.
Pinpoint: Problem Determination in Large, Dynamic Internet Services Mike Chen, Emre Kıcıman, Eugene Fratkin {emrek,
UCI Large-Scale Collection of Application Usage Data to Inform Software Development David M. Hilbert David F. Redmiles Information and Computer Science.
1 Artemis: Integrating Scientific Data on the Grid Rattapoom Tuchinda Snehal Thakkar Yolanda Gil Ewa Deelman.
Efficient Opportunistic Sensing using Mobile Collaborative Platform MOSDEN.
MicroGrid Update & A Synthetic Grid Resource Generator Xin Liu, Yang-suk Kee, Andrew Chien Department of Computer Science and Engineering Center for Networked.
Tanenbaum & Van Steen, Distributed Systems: Principles and Paradigms, 2e, (c) 2007 Prentice-Hall, Inc. All rights reserved DISTRIBUTED SYSTEMS.
Distributed Systems Architecure. Architectures Architectural Styles Software Architectures Architectures versus Middleware Self-management in distributed.
1 Scalability and Accuracy in a Large-Scale Network Emulator Nov. 12, 2003 Byung-Gon Chun.
VGrADS and GridSolve Asim YarKhan Jack Dongarra, Zhiao Shi, Fengguang Song Innovative Computing Laboratory University of Tennessee VGrADS Workshop – September.
Towards a High Performance Extensible Grid Architecture Klaus Krauter Muthucumaru Maheswaran {krauter,
Architecture and Algorithms for an IEEE 802
Supporting Fault-Tolerance in Streaming Grid Applications
The Globus Toolkit™: Information Services
CLUSTER COMPUTING.
The Design of a Grid Computing System for Drug Discovery and Design
Composite Subscriptions in Content-based Pub/Sub Systems
Decentralised Middleware and Workflow Enactment
Overview: Chapter 2 Localization and Tracking
Presentation transcript:

ACME: a platform for benchmarking distributed applications David Oppenheimer, Vitaliy Vatkovskiy, and David Patterson ROC Retreat 12 Jan 2003

Motivation Benchmarking large-scale distributed apps (peer-to-peer, Grid, CDNs,...) is difficult  very large (1000s-10,000s nodes) need scalable measurement and control  nodes and network links will fail need robust measurement and control  large variety of possible applications need standard interfaces for measurement and control ACME: platform that developers can use to benchmark their distributed applications

ACME benchmark lifecycle 1. User describes benchmark scenario  node requirements, workload, faultload, metrics 2. System finds the appropriate nodes, starts up the benchmarked application on those nodes 3. System the executes scenario  collects measurements  inject workload and faults note: same infrastructure for self-management (just replace “fault” with “control action” and “benchmark scenario” with “self-management rules” or “recovery actions”)

Outline Motivation and System Environment Interacting with apps: sensors & actuators Data collection architecture Describing and executing benchmark scenario Resource discovery: finding appropriate nodes in shared Internet-distributed environments Conclusion

Sensors and actuators Source/sink for monitoring/control Application-external: node-level  sensors load, memory usage, network traffic,...  actuators start/kill processes reboot physical nodes modify emulated network topology Application-embedded: application-level  initial application type: peer-to-peer overlay networks  sensors number of application-level msgs sent/received  actuators application-specific fault injection change parameters of workload generation

Outline Motivation and System Environment Interacting with apps: sensors & actuators Data collection architecture Describing and executing benchmark scenario Resource discovery: finding appropriate nodes in shared Internet-distributed environments Conclusion

Query processor architecture SenTree ISING sensor HTTP URL HTTP CSV data aggregated response query ISING SenTree query childrens’ values HTTP URL HTTP CSV data SenTreeDown SenTreeDown/ SenTreeUp childrens’ values

Query processor (cont.) Scalability  efficiently collect monitoring data from thousands of nodes in-network data aggregation and reduction Robustness  handle failures in the monitoring system and monitored application query processor based on self-healing peer-to-peer net partial aggregates on failure Extensibility  easy way to incorporate new monitoring data sources as the system evolves sensor interface

Outline Motivation and System Environment Interacting with apps: sensors & actuators Data collection architecture Describing and executing benchmark scenario Resource discovery: finding appropriate nodes in shared Internet-distributed environments Conclusion

Describing a benchmark scenario Key is usability: want easy way to define when andwhat actions to trigger  “kill half of the nodes after ten minutes”  “kill nodes until response latency doubles” Declarative XML-based rule system  conditions over sensors => invoke actuators

“Start 100 nodes. Starting 10 minutes later, kill 10 nodes every 3 minutes until latency doubles” <condition type="sensor" ID="oldVal" datatype="double" name="latency" hosts="ibm4.CS.Berkeley.EDU:34794 host2:port2" node="ALL:3333" period="10000" sensorAgg="AVG“ histSize="1" isSecondary="true"/> <condition type="sensor" datatype="double" name="latency" hosts="ibm4.CS.Berkeley.EDU:34794 host2:port2" node="ALL:3333" period="10000" sensorAgg="AVG“ histSize="1" operator="

ACME architecture experiment spec./ sys. mgmt. policy SenTree ISING sensor actuator HTTP URL HTTP CSV data HTTP URL HTTP CSV data aggregated response query ISING SenTree query childrens’ values controller HTTP URL HTTP CSV data XML SenTreeDown SenTreeDown/ SenTreeUp XML SenTreeDown/ SenTreeUp childrens’ values

ACME recap Taken together, the parts of ACME provide  application deployment and process management  data collection infrastructure  workload generation*  fault injection* ...all driven by a user-specified policy Future work (with Stanford)  scaling down: integrate cluster applications sensors/actuators for J2EE middleware target towards statistical monitoring  use rule system to invoke recovery routines  benchmark diagnosis techniques, not just apps  new, user-friendly policy language include expressing statistical algorithms

Benchmarking diagnosis techniques experiment spec. controller XML history rule-based diagnosis statistical diagnosis pub/ sub ISING or other query processor subscr. reqs fault injection mon. data & events / queries fault injection diagnosis events & subscr. reqs. monitoring metrics queries

“Start 100 nodes. Starting 10 minutes later, kill 10 nodes every 3 minutes until latency doubles” when (timer_T > 0) startNode(number=100); when ((timer_T > ) AND sensorCond_CompLatency) killNode(number=10) repeat(period=180000); when (timer_T > ) stopSensor(name=oldVal); define sensorCond CompLatency { hist1 < 2 * hist2 } define history hist1 { sensor=lat, size=1 } define history hist2 { sensor=oldVal, size=1 } define sensor lat { name="latency" hosts="ibm4.CS.Berkeley.EDU:34794 host2:port2“ node="ALL:3333" period="10000“ sensorAgg="AVG" } define sensor oldVal lat; Revamping the language

Outline Motivation and System Environment Interacting with apps: sensors & actuators Data collection architecture Describing and executing benchmark scenario Resource discovery: finding appropriate nodes in shared Internet-distributed environments Conclusion

Resource discovery and mapping When benchmarking, map desired emulated topology to available topology  example: “find me 100 P4-Linux nodes with inter-node bandwidth, latency, and loss rates characteristic of the Internet as a whole and that are lightly loaded” When deploying a service, find set of nodes on which to execute to achieve desired performance, cost, and availability  example: “find me the cheapest 50 nodes that will give me at least 3 9’s of availability, that are geographically well-dispersed, and that have at least 100 Kb/sec of bandwidth between them”

Current RD&M architecture 1. Each node that is offering resources periodically reports to a central server a) single-node statistics b) inter-node statistics expressed as N-element vector  central server builds an NxN “inference matrix”  currently statistic values are generated randomly 2. When desired, a node issues a resource discovery request to central server  MxM “constraint matrix” [ load=[0,2] latency=[[10ms,20ms],[200ms,300ms]] ] [ load=[0,2] latency=[[200ms,300ms],[200ms,300ms]] ] 3. Central server finds the M best nodes and returns them to the querying node

RD&M next steps Decentralized resource discovery/mapping  replicate needed statistics close to querying nodes improves avail. and perf. over centralized approach Better mapping functions  NP-hard problem  provide best mapping within cost/precision constraints Give user indication of accuracy and cost Integrate with experiment description language Integrate with PlanetLab resource allocation Evaluation

Conclusion Platform for benchmarking distributed apps Collect metrics and events  sensors  ISING query processor Describe & implement a benchmark scenario  actuators  controller/rule system: process mgmt., fault injection XML-based (to be replaced) Next steps  resource discovery/node mapping  improved benchmark descr./resource discovery lang.  incorporating Grid applications  incorporating cluster applications and using to benchmark diagnosis techniques (with Stanford)