Decentralizing Grids Jon Weissman University of Minnesota E-Science Institute Nov. 8 2007.

Slides:



Advertisements
Similar presentations
Cognitive Radio Communications and Networks: Principles and Practice By A. M. Wyglinski, M. Nekovee, Y. T. Hou (Elsevier, December 2009) 1 Chapter 12 Cross-Layer.
Advertisements

Performance in Decentralized Filesharing Networks Theodore Hong Freenet Project.
Network Resource Broker for IPTV in Cloud Computing Lei Liang, Dan He University of Surrey, UK OGF 27, G2C Workshop 15 Oct 2009 Banff,
Dynamic Replica Placement for Scalable Content Delivery Yan Chen, Randy H. Katz, John D. Kubiatowicz {yanchen, randy, EECS Department.
A Network Positioning System for the Internet T. S. Eugene Ng and Hui Zhang USENIX 04 Presented By: Imranul Hoque 1.
Energy Efficient Data Collection In Distributed Sensor Environments Qi Han, Sharad Mehrotra, Nalini Venkatasubramanian {qhan, sharad,
Efficient Event-based Resource Discovery Wei Yan*, Songlin Hu*, Vinod Muthusamy +, Hans-Arno Jacobsen +, Li Zha* * Chinese Academy of Sciences, Beijing.
CS 542: Topics in Distributed Systems Diganta Goswami.
Pastry Peter Druschel, Rice University Antony Rowstron, Microsoft Research UK Some slides are borrowed from the original presentation by the authors.
SDN + Storage.
Replication Strategies in Unstructured Peer-to-Peer Networks Edith Cohen Scott Shenker This is a modified version of the original presentation by the authors.
LASTor: A Low-Latency AS-Aware Tor Client
Scalable Content-Addressable Network Lintao Liu
Dr. Kalpakis CMSC 621, Advanced Operating Systems. Fall 2003 URL: Distributed System Architectures.
Decentralized Reactive Clustering in Sensor Networks Yingyue Xu April 26, 2015.
G. Alonso, D. Kossmann Systems Group
Multicasting in Mobile Ad-Hoc Networks (MANET)
Dept. of Computer Science & Engineering, CUHK1 Trust- and Clustering-Based Authentication Services in Mobile Ad Hoc Networks Edith Ngai and Michael R.
Peer-to-Peer Based Multimedia Distribution Service Zhe Xiang, Qian Zhang, Wenwu Zhu, Zhensheng Zhang IEEE Transactions on Multimedia, Vol. 6, No. 2, April.
Mitigating routing misbehavior in ad hoc networks Mary Baker Departments of Computer Science and.
Improving Robustness in Distributed Systems Jeremy Russell Software Engineering Honours Project.
Rutgers PANIC Laboratory The State University of New Jersey Self-Managing Federated Services Francisco Matias Cuenca-Acuna and Thu D. Nguyen Department.
EEC-681/781 Distributed Computing Systems Lecture 3 Wenbing Zhao Department of Electrical and Computer Engineering Cleveland State University
Application-Layer Anycasting: A Server Selection Architecture and Use in a Replicated Web Service IEEE/ACM Transactions on Networking Vol.8, No. 4, August.
1CS 6401 Peer-to-Peer Networks Outline Overview Gnutella Structured Overlays BitTorrent.
1 Content Distribution Networks. 2 Replication Issues Request distribution: how to transparently distribute requests for content among replication servers.
P2P Systems Meet Mobile Computing A Community-Oriented Software Infrastructure for Mobile Social Applications Cristian Borcea *, Adriana Iamnitchi + *
DEXA 2005 Quality-Aware Replication of Multimedia Data Yicheng Tu, Jingfeng Yan and Sunil Prabhakar Department of Computer Sciences, Purdue University.
PIC: Practical Internet Coordinates for Distance Estimation Manuel Costa joint work with Miguel Castro, Ant Rowstron, Peter Key Microsoft Research Cambridge.
GeoGrid: A scalable Location Service Network Authors: J.Zhang, G.Zhang, L.Liu Georgia Institute of Technology presented by Olga Weiss Com S 587x, Fall.
Infrastructure for Better Quality Internet Access & Web Publishing without Increasing Bandwidth Prof. Chi Chi Hung School of Computing, National University.
“Intra-Network Routing Scheme using Mobile Agents” by Ajay L. Thakur.
Peer to Peer Research survey TingYang Chang. Intro. Of P2P Computers of the system was known as peers which sharing data files with each other. Build.
Overcast: Reliable Multicasting with an Overlay Network CS294 Paul Burstein 9/15/2003.
QoS Support in High-Speed, Wormhole Routing Networks Mario Gerla, B. Kannan, Bruce Kwan, Prasasth Palanti,Simon Walton.
An Efficient Approach for Content Delivery in Overlay Networks Mohammad Malli Chadi Barakat, Walid Dabbous Planete Project To appear in proceedings of.
A Distributed Clustering Framework for MANETS Mohit Garg, IIT Bombay RK Shyamasundar School of Tech. & Computer Science Tata Institute of Fundamental Research.
What are the main differences and commonalities between the IS and DA systems? How information is transferred between tasks: (i) IS it may be often achieved.
Tony McGregor RIPE NCC Visiting Researcher The University of Waikato DAR Active measurement in the large.
Paper Group: 20 Overlay Networks 2 nd March, 2004 Above papers are original works of respective authors, referenced here for academic purposes only Chetan.
A Scalable Content-Addressable Network (CAN) Seminar “Peer-to-peer Information Systems” Speaker Vladimir Eske Advisor Dr. Ralf Schenkel November 2003.
1 4/23/2007 Introduction to Grid computing Sunil Avutu Graduate Student Dept.of Computer Science.
Cracow Grid Workshop October 2009 Dipl.-Ing. (M.Sc.) Marcus Hilbrich Center for Information Services and High Performance.
A Peer-to-Peer Approach to Resource Discovery in Grid Environments (in HPDC’02, by U of Chicago) Gisik Kwon Nov. 18, 2002.
Scalable Computing on Open Distributed Systems Jon Weissman University of Minnesota National E-Science Center CLADE 2008.
The Replica Location Service The Globus Project™ And The DataGrid Project Copyright (c) 2002 University of Chicago and The University of Southern California.
Zibin Zheng DR 2 : Dynamic Request Routing for Tolerating Latency Variability in Cloud Applications CLOUD 2013 Jieming Zhu, Zibin.
Paper Survey of DHT Distributed Hash Table. Usages Directory service  Very little amount of information, such as URI, metadata, … Storage  Data, such.
What is SAM-Grid? Job Handling Data Handling Monitoring and Information.
Trust-Sensitive Scheduling on the Open Grid Jon B. Weissman with help from Jason Sonnek and Abhishek Chandra Department of Computer Science University.
11 CLUSTERING AND AVAILABILITY Chapter 11. Chapter 11: CLUSTERING AND AVAILABILITY2 OVERVIEW  Describe the clustering capabilities of Microsoft Windows.
Chapter 3 System Performance and Models Introduction A system is the part of the real world under study. Composed of a set of entities interacting.
1 Secure Peer-to-Peer File Sharing Frans Kaashoek, David Karger, Robert Morris, Ion Stoica, Hari Balakrishnan MIT Laboratory.
6/23/2005 R. GARDNER OSG Baseline Services 1 OSG Baseline Services In my talk I’d like to discuss two questions:  What capabilities are we aiming for.
Network Computing Laboratory 1 Vivaldi: A Decentralized Network Coordinate System Authors: Frank Dabek, Russ Cox, Frans Kaashoek, Robert Morris MIT Published.
1/14/ :59 PM1/14/ :59 PM1/14/ :59 PM Research overview Koen Victor, 12/2007.
An Optimal Broadcast Algorithm for Content-Addressable Networks Ludovic Henrio Fabrice Huet Justine Rochas 1 18/12/ OPODIS (Nice)
Scheduling MPI Workflow Applications on Computing Grids Juemin Zhang, Waleed Meleis, and David Kaeli Electrical and Computer Engineering Department, Northeastern.
Algorithms and Techniques in Structured Scalable Peer-to-Peer Networks
Spring 2000CS 4611 Routing Outline Algorithms Scalability.
Large Scale Sharing Marco F. Duarte COMP 520: Distributed Systems September 19, 2004.
IHP Im Technologiepark Frankfurt (Oder) Germany IHP Im Technologiepark Frankfurt (Oder) Germany ©
Holding slide prior to starting show. Scheduling Parametric Jobs on the Grid Jonathan Giddy
Performance Comparison of Ad Hoc Network Routing Protocols Presented by Venkata Suresh Tamminiedi Computer Science Department Georgia State University.
Tanenbaum & Van Steen, Distributed Systems: Principles and Paradigms, 2e, (c) 2007 Prentice-Hall, Inc. All rights reserved DISTRIBUTED SYSTEMS.
Towards a High Performance Extensible Grid Architecture Klaus Krauter Muthucumaru Maheswaran {krauter,
Delay-Tolerant Networks (DTNs)
VIRTUAL SERVERS Presented By: Ravi Joshi IV Year (IT)
CLUSTER COMPUTING.
DATA RETRIEVAL IN ADHOC NETWORKS
Presentation transcript:

Decentralizing Grids Jon Weissman University of Minnesota E-Science Institute Nov

Roadmap Background The problem space Some early solutions Research frontier/opportunities Wrapup

Background Grids are distributed … but also centralized – Condor, Globus, BOINC, Grid Services, VOs – Why? client-server based Centralization pros – Security, policy, global resource management Decentralization pros – Reliability, dynamic, flexible, scalable – **Fertile CS research frontier**

Challenges May have to live within the Grid ecosystem – Condor, Globus, Grid services, VOs, etc. – First principle approaches are risky (Legion) 50K foot view – How to decentralize Grids yet retain their existing features? – High performance, workflows, performance prediction, etc.

Decentralized Grid platform Minimal assumptions about each node Nodes have associated assets (A) – basic: CPU, memory, disk, etc. – complex: application services – exposed interface to assets: OS, Condor, BOINC, Web service Nodes may up or down Node trust is not a given (do X, does Y instead) Nodes may connect to other nodes or not Nodes may be aggregates Grid may be large > 100K nodes, scalability is key

Grid Overlay Grid service Raw – OS services Condor network BOINC network

Grid Overlay - Join Grid service Raw – OS services Condor network BOINC network

Grid Overlay - Departure Grid service Raw – OS services Condor network BOINC network

Routing = Discovery Query contains sufficient information to locate a node: RSL, ClassAd, etc Exact match or semantic match discover A

Routing = Discovery bingo!

Routing = Discovery Discovered node returns a handle sufficient for the client to interact with it - perform service invocation, job/data transmission, etc

Routing = Discovery Three parties – initiator of discovery events for A – client: invocation, health of A – node offering A Often initiator and client will be the same Other times client will be determined dynamically – if W is a web service and results are returned to a calling client, want to locate C W near W => – discover W, then C W !

Routing = Discovery discover A X

Routing = Discovery

bingo!

Routing = Discovery

outside client

Routing = Discovery discover As

Routing = Discovery

Grid Overlay This generalizes … – Resource query (query contains job requirements) – Looks like decentralized matchmaking These are the easy cases … – independent simple queries find a CPU with characteristics x, y, z find 100 CPUs each with x, y, z – suppose queries are complex or related? find N CPUs with aggregate power = G Gflops locate an asset near a prior discovered asset

Grid Scenarios Grid applications are more challenging – Application has a more complex structure – multi-task, parallel/distributed, control/data dependencies individual job/task needs a resource near a data source workflow queries are not independent – Metrics are collective not simply raw throughput makespan response QoS

Related Work Maryland/Purdue – matchmaking Oregon-CCOF – time-zone CAN

Related Work (contd) None of these approaches address the Grid scenarios (in a decentralized manner) – Complex multi-task data/control dependencies – Collective metrics

50K Ft Research Issues Overlay Architecture – structured, unstructured, hybrid – what is the right architecture? Decentralized control/data dependencies – how to do it? Reliability – how to achieve it? Collective metrics – how to achieve them?

Context: Application Model answer = component service request job task … = data source

Context: Application Models Reliability Collective metrics Data dependence Control dependence

Context: Environment RIDGE project - ridge.cs.umn.edu – reliable infrastructure for donation grid envs Live deployment on PlanetLab – planet-lab.org – 700 nodes spanning 335 sites and 35 countries – emulators and simulators Applications – BLAST – Traffic planning – Image comparison

Application Models Reliability Collective metrics Data dependence Control dependence

Reliability Example C D E G B B G

C D E G B B CGCG G C G responsible for Gs health

Reliability Example C D E G B B G, loc(C G ) CGCG

Reliability Example C D E G B BG CGCG could also discover G then C G

Reliability Example C D E G B B CGCG X

C D E G B CGCG G. …

Reliability Example C D E G B CGCG G

Client Replication C D E G B B G

C D E G B BG C G1 C G2 loc (G), loc (CG 1 ), loc (CG 2 ) propagated

Client Replication C D E G B BG C G1 C G2 X client hand-off depends on nature of G and interaction

Component Replication C D E G B B G

C D E G B CGCG G1 G2

Replication Research Nodes are unreliable – crash, hacked, churn, malicious, slow, etc. How many replicas? – too many – waste of resources – too few – application suffers

System Model Reputation rating r i – degree of node reliability Dynamically size the redundancy based on r i Nodes are not connected and check-in to a central server Note: variable sized groups

Reputation-based Scheduling Reputation rating – Techniques for estimating reliability based on past interactions Reputation-based scheduling algorithms – Using reliabilities for allocating work – Relies on a success threshold parameter

Algorithm Space How many replicas? – first-, best-fit, random, fixed, … – algorithms compute how many replicas to meet a success threshold How to reach consensus? – M-first (better for timeliness) – Majority (better for byzantine threats)

Experimental Results: correctness This was a simulation based on byzantine behavior … majority voting

Experimental Results: timeliness M-first (M=1), best BOINC (BOINC*), conservative (BOINC-) vs. RIDGE

Next steps Nodes are decentralized, but not trust management! Need a peer-based trust exchange framework – Stanford: Eigentrust project – local exchange until network converges to a global state

Application Models Reliability Collective metrics Data dependence Control dependence

Collective Metrics BLAST Throughput not always the best metric Response, completion time, application-centric – makespan- response

Communication Makespan Nodes download data from replicated data nodes – Nodes choose data servers independently (decentralized) – Minimize the maximum download time for all worker nodes (communication makespan) data download dominates

Data node selection Several possible factors – Proximity (RTT) – Network bandwidth – Server capacity [Download Time vs. RTT - linear] [Download Time vs. Bandwidth - exp]

Heuristic Ranking Function Query to get candidates, RTT/bw probes Node i, data server node j – Cost function = rtt i,j * exp(k j /bw i,j ), k j load/capacity Least cost data node selected independently Three server selection heuristics that use k j – BW-ONLY: k j = 1 – BW-LOAD: k j = n-minute average load (past) – BW-CAND: k j = # of candidate responses in last m seconds (~ future load)

Performance Comparison

Computational Makespan compute dominates: BLAST

Computational Makespan variable-sizedequal-sized *

Next Steps Other makespan scenarios Eliminate probes for bw and RTT -> estimation Richer collective metrics – deadlines: user-in-the-loop

Application Models Reliability Collective metrics Data dependence Control dependence

Application Models Reliability Collective metrics Data dependence Control dependence

Data Dependence Data-dependent component needs access to one or more data sources – data may be large discover A,

Data Dependence (contd) discover A Where to run it?,

The Problem Where to run a data-dependent component? – determine candidate set – select a candidate Unlikely a candidate knows downstream bw from particular data nodes Idea: infer bw from neighbor observations w/r to data nodes!

Estimation Technique C 1 may have had little past interaction with – … but its neighbors may have For each neighbor generate a download estimate: – DT: prior download time to from neighbor – RTT: from candidate and neighbor to respectively – DP: average weighted measure of prior download times for any node to any data source C1C1 C2C2

Estimation Technique (contd) Download Power (DP) characterizes download capability of a node – DP = average (DT * RTT) – DT not enough (far-away vs. nearby data source) Estimation associated with each neighbor n i – ElapsedEst [n i ] = α β DT α : my_RTT/neighbor_RTT (to ) β : neighbor_DP /my_DP no active probes: historical data, RTT inference Combining neighbor estimates – mean, median, min, …. – median worked the best Take a min over all candidate estimates

Comparison of Candidate Selection Heuristics SELF uses direct observations

Take Away Next steps – routing to the best candidates Locality between a data source and component – scalable, no probing needed – many uses

Application Models Reliability Collective metrics Data dependence Control dependence

The Problem How to enable decentralized control? – propagate downstream graph stages – perform distributed synchronization Idea: – distributed dataflow – token matching – graph forwarding, futures (Mentat project)

Control Example C D E G B Bcontrol node token matching

Simple Example C D E G B

Control Example C D E G B {E, B*C*D} {C, G} {D, G}

Control Example C D E G C D B {E, B*C*D} B

Control Example C D E G C D B {E, B*C*D, loc(S C ) } {E, B*C*D, loc(S D ) } {E, B*C*D, loc(S B ) } B output stored at loc(…) – where component is run, or client, or a storage node

Control Example C D E G C D B B B

C D E G C D B B B

C D E G C D B B B

C D E G C D B B E B

C D E G C D B B E B

C D E G C D B B E B How to color and route tokens so that they arrive to the same control node?

Open Problems Support for Global Operations – troubleshooting – what happened? – monitoring – application progress? – cleanup – application died, cleanup state Load balance across different applications – routing to guarantee dispersion

Summary Decentralizing Grids is a challenging problem Re-think systems, algorithms, protocols, and middleware => fertile research Keep our eye on the ball – reliability, scalability, and maintaining performance Some preliminary progress on point solutions

My visit Looking to apply some of these ideas to existing UK projects via collaboration Current and potential projects – Decentralized dataflow: (Adam Barker) – Decentralized applications: Haplotype analysis (Andrea Christoforou, Mike Baker) – Decentralized control: openKnowledge (Dave Robertson) Goal – improve reliability and scalability of applications and/or infrastructures

Questions

EXTRAS

Non-stationarity Nodes may suddenly shift gears – deliberately malicious, virus, detach/rejoin – underlying reliability distribution changes Solution – window-based rating – adapt/learn target Experiment: blackout at round 300 (30% effected)

Adapting …

Adaptive Algorithm success ratethroughput

success ratethroughput

Scheduling Algorithms

Estimation Accuracy Download Elapsed Time Ratio (x-axis) is a ratio of estimation to real measured time – 1 means perfect estimation Accept if the estimation is within a range measured ± (measured * error) – Accept with error=0.33: 67% of the total are accepted – Accept with error=0.50: 83% of the total are accepted Objects: 27 (.5 MB – 2MB) Nodes: 130 on PlanetLab Download: 15,000 times from a randomly chosen node

Impact of Churn Jinoh – mean over what? Global(Prox) mean Random mean

Estimating RTT We use distance = (RTT+1) Simple RTT inference technique based on triangle inequality Triangle Inequality: Latency(a,c) Latency(a,b) + Latency(b,c) – |Latency(a,b)-Latency(b,c)| Latency(a,c) Latency(a,b)+Latency(b,c) Pick the intersected area as the range, and take the mean Via Neighbor A Via Neighbor B Via Neighbor C Inference RTT Lower bound Higher bound Final inference Intersected range

RTT Inference Result More neighbors, greater accuracy With 5 neighbors, 85% of the total < 16% error

Other Constraints C D E A B {E, B*C*D} {C, A, dep-CD} {D, A, dep-CD} C & D interact and they should be co-allocated, nearby … Tokens in bold should route to same control point so a collective query for C & D can be issued

Support for Global Operations Troubleshooting – what happened? Monitoring – application progress? Cleanup – application died, cleanup state Solution mechanism: propagate control node IPs back to origin (=> origin IP piggybacked) Control nodes and matcher nodes report progress (or lack thereof via timeouts) to origin Load balance across different applications

Other Constraints C D E A B {E, B*C*D} {C, A} {D, A} C & D interact and they should be co-allocated, nearby …

Combining Neighbors Estimation MEDIAN shows best results – using 3 neighbors 88% of the time error is within 50% (variation in download times is a factor of 10-20) 3 neighbors gives greatest bang

Effect of Candidate Size

Performance Comparison Parameters: Data size: 2MB Replication: 10 Candidates: 5

Computation Makespan (contd) Now bring in reliability … makespan improvement scales well # components

Token loss Between B and matcher; matcher and next stage – matcher must notify C B when token arrives (pass loc(C B ) with Bs token – destination (E) must notify C B when token arrives (pass loc(C B ) with Bs token C D B E B

RTT Inference >= 90-95% of Internet paths obey triangle inequality – RTT (a, c) <= RTT (a, b) + RTT (b, c) – RTT (server, c) <= RTT (server, n i ) + RTT (n i, c) – upper- bound – lower-bound: | RTT (server, n i ) - RTT (n i, c) | iterate over all neighbors to get max L, min U return mid-point