Presentation is loading. Please wait.

Presentation is loading. Please wait.

Distributed Systems Laboratory

Similar presentations


Presentation on theme: "Distributed Systems Laboratory"— Presentation transcript:

1 Distributed Systems Laboratory
9/22/2018 MSR HPC visit

2 Lab People - Faculty Prof. Ran El-Yaniv (Learning, Data Mining)
Prof. Roy Friedman (Distributed Systems, Ad hoc Networks) Prof. Erez Petrank (Memory Management) Dr. Avi Mendelson (Computer Architecture) Prof. Assaf Schuster, HEAD (Large-Scale Data Processing, Distributed Systems) 9/22/2018 MSR HPC visit

3 Lab People Engineers: Eran Issler, Max Kovgan, David Carmeli, Valentin Kravtchov, Artiom Sharov About 40 graduate research students (best of breed!) Dozens of undergraduate and graduate students working on projects each semester Hundreds of undergraduate students in systems courses 9/22/2018 MSR HPC visit

4 Sponsors and Partners 9/22/2018 MSR HPC visit

5 Scope Applications Middleware, Virtualization Hardware 9/22/2018
Large Scale Distributed Data Mining Grid/P2P/Sensor Data Mining Genetic Linkage Analysis Applications Distributed Scalable Model Checking Anonymous and Private distributed Data Mining Machine Learning Sensor Networks Internet Mining Light-weight group communication Fast interconnects for HPC and data processing Condor – Grid Computing – Research, development, deployment Software Distributed Shared Memory System Services for Ad-hoc networks Middleware, Virtualization Locality in large-scale computations Data Privacy in Distributed Databases Multilevel caching in storage systems Scalable Data Race Detection Highly Available Distributed Java Computer Architecture: Fine Grain Parallelization Hardware 9/22/2018 MSR HPC visit

6 The Resource Hierarchy
GLOW - UW Madison Boinc @HOME 9/22/2018 MSR HPC visit

7 EGEE 9/22/2018 MSR HPC visit

8 DSL users Dr. Avi Mendelson – Trace cache
Prof. Ran El Yaniv – Machine Learning Prof. Roy Friedman – Group Communication Prof. Assaf Schuster – Large scale and grid Prof. Eli Biham – Cryptography Prof. Dan Geiger – Genetic Linkage Analysis Prof. Orna Grumberg – Scalable Model Checking Prof. Uri Weiser – Computer Architecture Prof. Ron Pinter – Caching Architectures Prof. Ronny Kimmel – 3D Image processing Prof. Reuven Cohen – Communication Networks Prof. Danny Raz – Active Distributed Services Prof. Idit Keidar – Distributed Systems Prof. Mooly Sagiv – Compiler Analysis Prof. Shaul Markovitch – Machine Learning Prof. Yoram Rosen – High Energy Physics …. 9/22/2018 MSR HPC visit

9 Contents - Tools Multiview – Distributed Shared Memory
Data race detection Model checking-based DRD Grid Monitoring System Decorative HA for grids 9/22/2018 MSR HPC visit

10 Contents – Large-Scale Distributed Systems
Peer-to-Peer Data Mining DataMiningGrid project QosCosGrid project Distributed runtime for multithreaded Java Distributed Model Checking

11 Multiview – Technologies for Distributed Shared Memory
[OSDI’99] 9/22/2018 MSR HPC visit

12 See Multiview in a separate presentation
9/22/2018 MSR HPC visit

13 Data Race Detection for C++ Programs
[PPOPP’03] 9/22/2018 MSR HPC visit

14 See MultiRace in a separate presentation
9/22/2018 MSR HPC visit

15 Model Checking-Based Data Race Detection
[PPOPP’05] 9/22/2018 MSR HPC visit

16 Difficulties in model checking dataraces
Infinite state space Huge number of interleavings Huge transition systems Size problem 9/22/2018 MSR HPC visit

17 Basic idea 9/22/2018 MSR HPC visit

18 hybrid solution Combine Lockset & Model Checking
Provide witnesses for dataraces Rare dataraces Dataraces in large programs Model Checking Provide witnesses for rare DR + Lockset scale for large programs 9/22/2018 MSR HPC visit

19 Multi-threaded program
Idea and Prototype Multi-threaded program List of Warnings Violations of locking principle Lockset Access suspicious of racing Find a1 Extend 1 Wolf Model checker 1 snapshot witness 2 9/22/2018 MSR HPC visit

20 Benchmark programs Lines Description Program 706 Tsp 708 Our_tsp 3751
traveling salesman from ETH Tsp 708 Enhanced traveling salesman Our_tsp 3751 Multithreaded raytracer from specjvm98 mtrt 29948 Web Crawler Kernel from ETH Hedc 362 Parallel sort SortArray 129 Finds prime numbers in a given interval PrimeFinder 150 Elevator simulator Elevsim 166 Shared DB simulator DQueries 9/22/2018 MSR HPC visit

21 Experimental results 4 threads 3 threads 2 threads Program our_tsp
Memory (MB) Time (sec) Time (sec) Mem Out 353 35069 our_tsp 396 123 569.3 SortArray 168 4547.1 143 2645.5 116 888.7 PrimeFinder 48 147.9 33 67.92 28 33.02 ElevSim 136 585.97 89 201.8 60 140.1 DQueries 17 9 12 7.33 11 2.66 Hedc Out Mem 377 35243 tsp 9/22/2018 MSR HPC visit

22 Mining for Misconfigured Machines in a Grid System
[KDD’06] Tested with success on a production environment. 9/22/2018 MSR HPC visit

23 Grid Batch Systems Many potential causes of failures and misbehaviors
Many organizations or administration sites. 10000s machines Heterogeneous machines Non dedicated Different installation and configuration Many potential causes of failures and misbehaviors Software bugs, hardware, network , configuration Current solutions Manual diagnosis Ruled based expert system. Data mining Limited, if any, prior knowledge Submission Resource broker Execution 9/22/2018 MSR HPC visit

24 Data Acquisition Data collector Data collector Data miner Data miner
Non-intrusive Distributed Database Preprocessing Data miner Distributed Data miner Data collector 9/22/2018 MSR HPC visit

25 Distributed Outlier Detection
9/22/2018 MSR HPC visit

26 Distributed Outlier Detection
9/22/2018 MSR HPC visit

27 Distributed Outlier Detection
9/22/2018 MSR HPC visit

28 Distributed Outlier Detection
9/22/2018 MSR HPC visit

29 Distributed Implementation
SG3 S1 SG2 S2 S3 2 1 1 1 SG 9/22/2018 MSR HPC visit

30 Distributed Implementation
SG3 S1 SG2 S2 S3 2 3 1 1 SG 9/22/2018 MSR HPC visit

31 Distributed Implementation
SG3 SG1 S1 SG2 S2 S3 3 1 SG 9/22/2018 MSR HPC visit

32 Evaluation on DSL Hardware
3 of the top 4 suspected machines are actually misconfigured. bh10: unknown reason. i4: loaded by network service. bh13: active HyperThreading. i3: root file system was nearly full. 9/22/2018 MSR HPC visit

33 Future Work Fault identification, analysis, classification, prediction. Better resource allocation; better system utilization Feedback to user on submitted jobs description Optimizing transparent operation Collaboration with INTEL NetBatch team 9/22/2018 MSR HPC visit

34 HA for large scale grids
[HPDC’06] Production System – Condor distribution 9/22/2018 MSR HPC visit

35 The Challenges WAN backups Lightweight protocols Autonomous partitions
Failure detection is not perfect - no bounded delay Network anomalies - links are asymmetric, not transitive IP fail-over techniques inapplicable Lightweight protocols Traditional Group Communication algs do not scale well Autonomous partitions Transient failures Legacy applications without HA Grid developers do not want to deal with HA Random, uniformly chosen, partial membership Provides random representative in every netwsork part. 9/22/2018 MSR HPC visit

36 The Goal The goal is to turn HA into a commodity Decoration
“HA out of the box” No need to change or adapt your existing service HA is provided as a Grid service itself Solution: Decoration Transparent addition of HA to already existing and deployed services No changes to the decorated service 9/22/2018 MSR HPC visit

37 Application: HA for Condor Central Manager
Job queue machine Job queue machine Central Manager Collector Negotiator Execution machine Execution machine Job queue machine Job queue machine Execution machine 9/22/2018 MSR HPC visit

38 Solution Architecture
9/22/2018 MSR HPC visit

39 Solution Highlights HAInvocator - High Availability for Negotiator
Leader election Automatic failure detection Transparent failover to backup “Split brain” reconciliation after network partitions HAReplicator - Persistency of Negotiator state State replication between active and backups Proxy for multicasting client’s messages to Collector Loose coupling between replication and HA 9/22/2018 MSR HPC visit

40 Status Passed **testing** in 2005
Not a single code line of Condor changed Except for several bug fixes  Inside Condor distribution effective Version 6.8 Some important clients Some success stories On-going collaboration with the Condor team 9/22/2018 MSR HPC visit


Download ppt "Distributed Systems Laboratory"

Similar presentations


Ads by Google