Distributed Systems Laboratory

Slides:



Advertisements
Similar presentations
Technology Drivers Traditional HPC application drivers – OS noise, resource monitoring and management, memory footprint – Complexity of resources to be.
Advertisements

A Dynamic World, what can Grids do for Multi-Core computing? Daniel Goodman, Anne Trefethen and Douglas Creager
Scaling Model Checking of Dataraces Using Dynamic Information Ohad Shacham Tel Aviv University IBM Haifa Lab Mooly Sagiv Tel Aviv University Assaf Schuster.
Minimum intrusion GRID. Build one to throw away … So, in a given time frame, plan to achieve something worthwhile in half the time, throw it away, then.
Minimum intrusion GRID. Build one to throw away … So, in a given time frame, plan to achieve something worthwhile in half the time, throw it away, then.
OCCF – The Realtime Grid. 1 Characteristics of Current Grid Computing Static data sets - Generally from fixed length experiments - Statistical measurements.
EEC-681/781 Distributed Computing Systems Lecture 3 Wenbing Zhao Department of Electrical and Computer Engineering Cleveland State University
Undergraduate Poster Presentation Match 31, 2015 Department of CSE, BUET, Dhaka, Bangladesh Wireless Sensor Network Integretion With Cloud Computing H.M.A.
Securing Legacy Software SoBeNet User group meeting 25/06/2004.
ATIF MEHMOOD MALIK KASHIF SIDDIQUE Improving dependability of Cloud Computing with Fault Tolerance and High Availability.
Word Wide Cache Distributed Caching for the Distributed Enterprise.
Cloud Computing for the Enterprise November 18th, This work is licensed under a Creative Commons.
Computer System Architectures Computer System Software
Connecting OurGrid & GridSAM A Short Overview. Content Goals OurGrid: architecture overview OurGrid: short overview GridSAM: short overview GridSAM: example.
N. GSU Slide 1 Chapter 02 Cloud Computing Systems N. Xiong Georgia State University.
DISTRIBUTED COMPUTING
Maximilian Berger David Gstir Thomas Fahringer Distributed and parallel Systems Group University of Innsbruck Austria Oct, 13, Krakow, PL.
Spring 2011 CIS 4911 Senior Project Catalog Description: Students work on faculty supervised projects in teams of up to 5 members to design and implement.
CERN IT Department CH-1211 Genève 23 Switzerland t Internet Services Job Monitoring for the LHC experiments Irina Sidorova (CERN, JINR) on.
Evaluation of Agent Teamwork High Performance Distributed Computing Middleware. Solomon Lane Agent Teamwork Research Assistant October 2006 – March 2007.
Heavy and lightweight dynamic network services: challenges and experiments for designing intelligent solutions in evolvable next generation networks Laurent.
Introduction to dCache Zhenping (Jane) Liu ATLAS Computing Facility, Physics Department Brookhaven National Lab 09/12 – 09/13, 2005 USATLAS Tier-1 & Tier-2.
N. GSU Slide 1 Chapter 05 Clustered Systems for Massive Parallelism N. Xiong Georgia State University.
Issues Autonomic operation (fault tolerance) Minimize interference to applications Hardware support for new operating systems Resource management (global.
Authors: Ronnie Julio Cole David
Cluster 2004 San Diego, CA A Client-centric Grid Knowledgebase George Kola, Tevfik Kosar and Miron Livny University of Wisconsin-Madison September 23 rd,
What is SAM-Grid? Job Handling Data Handling Monitoring and Information.
Condor week – April 2006Artyom Sharov, Technion, Haifa1 Adding High Availability to Condor Central Manager Artyom Sharov Technion – Israel Institute of.
LOGO Development of the distributed computing system for the MPD at the NICA collider, analytical estimations Mathematical Modeling and Computational Physics.
11 CLUSTERING AND AVAILABILITY Chapter 11. Chapter 11: CLUSTERING AND AVAILABILITY2 OVERVIEW  Describe the clustering capabilities of Microsoft Windows.
Grid Computing Framework A Java framework for managed modular distributed parallel computing.
Scalable Hybrid Keyword Search on Distributed Database Jungkee Kim Florida State University Community Grids Laboratory, Indiana University Workshop on.
Condor week – March 2005©Gabriel Kliot, Technion1 Adding High Availability to Condor Central Manager Gabi Kliot Technion – Israel Institute of Technology.
Background Computer System Architectures Computer System Software.
INTRODUCTION TO GRID & CLOUD COMPUTING U. Jhashuva 1 Asst. Professor Dept. of CSE.
Since computing power is everywhere, how can we make it usable by anyone? (From Condor Week 2003, UW)
Tanenbaum & Van Steen, Distributed Systems: Principles and Paradigms, 2e, (c) 2007 Prentice-Hall, Inc. All rights reserved DISTRIBUTED SYSTEMS.
Cofax Scalability Document Version Scaling Cofax in General The scalability of Cofax is directly related to the system software, hardware and network.
Towards a High Performance Extensible Grid Architecture Klaus Krauter Muthucumaru Maheswaran {krauter,
Chapter 1 Characterization of Distributed Systems
Workload Management Workpackage
Data Management on Opportunistic Grids
Introduction to Distributed Platforms
Dag Toppe Larsen UiB/CERN CERN,
Dag Toppe Larsen UiB/CERN CERN,
Management of Virtual Machines in Grids Infrastructures
Definition of Distributed System
Chapter 1: Introduction
DISTRIBUTED SYSTEMS Principles and Paradigms Second Edition ANDREW S
Distributed System 電機四 陳伯翰 b
Processes The most important processes used in Web-based systems and their internal organization.
University of Technology
Management of Virtual Machines in Grids Infrastructures
Advanced Operating Systems
Department of Computer Science University of California, Santa Barbara
Haiyan Meng and Douglas Thain
Mining for Misconfigured Machines
QNX Technology Overview
Distributed Systems Bina Ramamurthy 11/30/2018 B.Ramamurthy.
CLUSTER COMPUTING.
Architectures of distributed systems Fundamental Models
Architectures of distributed systems Fundamental Models
Prof. Leonardo Mostarda University of Camerino
Architectures of distributed systems
Architectures of distributed systems Fundamental Models
Database System Architectures
Tools for the development of parallel applications
Department of Computer Science University of California, Santa Barbara
Question 1 How are you going to provide language and/or library (or other?) support in Fortran, C/C++, or another language for massively parallel programming.
Presentation transcript:

Distributed Systems Laboratory www.cs.technion.ac.il/Labs/dsl 9/22/2018 MSR HPC visit

Lab People - Faculty Prof. Ran El-Yaniv (Learning, Data Mining) Prof. Roy Friedman (Distributed Systems, Ad hoc Networks) Prof. Erez Petrank (Memory Management) Dr. Avi Mendelson (Computer Architecture) Prof. Assaf Schuster, HEAD (Large-Scale Data Processing, Distributed Systems) 9/22/2018 MSR HPC visit

Lab People Engineers: Eran Issler, Max Kovgan, David Carmeli, Valentin Kravtchov, Artiom Sharov About 40 graduate research students (best of breed!) Dozens of undergraduate and graduate students working on projects each semester Hundreds of undergraduate students in systems courses 9/22/2018 MSR HPC visit

Sponsors and Partners 9/22/2018 MSR HPC visit

Scope Applications Middleware, Virtualization Hardware 9/22/2018 Large Scale Distributed Data Mining Grid/P2P/Sensor Data Mining Genetic Linkage Analysis Applications Distributed Scalable Model Checking Anonymous and Private distributed Data Mining Machine Learning Sensor Networks Internet Mining Light-weight group communication Fast interconnects for HPC and data processing Condor – Grid Computing – Research, development, deployment Software Distributed Shared Memory System Services for Ad-hoc networks Middleware, Virtualization Locality in large-scale computations Data Privacy in Distributed Databases Multilevel caching in storage systems Scalable Data Race Detection Highly Available Distributed Java Computer Architecture: Fine Grain Parallelization Hardware 9/22/2018 MSR HPC visit

The Resource Hierarchy GLOW - UW Madison Boinc @HOME 9/22/2018 MSR HPC visit

EGEE 9/22/2018 MSR HPC visit

DSL users Dr. Avi Mendelson – Trace cache Prof. Ran El Yaniv – Machine Learning Prof. Roy Friedman – Group Communication Prof. Assaf Schuster – Large scale and grid Prof. Eli Biham – Cryptography Prof. Dan Geiger – Genetic Linkage Analysis Prof. Orna Grumberg – Scalable Model Checking Prof. Uri Weiser – Computer Architecture Prof. Ron Pinter – Caching Architectures Prof. Ronny Kimmel – 3D Image processing Prof. Reuven Cohen – Communication Networks Prof. Danny Raz – Active Distributed Services Prof. Idit Keidar – Distributed Systems Prof. Mooly Sagiv – Compiler Analysis Prof. Shaul Markovitch – Machine Learning Prof. Yoram Rosen – High Energy Physics …. 9/22/2018 MSR HPC visit

Contents - Tools Multiview – Distributed Shared Memory Data race detection Model checking-based DRD Grid Monitoring System Decorative HA for grids 9/22/2018 MSR HPC visit

Contents – Large-Scale Distributed Systems Peer-to-Peer Data Mining DataMiningGrid project QosCosGrid project Distributed runtime for multithreaded Java Distributed Model Checking

Multiview – Technologies for Distributed Shared Memory [OSDI’99] 9/22/2018 MSR HPC visit

See Multiview in a separate presentation 9/22/2018 MSR HPC visit

Data Race Detection for C++ Programs [PPOPP’03] 9/22/2018 MSR HPC visit

See MultiRace in a separate presentation 9/22/2018 MSR HPC visit

Model Checking-Based Data Race Detection [PPOPP’05] 9/22/2018 MSR HPC visit

Difficulties in model checking dataraces Infinite state space Huge number of interleavings Huge transition systems Size problem 9/22/2018 MSR HPC visit

Basic idea 9/22/2018 MSR HPC visit

hybrid solution Combine Lockset & Model Checking Provide witnesses for dataraces Rare dataraces Dataraces in large programs Model Checking Provide witnesses for rare DR + Lockset scale for large programs 9/22/2018 MSR HPC visit

Multi-threaded program Idea and Prototype Multi-threaded program List of Warnings Violations of locking principle Lockset Access suspicious of racing Find a1 Extend 1 Wolf Model checker 1 snapshot witness 2 9/22/2018 MSR HPC visit

Benchmark programs Lines Description Program 706 Tsp 708 Our_tsp 3751 traveling salesman from ETH Tsp 708 Enhanced traveling salesman Our_tsp 3751 Multithreaded raytracer from specjvm98 mtrt 29948 Web Crawler Kernel from ETH Hedc 362 Parallel sort SortArray 129 Finds prime numbers in a given interval PrimeFinder 150 Elevator simulator Elevsim 166 Shared DB simulator DQueries 9/22/2018 MSR HPC visit

Experimental results 4 threads 3 threads 2 threads Program our_tsp Memory (MB) Time (sec) Time (sec) Mem Out 353 35069 our_tsp 396 1334.93 123 569.3 SortArray 168 4547.1 143 2645.5 116 888.7 PrimeFinder 48 147.9 33 67.92 28 33.02 ElevSim 136 585.97 89 201.8 60 140.1 DQueries 17 9 12 7.33 11 2.66 Hedc Out Mem 377 35243 tsp 9/22/2018 MSR HPC visit

Mining for Misconfigured Machines in a Grid System [KDD’06] Tested with success on a production environment. 9/22/2018 MSR HPC visit

Grid Batch Systems Many potential causes of failures and misbehaviors Many organizations or administration sites. 10000s machines Heterogeneous machines Non dedicated Different installation and configuration Many potential causes of failures and misbehaviors Software bugs, hardware, network , configuration Current solutions Manual diagnosis Ruled based expert system. Data mining Limited, if any, prior knowledge Submission Resource broker Execution 9/22/2018 MSR HPC visit

Data Acquisition Data collector Data collector Data miner Data miner Non-intrusive Distributed Database Preprocessing Data miner Distributed Data miner Data collector 9/22/2018 MSR HPC visit

Distributed Outlier Detection 9/22/2018 MSR HPC visit

Distributed Outlier Detection 9/22/2018 MSR HPC visit

Distributed Outlier Detection 9/22/2018 MSR HPC visit

Distributed Outlier Detection 9/22/2018 MSR HPC visit

Distributed Implementation SG3 S1 SG2 S2 S3 2 1 1 1 SG 9/22/2018 MSR HPC visit

Distributed Implementation SG3 S1 SG2 S2 S3 2 3 1 1 SG 9/22/2018 MSR HPC visit

Distributed Implementation SG3 SG1 S1 SG2 S2 S3 3 1 SG 9/22/2018 MSR HPC visit

Evaluation on DSL Hardware 3 of the top 4 suspected machines are actually misconfigured. bh10: unknown reason. i4: loaded by network service. bh13: active HyperThreading. i3: root file system was nearly full. 9/22/2018 MSR HPC visit

Future Work Fault identification, analysis, classification, prediction. Better resource allocation; better system utilization Feedback to user on submitted jobs description Optimizing transparent operation Collaboration with INTEL NetBatch team 9/22/2018 MSR HPC visit

HA for large scale grids [HPDC’06] Production System – Condor distribution 9/22/2018 MSR HPC visit

The Challenges WAN backups Lightweight protocols Autonomous partitions Failure detection is not perfect - no bounded delay Network anomalies - links are asymmetric, not transitive IP fail-over techniques inapplicable Lightweight protocols Traditional Group Communication algs do not scale well Autonomous partitions Transient failures Legacy applications without HA Grid developers do not want to deal with HA Random, uniformly chosen, partial membership Provides random representative in every netwsork part. 9/22/2018 MSR HPC visit

The Goal The goal is to turn HA into a commodity Decoration “HA out of the box” No need to change or adapt your existing service HA is provided as a Grid service itself Solution: Decoration Transparent addition of HA to already existing and deployed services No changes to the decorated service 9/22/2018 MSR HPC visit

Application: HA for Condor Central Manager Job queue machine Job queue machine Central Manager Collector Negotiator Execution machine Execution machine Job queue machine Job queue machine Execution machine 9/22/2018 MSR HPC visit

Solution Architecture 9/22/2018 MSR HPC visit

Solution Highlights HAInvocator - High Availability for Negotiator Leader election Automatic failure detection Transparent failover to backup “Split brain” reconciliation after network partitions HAReplicator - Persistency of Negotiator state State replication between active and backups Proxy for multicasting client’s messages to Collector Loose coupling between replication and HA 9/22/2018 MSR HPC visit

Status Passed **testing** in 2005 Not a single code line of Condor changed Except for several bug fixes  Inside Condor distribution effective Version 6.8 Some important clients Some success stories On-going collaboration with the Condor team 9/22/2018 MSR HPC visit