TreadMarks Presented By: Jason Robey. Cool pic from last semester.

Slides:

Advertisements

Similar presentations

1 Lecture 20: Synchronization & Consistency Topics: synchronization, consistency models (Sections )

Advertisements

MPI Message Passing Interface

Synchronization. How to synchronize processes? – Need to protect access to shared data to avoid problems like race conditions – Typical example: Updating.

Relaxed Consistency Models. Outline Lazy Release Consistency TreadMarks DSM system.

ECE 454 Computer Systems Programming Parallel Architectures and Performance Implications (II) Ding Yuan ECE Dept., University of Toronto

Exploiting Distributed Version Concurrency in a Transactional Memory Cluster Kaloian Manassiev, Madalin Mihailescu and Cristiana Amza University of Toronto,

Multiple-Writer Distributed Memory. The Sequential Consistency Memory Model P1P2 P3 switch randomly set after each memory op ensures some serial order.

Presented by Evan Yang. Overview of Munin  Distributed shared memory (DSM) system  Unique features Multiple consistency protocols Release consistency.

Distributed Shared Memory

Using DSVM to Implement a Distributed File System Ramon Lawrence Dept. of Computer Science

1 Lecture 12: Hardware/Software Trade-Offs Topics: COMA, Software Virtual Memory.

November 1, 2005Sebastian Niezgoda TreadMarks Sebastian Niezgoda.

Slides for Parallel Programming Techniques & Applications Using Networked Workstations & Parallel Computers 2nd ed., by B. Wilkinson & M

Lightweight Logging For Lazy Release Consistent DSM Costa, et. al. CS /01/01.

Memory consistency models Presented by: Gabriel Tanase.

Active Messages: a Mechanism for Integrated Communication and Computation von Eicken et. al. Brian Kazian CS258 Spring 2008.

CVM (Coherent Virtual Machine). CVM CVM is a user-level library Enable the program to exploit shared- memory semantics over message-passing hardware.

Distributed Resource Management: Distributed Shared Memory

Memory Consistency Models

CS252/Patterson Lec /28/01 CS 213 Lecture 10: Multiprocessor 3: Directory Organization.

1 Lecture 20: Protocols and Synchronization Topics: distributed shared-memory multiprocessors, synchronization (Sections )

1 Shared-memory Architectures Adapted from a lecture by Ian Watson, University of Machester.

Multiprocessor Cache Coherency

Lecture 4: Parallel Programming Models. Parallel Programming Models Parallel Programming Models: Data parallelism / Task parallelism Explicit parallelism.

Synchronization (Barriers) Parallel Processing (CS453)

Distributed Shared Memory Systems and Programming

TreadMarks Distributed Shared Memory on Standard Workstations and Operating Systems Pete Keleher, Alan Cox, Sandhya Dwarkadas, Willy Zwaenepoel.

Distributed Shared Memory: A Survey of Issues and Algorithms B,. Nitzberg and V. Lo University of Oregon.

1 Cache coherence CEG 4131 Computer Architecture III Slides developed by Dr. Hesham El-Rewini Copyright Hesham El-Rewini.

August 15, 2001Systems Architecture II1 Systems Architecture II (CS ) Lecture 12: Multiprocessors: Non-Uniform Memory Access * Jeremy R. Johnson.

Lazy Release Consistency for Software Distributed Shared Memory Pete Keleher Alan L. Cox Willy Z.

TECHNIQUES FOR REDUCING CONSISTENCY- RELATED COMMUNICATION IN DISTRIBUTED SHARED-MEMORY SYSTEMS J. B. Carter University of Utah J. K. Bennett and W. Zwaenepoel.

1 Lecture 12: Hardware/Software Trade-Offs Topics: COMA, Software Virtual Memory.

Ch 10 Shared memory via message passing Problems –Explicit user action needed –Address spaces are distinct –Small Granularity of Transfer Distributed Shared.

Cache Coherence Protocols 1 Cache Coherence Protocols in Shared Memory Multiprocessors Mehmet Şenvar.

Distributed Shared Memory Presentation by Deepthi Reddy.

Distributed Shared Memory (part 1). Distributed Shared Memory (DSM) mem0 proc0 mem1 proc1 mem2 proc2 memN procN network... shared memory.

Treadmarks: Distributed Shared Memory on Standard Workstations and Operating Systems P. Keleher, A. Cox, S. Dwarkadas, and W. Zwaenepoel The Winter Usenix.

Implementation and Performance of Munin (Distributed Shared Memory System) Dongying Li Department of Electrical and Computer Engineering University of.

DISTRIBUTED COMPUTING

Page 1 Distributed Shared Memory Paul Krzyzanowski Distributed Systems Except as otherwise noted, the content of this presentation.

Fundamentals of Parallel Computer Architecture - Chapter 71 Chapter 7 Introduction to Shared Memory Multiprocessors Yan Solihin Copyright.

TreadMarks: Distributed Shared Memory on Standard Workstations and Operating Systems Present By: Blair Fort Oct. 28, 2004.

1 Chapter 9 Distributed Shared Memory. 2 Making the main memory of a cluster of computers look as though it is a single memory with a single address space.

Computer Science and Engineering Copyright by Hesham El-Rewini Advanced Computer Architecture CSE 8383 March 20, 2008 Session 9.

Lazy Release Consistency for Software Distributed Shared Memory Pete Keleher Alan L. Cox Willy Z. By Nooruddin Shaik.

Lecture 4 Mechanisms & Kernel for NOSs. Mechanisms for Network Operating Systems  Network operating systems provide three basic mechanisms that support.

1 Lecture 19: Scalable Protocols & Synch Topics: coherence protocols for distributed shared-memory multiprocessors and synchronization (Sections )

OpenMP for Networks of SMPs Y. Charlie Hu, Honghui Lu, Alan L. Cox, Willy Zwaenepoel ECE1747 – Parallel Programming Vicky Tsang.

Distributed shared memory u motivation and the main idea u consistency models F strict and sequential F causal F PRAM and processor F weak and release.

August 13, 2001Systems Architecture II1 Systems Architecture II (CS ) Lecture 11: Multiprocessors: Uniform Memory Access * Jeremy R. Johnson Monday,

Computer Science and Engineering Parallel and Distributed Processing CSE 8380 April 7, 2005 Session 23.

Software Coherence Management on Non-Coherent-Cache Multicores

Distributed Shared Memory

The University of Adelaide, School of Computer Science

Multiprocessor Cache Coherency

Pete Keleher, Alan L. Cox, Sandhya Dwarkadas and Willy Zwaenepoel

CMSC 611: Advanced Computer Architecture

Example Cache Coherence Problem

Slides developed by Dr. Hesham El-Rewini Copyright Hesham El-Rewini

The University of Adelaide, School of Computer Science

Lecture 24: Multiprocessors

Coherent caches Adapted from a lecture by Ian Watson, University of Machester.

Distributed Resource Management: Distributed Shared Memory

Lecture 17 Multiprocessors and Thread-Level Parallelism

Lecture 18: Coherence and Synchronization

The University of Adelaide, School of Computer Science

An Implementation of User-level Distributed Shared Memory

Chapter 13: I/O Systems “The two main jobs of a computer are I/O and [CPU] processing. In many cases, the main job is I/O, and the [CPU] processing is.

Presentation transcript:

TreadMarks Presented By: Jason Robey

Cool pic from last semester

TreadMarks Authors Rice University –Christiana Amza –Alan Cox—Keleher committee –Eyal Delara—new –Snadhya Dwarkadas –Charlie Hu—new –Pete Keleher—Ph.D. Thesis author –Honghui Lu—Message Passing counter-examples –Karthick Rajamani –Weimin Yu –Willy Zwaenepoel—Keleher committee

Overview 1.Consistency Model 2.API 3.Protocols and Implementation 4.Applications and Performance 5.Results Analysis 6.Conclusion

What’s the Problem? We want to use multiple COTS processors to do our work more quickly Shared memory is closer to our normal model of programming than message passing DSM systems usually spend too many resources ensuring bad programs will work reasonably well We should provide the programmer with the ability to specify coherence requirements

Lazy Release Consistency (RC) Eager RC acknowledges that there are programming synchronization points for valid parallel programs Processors acquire a data region, work on the data region, and make it available to other processors Upon completion of work, valid copies are sent to all concerned processors Lazy waits for data to be accessed

Ordering and Correct Programs Partial ordering—hb 1 –Maintain sequential consistency per processor –Release and acquire happen in order so that all releases are visible to a subsequent acquire –Ordering is transitive Lazy? –Updates are not made until access

Ordering and Correct Programs Correct program –No data race conditions –Programmer handles synchronization –Synchronization events can be used to denote releases and acquires –What is required is to provide the programmer a model which they can make deterministic with synchronization primitives, not to guess how an update will need to transpire

API Setup –Fixed number of processors during runtime –Startup and exit –Feels similar to MPI Synchronization –Barriers and Locks (acquire, release) –Integer based with fixed number of supported locks and barriers Memory –Tmk_malloc/Tmk_free –Tmk_distribute (new since paper)

Manual Example struct shared { int sum; int turn; int* array; } *shared; main(int argc, char **argv) { /*…*/ if (Tmk_proc_id==0) { shared = (struct shared *) Tmk_malloc(sizeof(shared)); if (shared==NULL) Tmk_exit(-1); /* share common pointer with all procs */ Tmk_distribute(&shared, sizeof(shared)); shared->array = (int *) Tmk_malloc(arrayDim*sizeof(int)); if (shared->array==NULL) Tmk_exit(-1); shared->turn = 0; shared->sum = 0; } /* … */ if (Tmk_proc_id == 0) { Tmk_free(shared->array); Tmk_free(shared); /*…*/ }}

Paper Example Barriers on p. 6, Locks on p. 8 Excessively simplified, but shows the use of barriers and locks Barrier = wait until all processors hold on the same barrier before continuing Lock = make sure no other processor accesses a region protected by this lock until I release

Protocols and Implementation Do not assume specialized hardware Do not assume light-weight processes Use only one process per processor Register signal handlers for asynchronous messaging and shared memory access

Protocols and Implementation Init 1.Create Requested number of processes on remote machines 2.Set up full duplex sockets between each process 3.Register SIGIO handler for messaging 4.Allocate 1 large block for shared memory at the same (VM) address on each machine and mark as non-accessible using mprotect 5.Choose a processor in round-robin fashion to be the manager for each page of the block and for each lock and barrier 6.Register SEGV handler for shared memory access

Protocols and Implementation Memory (p. 20—[2]) –4 states (UNMAPPED, READ-ONLY, READ-WRITE, INVALID) if (p READ_ONLY) then Allocate twin Update p to READ-WRITE else if (cold miss) then get copy from manager if (write notices) then retrieve difs if (write miss) then allocate twin change p to READ-WRITE else change p to READ-ONLY end

Protocols and Implementation Locks –Lock = Acquire, Unlock = Release –Lock has local and held flags –If local lock request, set flag if not held –Otherwise request it from the manager –Manager keeps flag status and current owner pointer if held

Protocols and Implementation Barriers –Arrive = acquire for manager, release for workers –Exit = release for manager, acquire for workers –Centralized barrier scheme, so manager listens for processors getting to barrier and send release when all present

Protocols and Implementation Multiple Writers –Avoid ping-pong (tech) effect of other VM page level DSM systems –Maintain a diff of current shared version and processor version –When needed, send diffs to other processors to update shared memory region –Multiple writes to same page—avoids false sharing –If same memory written, then race-condition

Protocols and Implementation Lazy diffs –Diffing can be an expensive operation –Worst case is modification on every-other byte –Instead of sending diffs on releases (eager) or acquires (lazy), send only invalidate messages –Upon access, SEGV handler will request diffs—do diff at this time –Multiple diffs may then be taken care of with a single delayed diff –Once diff has been sent, memory eligible for gc –Typically, diffs are needed from only one processor in lock situations

Protocols and Implementation Comms (over best-effort protocols) –Send Kernel trap—interrupt current process Send message Wait for appropriate response or request If Timeout, retransmit Restart process –Receive Interrupt process, pull up SIGIO handler Perform requested operation Send response Restart process

Applications and Performance Only two major applications ever done with this Mixed Integer Programming (MIP) ILINK—genetic tracing through family trees Tested from 1 to 8 processors Speedups from 4 to 7 by 8 processors Around 10 universities have purchased

Results Analysis Starting from an efficient serial solution “the amount of modification to arrive at an efficient parallel code proved to be relatively minor” –Usually only the case for systems bordering on trivially parallel –Two major applications appear to be in this class Even on these, speed-ups are significantly decreased by the time we reach only 8 processors Seems to be a stretch to claim scalability to larger problems and clusters

Results Analysis With this system, some things that you typically do in the message passing paradigm happen automatically This is at a cost (diffs and other overhead), and the message passing can typically be made more efficient Sounds similar to an argument about high-level programming vs. assembly programming Shared memory does seem to make some things nice

Conclusion This work optimized on a lot of the shared memory problem Results are worse than one would like for as small as 8 processors Do not expect good speedup for 16, 32, … processors Message passing may be better suited for NOWs

References 1.“TreadMarks: Shared Memory Computing on Networks of Workstations,” C. Amza et. al., Rice University, “Distributed Shared Memory Using Lazy Release Consistency,” P. Keleher, PhD thesis, Rice University, December TreadMarks API documentation of versions and “The TreadMarks Distributed Shared Memory (DSM) System,” iew.html, website iew.html

Questions, s’il vous plait? Non? Questions de connaissances générales? --En Anglais, s’il vous plait --En Anglais, s’il vous plait