TreadMarks Presented By: Jason Robey. Cool pic from last semester.

TreadMarks Presented By: Jason Robey

Cool pic from last semester

TreadMarks Authors Rice University –Christiana Amza –Alan Cox—Keleher committee –Eyal Delara—new –Snadhya Dwarkadas –Charlie Hu—new –Pete Keleher—Ph.D. Thesis author –Honghui Lu—Message Passing counter-examples –Karthick Rajamani –Weimin Yu –Willy Zwaenepoel—Keleher committee

Overview 1.Consistency Model 2.API 3.Protocols and Implementation 4.Applications and Performance 5.Results Analysis 6.Conclusion

What’s the Problem? We want to use multiple COTS processors to do our work more quickly Shared memory is closer to our normal model of programming than message passing DSM systems usually spend too many resources ensuring bad programs will work reasonably well We should provide the programmer with the ability to specify coherence requirements

Lazy Release Consistency (RC) Eager RC acknowledges that there are programming synchronization points for valid parallel programs Processors acquire a data region, work on the data region, and make it available to other processors Upon completion of work, valid copies are sent to all concerned processors Lazy waits for data to be accessed

Ordering and Correct Programs Partial ordering—hb 1 –Maintain sequential consistency per processor –Release and acquire happen in order so that all releases are visible to a subsequent acquire –Ordering is transitive Lazy? –Updates are not made until access

Ordering and Correct Programs Correct program –No data race conditions –Programmer handles synchronization –Synchronization events can be used to denote releases and acquires –What is required is to provide the programmer a model which they can make deterministic with synchronization primitives, not to guess how an update will need to transpire

API Setup –Fixed number of processors during runtime –Startup and exit –Feels similar to MPI Synchronization –Barriers and Locks (acquire, release) –Integer based with fixed number of supported locks and barriers Memory –Tmk_malloc/Tmk_free –Tmk_distribute (new since paper)

Manual Example struct shared { int sum; int turn; int* array; } *shared; main(int argc, char **argv) { /*…*/ if (Tmk_proc_id==0) { shared = (struct shared *) Tmk_malloc(sizeof(shared)); if (shared==NULL) Tmk_exit(-1); /* share common pointer with all procs */ Tmk_distribute(&shared, sizeof(shared)); shared->array = (int *) Tmk_malloc(arrayDim*sizeof(int)); if (shared->array==NULL) Tmk_exit(-1); shared->turn = 0; shared->sum = 0; } /* … */ if (Tmk_proc_id == 0) { Tmk_free(shared->array); Tmk_free(shared); /*…*/ }}

Paper Example Barriers on p. 6, Locks on p. 8 Excessively simplified, but shows the use of barriers and locks Barrier = wait until all processors hold on the same barrier before continuing Lock = make sure no other processor accesses a region protected by this lock until I release

Protocols and Implementation Do not assume specialized hardware Do not assume light-weight processes Use only one process per processor Register signal handlers for asynchronous messaging and shared memory access

Protocols and Implementation Init 1.Create Requested number of processes on remote machines 2.Set up full duplex sockets between each process 3.Register SIGIO handler for messaging 4.Allocate 1 large block for shared memory at the same (VM) address on each machine and mark as non-accessible using mprotect 5.Choose a processor in round-robin fashion to be the manager for each page of the block and for each lock and barrier 6.Register SEGV handler for shared memory access

Protocols and Implementation Memory (p. 20—[2]) –4 states (UNMAPPED, READ-ONLY, READ-WRITE, INVALID) if (p READ_ONLY) then Allocate twin Update p to READ-WRITE else if (cold miss) then get copy from manager if (write notices) then retrieve difs if (write miss) then allocate twin change p to READ-WRITE else change p to READ-ONLY end

Protocols and Implementation Locks –Lock = Acquire, Unlock = Release –Lock has local and held flags –If local lock request, set flag if not held –Otherwise request it from the manager –Manager keeps flag status and current owner pointer if held

Protocols and Implementation Barriers –Arrive = acquire for manager, release for workers –Exit = release for manager, acquire for workers –Centralized barrier scheme, so manager listens for processors getting to barrier and send release when all present

Protocols and Implementation Multiple Writers –Avoid ping-pong (tech) effect of other VM page level DSM systems –Maintain a diff of current shared version and processor version –When needed, send diffs to other processors to update shared memory region –Multiple writes to same page—avoids false sharing –If same memory written, then race-condition

Protocols and Implementation Lazy diffs –Diffing can be an expensive operation –Worst case is modification on every-other byte –Instead of sending diffs on releases (eager) or acquires (lazy), send only invalidate messages –Upon access, SEGV handler will request diffs—do diff at this time –Multiple diffs may then be taken care of with a single delayed diff –Once diff has been sent, memory eligible for gc –Typically, diffs are needed from only one processor in lock situations

Protocols and Implementation Comms (over best-effort protocols) –Send Kernel trap—interrupt current process Send message Wait for appropriate response or request If Timeout, retransmit Restart process –Receive Interrupt process, pull up SIGIO handler Perform requested operation Send response Restart process

Applications and Performance Only two major applications ever done with this Mixed Integer Programming (MIP) ILINK—genetic tracing through family trees Tested from 1 to 8 processors Speedups from 4 to 7 by 8 processors Around 10 universities have purchased

Results Analysis Starting from an efficient serial solution “the amount of modification to arrive at an efficient parallel code proved to be relatively minor” –Usually only the case for systems bordering on trivially parallel –Two major applications appear to be in this class Even on these, speed-ups are significantly decreased by the time we reach only 8 processors Seems to be a stretch to claim scalability to larger problems and clusters

Results Analysis With this system, some things that you typically do in the message passing paradigm happen automatically This is at a cost (diffs and other overhead), and the message passing can typically be made more efficient Sounds similar to an argument about high-level programming vs. assembly programming Shared memory does seem to make some things nice

Conclusion This work optimized on a lot of the shared memory problem Results are worse than one would like for as small as 8 processors Do not expect good speedup for 16, 32, … processors Message passing may be better suited for NOWs

References 1.“TreadMarks: Shared Memory Computing on Networks of Workstations,” C. Amza et. al., Rice University, 1994 2.“Distributed Shared Memory Using Lazy Release Consistency,” P. Keleher, PhD thesis, Rice University, December 1994 3.TreadMarks API documentation of versions 0.9.8 and 0.10.1 4.“The TreadMarks Distributed Shared Memory (DSM) System,” http://www.cs.rice.edu/~willy/TreadMarks/overv iew.html, website http://www.cs.rice.edu/~willy/TreadMarks/overv iew.html

Questions, s’il vous plait? Non? Questions de connaissances générales? --En Anglais, s’il vous plait --En Anglais, s’il vous plait

TreadMarks Presented By: Jason Robey. Cool pic from last semester.

Similar presentations

Presentation on theme: "TreadMarks Presented By: Jason Robey. Cool pic from last semester."— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

TreadMarks Presented By: Jason Robey. Cool pic from last semester.

Similar presentations

Presentation on theme: "TreadMarks Presented By: Jason Robey. Cool pic from last semester."— Presentation transcript:

Similar presentations

About project

Feedback