U NIVERSITY OF M ASSACHUSETTS, A MHERST – Department of Computer Science Performance of Work Stealing in Multiprogrammed Environments Matthew Hertz Department.

U NIVERSITY OF M ASSACHUSETTS, A MHERST – Department of Computer Science Performance of Work Stealing in Multiprogrammed Environments Matthew Hertz Department of Computer Science University of Massachusetts

U NIVERSITY OF M ASSACHUSETTS, A MHERST – Department of Computer Science 1 Research Goals SMP machines moving onto desktops Parallel applications coexist with other work Unlike Cilk, seek multithreaded solution Seek simple solution to maximize throughput Develop effective performance model

U NIVERSITY OF M ASSACHUSETTS, A MHERST – Department of Computer Science Work Overview Number of processes: P Number of processors: P A Total work of computation: T 1 Execution time of program: T P (≈ T 1 /P) Speedup of system: T 1 /T P (≈ P) Processor utilization: T 1 /(P A T P )

U NIVERSITY OF M ASSACHUSETTS, A MHERST – Department of Computer Science Static Partitioning Given P processes & P A processors Run first P A processes to completion Processes replaced as they complete Very simple mechanism works when P A ≥ P Speedup of P Utilization of 1.0

U NIVERSITY OF M ASSACHUSETTS, A MHERST – Department of Computer Science Static Partitioning Problems Worst-case when P = P A + 1 First P A processes take time T 1 /P Last process needs additional T 1 /P T P = 2(T 1 /P) = 2T 1 /(P A + 1) Utilization = T 1 /(P A T P ) = ½ This is not a robust solution!

U NIVERSITY OF M ASSACHUSETTS, A MHERST – Department of Computer Science Static Partitioning Performance

U NIVERSITY OF M ASSACHUSETTS, A MHERST – Department of Computer Science User-Level Thread Scheduling Dynamically, not statically, manage parallelism Solution offers simplicity and portability Natural level of parallelism at which to program Many OSs allow these plug-in schedulers OS still manages the kernel-level threads But assume that the OS is an advesary

U NIVERSITY OF M ASSACHUSETTS, A MHERST – Department of Computer Science Work Stealing Improves upon static partitioning by balancing work among processors Best work-stealing method unknown Considers 3 “strawman” algorithms: Use spinning lock on deques Use blocking lock on deques Use naïve non-blocking deques Where these algorithms fail suggests where improvements are possible

U NIVERSITY OF M ASSACHUSETTS, A MHERST – Department of Computer Science Spinning Locks Locking and unlocking requires few instructions Follows work-first design principles As processes increase beyond P A then Thread acquires lock and is preempted Other threads spin waiting for lock, keeping preempted thread from returning Speedup degrades quickly

U NIVERSITY OF M ASSACHUSETTS, A MHERST – Department of Computer Science Naïve Non-Blocking Deques Non-blocking deques support concurrency Better performance by not spinning on locks However, as processes increase beyond P A One process must be preempted Other processes attempt to steal, but only remaining (unblocked) thread is preempted By attempting to steal, processes prevent preempted thread from running

U NIVERSITY OF M ASSACHUSETTS, A MHERST – Department of Computer Science Blocking Locks Use OS calls to block/unblock threads waiting on lock Improves performance as little time spent waiting for locks Each lock access takes a context switches Context switches are very expensive The number of switches explodes when P > P A This design violates the work-first principle!

U NIVERSITY OF M ASSACHUSETTS, A MHERST – Department of Computer Science The Solution: Hood Designed using work-first principles Must avoid blocking locks implementation But do not want expense of spinning locks Uses non-blocking deques, but... Adds (rare) calls into OS to prevent bursts of steal attempts

U NIVERSITY OF M ASSACHUSETTS, A MHERST – Department of Computer Science Hood is Non-Blocking Threads can only run-to-completion Cannot use locks, semaphores, conditional vars. Cannot block while waiting on child thread Can start blocked waiting on other thread Threads added to deque only when unblocked

U NIVERSITY OF M ASSACHUSETTS, A MHERST – Department of Computer Science Hood uses Work-Stealing Parallelism expressed on thread granularity When out of threads, process becomes thief Before becoming thief, lowers own priority After successful theft, restores old priority In between attempts to steal, process“yields” This uses work-first design principles OS calls only done by thieves Calls only occur when cost can be amortized

U NIVERSITY OF M ASSACHUSETTS, A MHERST – Department of Computer Science Hood Performance Results Considered 6 benchmarks: mm, lu, barnes, heat, msort, ray On a dedicated 8 processor machine Average overhead: 1.008 (0.955 - 1.030) Average speedup: 7.52 (7.02 - 7.83) Outperforms static partitioning on 3 of 4 benchmarks that were compared

U NIVERSITY OF M ASSACHUSETTS, A MHERST – Department of Computer Science Performance Model Analytical results from previous papers can be expanded: T P ≤ c 1 T 1 /P A + c ∞ T ∞ P/P A (4 unknowns) Simplify by combining with utilization: (1+cP/(T 1 /T ∞ )) -1 ≤ T 1 /(P A T P )

U NIVERSITY OF M ASSACHUSETTS, A MHERST – Department of Computer Science Performance Model Measured model versus knary c 1 = 1 and c ∞ = 1 holds as lower bound c 1 = 1.1 and c ∞ = 2 usually holds Bounds continue to hold on non-dedicated machines knary violates first bound occasionally knary violates second bound often

U NIVERSITY OF M ASSACHUSETTS, A MHERST – Department of Computer Science Conclusions Non-blocking work stealer is efficient and effective Performs as well as, or better than, static partitioning on most benchmarks Performance improvement continues on non-dedicated machines Derived performance model matches expectations

U NIVERSITY OF M ASSACHUSETTS, A MHERST – Department of Computer Science Questions How does Hood perform on “real” benchmarks? How does Hood compare with the other work-stealing implementations? Why does Hood have a negative overhead? Why not statically partition the other two benchmarks?

U NIVERSITY OF M ASSACHUSETTS, A MHERST – Department of Computer Science Questions (cont.) How well does non-blocking work stealing perform in other uses (e.g., Cilk)? How will Hood perform on other platforms? Especially the Alpha, where cas costs more How well does Hood compare with co- scheduling or process control?

U NIVERSITY OF M ASSACHUSETTS, A MHERST – Department of Computer Science Performance of Work Stealing in Multiprogrammed Environments Matthew Hertz Department.

Similar presentations

Presentation on theme: "U NIVERSITY OF M ASSACHUSETTS, A MHERST – Department of Computer Science Performance of Work Stealing in Multiprogrammed Environments Matthew Hertz Department."— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

U NIVERSITY OF M ASSACHUSETTS, A MHERST – Department of Computer Science Performance of Work Stealing in Multiprogrammed Environments Matthew Hertz Department.

Similar presentations

Presentation on theme: "U NIVERSITY OF M ASSACHUSETTS, A MHERST – Department of Computer Science Performance of Work Stealing in Multiprogrammed Environments Matthew Hertz Department."— Presentation transcript:

Similar presentations

About project

Feedback