Department of Computer Science University of California,Santa Barbara

Department of Computer Science University of California,Santa Barbara
Communication Optimizations for Parallel Computing Using Data Access Information Martin Rinard Department of Computer Science University of California,Santa Barbara

Motivation Communication Overhead Can Substantially Degrade the Performance of Parallel Computations

Communication Optimizations
Replication Locality Broadcast Concurrent Fetches Latency Hiding

Applying Optimizations
Programmer: By Hand Programming Burden Portability Problems Language Implementation: Automatically Reduces Programming Burden No Portability Problems -Each Implementation Optimized for Current Hardware Platform

Key Questions How does the implementation get the information it needs to apply the communication optimizations? What communication optimization algorithms does the implementation use? How well do the optimized computations perform?

Goal of Talk Present Experience Automatically Applying Communication Optimizations in Jade

Talk Outline Jade Language Message Passing Implementation
Communication Optimization Algorithms Experimental Results on iPSC/860 Shared Memory Implementation Experimental Results on Stanford DASH Conclusion

Jade Portable, Implicitly Parallel Language Data Access Information
Programmer starts with serial program Uses Jade constructs to provide information about how parts of program access data Jade Implementation Uses Data Access Information to Automatically Extract Concurrency Synchronize Computation Apply Communication Optimizations

Jade Concepts Shared Objects Tasks Access Specifications withonly {
shared object references withonly { } do () { computation that reads and writes } rd ; wr ; access specification task

Jade Example withonly { } do () { ...} rd ; wr ; withonly {

Jade Example withonly { } do () { ...} rd ; rd ; wr ; wr ; withonly {

Jade Example withonly { } do () { ...} rd ; wr ; withonly {

Result At Each Point in the Execution Jade Implementation
A Collection of Enabled Tasks Each Task Has an Access Specification Jade Implementation Exploits Information in Access Specifications Apply Communication Optimizations

Message Passing Implementation
Model of Computation for Implementation Implementation Overview Communication Optimizations Experimental Results for iPSC/860

Model of Computation Each Processor Has a Private Memory
Processors Communicate by Sending Messages through Network network memory processor

Implementation Overview
Distributes Objects Across Memories

Assigns Enabled Tasks to Idle Processors rd ; wr ;

Transfers Objects to Accessing Processor Replicates Objects that Task will Read rd ; wr ;

Transfers Objects to Accessing Processor Migrates Objects that Task will Write rd ; wr ;

When all Remote Objects Arrive Task Executes rd ; wr ;

Optimization Goal Mechanism Adaptive Broadcast Parallelize
Communication Broadcast Each New Version of Widely Accessed Objects Replication Enable Tasks to Concurrently Read Same Data Replicate Data on Reading Processors Latency Hiding Overlap Computation and Communication Assign Multiple Enabled Tasks to Same Processor Concurrent Fetch Parallelize Communication Concurrently Transfer Remote Objects that Task will Access Locality Eliminate Communication Execute Tasks on Processors that have Locally Available Copies of Accessed Objects

Application-Based Evaluation
Water: Evaluates forces and potentials in a system of liquid water molecules String: Computes a velocity model of the geology between two oil wells Ocean: Simulates the role of eddy and boundary currents in influencing large-scale ocean movements Panel Cholesky: Sparse Cholesky factorization algorithm

Impact of Communication Optimizations
Panel Cholesky Water String Ocean - + Adaptive Broadcast Replication - Latency Hiding - Concurrent Fetch + Significant Impact - Negligible Impact Required To Expose Concurrency

Locality Optimization
Integrated into Online Scheduler Scheduler Maintains Pool of Enabled Tasks Maintains Pool of Idle Processors Balances Load by Assigning Enabled Tasks to Idle Processors Locality Algorithm Affects the Assignment

Locality Concepts Each Object has an Owner:
Last processor to write the object. Owner has a current copy of the object. Each Task has a Locality Object: Currently first object in access specification. Locality Object Determines Target Processor: Owner of locality object. Goal: Execute each task on its target processor.

When Task Becomes Enabled
Scheduler Checks Pool of Idle Processors If Target Processor is Idle Target Processor Gets Task If Some Other Processor is Idle Other Processor Gets Task No Processor is Idle Task is Held in Pool of Enabled Tasks

When Processor Becomes Idle
Scheduler Checks Pool of Enabled Tasks If Processor is Target of an Enabled Task Processor Gets That Task If Other Enabled Tasks Exist Processor Gets one of Those Tasks No Enabled Tasks Processor Stays Idle

Implementation Versions
Locality: Implementation uses Locality Algorithm No Locality: First Come, First Served Assignment of Enabled Tasks to Idle Processors Task Placement (Ocean and Panel Cholesky) Programmer assigns tasks to processors

Percentage of Tasks Executed on Target Processor on iPSC/860
25 50 75 100 8 16 24 32 25 50 75 100 8 16 24 32 string water panel cholesky ocean locality no locality task placement 100 75 50 25 8 16 24 32

Communication to Useful Computation Ratio on IPSC/860 (Mbytes/Second/Processor)
0.0025 0.0025 0.0020 0.0020 0.0015 0.0015 0.0010 0.0010 no locality 0.0005 0.0005 locality 8 16 24 32 8 16 24 32 water string task placement 3 3 2 2 1 1 8 16 24 32 8 16 24 32 ocean panel cholesky

Speedup on iPSC/860 string water panel cholesky ocean locality
8 16 24 32 8 16 24 32 string water panel cholesky ocean locality no locality task placement 1 2 3 4 8 16 24 32 1 2 3 4 8 16 24 32

Shared Memory Implementation
Model of Computation Locality Optimization Locality Performance Results

Model of Computation Single Shared Memory
Composed of Memory Modules Each Memory Module Associated with a Processor Each Object Allocated in a Memory Module Processors Communicate by Reading and Writing Objects in the Shared Memory memory module shared memory object processor

Locality Algorithm Integrated into Online Scheduler
Scheduler Runs Distributed Task Queue Each Processor Has a Queue of Enabled Tasks Idle Processors Search Task Queues Locality Algorithm Affects Task Queue Algorithm

Locality Concepts Each Object has an Owner:
Processor associated with memory module that holds object. Accesses to the object from this processor are satisfied from local memory module. Each Task has a Locality Object: Currently first object in access specification. Locality Object Determines Target Processor: Owner of locality object. Goal: Execute each task on its target processor.

When Processor Becomes Idle
If Its Task Queue is not Empty Execute First Task in Task Queue Otherwise Cyclically Search Task Queues If Remote Task Queue is not Empty Execute Last Task in Task Queue

When Task Becomes Enabled
Locality Algorithm Inserts Task into Task Queue at the Owner of Its Locality Object Tasks with Same Locality Object are Adjacent in Queue Goals: Enhance memory locality by executing each task on the owner of its locality object. Enhance cache locality by executing tasks with the same locality object consecutively on same the processor.

Evaluation Same Set of Applications Same Locality Versions Water
String Ocean Panel Cholesky Same Locality Versions Locality No Locality (Single Task Queue) Explicit Task Placement (Ocean and Panel Cholesky)

Percentage of Tasks Executed on Target Processor on DASH
25 50 75 100 8 16 24 32 100 75 50 string water panel cholesky ocean locality no locality task placement 25 8 16 24 32 25 50 75 100 8 16 24 32 25 50 75 100 8 16 24 32

Task Execution Time on DASH
8 16 24 32 1000 2000 3000 4000 6000 12000 18000 24000 8 16 24 32 no locality locality water string task placement 100 200 300 400 8 16 24 32 100 75 50 25 8 16 24 32 ocean panel cholesky

Speedup on DASH string water panel cholesky ocean locality no locality
32 32 24 24 16 16 string water panel cholesky ocean locality no locality task placement 8 8 8 16 24 32 8 16 24 32 32 32 24 24 16 16 8 8 8 16 24 32 8 16 24 32

Related Work Shared Memory Message Passing
COOL - Chandra, Gupta, Hennessy Fowler, Kontothanasis Message Passing Tempest - Falsafi, Lebeck, Reinhardt, Schoinas, Hill, Larus, Rogers, Wood Munin - Carter, Bennet, Zwaenepol SAM - Scales, Lam Olden - Carlisle, Rogers Prelude - Hsieh, Wang, Weihl

Conclusion Access Specifications Enable Communication Optimizations
Implemented Optimizations for Jade Message Passing Implementation Shared Memory Implementation Experimental, Application-Based Evaluation Replication Required to Expose Concurrency Locality Significant for Ocean and Panel Cholesky Broadcast Significant for Water Other Optimizations Have Little or No Impact

Department of Computer Science University of California,Santa Barbara

Similar presentations

Presentation on theme: "Department of Computer Science University of California,Santa Barbara"— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

Department of Computer Science University of California,Santa Barbara

Similar presentations

Presentation on theme: "Department of Computer Science University of California,Santa Barbara"— Presentation transcript:

Similar presentations

About project

Feedback