(Superficial!) Review of Uniprocessor Architecture Parallel Architectures and Related concepts CS 433 Laxmikant Kale University of Illinois at Urbana-Champaign Department of Computer Science
Parallel Machines: an abstract introduction Our main focus will be on three kinds of machines –Bus-based shared memory machines –Scalable shared memory machines Cache coherent Hardware support for remote memory access –Distributed memory machines
Distributed memory m/cs: debate Interconnection Network PE0 Mem0 cache Pep-1 Memp-1 cache PE1 Mem1 cache Should this machine support a shared address space? If not : coordination by “passing messages” If so: how and whether to keep caches “ coherent”? This debate is also tied to the debate over programming models:
Writing parallel programs Programming model –How should a programmer view the parallel machine? –Sequential programming: von Neumann model Parallel programming models: –Shared memory (Shared address space) model –Message passing model –Shared Objects model Common to all these models: –In all these models, you have multiple independent entities communicating, synchronizing and coordinating with each other via specific mechanisms provided by the model Special-purpose models: –A common case: data-parallel (loop-parallel) models –Other “domain-specific” models
Shared Address space model Also called shared memory model sometimes: –considered a misnomer by some: shared memory is an arch. Concept Independent entities are called threads (or processes) –All threads use the same common address space –When thread i refers to an address A, it is the same location as when thread j refers to address A. Advantages: –Natural extension of sequential programming model Some people disagree even about this –Relatively easy to get “first parallel version” of an existing sequential code
Shared Address space model: Issues: –Need hardware support for cache coherence and consistency: But that’s not the concern when we are discussing efficacy of the prog model –Data being read by one may be being modified by another Need ways of synchronizing access E.g. Producer-consumer relationship between threads –Producer is to store the result in shared variable X –When can the consumer thread read it? –Another example: inconsistent modifications: Suppose two processes are both trying to add 5 to x. –In reality, it is not one instruction, but 3: Now, the 6 instructions (3 from each thread) –may interleave in many possible ways –leading to wrong behavior x := x-5 ld r1,x; add r1,r1,5; st r1,x
SAS model: Locks and Barriers Solution: Locks –A lock is a variable –You can: create a lock, “lock” a lock, and “unlock” a lock –The implementation guarantees that: only one thread can “get” or “lock” a lock at a time Using locks: –Protect vulnerable shared data using a lock –associate a lock with such a variable Mentally (there is no construct or call to do the association) –Before changing the variable, lock its associated variable unlock it as soon as you finished using it –Remember that this is only a convention Nothing prevents a thread from inadvertently changing a variable that is protected by lock in another part of the code: Analogy: locking a room with a “post-it” on the door
Matrix multiplication: Why people like SAS model: for (i=0; i<M; i++) for (j=0; j<N; j++) for (k=0; k<L; k++) C[i][j] += A[i][k]*B[k][j]; In a shared memory style, this program is trivial to parallelize Just have each processor deal with a different range of I (or J?) (or Both?)
SAS matrix multiply Each thread know its “serial number”: –myPe() size= M/numPEs( ); myStart = myPE( ) for (i=myStart; i<myStart+size; i++) for (j=0; j<N; j++) for (k=0; k<L; k++) C[i][j] += A[i][k]*B[k][j];
Message passing Parallel entities are processes –With their own address space Assume that processors have direct access to only their memory Each processor typically executes the same executable, but may be running different part of the program at a time Coordination : –via sending and receiving “messages”: bytes of data
Message passing basics: Basic calls: send and recv send(int proc, int tag, int size, char *buf); recv(int proc, int tag, int size, char * buf); Recv may return the actual number of bytes received in some systems tag and proc may be wildcarded in a recv: –recv(ANY, ANY, 1000, &buf); broadcast: Other global operations (reductions)
Parallel Programming Decomposition – what to do in parallel –Tasks (loop iterations, functions,.. ) that can be done in parallel Mapping: –Which processor does each task Scheduling (sequencing) –On each processor Machine dependent expression –Express the above decisions for the particular parallel machine
Spectrum of parallel Languages Specialization LevelLevel MPI/SAS Parallelizing fortran compiler Machine dependent expression Scheduling (sequencing) Mapping Decomposition What is automated Charm++
Shared objects model: Basic philosophy: –Let the programmer decide what to do in parallel –Let the system handle the rest: Which processor executes what, and when With some override control to the programmer, when needed Basic model: –The program is set of communicating objects –Objects only know about other objects (not processors) –System maps objects to processors And may remap the objects for load balancing etc. dynamically Shared objects, not shared memory –So, in some ways, in between “shared nothing” message passing, and “shared everything” of SAS –More disciplined sharing –Additional information sharing mechanisms
Charm++ Data Driven Objects: called chares Asynchronous method invocation Prioritized scheduling Object Arrays Object Groups: –global object with a “representative” on each PE Information sharing abstractions –readonly data –accumulators –distributed tables
Data Driven Execution Scheduler Message Q Objects
Object Arrays A collection of chares, –with a single global name for the collection, and –each member addressed by an index –Mapping of element objects to processors handled by the system A[0]A[1]A[2]A[3]A[..] A[3] A[0] User’s view System view
Object Groups A group of objects (chares) –with exactly one representative on each processor –A single Id for the group as a whole –invoke methods in a branch (asynchronously), all branches (broadcast), or in the local branch
Information sharing abstractions Observation: –Information is shared in several specific modes in parallel programs Other models support only a limited sets of modes: –Shared memory: everything is shared: sledgehammer approach –Message passing: messages are the only method Charm++: identifies and supports several modes –Readonly / writeonce –Tables (hash tables) –accumulators –Monotonic variables
Comparing Programming Models What are the advantages and disadvantages of the models? –even at this simple/abstract level of introduction?