October 17, MuPC Run Time System for UPC Steve Seidel, Phil Merkey Jeevan Savant, Kian Gap Lee Department of Computer Science Michigan Technological University Brian Wibecan, Program PI Phil Becker, Program Manager Kevin Harris, Bruce Trull, and Daniel Christians Compaq UPC Development
October 17, UPC designed by Carlson et al. A “light weight” extension of C for parallelism A shared memory, multithreaded model Arrays and pointers can be shared Array distribution is semi-automatic Remote references are automatically resolved Parallel constructs include – forall – fence and split barrier Built-ins for – memory allocation/free – locks
October 17, Compaq's UPC compiler UPC object code – front end translates UPC source to EDG IL – lowering phase converts UPC-specifics to standard EDG IL – middle end converts EDG IL to GEM-compatible IL – GEM back end converts GEM IL to alpha object code Each of the intermediate phases above has some UPC-specific components. Alternative:“Bail out" after lowering phase to produce C code that includes calls to a run time system. Under discussion: EDG front end for UPC
October 17, Run Time System Interface The RTS interface is an evolving set of data objects and methods that captures the semantics of “UPC minus C”. An RTS "reference implementation" was suggested by Harris. A publicly available reference implementation will – promote UPC code base, user base and platform base – challenge MPI and OpenMP – foster RTS evolution – promote support for UPC tools MuPC is MTU's run time system for UPC
October 17, Run Time System Structure Run time structures describing shared objects and globals are maintained. References to nonlocal shared objects are made through get and put. UPC barrier ’s and fence ’s are passed directly to the RTS. The same is true of UPC calls to other built-in functions that provide locks and dynamic memory allocation.
October 17, Available compiler technology Proprietary Compaq compiler supports a proprietary RTS. Reference compiler is not currently available, but... Compaq will provide a compiler that supports the reference RTS.
October 17, MuPC Design Goals Public availability Wide platform base Open source maintained by MTU User-level implementation Quick delivery Efficiency is not a primary goal
October 17, Available Platforms MTU (on site): – Beowulf cluster (64 nodes) – Sun Enterprise 4500 (12 processors) – SGI Origin 2000 (4 processors) – Sun workstation networks (various) – Linux workstation networks (various) – AlphaServer and 2 workstations (provided by Compaq) Remote: – AlphaServer SC (Compaq) – T3E (Cray)
October 17, Transport vehicle selection Candidates – MPIno one-sided communication – MPI-2incomplete implementations – Pthreadsno multiprocessor support – OpenMPexpensive, possibly incompatible – shmemlimited platform base – VIAlimited platform base – ARMCIlimited user base – TCP/IPtoo low-level Selection criteria – Portability and availability: MPI, Pthreads, TCP/IP – Technical shortcomings can be overcome
October 17, MPI/Pthreads hybrid transport vehicle MPI provides process control and interprocessor communication. Pthreads provides multithreading within each process to handle asynchronous remote accesses. The following are equivalent in MuPC: – one MPI process – one UPC thread (from the user’s point of view) – one user Pthread + one MPI send/recv Pthread Thread safety is provided by isolating all MPI calls in the send/recv Pthread.
October 17, upcrun -np 3 upc-demo MPI_init pthread_create MPI_init pthread_create MPI_init pthread_create user UPC thrd send/ recv thrd user UPC thrd send/ recv thrd user UPC thrd send/ recv thrd upc_finalize
October 17, Example: Nonlocal array reference x=a[k]; // User shared array shared int a[10][THREADS]; // Frontend-generated temporary pointer shared int *UPC_RTS_ptr;... // UPC source code: // x=a[k]; // Front end computes address, // phase and thread of remote reference. UPC_RTS_ptr = (vaddr,phase,thread); // Call is made to get a[k] x = MuPC_get_sync_int(UPC_RTS_ptr);
October 17, x = MuPC_get_sync_int(UPC_RTS_ptr p); MuPC_get_sync_int Send/Recv Thr Send/Recv Thr send_lock.type=GET send_lock.ptr=p wait on recv_lock.done x=recv_lock.data Pthread lock structs: send_lock recv_lock while (threads) case... GET: MPI_Send(p,RECV)... REPLY: MPI_Recv(y) recv_lock.data=y recv_lock.done=T... end while while (threads) case... RECV: MPI_Recv(p) MPI_Send(*p,REPLY)... end while
October 17, x = MuPC_get_sync_int(UPC_RTS_ptr p); MuPC_get_sync_int Send/Recv Thr Send/Recv Thr send_lock.type=GET send_lock.ptr=p wait on recv_lock.done x=recv_lock.data Pthread lock structs: send_lock recv_lock while (threads) case... GET: MPI_Send(p,RECV)... REPLY: MPI_Recv(y) recv_lock.data=y recv_lock.done=T... end while while (threads) case... RECV: MPI_Recv(p) MPI_Send(*p,REPLY)... end while
October 17, x = MuPC_get_sync_int(UPC_RTS_ptr p); MuPC_get_sync_int Send/Recv Thr Send/Recv Thr send_lock.type=GET send_lock.ptr=p wait on recv_lock.done x=recv_lock.data Pthread lock structs: send_lock recv_lock while (threads) case... GET: MPI_Send(p,RECV)... REPLY: MPI_Recv(y) recv_lock.data=y recv_lock.done=T... end while while (threads) case... RECV: MPI_Recv(p) MPI_Send(*p,REPLY)... end while
October 17, x = MuPC_get_sync_int(UPC_RTS_ptr p); MuPC_get_sync_int Send/Recv Thr Send/Recv Thr send_lock.type=GET send_lock.ptr=p wait on recv_lock.done x=recv_lock.data Pthread lock structs: send_lock recv_lock while (threads) case... GET: MPI_Send(p,RECV)... REPLY: MPI_Recv(y) recv_lock.data=y recv_lock.done=T... end while while (threads) case... RECV: MPI_Recv(p) MPI_Send(*p,REPLY)... end while
October 17, x = MuPC_get_sync_int(UPC_RTS_ptr p); MuPC_get_sync_int Send/Recv Thr Send/Recv Thr send_lock.type=GET send_lock.ptr=p wait on recv_lock.done x=recv_lock.data Pthread lock structs: send_lock recv_lock while (threads) case... GET: MPI_Send(p,RECV)... REPLY: MPI_Recv(y) recv_lock.data=y recv_lock.done=T... end while while (threads) case... RECV: MPI_Recv(p) MPI_Send(*p,REPLY)... end while
October 17, Synthetic Testing Pseudo-code walkthroughs of all MuPC functions Synthetic test codes are C/MPI programs that call MuPC RTS routines directly. Shared data is artificially allocated. // THREAD 0 int a[10];... // a[12]=42; index=12%10; thread=12/10; MuPC_put_integer(a,index,thread,42);... // THREAD 1 int a[10];... // outcome is // a[2]=42;...
October 17, Integration Testing Wrap get ’s, put ’s and notify / wait to conform to the RTS interface. Integrate MuPC with front end... –... data structures and globals –... initialization and finalization Rewrite synthetic tests in UPC and compare to previous results. Add built-in functions for – locks – memory allocation
October 17, Full-scale Testing MTU test kernels GWU UPC test suite Contributed UPC codes
October 17, Documentation, Delivery, and Distribution MuPC source Front-end binaries for targeted platforms Makefiles, release notes, etc. Serve these items from MTU MuPC web site Publish a description of MuPC
October 17, Preliminary Work, Summer, 2001 RTS header files provided by Compaq MPI-2 one-sided communication proposed as primary transport vehicle but current implementations do not meet full standard MPI/Pthreads hybrid selected Studied intermediate output of Compaq's UPC front end Compaq hardware and software delivered Single-threaded working environment verified Accounts on AlphaServer SC also provided
October 17, August 20-21, Nashua Participants: – Bill Carlson, Brian Wibecan, Kevin Harris, Phil Becker, Daniel Christians, Jim Bovay, Savant, Merkey, Seidel Discussed RTS definition and UPC features per Wibecan's agenda. Outcomes: – MPI/Pthreads hybrid design feasible – MuPC will include upccc and upcrun MPI wrappers – Agreed on RTS and UPC feature interpretations – MuPC efficiency and performance not highest priority – Written meeting summary submitted to Compaq (Sept. 23, 2001) (Sept. 23, 2001)
October 17, Current Work Recent improvements: – isolating MPI calls for thread safety – send/recv threads yield control when there are no pending requests Skeleton implementations of get/put, barrier, fence, and finalize have been scaled to over 30 nodes on MTU’s Beowulf cluster.
October 17, Project Work Plan: Start date June 28, 2001 This plan is based on the Project Work Items specified in the March 27 RFP from Compaq and on the March 30 MTU Proposal.
October 17, Completed Work Items (per MTU proposal) 1(a): Review implementation methodologies (b): Identify development platforms (b): Identify development platforms (c): Align resources (staff and platforms) (c): Align resources (staff and platforms) (d): Identify target platforms (d): Identify target platforms (e): Conclusion memo (sent 9/23/1) (e): Conclusion memo (sent 9/23/1) 2: Formal Work Plan and Agreement – (Written version of this document) 4: Initial Design of Run Time System – Design presented in Nashua on August 20, 2001
October 17, Remaining Work Items (w/completion dates) 5: Development of remaining primary components (Jan. 1, 2002) – (d) locks – (e) complete gets and puts – (b) memory allocation – (f) utility functions 3: Test design and documentation (Feb. 1, 2002) – This testing will be done concurrent with Item 5 above. – (a) Synthetic testing – (b) Integration testing – (c) Full-scale testing
October 17, : Public Interface development (April 1, 2002) – (a) Makefiles, release notes, installation notes, etc. – (b) Bundle all necessary software – (c) Provide MTU-authored test codes and results – (d) Release advance copies for review and comment 7: System Refinement and Delivery (June 1, 2002) – (a) Release MuPC to the UPC Developers' Group – (b) Maintain MuPC website at MTU – (c) Publish description of MuPC 8: Completion Certification (June 28, 2002) – (a) Final MuPC release by MTU
October 17, MuPC Project Staff Jeevan Savant, M.S. Graduate Student – MuPC design and implementation – (Items 5(b,d,e,f), 6(a,d), and 7(a,c) above) – Support: 9 months, half-time Kian Gap (Mark) Lee, M.S. Graduate Student – MuPC testing and platform integration – (Items 3(a,b,c), 6(b,c), 7(b,c) above) – Support: 9 months, half-time Phillip Merkey, Research Assistant Professor Steven Seidel, Associate Professor
October 17, Additional MTU UPC projects Charles Wallace, Assistant Professor – UPC Memory models Xiaodi (Lisa) Li, M.S. Graduate Student – Benchmarking MuPC using one or two NAS parallel benchmarks Yi (Leon) Liang, M.S. Graduate Student – Pthreads-only MuPC RTS Yongsheng Huang, M.S. Graduate Student – UPC memory models, improving MuPC efficiency Zhang Zhang, Ph.D. Graduate Student – UPC memory models, improving MuPC efficiency