Presentation is loading. Please wait.

Presentation is loading. Please wait.

Overview of An Efficient Implementation Scheme of Concurrent Object-Oriented Languages on Stock Multicomputers Tony Chen, Sunjeev Sikand, and John Kerwin.

Similar presentations


Presentation on theme: "Overview of An Efficient Implementation Scheme of Concurrent Object-Oriented Languages on Stock Multicomputers Tony Chen, Sunjeev Sikand, and John Kerwin."— Presentation transcript:

1 Overview of An Efficient Implementation Scheme of Concurrent Object-Oriented Languages on Stock Multicomputers Tony Chen, Sunjeev Sikand, and John Kerwin CSE 291 - Programming Sensor Networks May 23, 2003 Paper by: Kenjiro Taura, Satoshi Matsuoka, and Akinori Yonezawa

2 2 Background Most of the work done on high performance, concurrent object-oriented programming languages (OOPLs) has focused on combinations of elaborate hardware and highly- tuned, specially tailored software. These software architectures (the compiler and the runtime system) exploit special features provided by the hardware in order to achieve: Efficient intra-node multithreading Efficient intra-node multithreading Efficient message passing between objects Efficient message passing between objects

3 3 Special Hardware Features The hardware manages the thread scheduling queue, and automatically dispatches the next runnable thread upon termination of the current thread. Processors and the network are tightly connected. Processors can send a packet to the network within a few machine cycles. Processors can send a packet to the network within a few machine cycles. Dispatching a task upon packet arrival takes only a few cycles. Dispatching a task upon packet arrival takes only a few cycles.

4 4 Objective of this Paper Demonstrate software techniques that can be used to achieve comparable intra-node multithreading, and inter-node message passing performance on conventional multicomputers, without special hardware scheduling and message passing facilities.

5 5 System Used to Demonstrate these Techniques The authors developed a runtime environment for a concurrent object- oriented programming language called ABCL/onAP1000. Used Fujitsu Laboratory’s experimental multicomputer called AP1000. 512 SPARC chips running at 25 MHz 512 SPARC chips running at 25 MHz Interconnected with a 25 MB/s torus network Interconnected with a 25 MB/s torus network

6 6 Computation/Programming Model Computation is carried out by message transmissions among concurrent objects. Units of concurrency that become active when they accept messages. Units of concurrency that become active when they accept messages. Multiple message transmissions may take place in parallel, so objects may become active simultaneously. When an object receives a message, the message is placed in its message queue, so that messages can be invoked one at a time.

7 7 Computation/Programming Model (cont.) Messages can contain mail addresses of concurrent objects in addition to basic values such as numbers and booleans. Each object has its own autonomous single thread of control, and its own encapsulated state variables. Objects can be in dormant mode if they have no messages to process, active mode if they are executing a method, or waiting mode if they are waiting to receive a certain set of messages.

8 8 Possible Actions Within a Method Message Sends to other concurrent objects Past type – sender does not wait for a reply message Past type – sender does not wait for a reply message Now type – sender waits for a reply message Now type – sender waits for a reply message Reply messages are sent through a third object called a reply destination object, which resumes the original sender upon the reception of the reply message. Creation of concurrent objects

9 9 Possible Actions Within a Method (cont.) Referencing and Updating the contents of state variables Waiting for a specified set of messages Standard Operations (like arithmetic operations) on values stored in state variables

10 10 Scheduling Process Scheduling for sequential OOPLs simply involves a method lookup and a stack-based function call. For concurrent OOPLs, scheduling of methods is not necessarily LIFO-based, since methods may be blocked to wait for messages, and resumed upon the arrival of a message. Therefore, a naïve implementation must allocate invocation frames from the heap instead of the stack, and use a scheduling queue to keep track of pending methods. Therefore, a naïve implementation must allocate invocation frames from the heap instead of the stack, and use a scheduling queue to keep track of pending methods.

11 11 Scheduling Process (cont.) In addition, since it may not be possible for a receiver object to immediately process incoming messages, each object must have its own message queue to buffer incoming messages. This can lead to substantial overhead for frame allocation/deallocation, and queue manipulation, for both the scheduling and message queues.

12 12 Example of a Naïve Scheduling Mechanism A naïve implementation of message reception / method invocation for an object would require: 1. Allocation of an invocation frame to hold local variables and message arguments of the method. 2. Buffering a message into the frame. 3. Enqueueing the frame into the object message queue. 4. Enqueueing the object into the scheduling queue (if it is not already there).

13 13 Key Observation for Intra- node Scheduling Strategy In many cases, this full scheduling mechanism is not necessary, and we can use more efficient stack-based scheduling. If an object is dormant, meaning it has no messages to be processed, its method can be invoked immediately upon message reception, without message buffering or schedule queue manipulation. If it is active, then the message is buffered, and the method is invoked later via the scheduling queue.

14 14 Example of ABCL/onAP1000 Intra-node Scheduling Strategy

15 15 Scheduling Strategy Implementation We need a mechanism to implement this strategy efficiently. We cannot perform a runtime check on every intra-node message send to determine whether or not the receiver is dormant. When a running object becomes blocked on the stack, we must be able to resume other objects.

16 16 Components of an Object

17 17 Virtual Function Tables A Virtual Function Table Pointer (VFTP) points to a Virtual Function Table, which contains the address of each compiled function (method) of the class.

18 18 Key Idea in Object Representation Each class has multiple virtual function tables, each of which roughly corresponds to a mode (dormant, active, and waiting) of an object. When an object is in dormant mode, its Virtual Function Table Pointer (VFTP) points to the table that contains the method bodies. When an object is active, the VFTP points to a virtual function table that holds tiny queueing procedures, which simply allocate a frame, store the message into the frame, and enqueue it on the object’s message queue.

19 19 Benefits of Multiple Virtual Function Tables With multiple virtual function tables, a sender object does not have to do a runtime check of whether or not the receiver object is dormant. Instead this check is built into the virtual function table look-up, which is already a necessary cost in object-oriented programming languages.

20 20 Benefits of Multiple Virtual Function Tables (cont.) Can be used to implement selective message reception where acceptable messages trigger functions that restore the context of the object, and unacceptable messages trigger queueing procedures. Can also be used to initialize an object’s state variables, by creating a table that points to initialization functions that initialize variables before calling a method body.

21 21 Combining the Stack with the Scheduling Queue When a method is invoked on a dormant object, an activation frame is allocated on the stack, thereby achieving fast frame allocation/deallocation. If this invocation blocks in the middle of a thread, it allocates another frame on the heap, and saves its context to this frame, which will survive until termination of the method. The scheduling queue is used to schedule preempted objects that saved their context into a heap-allocated frame, or to invoke messages that were buffered in a message queue.

22 22 Example of Stack Unwinding

23 23 Inter-node Software Architecture Important for message passing between objects on different nodes, and object creation on a remote node. Assumes the hardware (or message passing libraries) provides an interface to send and receive messages asynchronously. Uses an Active Message-like mechanism, where each message attaches its own self-dispatching message handler, which is invoked immediately after the delivery of the message.

24 24 Customized Message Handlers Providing a customized message handler for each kind of remote message allows the system to achieve low overhead remote task dispatching. Message handlers are classified into the following categories: 1. Normal message transmission between objects 2. Request for remote object creation 3. Reply to remote memory allocation request 4. Other services such as load balancing, garbage collection, etc.

25 25 Remote Object Creation A mail address of an object is represented as. This provides maximum performance for local object access, and avoids the overhead of export table management. Object creation on a remote node requires a memory allocation on the remote node to generate a remote mail address.

26 26 Remote Object Creation (cont.) Since the latency of remote communication is unpredictable, and the cost of context switching is high, it is unacceptable to wait for the remote node to allocate memory and return a pointer. Therefore the system uses a prefetch scheme, where each node manages predelivered stocks of addresses of memory chunks on remote nodes, and these addresses are used for remote object allocation. A node only has to wait for a remote address to be allocated if its local stock is empty.

27 27 Typical Remote Object Creation Sequence The requester node obtains a unique mail address locally from the stock. It sends a creation request message to the node specified by the mail address. The target node performs class-specific initialization (such as initialization of the virtual function table) of the created object upon receipt of the creation message. The target node allocates a replacement chunk of memory, and returns its address to the requester node. The requester replenishes its stock upon receipt of the replacement address.

28 28 Costs of Basic Operations Time (µs) Intra-node Message (to Dormant) Intra-node Message (to Active) Intra-node Creation Latency of Inter-node Message 2.39.62.18.9

29 29 Breakdown of Intra-node Message to Dormant Object Instructions Check Locality Lookup and Call Switch VFTP to Active Mode Execution of Method Body Check Message Queue Switch VFTP to Dormant Mode Polling of Remote Message Arrival Adjusting Stack Pointer and Return 353-3353 Total25

30 30 Comparison of Send/Reply Latency Instruction Counts Real Time (µs) Cycles Clock Rate (MHz) ABCL/onAP1000ABCL/onEM4 CST (on J-Machine) 16010011017.8944501102202512.550 Send and reply latency for the ABCL/onAP1000 conventional multicomputer is only about 4 times that of the ABCL/onEM4 fine-grain machine, and 2 times that of the CST fine-grain machine.

31 31 Benchmark Statistics To evaluate these techniques on real applications, the authors measured the performance of the N-queen exhaustive search algorithm for N = 8 and N = 13. They compared these results to the results of running the same programs on a single CPU SPARC station 1+, which uses the same CPU that is used in the AP1000.

32 32 The Scale of the N-queen Program N = 8 N = 13 Number of Solutions Number of Objects Created Number of Messages Total Memory Used (KB) Elapsed Time on SS1+ 922,0564,104130 84 ms 73,7124,636,2109,349,765549,463 461,955 ms

33 33 Speedup of the N-queen Program

34 34 The Effect of Stack-based Scheduling To demonstrate the effect of stack-based scheduling, they compared the performance of the N-queen program using stack-based scheduling, to its performance using a naïve scheduling mechanism that always buffers a message in the message queue of the receiver object, and schedules the object through the scheduling queue. In these programs, approximately 75% of local messages are sent to dormant objects. In general, they observed a speedup of approximately 30%.

35 35 The Effect of Stack-based Scheduling (cont.)

36 36 Conclusions The authors proposed a software architecture for concurrent OOPLs on conventional multicomputers that can compete with implementations on special-purpose, fine-grain architectures. Their stack-based intra-node scheduling mechanism significantly reduces the average cost of intra-node method invocation. Their Active Message-like messages, and address prefetch scheme minimize the cost of inter-node message passing, and remote object creation.

37 37 Discussion The eternal question: How does this apply to sensor networks? Low instruction count for intra-node scheduling Power efficient remote object creation cuts down on communication

38 38 Flaws Security problems related to active messages. User can run any code they desire. Scalability for prefetching objects, if thousands of nodes results in lots of communication between nodes and memory becomes a scarce commodity.


Download ppt "Overview of An Efficient Implementation Scheme of Concurrent Object-Oriented Languages on Stock Multicomputers Tony Chen, Sunjeev Sikand, and John Kerwin."

Similar presentations


Ads by Google