Presentation is loading. Please wait.

Presentation is loading. Please wait.

Using CoDeL to rapidly prototype network processsor extensions Nainesh Agarwal and Nikitas J. Dimopoulos Department of Electrical and Computer Engineering.

Similar presentations


Presentation on theme: "Using CoDeL to rapidly prototype network processsor extensions Nainesh Agarwal and Nikitas J. Dimopoulos Department of Electrical and Computer Engineering."— Presentation transcript:

1 Using CoDeL to rapidly prototype network processsor extensions Nainesh Agarwal and Nikitas J. Dimopoulos Department of Electrical and Computer Engineering University of Victoria

2 SAMOS IV, July 19-24, 20042 M PPP … Interconnect NIC M PPP … Interconnect NIC System Interconnect M PPP … NIC General structure of a “massively” parallel system

3 SAMOS IV, July 19-24, 20043 Outline The problem (latency) The problem (latency) Prediction Prediction Architectural enhancements Architectural enhancements CoDeL and Implementation CoDeL and Implementation

4 SAMOS IV, July 19-24, 20044 Latency CG Benchmark Completed. Class = W Size = 7000 Iterations = 15 Time in seconds = 1.72 Total processes = 8 Compiled procs = 8 Mop/s total = 244.10 Mop/s/process = 30.51 Operation type = floating point Verification = SUCCESSFUL Version = 2.3 Compile date = 07 Mar 2001 CG Benchmark Completed. Class = W Size = 7000 Iterations = 15 Time in seconds =.99 Total processes = 8 Compiled procs = 8 Mop/s total = 426.95 Mop/s/process = 53.37 Operation type = floating point Verification = SUCCESSFUL Version = 2.3 Compile date = 07 Mar 2001 Switch shared Switch shared, user space MPI over LAPI Faster communications

5 SAMOS IV, July 19-24, 20045 Latency Minimizing communication latency is crucial in achieving high performance. Minimizing communication latency is crucial in achieving high performance. Network Send Process Receive Process Send bufferReceive buffer System buffer System buffer NI

6 SAMOS IV, July 19-24, 20046 Latency Efficiency requires the message to be available to be consumed Efficiency requires the message to be available to be consumed Send call SenderReceiver Receive call issued Receive call executed (address resolution) Consumer idle Copy to consumer space

7 SAMOS IV, July 19-24, 20047 Latency Send call SenderReceiver thread Receive call issued Receive call executed (address resolution) Consumer thread Cache miss

8 SAMOS IV, July 19-24, 20048 Latency Even when the network delays are minimized (non-existent) Even when the network delays are minimized (non-existent) receiver synchronization, receiver synchronization, message copying, message copying, cache misses cache misses delay execution.

9 SAMOS IV, July 19-24, 20049 The solution Ensure that the received message is in the consumer’s cache at the point the consumer needs to consume the message. Ensure that the received message is in the consumer’s cache at the point the consumer needs to consume the message. P cache M

10 SAMOS IV, July 19-24, 200410 The solution Enabling mechanisms Enabling mechanisms In an asynchronous environment where many messages arrive at a node, can we decide which is the message to be consumed next? In an asynchronous environment where many messages arrive at a node, can we decide which is the message to be consumed next? How do we place the message to be consumed in the cache? How do we place the message to be consumed in the cache? M P cache

11 SAMOS IV, July 19-24, 200411 The solution Learn the pattern of message consumption and use this to decide which is the message to be consumed next. Learn the pattern of message consumption and use this to decide which is the message to be consumed next. Develop a hardware environment that will facilitate the placement of the message in the consumer’s cache Develop a hardware environment that will facilitate the placement of the message in the consumer’s cache

12 SAMOS IV, July 19-24, 200412 Receive call predictors History-based predictors predict subsequent receive calls at a given node in a message-passing application. History-based predictors predict subsequent receive calls at a given node in a message-passing application.

13 SAMOS IV, July 19-24, 200413 Locality Message reception locality Message reception locality If a certain message reception call has been used it will be re-used with high probability by a portion of code that is “near” the place that was used earlier, and it will also be re-used in the near future If a certain message reception call has been used it will be re-used with high probability by a portion of code that is “near” the place that was used earlier, and it will also be re-used in the near future

14 SAMOS IV, July 19-24, 200414 Messages vary in size from a few bytes to several kbytes Messages vary in size from a few bytes to several kbytes

15 SAMOS IV, July 19-24, 200415 Predictors Heuristics that predict the subsequent receive calls based on the past history of communication patterns on a per node basis. Heuristics that predict the subsequent receive calls based on the past history of communication patterns on a per node basis. Tag Predictor Tag Predictor Single-cycle Predictor Single-cycle Predictor Tag-cycle Predictor Tag-cycle Predictor Tag-better-cycle Predictor Tag-better-cycle Predictor

16 SAMOS IV, July 19-24, 200416 Single-cycle Predictor N = 64 for CG, and 49 for others

17 SAMOS IV, July 19-24, 200417 What next Network Processor Extensions Network Processor Extensions Achieve zero-copy through re-mapping Achieve zero-copy through re-mapping Use the predictors to “optimize” size and performance. Use the predictors to “optimize” size and performance.

18 SAMOS IV, July 19-24, 200418 Architecture M P Interconnect NIC Network cache cache

19 SAMOS IV, July 19-24, 200419 Architectural Enhancements Network Memory SpaceProcess Memory Space Network tag Process tagcache data lineMessage ID Network Cache initial final Separate Network Cache “ties” the Network Memory Space and the Process Memory Space Separate Network Cache “ties” the Network Memory Space and the Process Memory Space

20 SAMOS IV, July 19-24, 200420 Definitions Network Memory Space: Network Memory Space: Network buffers Network buffers Received messages live waiting to be bound to the process address space. Received messages live waiting to be bound to the process address space. Process Memory Space: Process Memory Space: Process address space Process address space Process objects including bound messages live Process objects including bound messages live

21 SAMOS IV, July 19-24, 200421 Operation network tag is associated with the Network Memory Space, network tag is associated with the Network Memory Space, process tag is associated with the Process Memory Space. process tag is associated with the Process Memory Space. message ID tag holds the message ID. message ID tag holds the message ID. All three tags can be searched associatively. All three tags can be searched associatively. The Network Cache includes three separate tags.

22 SAMOS IV, July 19-24, 200422 Operation cont’d On message arrival, the message is cached on the network cache. On message arrival, the message is cached on the network cache. The network tag is set to the address of the buffer in network memory space that is allocated to the message The network tag is set to the address of the buffer in network memory space that is allocated to the message The message id tag is set to the message id. The message id tag is set to the message id.

23 SAMOS IV, July 19-24, 200423 Operation cont’d The message lives at the network cache and it migrates to the Network Memory space according to a cache replacement policy which replaces the message that is least likely to be consumed next. The message lives at the network cache and it migrates to the Network Memory space according to a cache replacement policy which replaces the message that is least likely to be consumed next. The receive-call prediction heuristics are used for this purpose. The receive-call prediction heuristics are used for this purpose.

24 SAMOS IV, July 19-24, 200424 Late binding A receive call invalidates the message ID and network tags and will set the process tag to point to the address of the object destined to receive the message in Process Memory Space. A receive call invalidates the message ID and network tags and will set the process tag to point to the address of the object destined to receive the message in Process Memory Space. The buffer in Network Memory space is released and can be garbage collected. The buffer in Network Memory space is released and can be garbage collected. From this point onward, the cache line is associated with the Process Memory Space. On cache replacement, the message is written back to its targeted object in Process Memory Space From this point onward, the cache line is associated with the Process Memory Space. On cache replacement, the message is written back to its targeted object in Process Memory Space

25 SAMOS IV, July 19-24, 200425 Large Messages Are not dealt with in this work (TLB techniques would accomplish message re- binding) Are not dealt with in this work (TLB techniques would accomplish message re- binding)

26 SAMOS IV, July 19-24, 200426 ISA extensions network_load network_load network_store network_store Identical to standard load and store instructions with the exception that they cause the network cache to be searched according to the network tag. No other cache is searched. Identical to standard load and store instructions with the exception that they cause the network cache to be searched according to the network tag. No other cache is searched.

27 SAMOS IV, July 19-24, 200427 ISA extensions cont’d Regular load and store instructions target both the normal data cache and the network cache and the network cache is searched according to the process tag. Regular load and store instructions target both the normal data cache and the network cache and the network cache is searched according to the process tag.

28 SAMOS IV, July 19-24, 200428 ISA extensions cont’d remap message_id, new_process_tag remaps the cache line identified by the message_id to the new_process_tag. The message_id and new_process_tag are in registers. remap message_id, new_process_tag remaps the cache line identified by the message_id to the new_process_tag. The message_id and new_process_tag are in registers.

29 SAMOS IV, July 19-24, 200429 Implementation

30 SAMOS IV, July 19-24, 200430 Implementation --cont’d Network cache is implemented as m-way associative Network cache is implemented as m-way associative Three sections Three sections Process section Process section MessageID section MessageID section Network Cache section Network Cache section

31 SAMOS IV, July 19-24, 200431 Implementation -- cont’d The network cache section holds the message payload The network cache section holds the message payload The messageID and process sections hold pointers that point to payloads in the network cache section The messageID and process sections hold pointers that point to payloads in the network cache section The associativity of the messageID and process sections is larger than that of the network cache section to avoid unnecessary cache misses. The associativity of the messageID and process sections is larger than that of the network cache section to avoid unnecessary cache misses.

32 SAMOS IV, July 19-24, 200432 Implementation--overall

33 SAMOS IV, July 19-24, 200433 CoDeL CoDeL (Controller Description Language), targets the specification and design at the behavioral level. CoDeL is a procedural language in which the order of the statements implicitly represents the sequence of activities. It extracts the data and control flow from the program automatically, assigns the necessary hardware blocks and exploits inherent parallelism.

34 SAMOS IV, July 19-24, 200434 CoDeL It is similar to the C programming language and is therefore easy to learn. It includes a library of I/O protocols that simplify (sub)system interaction. The CoDeL compiler produces synthesizable VHDL code which can be targeted to any technology including PLD, FPGA or ASIC.

35 SAMOS IV, July 19-24, 200435 CoDeL--Ports and Protocols CoDeL abstracts module interaction through ports and protocols. CoDeL abstracts module interaction through ports and protocols. Protocols define the sequence of events necessary to transfer information from one module to another Protocols define the sequence of events necessary to transfer information from one module to another

36 SAMOS IV, July 19-24, 200436 CoDeL--Example # Define a 16-bit address # in 4 dimensions bitstruct mixed_radix_4 { (bits) field1[4]; (bits) field2[4]; (bits) field3[4]; (bits) field4[4]; } # Define a 36-bit # message header using # the above bitstruct data_frame { (mixed_radix_4) source_address; (mixed_radix_4) destn_address; (bits) header[4]; } in (data_frame) p1 with input_handshake; out (data_frame) p3 with output_handshake ;

37 SAMOS IV, July 19-24, 200437 CoDeL--Example Protocol Example of a handshake protocol Example of a handshake protocol

38 SAMOS IV, July 19-24, 200438 Network Processor Extension Implementation The register file modules were implemented in VHDL. Each of these required about 60 lines of VHDL code. Each cache line is 32 bytes. The register file modules were implemented in VHDL. Each of these required about 60 lines of VHDL code. Each cache line is 32 bytes. The network controller module, written in CoDeL, required about 697 lines of code, and generated close to 4011 lines of VHDL code. The network controller module, written in CoDeL, required about 697 lines of code, and generated close to 4011 lines of VHDL code. Under simulation we see that the network load instruction requires 15 clock cycles, the network store takes 29 cycles, the remap takes 29 cycles, while the load requires 21 cycles. Under simulation we see that the network load instruction requires 15 clock cycles, the network store takes 29 cycles, the remap takes 29 cycles, while the load requires 21 cycles.

39 SAMOS IV, July 19-24, 200439 Synthesis This design has not been synthesized (Xilinx synthesis has failed) This design has not been synthesized (Xilinx synthesis has failed) We have been able to syntjesize other designs (including the 5/3 Le Gall integer- to-integer wavelet) We have been able to syntjesize other designs (including the 5/3 Le Gall integer- to-integer wavelet)

40 SAMOS IV, July 19-24, 200440 Conclusions A network processor extension has been proposed and designed using CodeL. A network processor extension has been proposed and designed using CodeL. Using CoDeL has allowed the rapid prototyping of the design. Using CoDeL has allowed the rapid prototyping of the design. CoDeL needs to be extended to enhance parallelism. CoDeL needs to be extended to enhance parallelism. Compiler directives (similar to the technique used in OpenMP) could be used. Compiler directives (similar to the technique used in OpenMP) could be used. State collapsing and data forwarding would allow faster design. State collapsing and data forwarding would allow faster design.

41 SAMOS IV, July 19-24, 200441 What next SMP nodes SMP nodes A cache-coherent based organization will migrate and bind received messages to the consuming processor A cache-coherent based organization will migrate and bind received messages to the consuming processor Refine the ISA. Refine the ISA. Is there any more functionality needed? Is there any more functionality needed? Is the TLB-based re-mapping of the very large messages necessary? Is the TLB-based re-mapping of the very large messages necessary? Can we live with one sided communications? Can we live with one sided communications? Performance evaluation!! Performance evaluation!!


Download ppt "Using CoDeL to rapidly prototype network processsor extensions Nainesh Agarwal and Nikitas J. Dimopoulos Department of Electrical and Computer Engineering."

Similar presentations


Ads by Google