Using CoDeL to rapidly prototype network processsor extensions Nainesh Agarwal and Nikitas J. Dimopoulos Department of Electrical and Computer Engineering.

Slides:



Advertisements
Similar presentations
System Integration and Performance
Advertisements

School of EECS, Peking University “Advanced Compiler Techniques” (Fall 2011) Parallelism & Locality Optimization.
Topics covered: Memory subsystem CSE243: Introduction to Computer Architecture and Hardware/Software Interface.
Zhiguo Ge, Weng-Fai Wong, and Hock-Beng Lim Proceedings of the Design, Automation, and Test in Europe Conference, 2007 (DATE’07) April /4/17.
Extensibility, Safety and Performance in the SPIN Operating System Presented by Allen Kerr.
AMLAPI: Active Messages over Low-level Application Programming Interface Simon Yau, Tyson Condie,
Multiple Processor Systems
Presented By Srinivas Sundaravaradan. MACH µ-Kernel system based on message passing Over 5000 cycles to transfer a short message Buffering IPC L3 Similar.
Chapter 1 Computer System Overview Patricia Roy Manatee Community College, Venice, FL ©2008, Prentice Hall Operating Systems: Internals and Design Principles,
Scripting Languages For Virtual Worlds. Outline Necessary Features Classes, Prototypes, and Mixins Static vs. Dynamic Typing Concurrency Versioning Distribution.
Power Efficient Rapid System Prototyping Using CoDeL: The 2D DWT Using Lifting Nainesh Agarwal & Nikitas Dimopoulos University of Victoria, Canada PacRim,
Computer System Overview
Active Messages: a Mechanism for Integrated Communication and Computation von Eicken et. al. Brian Kazian CS258 Spring 2008.
Trace Caches J. Nelson Amaral. Difficulties to Instruction Fetching Where to fetch the next instruction from? – Use branch prediction Sometimes there.
EEC-681/781 Distributed Computing Systems Lecture 3 Wenbing Zhao Department of Electrical and Computer Engineering Cleveland State University
Multiscalar processors
User-Level Interprocess Communication for Shared Memory Multiprocessors Brian N. Bershad, Thomas E. Anderson, Edward D. Lazowska, and Henry M. Levy Presented.
PRASHANTHI NARAYAN NETTEM.
Introduction to Symmetric Multiprocessors Süha TUNA Bilişim Enstitüsü UHeM Yaz Çalıştayı
An Integrated Hardware-Software Approach to Transactional Memory Sean Lie Theory of Parallel Systems Monday December 8 th, 2003.
Lecture 4: Parallel Programming Models. Parallel Programming Models Parallel Programming Models: Data parallelism / Task parallelism Explicit parallelism.
CSET 4650 Field Programmable Logic Devices
ECE 526 – Network Processing Systems Design Network Processor Architecture and Scalability Chapter 13,14: D. E. Comer.
Chapter 1 Computer System Overview Patricia Roy Manatee Community College, Venice, FL ©2008, Prentice Hall Operating Systems: Internals and Design Principles,
Computer Systems Overview. Page 2 W. Stallings: Operating Systems: Internals and Design, ©2001 Operating System Exploits the hardware resources of one.
Topics covered: Memory subsystem CSE243: Introduction to Computer Architecture and Hardware/Software Interface.
Chapter 1 Computer System Overview Dave Bremer Otago Polytechnic, N.Z. ©2008, Prentice Hall Operating Systems: Internals and Design Principles, 6/E William.
CACHE MEMORY Cache memory, also called CPU memory, is random access memory (RAM) that a computer microprocessor can access more quickly than it can access.
Recall: Three I/O Methods Synchronous: Wait for I/O operation to complete. Asynchronous: Post I/O request and switch to other work. DMA (Direct Memory.
1 Exploring Custom Instruction Synthesis for Application-Specific Instruction Set Processors with Multiple Design Objectives Lin, Hai Fei, Yunsi ACM/IEEE.
TEMPLATE DESIGN © Hardware Design, Synthesis, and Verification of a Multicore Communication API Ben Meakin, Ganesh Gopalakrishnan.
1 Fly – A Modifiable Hardware Compiler C. H. Ho 1, P.H.W. Leong 1, K.H. Tsoi 1, R. Ludewig 2, P. Zipf 2, A.G. Oritz 2 and M. Glesner 2 1 Department of.
ECE200 – Computer Organization Chapter 9 – Multiprocessors.
Implicitly-Multithreaded Processors Il Park and Babak Falsafi and T. N. Vijaykumar Presented by: Ashay Rane Published in: SIGARCH Computer Architecture.
Chapter 8 CPU and Memory: Design, Implementation, and Enhancement The Architecture of Computer Hardware and Systems Software: An Information Technology.
MEMORY ORGANIZTION & ADDRESSING Presented by: Bshara Choufany.
Evaluating and Improving an OpenMP-based Circuit Design Tool Tim Beatty, Dr. Ken Kent, Dr. Eric Aubanel Faculty of Computer Science University of New Brunswick.
System-level power analysis and estimation September 20, 2006 Chong-Min Kyung.
Operating System Isfahan University of Technology Note: most of the slides used in this course are derived from those of the textbook (see slide 4)
Precomputation- based Prefetching By James Schatz and Bashar Gharaibeh.
Next Generation ISA Itanium / IA-64. Operating Environments IA-32 Protected Mode/Real Mode/Virtual Mode - if supported by the OS IA-64 Instruction Set.
Silberschatz, Galvin and Gagne  2002 Modified for CSCI 399, Royden, Operating System Concepts Operating Systems Lecture 4 Computer Systems Review.
3/12/2013Computer Engg, IIT(BHU)1 PARALLEL COMPUTERS- 2.
LECTURE 12 Virtual Memory. VIRTUAL MEMORY Just as a cache can provide fast, easy access to recently-used code and data, main memory acts as a “cache”
1 Computer Architecture. 2 Basic Elements Processor Main Memory –volatile –referred to as real memory or primary memory I/O modules –secondary memory.
4/27/2000 A Framework for Evaluating Programming Models for Embedded CMP Systems Niraj Shah Mel Tsai CS252 Final Project.
Unit - I Real Time Operating System. Content : Operating System Concepts Real-Time Tasks Real-Time Systems Types of Real-Time Tasks Real-Time Operating.
Chapter 11 System Performance Enhancement. Basic Operation of a Computer l Program is loaded into memory l Instruction is fetched from memory l Operands.
UltraSparc IV Tolga TOLGAY. OUTLINE Introduction History What is new? Chip Multitreading Pipeline Cache Branch Prediction Conclusion Introduction History.
Embedded Real-Time Systems Processing interrupts Lecturer Department University.
Niagara: A 32-Way Multithreaded Sparc Processor Kongetira, Aingaran, Olukotun Presentation by: Mohamed Abuobaida Mohamed For COE502 : Parallel Processing.
Computer Systems Overview. Lecture 1/Page 2AE4B33OSS W. Stallings: Operating Systems: Internals and Design, ©2001 Operating System Exploits the hardware.
Chapter 1 Computer System Overview
Module 11: File Structure
Lecture 12 Virtual Memory.
Multiscalar Processors
Software Cache Coherent Control by Parallelizing Compiler
Lecture 14 Virtual Memory and the Alpha Memory Hierarchy
CMSC 611: Advanced Computer Architecture
Lesson 4 Synchronous Design Architectures: Data Path and High-level Synthesis (part two) Sept EE37E Adv. Digital Electronics.
Translation Buffers (TLB’s)
Hybrid Programming with OpenMP and MPI
* From AMD 1996 Publication #18522 Revision E
Chapter 2: Operating-System Structures
CSE 451: Operating Systems Autumn 2003 Lecture 10 Paging & TLBs
Translation Buffers (TLB’s)
CSE 451: Operating Systems Autumn 2003 Lecture 10 Paging & TLBs
Chapter 1 Computer System Overview
Translation Buffers (TLBs)
Chapter 2: Operating-System Structures
Presentation transcript:

Using CoDeL to rapidly prototype network processsor extensions Nainesh Agarwal and Nikitas J. Dimopoulos Department of Electrical and Computer Engineering University of Victoria

SAMOS IV, July 19-24, M PPP … Interconnect NIC M PPP … Interconnect NIC System Interconnect M PPP … NIC General structure of a “massively” parallel system

SAMOS IV, July 19-24, Outline The problem (latency) The problem (latency) Prediction Prediction Architectural enhancements Architectural enhancements CoDeL and Implementation CoDeL and Implementation

SAMOS IV, July 19-24, Latency CG Benchmark Completed. Class = W Size = 7000 Iterations = 15 Time in seconds = 1.72 Total processes = 8 Compiled procs = 8 Mop/s total = Mop/s/process = Operation type = floating point Verification = SUCCESSFUL Version = 2.3 Compile date = 07 Mar 2001 CG Benchmark Completed. Class = W Size = 7000 Iterations = 15 Time in seconds =.99 Total processes = 8 Compiled procs = 8 Mop/s total = Mop/s/process = Operation type = floating point Verification = SUCCESSFUL Version = 2.3 Compile date = 07 Mar 2001 Switch shared Switch shared, user space MPI over LAPI Faster communications

SAMOS IV, July 19-24, Latency Minimizing communication latency is crucial in achieving high performance. Minimizing communication latency is crucial in achieving high performance. Network Send Process Receive Process Send bufferReceive buffer System buffer System buffer NI

SAMOS IV, July 19-24, Latency Efficiency requires the message to be available to be consumed Efficiency requires the message to be available to be consumed Send call SenderReceiver Receive call issued Receive call executed (address resolution) Consumer idle Copy to consumer space

SAMOS IV, July 19-24, Latency Send call SenderReceiver thread Receive call issued Receive call executed (address resolution) Consumer thread Cache miss

SAMOS IV, July 19-24, Latency Even when the network delays are minimized (non-existent) Even when the network delays are minimized (non-existent) receiver synchronization, receiver synchronization, message copying, message copying, cache misses cache misses delay execution.

SAMOS IV, July 19-24, The solution Ensure that the received message is in the consumer’s cache at the point the consumer needs to consume the message. Ensure that the received message is in the consumer’s cache at the point the consumer needs to consume the message. P cache M

SAMOS IV, July 19-24, The solution Enabling mechanisms Enabling mechanisms In an asynchronous environment where many messages arrive at a node, can we decide which is the message to be consumed next? In an asynchronous environment where many messages arrive at a node, can we decide which is the message to be consumed next? How do we place the message to be consumed in the cache? How do we place the message to be consumed in the cache? M P cache

SAMOS IV, July 19-24, The solution Learn the pattern of message consumption and use this to decide which is the message to be consumed next. Learn the pattern of message consumption and use this to decide which is the message to be consumed next. Develop a hardware environment that will facilitate the placement of the message in the consumer’s cache Develop a hardware environment that will facilitate the placement of the message in the consumer’s cache

SAMOS IV, July 19-24, Receive call predictors History-based predictors predict subsequent receive calls at a given node in a message-passing application. History-based predictors predict subsequent receive calls at a given node in a message-passing application.

SAMOS IV, July 19-24, Locality Message reception locality Message reception locality If a certain message reception call has been used it will be re-used with high probability by a portion of code that is “near” the place that was used earlier, and it will also be re-used in the near future If a certain message reception call has been used it will be re-used with high probability by a portion of code that is “near” the place that was used earlier, and it will also be re-used in the near future

SAMOS IV, July 19-24, Messages vary in size from a few bytes to several kbytes Messages vary in size from a few bytes to several kbytes

SAMOS IV, July 19-24, Predictors Heuristics that predict the subsequent receive calls based on the past history of communication patterns on a per node basis. Heuristics that predict the subsequent receive calls based on the past history of communication patterns on a per node basis. Tag Predictor Tag Predictor Single-cycle Predictor Single-cycle Predictor Tag-cycle Predictor Tag-cycle Predictor Tag-better-cycle Predictor Tag-better-cycle Predictor

SAMOS IV, July 19-24, Single-cycle Predictor N = 64 for CG, and 49 for others

SAMOS IV, July 19-24, What next Network Processor Extensions Network Processor Extensions Achieve zero-copy through re-mapping Achieve zero-copy through re-mapping Use the predictors to “optimize” size and performance. Use the predictors to “optimize” size and performance.

SAMOS IV, July 19-24, Architecture M P Interconnect NIC Network cache cache

SAMOS IV, July 19-24, Architectural Enhancements Network Memory SpaceProcess Memory Space Network tag Process tagcache data lineMessage ID Network Cache initial final Separate Network Cache “ties” the Network Memory Space and the Process Memory Space Separate Network Cache “ties” the Network Memory Space and the Process Memory Space

SAMOS IV, July 19-24, Definitions Network Memory Space: Network Memory Space: Network buffers Network buffers Received messages live waiting to be bound to the process address space. Received messages live waiting to be bound to the process address space. Process Memory Space: Process Memory Space: Process address space Process address space Process objects including bound messages live Process objects including bound messages live

SAMOS IV, July 19-24, Operation network tag is associated with the Network Memory Space, network tag is associated with the Network Memory Space, process tag is associated with the Process Memory Space. process tag is associated with the Process Memory Space. message ID tag holds the message ID. message ID tag holds the message ID. All three tags can be searched associatively. All three tags can be searched associatively. The Network Cache includes three separate tags.

SAMOS IV, July 19-24, Operation cont’d On message arrival, the message is cached on the network cache. On message arrival, the message is cached on the network cache. The network tag is set to the address of the buffer in network memory space that is allocated to the message The network tag is set to the address of the buffer in network memory space that is allocated to the message The message id tag is set to the message id. The message id tag is set to the message id.

SAMOS IV, July 19-24, Operation cont’d The message lives at the network cache and it migrates to the Network Memory space according to a cache replacement policy which replaces the message that is least likely to be consumed next. The message lives at the network cache and it migrates to the Network Memory space according to a cache replacement policy which replaces the message that is least likely to be consumed next. The receive-call prediction heuristics are used for this purpose. The receive-call prediction heuristics are used for this purpose.

SAMOS IV, July 19-24, Late binding A receive call invalidates the message ID and network tags and will set the process tag to point to the address of the object destined to receive the message in Process Memory Space. A receive call invalidates the message ID and network tags and will set the process tag to point to the address of the object destined to receive the message in Process Memory Space. The buffer in Network Memory space is released and can be garbage collected. The buffer in Network Memory space is released and can be garbage collected. From this point onward, the cache line is associated with the Process Memory Space. On cache replacement, the message is written back to its targeted object in Process Memory Space From this point onward, the cache line is associated with the Process Memory Space. On cache replacement, the message is written back to its targeted object in Process Memory Space

SAMOS IV, July 19-24, Large Messages Are not dealt with in this work (TLB techniques would accomplish message re- binding) Are not dealt with in this work (TLB techniques would accomplish message re- binding)

SAMOS IV, July 19-24, ISA extensions network_load network_load network_store network_store Identical to standard load and store instructions with the exception that they cause the network cache to be searched according to the network tag. No other cache is searched. Identical to standard load and store instructions with the exception that they cause the network cache to be searched according to the network tag. No other cache is searched.

SAMOS IV, July 19-24, ISA extensions cont’d Regular load and store instructions target both the normal data cache and the network cache and the network cache is searched according to the process tag. Regular load and store instructions target both the normal data cache and the network cache and the network cache is searched according to the process tag.

SAMOS IV, July 19-24, ISA extensions cont’d remap message_id, new_process_tag remaps the cache line identified by the message_id to the new_process_tag. The message_id and new_process_tag are in registers. remap message_id, new_process_tag remaps the cache line identified by the message_id to the new_process_tag. The message_id and new_process_tag are in registers.

SAMOS IV, July 19-24, Implementation

SAMOS IV, July 19-24, Implementation --cont’d Network cache is implemented as m-way associative Network cache is implemented as m-way associative Three sections Three sections Process section Process section MessageID section MessageID section Network Cache section Network Cache section

SAMOS IV, July 19-24, Implementation -- cont’d The network cache section holds the message payload The network cache section holds the message payload The messageID and process sections hold pointers that point to payloads in the network cache section The messageID and process sections hold pointers that point to payloads in the network cache section The associativity of the messageID and process sections is larger than that of the network cache section to avoid unnecessary cache misses. The associativity of the messageID and process sections is larger than that of the network cache section to avoid unnecessary cache misses.

SAMOS IV, July 19-24, Implementation--overall

SAMOS IV, July 19-24, CoDeL CoDeL (Controller Description Language), targets the specification and design at the behavioral level. CoDeL is a procedural language in which the order of the statements implicitly represents the sequence of activities. It extracts the data and control flow from the program automatically, assigns the necessary hardware blocks and exploits inherent parallelism.

SAMOS IV, July 19-24, CoDeL It is similar to the C programming language and is therefore easy to learn. It includes a library of I/O protocols that simplify (sub)system interaction. The CoDeL compiler produces synthesizable VHDL code which can be targeted to any technology including PLD, FPGA or ASIC.

SAMOS IV, July 19-24, CoDeL--Ports and Protocols CoDeL abstracts module interaction through ports and protocols. CoDeL abstracts module interaction through ports and protocols. Protocols define the sequence of events necessary to transfer information from one module to another Protocols define the sequence of events necessary to transfer information from one module to another

SAMOS IV, July 19-24, CoDeL--Example # Define a 16-bit address # in 4 dimensions bitstruct mixed_radix_4 { (bits) field1[4]; (bits) field2[4]; (bits) field3[4]; (bits) field4[4]; } # Define a 36-bit # message header using # the above bitstruct data_frame { (mixed_radix_4) source_address; (mixed_radix_4) destn_address; (bits) header[4]; } in (data_frame) p1 with input_handshake; out (data_frame) p3 with output_handshake ;

SAMOS IV, July 19-24, CoDeL--Example Protocol Example of a handshake protocol Example of a handshake protocol

SAMOS IV, July 19-24, Network Processor Extension Implementation The register file modules were implemented in VHDL. Each of these required about 60 lines of VHDL code. Each cache line is 32 bytes. The register file modules were implemented in VHDL. Each of these required about 60 lines of VHDL code. Each cache line is 32 bytes. The network controller module, written in CoDeL, required about 697 lines of code, and generated close to 4011 lines of VHDL code. The network controller module, written in CoDeL, required about 697 lines of code, and generated close to 4011 lines of VHDL code. Under simulation we see that the network load instruction requires 15 clock cycles, the network store takes 29 cycles, the remap takes 29 cycles, while the load requires 21 cycles. Under simulation we see that the network load instruction requires 15 clock cycles, the network store takes 29 cycles, the remap takes 29 cycles, while the load requires 21 cycles.

SAMOS IV, July 19-24, Synthesis This design has not been synthesized (Xilinx synthesis has failed) This design has not been synthesized (Xilinx synthesis has failed) We have been able to syntjesize other designs (including the 5/3 Le Gall integer- to-integer wavelet) We have been able to syntjesize other designs (including the 5/3 Le Gall integer- to-integer wavelet)

SAMOS IV, July 19-24, Conclusions A network processor extension has been proposed and designed using CodeL. A network processor extension has been proposed and designed using CodeL. Using CoDeL has allowed the rapid prototyping of the design. Using CoDeL has allowed the rapid prototyping of the design. CoDeL needs to be extended to enhance parallelism. CoDeL needs to be extended to enhance parallelism. Compiler directives (similar to the technique used in OpenMP) could be used. Compiler directives (similar to the technique used in OpenMP) could be used. State collapsing and data forwarding would allow faster design. State collapsing and data forwarding would allow faster design.

SAMOS IV, July 19-24, What next SMP nodes SMP nodes A cache-coherent based organization will migrate and bind received messages to the consuming processor A cache-coherent based organization will migrate and bind received messages to the consuming processor Refine the ISA. Refine the ISA. Is there any more functionality needed? Is there any more functionality needed? Is the TLB-based re-mapping of the very large messages necessary? Is the TLB-based re-mapping of the very large messages necessary? Can we live with one sided communications? Can we live with one sided communications? Performance evaluation!! Performance evaluation!!