Vector IRAM Overview A processor architecture for embedded/portable systems running media applications Based on vector processing and embedded DRAM Simple,

Vector IRAM Overview A processor architecture for embedded/portable systems running media applications Based on vector processing and embedded DRAM Simple, scalable, and efficient Good compiler target Microprocessor prototype with 256-bit vector processor, 16 MBytes DRAM 150 million transistors, 290 mm2 3.2 Gops, 2W at 200 MHz Industrial strength vectorizing compiler Implemented by 6 graduate students I will start with what is interesting about Vector IRAM. This is a prototype microprocessor that integrates a vector unit with 256 bit datapaths with a 16 MByte embedded DRAM memory system. The design uses 150 million transistors and occupies nearly 300 square mm. While operating at just 200 MHz, Vector IRAM achieves 3.2 giga ops and consumes 2 Watts. Vector IRAM also comes with an industrial strength vectorizing compiler for software development. Vector IRAM is being implemented by a group of only 6 graduate students, responsible for architecture, design, simulation and testing. So, if Patterson and Hennessy decide to introduce performance/watt/man year as major processor metric in the new version of their book, this processor will likely be one of the best in this class.

The IRAM Team Hardware: Software: Advisors: Help from:
Joe Gebis, Christoforos Kozyrakis, Ioannis Mavroidis, Iakovos Mavroidis, Steve Pope, Sam Williams Software: Alan Janin, David Judd, David Martin, Randi Thomas Advisors: David Patterson, Katherine Yelick Help from: IBM Microelectronics, MIPS Technologies, Cray

Outline Motivation and goals Vector instruction set
Vector IRAM prototype Microarchitecture and design Vectorizing compiler Performance Comparison with SIMD Future work On vector processors for media applications

PostPC processor applications
Multimedia processing image/video processing, voice/pattern recognition, 3D graphics, animation, digital music, encryption narrow data types, streaming data, real-time response Embedded and portable systems notebooks, PDAs, digital cameras, cellular phones, pagers, game consoles, set-top boxes limited chip count, limited power/energy budget Significantly different environment from that of workstations and servers

Motivation and Goals Processor features for PostPC systems:
High performance on demand for multimedia without continuous high power consumption Tolerance to memory latency Scalable Mature, HLL-based software model Design a prototype processor chip Complete proof of concept Explore detailed architecture and design issues Motivation for software development The goal of this project has been to explore processor architecture for systems other than personal computers and servers. A large percentage of electronic devices in the near future will be embedded or portable and will be running multimedia applications. So what are the requirements for microprocessors for such devices? The first one is high performance for multimedia applications at low power consumption, and you want to spend the power only when the performance is needed. Next, we want to be able to tolerate the ever growing gap between processor and memory technology. The processor architecture must be scalable. Ideally, you want to be able to generate multiple high-end or low-end chips from a single design database. You also want to be able to move to the next generation processing technology and deliver additional performance without significant complexity increase. Finally, for any processor to become widely used, its software development model must rely on high level languages and compiler technology. We decided to research these issues by implementing a prototype microprocessor chip. Apart from being the ultimate proof of concept, designing working prototype allowed us to explore detailed architectural and design issues of our ideas and learn many valuable lessons from them. A working prototype also provides the motivation and the fast platform for compiler and software development. And, after all, designing hardware can be a lot of fun…

Key Technologies Vector processing Embedded DRAM
High performance on demand for media processing Low power for issue and control logic Low design complexity Well understood compiler technology Embedded DRAM High bandwidth for vector processing Low power/energy for memory accesses “System on a chip” The architecture of Vector IRAM relies on two key technologies: vector processing and embedded DRAM. Vector processing provides the required performance for multimedia applications. With vector architectures data level parallelism is described explicitly by every instruction and not discovered dynamically by hardware. That leads designs with low power consumption and low complexity. They also come with mature compiler technology in use for over twenty years in vector supercomputer systems. Embedded DRAM technology provides the high memory bandwidth needed by a vector processor for high performance. By eliminating off chip memory references and redundant cache lookups, the power consumption in the memory system is also reduced. Finally, by integrating a significant amount memory in the same die with the processor is a step towards a “system-on-a-chip”, something highly desired for embedded and portable applications.

Vector IRAM prototype Microarchitecture and design Vectorizing compiler Performance Comparison with SIMD Future work For vector processors for multimedia applications

Vector Instruction Set
Complete load-store vector instruction set Uses the MIPS64™ ISA coprocessor 2 opcode space Architecture state 32 general-purpose vector registers 32 vector flag registers Data types supported in vectors: 64b, 32b, 16b (and 8b) 91 arithmetic and memory instructions Not specified by the ISA Maximum vector register length Functional unit datapath width Before I talk describe the micro-architecture of our chip, I will take a few minutes to introduce the instruction set. We have defined a complete load-store vector ISA within the opcode space of coprocessor 2 of the MIPS64 instruction. A vector can contain integer elements that are 64, 32, 16 or 8 bits wide, or floating-point elements of single or double precision. In addition to the architecture state in the MIPS ISA, our architecture introduces a vector register file with 32, general-purpose vector registers. There are also 32 vector flag registers with a single mask bit per element used for predicated execution and exception handling, as well as some register for scalar operands. The vector instruction set defines 91 unique instructions. These include vector arithmetic operations for integer and floating-point, logical operations and vector processing instructions that perform pack and unpack operations on vector registers. Three type of vector load and store instructions are supported: sequential, strided, and indexed. Our instruction set does not define the vector length. This number is the maximum number of elements a vector register. It is also defines the maximum number of operations a vector instruction can specify. It does not define the width of the functional unit datapaths either, which determines the number of element operations executed in parallel every cycle. An implementation can select the proper maximum vector length and datapath width to match the available silicon resources and performance requirements. All implementations are still binary compatible.

Vector Architecture State
General Purpose Registers (32) Flag VP0 VP1 VP$vlr-1 vr0 vr1 vr31 vf0 vf1 vf31 $vpw 1b Virtual Processors ($vlr) vs0 vs1 vs15 Scalar Regs 64b

Vector IRAM ISA Summary
Scalar MIPS64 scalar instruction set 91 instructions 660 opcodes s.int u.int s.fp d.fp .v .vv .vs .sv Vector ALU alu op 8 16 32 64 Vector Memory unit stride constant stride indexed load store s.int u.int ALU operations: integer, floating-point, convert, logical, vector processing, flag processing

Support for DSP z x + w * y a
sat Round a w y z + * x n/2 n Support for fixed-point numbers, saturation, rounding modes Simple instructions for intra-register permutations for reductions and butterfly operations High performance for dot-products and FFT without the complexity of a random permutation To enable efficient vectorization of multimedia applications, a number of enhancements were made to traditional vector instruction sets. Specifically we added instructions that can handle fix-point formats. They implement operations like multiply and multiply-add and features like saturated arithmetic and special fixed-point rounding modes. We also introduced a set of instructions that perform simple element permutations within a vector register. These instructions lead to high performance for kernels like dot-products and FFTs, without the complexity and scaling problems of full, random permutations. Finally we also added a number of features to assist compiler and operating system and operating system tasks. We reduce the frequency of branch instructions by by supporting conditional execution using the flag registers. Virtually every vector instructions executes under mask control. Operating system support includes MMU-based virtual memory.

Compiler/OS Enhancements
Compiler support Conditional execution of vector instruction Using the vector flag registers Support for software speculation of load operations Operating system support MMU-based virtual memory Restartable arithmetic exceptions Valid and dirty bits for vector registers Tracking of maximum vector length used To enable efficient vectorization of multimedia applications, a number of enhancements were made to traditional vector instruction sets. Specifically we added instructions that can handle fix-point formats. They implement operations like multiply and multiply-add and features like saturated arithmetic and special fixed-point rounding modes. We also introduced a set of instructions that perform simple element permutations within a vector register. These instructions lead to high performance for kernels like dot-products and FFTs, without the complexity and scaling problems of full, random permutations. Finally we also added a number of features to assist compiler and operating system and operating system tasks. We reduce the frequency of branch instructions by by supporting conditional execution using the flag registers. Virtually every vector instructions executes under mask control. Operating system support includes MMU-based virtual memory.

VIRAM Prototype Architecture
Flag Unit 0 Flag Unit 1 Instr. Cache (8KB) FPU Flag Register File (512B) CP IF MIPS64™ 5Kc Core Arithmetic Unit 0 Arithmetic Unit 1 256b 256b Vector Register File (8KB) Data Cache (8KB) SysAD IF 64b 64b Memory Unit TLB We now move to the microarchitecture of our prototype microprocessor. This slide presents the block diagram for Vector IRAM. It consists of three basic blocks: the scalar core, the vector unit and the memory system. The slides in your handout include all the details of the system. I will just highlight some of the most interesting components. The scalar core of VIRAM is the 5Kc MIPS processor. This is a single issue, 64-bit processor with a six stage pipeline. It includes first level instruction and data caches and a coprocessor interface to which a single precision FPU has been attached. The scalar core operates at 200 MHz and its design has been provided by MIPS Technologies in the form of a sythesizable core. The vector unit is also connected to the coprocessor interface of the MIPS processor and works at 200 MHz. It includes a multiported 8 Kbyte register file. This allows each vector to hold 32 64bit elements or 64 32b elements and so on. There are two functional units for arithmetic operations. Both can executed integer and logical operations, but only one can executed floating-point. There are also 2 flag processing units which provide support for predicated execution and exception handling. Each of the functional units has a 256 bit pipelined datapath. One each cycle, 4 64b operations or 8 32b operations or 16 16b operations can execute in parallel. To simplify the design and reduce area requirements, our prototype does not implement 8b integer operations and double precision arithmetic. All operations excluding divides are fully pipelined. The vector coprocessor also includes one memory or load/store unit. The LSU can exchange up to 256b per cycle with the memory system and has four address generators for strided and indexed accesses. Address translation is performed in a two level TLB structure. The memory unit is pipelined and up to 64 independent accesses may be pending at any time. Embedded DRAM is used in Vector IRAM as the main memory for both scalar and vector units. There is not SRAM cache for the vector unit which accesses DRAM directly.The memory system consists of 8 DRAM macro. Each bank has capacity of 2 Mbytes and a 256-bit synchronous interface. Random and page access time for each macro are 25 and 7.5 nsec respectively. Embedded DRAM macros are designed and provided by IBM Microelectronics. The DRAM is connected to the scalar and vector units with a crossbar interconnect. The crossbar has peak bandwidth of 12.8 Gbytes per direction, load or store, and is able to transmit up to 5 independent addresses per cycle. 256b DMA Memory Crossbar JTAG IF … JTAG DRAM0 (2MB) DRAM1 (2MB) DRAM7 (2MB)

Vector Unit Pipeline Single-issue, in-order pipeline
Efficient for short vectors Pipelined instruction start-up Full support for instruction chaining, the vector equivalent of result forwarding Hides long DRAM access latency Random access latency could lead to stalls due to long load®use RAW hazards Simple solution: “delayed” vector pipeline The execution pipeline in the vector unit is simple, single issue and in order. Keep in mind that each issued vector instruction specifies a large number of independent operations that can keep a functional unit busy for up to eight cycles. To make the pipeline efficient for applications with short vectors, the start-up of vector instructions is pipelined. We also provide full support to instructions chaining, the vector equivalent of result forwarding or bypassing. The pipeline must also be able to hide the long latency of on-chip DRAM. While embedded DRAM is faster than external memory, it is still slower than SRAM and several processor cycles are necessary to load a value, especially if a random access is required. In this case, performance can be significantly reduced from stalls due to load->use RAW hazards. To address this issue we use a very simple and power efficient way, structure called the “delayed vector pipeline”. This is not the typical pipeline organization of vector supercomputers.

Delayed Vector Pipeline
F D R E M W . DRAM latency: >25ns vld VLD A T VW vadd Load ® Add RAW hazard vst vld VADD DELAY VR VX VW vadd vst VST A T VR . This slide presents how the delayed pipeline works. When a vector load operation is issued by the scalar core to the vector unit, it may take several for data to start arriving from the memory. If a following arithmetic instruction that uses that data was to be executed immediately by the vector unit, we would have a long RAW hazard, therefore a large number of stall cycles and performance decrease. To overcome this, we include the latency of a random DRAM access in the pipeline for all operations. In other words, the execution of arithmetic and store operations is delayed by a number of cycles so that the RAW hazard is eliminated or significantly shortened. This eliminates long pipeline stalls for common loop cases. The resulting pipeline is, of course, deep and in our case it has 15 stages. Random access latency included in the vector unit pipeline Arithmetic operations and stores are delayed to shorten RAW hazards Long hazards eliminated for the common loop cases Vector pipeline length: 15 stages

Handling Memory Conflicts
Single sub-bank DRAM macro can lead to memory conflicts for non-sequential access patterns Solution 1: address interleaving Selects between 3 address interleaving modes for each virtual page Solution 2: address decoupling buffer (128 slots) Allows scheduling of long indexed accesses without stalling the arithmetic operations executing in parallel Another potential performance bottleneck is that of memory conflicts. Sequential accesses or accesses with a small stride can be served with a single page access from DRAM, large strided or indexed accesses require accessing multiple DRAM rows. If those rows are in the same bank, then we have DRAM bank conflicts. If the DRAM macro used had a multi-bank structure, then accesses to different row could be overlapped and that would not be a significant performance problem. Unfortunately, the macro available to us has a single bank structure and accesses to different row have to be fully serialized. To handle this problem we employ two solution. First we use 3 interleaving schemes for mapping addresses to memory location. The first one is optimized for performance and power consumption of unit stride accesses, the second on for small strides and the third one for large strides. The interleaving scheme can be selected independently for each virtual page. The second solution is to use a buffer for address decoupling. These allows the hardware to buffer the addresses of on vector memory instruction. These accesses are served by the memory system as soon as possible without stalling independent instructions executed in the arithmetic units at that time. Hence properly scheduling instructions that may generate a large number of conflicts allows maintain high performance. The figure here presents the simulated performance of a 8x8 IDCT on Vector IRAM. IDCT uses large strides. We have simulated DRAM macros with 1, 2, and 4 sub-banks as well as with 1 sub-bank and address decoupling support. Address decoupling allows achieving the performance level of having a DRAM macro with two sub-banks, about 60% higher than that with the single-bank structure.

Modular Vector Unit Design
256b 64b Xbar IF Integer Datapath 0 Flag Reg. Elements & Datapaths Vector Reg. Elements FP Datapath Integer Datapath 1 64b Xbar IF Integer Datapath 0 Flag Reg. Elements & Datapaths Vector Reg. Elements FP Datapath Integer Datapath 1 64b Xbar IF Integer Datapath 0 Flag Reg. Elements & Datapaths Vector Reg. Elements FP Datapath Integer Datapath 1 64b Xbar IF Integer Datapath 0 Flag Reg. Elements & Datapaths Vector Reg. Elements FP Datapath Integer Datapath 1 Control I will now talk about the implementation of Vector IRAM and discuss why it is scalable and modular. For the heart of the vector unit we have designed a single 64b component we call a “lane”, which we replicated 4 times. A lane includes one 64b datapath from every functional unit and a partition of the vector register file. This partition stores the vector elements to be processed by local datapaths during the execution of a vector instruction. This modular approach significantly simplifies design and testing, as a single simpler component must be implemented. It also provides a simple scaling path for our architecture. Future implementations can provide higher or low performance, power, and area by scaling the number of lanes in the system, without requiring significant datapath or control redesign. One can generate multiple high-end or low-end implementations from a single design database this way. In addition, Most instructions require only local interconnect within a lane for their execution, a fact that reduces the effect of interconnect delay scaling on this design. This makes the architecture appropriate for deep submicron processes. Single 64b “lane” design replicated 4 times Reduces design and testing time Provides a simple scaling model (up or down) without major control or datapath redesign Most instructions require only intra-lane interconnect Tolerance to interconnect delay scaling

Floorplan Technology: IBM SA-27E 290 mm2 die area
0.18mm CMOS 6 metal layers (copper) 290 mm2 die area 225 mm2 for memory/logic DRAM: 161 mm2 Vector lanes: 51 mm2 Transistor count: ~150M Power supply 1.2V for logic, 1.8V for DRAM Peak vector performance 1.6/3.2/6.4 Gops wo. multiply-add (64b/32b/16b operations) 3.2/6.4 /12.8 Gops w. multiply-add 1.6 Gflops (single-precision) 14.5 mm 20.0 mm This figure presents the floorplan of Vector IRAM. It occupies nearly 300 square mm and 150 million transistors in a 0.18um CMOS process by IBM. Blue blocks on the floorplan indicate DRAM macros or compiled SRAM blocks. Golden blocks are those designed at Berkeley. They included synthesized logic for control and the FP datapaths, and full custom logic for register files, integer datapaths and DRAM. Vector IRAM operates at 200MHz. The power supply is 1.2V for logic and 1.8V for DRAM. The peak performance for the vector unit is 1.6 giga ops for 64bit integer operations. Performance doubles or quadruples for 32 and 16b operations respectively. Peak floating point performance is 1.6 Gflops. There are several interesting things to notice on the floorplan. First the overall design modularity and scalability. It mostly consists of replicated DRAM macros and vector lanes connected through a crossbar. Another very interesting feature is the percentage of this design directly visible to software. Compilers can control any part of the design that is registers, datapaths or main memory. They do that by scheduling proper arithmetic or load store instructions. The majority of our design is used for main memory, vector registers and datapaths. On the other hand, if you take a look at a processor like Pentium 3, you will see that less than 20% of its are is used for datapaths and registers. The rest is caches and dynamic issue logic. While this usually work for the benefit of applications, they cannot be controlled by compiler and they cannot be turned off when not necessary.

Alternative Floorplans (1)
“VIRAM-8MB” 4 lanes, 8 Mbytes 190 mm2 3.2 Gops at 200 MHz “VIRAM-2Lanes” 2 lanes, 4 Mbytes 120 mm2 1.6 Gops at 200 MHz “VIRAM-Lite” 1 lane, 2 Mbytes 60 mm2 0.8 Gops at 200 MHz

Alternative Floorplans (2)
“RAMless” VIRAM 2 lanes, 55 mm2, 1.6 Gops at 200 MHz 2 high-bandwidth DRAM interfaces and decoupling buffers Vector processors need high bandwidth, but they can tolerate latency

Power Consumption Power saving techniques
Low power supply for logic (1.2 V) Possible because of the low clock rate (200 MHz) Wide vector datapaths provide high performance Extensive clock gating and datapath disabling Utilizing the explicit parallelism information of vector instructions and conditional execution Simple, single-issue, in-order pipeline Typical power consumption: 2.0 W MIPS core: 0.5 W Vector unit: 1.0 W (min ~0 W) DRAM: W (min ~0 W) Misc.: 0.3 W (min ~0 W) A number of architectural techniques were used to reduce the power consumption of Vector IRAM. First using the low clock frequency of 200 MHz allowed to use 1.2V for logic power supply and that was the major contribution to power saving. Performance on this chip does not come from high frequency operation but from the use of wide datapaths on which we executed multiple vector element operations in parallel. We also made extensive clock gating and datapath disabling. For that we utilize the explicit parallelism information provided by vector instructions. Each instruction describes the resources it will use for the next few cycles. We also used predicated execution for disabling additional resources and clocks. Finally the simple, in-order nature of the vector unit leads to low power overhead for issue and control logic. We have estimated the typical power consumption of Vector IRAM to be 2 Watts. About half of that is for the vector unit where the major contributor is the large register file. One quarter goes to the scalar core and another quarter to the DRAM and I/O logic. Note that when not used, most of the components in our design dissipate almost 0 power as they are dynamically turned off.

VIRAM Compiler Frontends Optimizer Code Generators C T3D/T3E Cray’s
PDGCS C++ C90/T90/SV1 Fortran95 SV2/VIRAM Based on the Cray’s PDGCS production environment for vector supercomputers Extensive vectorization and optimization capabilities including outer loop vectorization No need to use special libraries or variable types for vectorization Apart from the hardware, we have also worked on software development tools. We have a vectorizing compiler with C, C++, and Fortran front-ends. It is based on the production compiler by Cray for its vector supercomputers, which we ported to our architecture. Its has extensive vectorization capabilities including outer-loop vectorization. Using this compiler, vectorize applications written in high level languages without necessarily using optimized libraries or “special” (non-standard) variable types in his application.

Compiler Performance 64x64 matrix-matrix multiply, single precision
Theoretical peak 1.60 GFLOPS Handcoded assembly 1.58 GFLOPS Compiler 0.85 GFLOPS Compiler with outer loop vectorization 1.51 GFLOPS Performance tuning is currently in progress

Compiler Challenges Generate code for variable data type width
Vectorizer starts with largest width (64b) At the end, vectorization discarded if greatest width met is smaller; vectorization restarted For simplicity, a single loop will use the largest width present in it Consistency between scalar cache and DRAM Problem when vector unit writes cached data Vector unit invalidates cache entries on writes Compiler generates synchronization instructions Vector after scalar, scalar after vector Read after write, write after read, write after write

Performance: Efficiency
Peak Sustained % of Peak Image Composition 6.4 GOPS 6.40 GOPS 100% iDCT 3.10 GOPS 48.4% Color Conversion 3.2 GOPS 3.07 GOPS 96.0% Image Convolution 3.16 GOPS 98.7% Integer VM Multiply 3.00 GOPS 93.7% FP VM Multiply 1.6 GFLOPS 1.59 GFLOPS 99.6% Average 89.4% In the next few slides I will talk about the performance of Vector IRAM, evaluated using simulations at this point of the design. This slides shows the theoretical peak and sustained throughput of Vector IRAM for a number of media kernels like image composition and convolution. Despite the lack of caches, VIRAM can achieve throughput very close to its peak for the majority of these kernels, 90% of it on average. iDCT is limited by memory conflicts and the ability to generate just four addresses per cycle for strided and indexed operations.

Performance: Comparison
VIRAM MMX iDCT 0.75 3.75 (5.0x) Color Conversion 0.78 8.00 (10.2x) Image Convolution 1.23 5.49 (4.5x) QCIF (176x144) 7.1M 33M (4.6x) CIF (352x288) 28M 140M (5.0x) This slide compares Vector IRAM to SIMD multimedia extensions for media kernels. The numbers presented here are in cycles per pixel. All results for SIMD extensions assume that caches are preloaded. The red numbers in parenthesis are the speedup of the Vector IRAM architecture. You can see that Vector IRAM outperforms SIMD extensions 4 to 17 times, even for applications like iDCT that it does not achieve its peak performance. That means that a processor using these extensions must be clocked 4 to 17 times faster to achieve similar performance. And that would lead to much higher power consumption. Vector instructions allow us to keep wide datapaths for multiple cycles. A SIMD instruction can utilize a 64b datapath for a single cycle. To utilize multiple SIMD datapaths one must fetch, decode and issue multiple instructions per cycle. VIRAM also supports vector load and store operations without any alignment restriction for the whole vector. Only the individual elements have to be aligned. Vector load and store instructions take care of the alignment of the vector automatically. On the other hand, with SIMD extensions on have to issue a number of load operations and shift/rotate instructions to achieve the functionality of a vector load. QCIF and CIF numbers are in clock cycles per frame All other numbers are in clock cycles per pixel MMX results assume no first level cache misses

Vector Vs. SIMD Vector SIMD
One instruction keeps multiple datapaths busy for many cycles One instruction keeps one datapath busy for one cycle Wide datapaths can be used without changes in ISA or issue logic redesign Wide datapaths can be used either after changing the ISA or after changing the issue width Strided and indexed vector load and store instructions Simple scalar loads; multiple instructions needed to load a vector No alignment restriction for vectors; only individual elements must be aligned to their width Short vectors must be aligned in memory; otherwise multiple instructions needed to load them

Vector Vs. SIMD: Example
Simple example: conversion from RGB to YUV Y = [( 9798*R *G *B) / 32768] U = [(-4784*R *G *B) / 32768] + 128 V = [(20218*R – 16941*G – 3277*B) / 32768] + 128

VIRAM Code RGBtoYUV: vlds.u.b r_v, r_addr, stride3, addr_inc # load R
vlds.u.b g_v, g_addr, stride3, addr_inc # load G vlds.u.b b_v, b_addr, stride3, addr_inc # load B xlmul.u.sv o1_v, t0_s, r_v # calculate Y xlmadd.u.sv o1_v, t1_s, g_v xlmadd.u.sv o1_v, t2_s, b_v vsra.vs o1_v, o1_v, s_s xlmul.u.sv o2_v, t3_s, r_v # calculate U xlmadd.u.sv o2_v, t4_s, g_v xlmadd.u.sv o2_v, t5_s, b_v vsra.vs o2_v, o2_v, s_s vadd.sv o2_v, a_s, o2_v xlmul.u.sv o3_v, t6_s, r_v # calculate V xlmadd.u.sv o3_v, t7_s, g_v xlmadd.u.sv o3_v, t8_s, b_v vsra.vs o3_v, o3_v, s_s vadd.sv o3_v, a_s, o3_v vsts.b o1_v, y_addr, stride3, addr_inc # store Y vsts.b o2_v, u_addr, stride3, addr_inc # store U vsts.b o3_v, v_addr, stride3, addr_inc # store V subu pix_s,pix_s, len_s bnez pix_s, RGBtoYUV

MMX Code (1) RGBtoYUV: movq mm1, [eax] pxor mm6, mm6 movq mm0, mm1
psrlq mm1, 16 punpcklbw mm0, ZEROS movq mm7, mm1 punpcklbw mm1, ZEROS movq mm2, mm0 pmaddwd mm0, YR0GR movq mm3, mm1 pmaddwd mm1, YBG0B movq mm4, mm2 pmaddwd mm2, UR0GR movq mm5, mm3 pmaddwd mm3, UBG0B punpckhbw mm7, mm6; pmaddwd mm4, VR0GR paddd mm0, mm1 pmaddwd mm5, VBG0B movq mm1, 8[eax] paddd mm2, mm3 movq mm6, mm1 paddd mm4, mm5 movq mm5, mm1 psllq mm1, 32 paddd mm1, mm7 punpckhbw mm6, ZEROS movq mm3, mm1 pmaddwd mm1, YR0GR movq mm7, mm5 pmaddwd mm5, YBG0B psrad mm0, 15 movq TEMP0, mm6 movq mm6, mm3 pmaddwd mm6, UR0GR psrad mm2, 15 paddd mm1, mm5 movq mm5, mm7 pmaddwd mm7, UBG0B psrad mm1, 15 pmaddwd mm3, VR0GR packssdw mm0, mm1 pmaddwd mm5, VBG0B psrad mm4, 15 movq mm1, 16[eax]

MMX Code (2) paddd mm6, mm7 movq mm7, mm1 psrad mm6, 15 paddd mm3, mm5
psllq mm7, 16 movq mm5, mm7 psrad mm3, 15 movq TEMPY, mm0 packssdw mm2, mm6 movq mm0, TEMP0 punpcklbw mm7, ZEROS movq mm6, mm0 movq TEMPU, mm2 psrlq mm0, 32 paddw mm7, mm0 movq mm2, mm6 pmaddwd mm2, YR0GR movq mm0, mm7 pmaddwd mm7, YBG0B packssdw mm4, mm3 add eax, 24 add edx, 8 movq TEMPV, mm4 movq mm4, mm6 pmaddwd mm6, UR0GR movq mm3, mm0 pmaddwd mm0, UBG0B paddd mm2, mm7 pmaddwd mm4, pxor mm7, mm7 pmaddwd mm3, VBG0B punpckhbw mm1, paddd mm0, mm6 movq mm6, mm1 pmaddwd mm6, YBG0B punpckhbw mm5, movq mm7, mm5 paddd mm3, mm4 pmaddwd mm5, YR0GR movq mm4, mm1 pmaddwd mm4, UBG0B psrad mm0, 15 paddd mm0, OFFSETW psrad mm2, 15 paddd mm6, mm5 movq mm5, mm7

MMX Code (3) pmaddwd mm7, UR0GR psrad mm3, 15 pmaddwd mm1, VBG0B
paddd mm4, OFFSETD packssdw mm2, mm6 pmaddwd mm5, VR0GR paddd mm7, mm4 psrad mm7, 15 movq mm6, TEMPY packssdw mm0, mm7 movq mm4, TEMPU packuswb mm6, mm2 movq mm7, OFFSETB paddd mm1, mm5 paddw mm4, mm7 psrad mm1, 15 movq [ebx], mm6 packuswb mm4, movq mm5, TEMPV packssdw mm3, mm4 paddw mm5, mm7 paddw mm3, mm7 movq [ecx], mm4 packuswb mm5, mm3 add ebx, 8 add ecx, 8 movq [edx], mm5 dec edi jnz RGBtoYUV

Performance: FFT (1) This slide presents the performance of VIRAM for for floating point and fixed point FFT and how it compares to that of various other architectures. The code for all machines on this slide is hand optimized. First for floating point FFT. VIRAM outperforms VLIW based DSP chips like Tigersharc and VelociTI by factors of 2 to 3. It also comes close to specialized FFT systems like the Pathfinder and Wildstar boards.

Performance: FFT (2) This slide shows similar results for fixed point FFT. Vector IRAM performs as good as DSP architectures and specialized systems, while it outperforms general purpose processors by an order of magnitude.

Future Work A platform for ultra-scalable vector coprocessors Goals
Balance data level and random ILP in the vector design Add another scaling dimension to vector processors Work around the scaling problems of a large register file Allow the generation of numerous configuration for different performance, area (cost), power requirements Approach Cluster-based architecture within lanes Local register files for datapaths Decoupled everything

Ultra-scalable Architecture

Benefits Two scaling models Performance, power, cost on demand
More lanes: when data level parallelism is plenty More clusters: when random ILP is available Performance, power, cost on demand Simple to derive of tens of configuration optimized for specific applications Simpler design Simple clusters, simpler register files, trivial chaining control No need for strictly synchronous clusters

Questions to Answer Cluster organization
How many local registers Assignment of instructions to clusters Frequency of inter-cluster communication Dependence on the number of clusters, registers per cluster etc. Balancing the two scaling methods Scaling the number of lanes vs. scaling the number of clusters Special ISA support for the clustered architecture Compiler support for the clustered architecture

Conclusions Vector IRAM One thing to keep in mind
An integrated architecture for media processing Based on vector processing and embedded DRAM Simple, scalable, and efficient One thing to keep in mind Use the most efficient solution to exploit each level of parallelism Make the best solutions for each level work together Vector processing is very efficient for data level parallelism To conclude my talk, today I have presented to you Vector IRAM. This is an integrated architecture for media processing that combines a 256 bit vector unit with 16 Mbytes of embedded DRAM. It uses 150 million transistors and 300 square mm. At just 200 MHz, it achieves 3.2 giga ops for 32b integers and consumes 2 Watts. It is a simple, scalable design that is efficient in terms of performance, power, and area. The current status of the prototype design is the following. We are currently in the verification and back-end stage of the design. RTL development and the design of several full custom components has been completed. We expect to tape-out the design by the end of the year. The compiler is also operational and is being tuned for performance. We are also working on applications for this system. Data Irregular ILP Thread Multi-programming Levels of Parallelism MT? SMT? CMP? VLIW? Superscalar? VECTOR MPP? NOW? Efficient Solution

Backup slides Before I answer any questions I would like to acknowledge the help of several companies and individuals with this research project.

Architecture Details (1)
MIPS64™ 5Kc core (200 MHz) Single-issue core with 6 stage pipeline 8 KByte, direct-map instruction and data caches Single-precision scalar FPU Vector unit (200 MHz) 8 KByte register file (32 64b elements per register) 4 functional units: 2 arithmetic (1 FP), 2 flag processing 256b datapaths per functional unit Memory unit 4 address generators for strided/indexed accesses 2-level TLB structure: 4-ported, 4-entry microTLB and single-ported, 32-entry main TLB Pipelined to sustain up to 64 pending memory accesses The vector unit is also connected to the coprocessor interface of the MIPS processor and works at 200 MHz. It includes a multiported 8 Kbyte register file. This allows each of the 32 registers to hold 32 64bit elements or 64 32b elements and so on. The flag register file has capacity of half a Kbyte. There are two functional units for arithmetic operations. Both can executed integer and logical operations, but only one can executed floating-point. There are also 2 flag processing units which provide support for predicated execution and exception handling. Each of the functional units has a 256 bit pipelined datapath. One each cycle, 4 64b operations or 8 32b operations or 16 16b operations can execute in parallel. To simplify the design and reduce area requirements, our prototype does not implement 8b integer operations and double precision arithmetic. All operations excluding divides are fully pipelined. The vector coprocessor also includes one memory or load/store unit. The LSU can exchange up to 265b per cycle with the memory system and has four address generators for strided and indexed accesses. Address translation is performed in a two level TLB structure. The hardware managed, first level, microTLB has four entries and four ports, while the main TLB has 32 double-page entries and a single access port. The main TLB is managed by software. The memory unit is pipelined and up to 64 independent accesses may be pending at any time.

Architecture Details (2)
Main memory system No SRAM cache for the vector unit 8 2-MByte DRAM macros Single bank per macro, 2Kb page size 256b synchronous, non-multiplexed I/O interface 25ns random access time, 7.5ns page access time Crossbar interconnect 12.8 GBytes/s peak bandwidth per direction (load/store) Up to 5 independent addresses transmitted per cycle Off-chip interface 64b SysAD bus to external chip-set (100 MHz) 2 channel DMA engine Embedded DRAM is used in Vector IRAM as the main memory for both scalar and vector units. There is not SRAM cache for the vector unit which accesses DRAM directly. The memory system consists of 8 DRAM macro. Each bank has capacity of 2 Mbytes with 2 Kbit pages, and a monolithic bank organization. They interface is synchronous at 200MHz and 256 bits wide. Random and page access time for each macro are 25 and 7.5 nsec respectively. Embedded DRAM macros are designed and provided by IBM Microelectronics. The macros are connected to the scalar and vector units with a crossbar interconnect. The crossbar has peak bandwidth of 12.8 Gbytes per direction, load or store, and is able to transmit up to 5 independent addresses per cycle. The external interface of vector IRAM is 64b SysAD, the standard bus for MIPS based processors. It operates at 100MHz. A simple 2 channel DMA engine is also included for transferring data between external devices and the on-chip DRAM.

Hardware Exposed to Software
Pentium® III <25% of area for registers and datapaths The rest is still useful, but not visible to software Cannot turn off is not needed The blocks in blue are those used to execute operations (SSE/MMX included). The rest are used to do instruction and data fetching (speculation, reordering etc), tolerate latency (2 level caching), solve dependency problems (write buffers) etc. In other words, 80% of the die is there to keep about 20% of it busy. This is bad engineering. Extra hardware means: a) larger cost, b) larger complexity (these xtors must be verified) and c) higher power (more transistors switch per operation). While these (speculation, reordering, caching) are very useful and we should definitely keep using them in some form, we cannot rely just on that (just on HW) to keep increasing performance at rates similar to those of the last 10 years. Caching and speculation techniques are running out-of-steam and become less and less efficient in terms of performance benefit and complexity/die area increase. Multimedia applications change the name of the game as well. Due to real time performance requirements, out-of-order architectures are more difficult to use. Since temporal locality not always there, the caches create a performance problem. That’s why cache bypass buffers are build-in several processors (how embarrassing). Caches are also slow and difficult to manage optimally in HW when they are large.

Vector IRAM Overview A processor architecture for embedded/portable systems running media applications Based on vector processing and embedded DRAM Simple,

Similar presentations

Presentation on theme: "Vector IRAM Overview A processor architecture for embedded/portable systems running media applications Based on vector processing and embedded DRAM Simple,"— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

Vector IRAM Overview A processor architecture for embedded/portable systems running media applications Based on vector processing and embedded DRAM Simple,

Similar presentations

Presentation on theme: "Vector IRAM Overview A processor architecture for embedded/portable systems running media applications Based on vector processing and embedded DRAM Simple,"— Presentation transcript:

Similar presentations

About project

Feedback