IRAM A Media-oriented Processor with Embedded DRAM

IRAM A Media-oriented Processor with Embedded DRAM
Christoforos Kozyrakis, David Patterson, Katherine Yelick Computer Science Division University of California at Berkeley Today I will present Vector IRAM, a vector microprocessor for multimedia applications, integrated with embedded DRAM. This processor is currently is under development at the University of California at Berkeley, in the IRAM research group, lead by professor David Patterson. On the title slide you see the names of the people involved in the design. IBM Microelectronics and MIPS Technologies have provided significant help with this project.

IRAM Overview A processor architecture for embedded/portable systems running media applications Based on media processing and embedded DRAM Simple, scalable, and efficient Good compiler target Microprocessor prototype with 256-bit media processor, 16 MBytes DRAM 150 million transistors, 290 mm2 3.2 Gops, 2W at 200 MHz Industrial strength compiler Implemented by 6 graduate students I will start with what is interesting about Vector IRAM. This is a prototype microprocessor that integrates a vector unit with 256 bit datapaths with a 16 MByte embedded DRAM memory system. The design uses 150 million transistors and occupies nearly 300 square mm. While operating at just 200 MHz, Vector IRAM achieves 3.2 giga ops and consumes 2 Watts. Vector IRAM also comes with an industrial strength vectorizing compiler for software development. Vector IRAM is being implemented by a group of only 6 graduate students, responsible for architecture, design, simulation and testing. So, if Patterson and Hennessy decide to introduce performance/watt/man year as major processor metric in the new version of their book, this processor will likely be one of the best in this class.

The IRAM Team Hardware: Software: Advisors: Help from:
Joe Gebis, Christoforos Kozyrakis, Ioannis Mavroidis, Iakovos Mavroidis, Steve Pope, Sam Williams Software: Alan Janin, David Judd, David Martin, Randi Thomas Advisors: David Patterson, Katherine Yelick Help from: IBM Microelectronics, MIPS Technologies, Cray, Avanti

Outline Motivation and goals Instruction set IRAM prototype Compiler
Microarchitecture and design Compiler Performance Comparison with SIMD

PostPC processor applications
Multimedia processing; (“90% of cycles on desktop”) image/video processing, voice/pattern recognition, 3D graphics, animation, digital music, encryption narrow data types, streaming data, real-time response Embedded and portable systems notebooks, PDAs, digital cameras, cellular phones, pagers, game consoles, set-top boxes limited chip count, limited power/energy budget Significantly different environment from that of workstations and servers And larger: bit microprocessor market: 386 million for Embedded, 160 million for PCs

Motivation and Goals Processor features for PostPC systems:
High performance on demand for multimedia without continuous high power consumption Tolerance to memory latency Scalable Mature, HLL-based software model Design a prototype processor chip Complete proof of concept Explore detailed architecture and design issues Motivation for software development The goal of this project has been to explore processor architecture for systems other than personal computers and servers. A large percentage of electronic devices in the near future will be embedded or portable and will be running multimedia applications. So what are the requirements for microprocessors for such devices? The first one is high performance for multimedia applications at low power consumption, and you want to spend the power only when the performance is needed. Next, we want to be able to tolerate the ever growing gap between processor and memory technology. The processor architecture must be scalable. Ideally, you want to be able to generate multiple high-end or low-end chips from a single design database. You also want to be able to move to the next generation processing technology and deliver additional performance without significant complexity increase. Finally, for any processor to become widely used, its software development model must rely on high level languages and compiler technology. We decided to research these issues by implementing a prototype microprocessor chip. Apart from being the ultimate proof of concept, designing working prototype allowed us to explore detailed architectural and design issues of our ideas and learn many valuable lessons from them. A working prototype also provides the motivation and the fast platform for compiler and software development. And, after all, designing hardware can be a lot of fun…

Key Technologies Media processing Embedded DRAM
High performance on demand for media processing Low power for issue and control logic Low design complexity Well understood compiler technology Embedded DRAM High bandwidth for media processing Low power/energy for memory accesses “System on a chip” The architecture of Vector IRAM relies on two key technologies: vector processing and embedded DRAM. Vector processing provides the required performance for multimedia applications. With vector architectures data level parallelism is described explicitly by every instruction and not discovered dynamically by hardware. That leads designs with low power consumption and low complexity. They also come with mature compiler technology in use for over twenty years in vector supercomputer systems. Embedded DRAM technology provides the high memory bandwidth needed by a vector processor for high performance. By eliminating off chip memory references and redundant cache lookups, the power consumption in the memory system is also reduced. Finally, by integrating a significant amount memory in the same die with the processor is a step towards a “system-on-a-chip”, something highly desired for embedded and portable applications.

Outline Motivation and goals Instruction set IRAM prototype Compiler
Microarchitecture and design Compiler Performance Comparison with SIMD

Potential Multimedia Architecture
“New” model: VSIW=Very Short Instruction Word! Compact: Describe N operations with 1 short instruct. Predictable (real-time) perf. vs. statistical perf. (cache) Multimedia ready: choose N*64b, 2N*32b, 4N*16b Easy to get high performance; N operations: are independent use same functional unit access disjoint registers access registers in same order as previous instructions access contiguous memory words or known pattern hides memory latency (and any other latency) Compiler technology already developed, for sale! Why MPP? Best potential performance! Few successes Operator on vectors of registers Its easier to vectorize than parallelize Scales well: more hardware and slower clock rate Crazy research

VSIW reduces ops by 1.2X, instructions by 20X!
Operation & Instruction Count: RISC v. “VSIW” Processor (from F. Quintana, U. Barcelona.) Spec92fp Operations (M) Instructions (M) Program RISC VSIW R / V RISC VSIW R / V swim x x hydro2d x x nasa x x su2cor x x tomcatv x x wave x x mdljdp x x VSIW reduces ops by 1.2X, instructions by 20X!

Revive Vector (VSIW) Architecture!
Cost: ~ $1M each? Low latency, high BW memory system? Code density? Compilers? Vector Performance? Power/Energy? Scalar performance? Real-time? Limited to scientific applications? Single-chip CMOS MPU/IRAM Embedded DRAM Much smaller than VLIW/EPIC For sale, mature (>20 years) Easy scale speed with technology Parallel to save energy, keep perf Include modern, modest CPU  OK scalar No caches, no speculation  repeatable speed as vary input Multimedia apps vectorizable too: N*64b, 2N*32b, 4N*16b Supercomputer industry dead? Very attractive to scale New class of applications Before had a lousy scalar processor; modest CPU will do well on many programs, vector do great on others

But ... But vectors are in your appendix, not in a chapter
But my professor told me vectors are dead But I know my application doesn’t vectorize (= “but my application is not a dense matrix”) But the latest fashion trend is VLIW, and I don’t want to be out of style Comeback: “If the code is vectorizable, the best architecture is vector” (Jim Smith, ISCA keynote address) Claim is media code is vectorizable, or if not, then scalar architecture is as good as any other solution

Vector Surprise Use vectors for inner loop parallelism (no surprise)
One dimension of array: A[0, 0], A[0, 1], A[0, 2], ... think of machine as 32 vector regs each with 64 elements 1 instruction updates 64 elements of 1 vector register and for outer loop parallelism! 1 element from each column: A[0,0], A[1,0], A[2,0], ... think of machine as 64 “virtual processors” (VPs) each with 32 scalar registers! (~ multithreaded processor) 1 instruction updates 1 scalar register in 64 VPs Hardware identical, just 2 compiler perspectives

Vector Architecture State
General Purpose Registers (32) Flag VP0 VP1 VP$vlr-1 vr0 vr1 vr31 vf0 vf1 vf31 $vpw 1b Virtual Processors ($vlr) vs0 vs1 vs15 Scalar Regs 64b

Vector Multiply with dependency
/* Multiply a[m][k] * b[k][n] to get c[m][n] */ for (i=1; i<m; i++) { for (j=1; j<n; j++) sum = 0; for (t=1; t<k; t++) sum += a[i][t] * b[t][j]; } c[i][j] = sum;

Novel Matrix Multiply Solution
You don't need to do reductions for matrix multiply You can calculate multiple independent sums within one vector register You can vectorize the outer (j) loop to perform 32 dot-products at the same time Or you can think of each 32 Virtual Processors doing one of the dot products (Assume Maximum Vector Length is 32) Show it in C source code, but can imagine the assembly vector instructions from it

Optimized Vector Example
/* Multiply a[m][k] * b[k][n] to get c[m][n] */ for (i=1; i<m; i++) { for (j=1; j<n; j+=32)/* Step j 32 at a time. */ sum[0:31] = 0; /* Initialize a vector register to zeros. */ for (t=1; t<k; t++) a_scalar = a[i][t]; /* Get scalar from a matrix. */ b_vector[0:31] = b[t][j:j+31]; /* Get vector from b matrix. */ prod[0:31] = b_vector[0:31]*a_scalar; /* Do a vector-scalar multiply. */

Optimized Vector Example cont’d
/* Vector-vector add into results. */ sum[0:31] += prod[0:31]; } /* Unit-stride store of vector of results. */ c[i][j:j+31] = sum[0:31];

Vector Instruction Set
Complete load-store vector instruction set Uses the MIPS64™ ISA coprocessor 2 opcode space Ideas work with any core CPU: Arm, PowerPC, ... Architecture state 32 general-purpose vector registers 32 vector flag registers Data types supported in vectors: 64b, 32b, 16b (and 8b) 91 arithmetic and memory instructions Not specified by the ISA Maximum vector register length Functional unit datapath width Before I talk describe the micro-architecture of our chip, I will take a few minutes to introduce the instruction set. We have defined a complete load-store vector ISA within the opcode space of coprocessor 2 of the MIPS64 instruction. A vector can contain integer elements that are 64, 32, 16 or 8 bits wide, or floating-point elements of single or double precision. In addition to the architecture state in the MIPS ISA, our architecture introduces a vector register file with 32, general-purpose vector registers. There are also 32 vector flag registers with a single mask bit per element used for predicated execution and exception handling, as well as some register for scalar operands. The vector instruction set defines 91 unique instructions. These include vector arithmetic operations for integer and floating-point, logical operations and vector processing instructions that perform pack and unpack operations on vector registers. Three type of vector load and store instructions are supported: sequential, strided, and indexed. Our instruction set does not define the vector length. This number is the maximum number of elements a vector register. It is also defines the maximum number of operations a vector instruction can specify. It does not define the width of the functional unit datapaths either, which determines the number of element operations executed in parallel every cycle. An implementation can select the proper maximum vector length and datapath width to match the available silicon resources and performance requirements. All implementations are still binary compatible.

Vector IRAM ISA Summary
Scalar MIPS64 scalar instruction set 91 instructions 660 opcodes s.int u.int s.fp d.fp .v .vv .vs .sv Vector ALU alu op 8 16 32 64 Vector Memory unit stride constant stride indexed load store s.int u.int ALU operations: integer, floating-point, convert, logical, vector processing, flag processing

Support for DSP z x + w * y a
sat Round a w y z + * x n/2 n Support for fixed-point numbers, saturation, rounding modes Simple instructions for intra-register permutations for reductions and butterfly operations High performance for dot-products and FFT without the complexity of a random permutation To enable efficient vectorization of multimedia applications, a number of enhancements were made to traditional vector instruction sets. Specifically we added instructions that can handle fix-point formats. They implement operations like multiply and multiply-add and features like saturated arithmetic and special fixed-point rounding modes. We also introduced a set of instructions that perform simple element permutations within a vector register. These instructions lead to high performance for kernels like dot-products and FFTs, without the complexity and scaling problems of full, random permutations. Finally we also added a number of features to assist compiler and operating system and operating system tasks. We reduce the frequency of branch instructions by by supporting conditional execution using the flag registers. Virtually every vector instructions executes under mask control. Operating system support includes MMU-based virtual memory.

Compiler/OS Enhancements
Compiler support Conditional execution of vector instruction Using the vector flag registers Support for software speculation of load operations Operating system support MMU-based virtual memory Restartable arithmetic exceptions Valid and dirty bits for vector registers Tracking of maximum vector length used To enable efficient vectorization of multimedia applications, a number of enhancements were made to traditional vector instruction sets. Specifically we added instructions that can handle fix-point formats. They implement operations like multiply and multiply-add and features like saturated arithmetic and special fixed-point rounding modes. We also introduced a set of instructions that perform simple element permutations within a vector register. These instructions lead to high performance for kernels like dot-products and FFTs, without the complexity and scaling problems of full, random permutations. Finally we also added a number of features to assist compiler and operating system and operating system tasks. We reduce the frequency of branch instructions by by supporting conditional execution using the flag registers. Virtually every vector instructions executes under mask control. Operating system support includes MMU-based virtual memory.

Outline Motivation and goals Vector instruction set
Vector IRAM prototype Microarchitecture and design Vectorizing compiler Performance Comparison with SIMD

VIRAM Prototype Architecture
Flag Unit 0 Flag Unit 1 Instr. Cache (8KB) FPU Flag Register File (512B) CP IF MIPS64™ 5Kc Core Arithmetic Unit 0 Arithmetic Unit 1 256b 256b Vector Register File (8KB) Data Cache (8KB) SysAD IF 64b 64b Memory Unit TLB We now move to the microarchitecture of our prototype microprocessor. This slide presents the block diagram for Vector IRAM. It consists of three basic blocks: the scalar core, the vector unit and the memory system. The slides in your handout include all the details of the system. I will just highlight some of the most interesting components. The scalar core of VIRAM is the 5Kc MIPS processor. This is a single issue, 64-bit processor with a six stage pipeline. It includes first level instruction and data caches and a coprocessor interface to which a single precision FPU has been attached. The scalar core operates at 200 MHz and its design has been provided by MIPS Technologies in the form of a sythesizable core. The vector unit is also connected to the coprocessor interface of the MIPS processor and works at 200 MHz. It includes a multiported 8 Kbyte register file. This allows each vector to hold 32 64bit elements or 64 32b elements and so on. There are two functional units for arithmetic operations. Both can executed integer and logical operations, but only one can executed floating-point. There are also 2 flag processing units which provide support for predicated execution and exception handling. Each of the functional units has a 256 bit pipelined datapath. One each cycle, 4 64b operations or 8 32b operations or 16 16b operations can execute in parallel. To simplify the design and reduce area requirements, our prototype does not implement 8b integer operations and double precision arithmetic. All operations excluding divides are fully pipelined. The vector coprocessor also includes one memory or load/store unit. The LSU can exchange up to 256b per cycle with the memory system and has four address generators for strided and indexed accesses. Address translation is performed in a two level TLB structure. The memory unit is pipelined and up to 64 independent accesses may be pending at any time. Embedded DRAM is used in Vector IRAM as the main memory for both scalar and vector units. There is not SRAM cache for the vector unit which accesses DRAM directly.The memory system consists of 8 DRAM macro. Each bank has capacity of 2 Mbytes and a 256-bit synchronous interface. Random and page access time for each macro are 25 and 7.5 nsec respectively. Embedded DRAM macros are designed and provided by IBM Microelectronics. The DRAM is connected to the scalar and vector units with a crossbar interconnect. The crossbar has peak bandwidth of 12.8 Gbytes per direction, load or store, and is able to transmit up to 5 independent addresses per cycle. 256b DMA Memory Crossbar JTAG IF … JTAG DRAM0 (2MB) DRAM1 (2MB) DRAM7 (2MB)

Architecture Details (1)
MIPS64™ 5Kc core (200 MHz) Single-issue core with 6 stage pipeline 8 KByte, direct-map instruction and data caches Single-precision scalar FPU Vector unit (200 MHz) 8 KByte register file (32 64b elements per register) 4 functional units: 2 arithmetic (1 FP), 2 flag processing 256b datapaths per functional unit Memory unit 4 address generators for strided/indexed accesses 2-level TLB structure: 4-ported, 4-entry microTLB and single-ported, 32-entry main TLB Pipelined to sustain up to 64 pending memory accesses The vector unit is also connected to the coprocessor interface of the MIPS processor and works at 200 MHz. It includes a multiported 8 Kbyte register file. This allows each of the 32 registers to hold 32 64bit elements or 64 32b elements and so on. The flag register file has capacity of half a Kbyte. There are two functional units for arithmetic operations. Both can executed integer and logical operations, but only one can executed floating-point. There are also 2 flag processing units which provide support for predicated execution and exception handling. Each of the functional units has a 256 bit pipelined datapath. One each cycle, 4 64b operations or 8 32b operations or 16 16b operations can execute in parallel. To simplify the design and reduce area requirements, our prototype does not implement 8b integer operations and double precision arithmetic. All operations excluding divides are fully pipelined. The vector coprocessor also includes one memory or load/store unit. The LSU can exchange up to 265b per cycle with the memory system and has four address generators for strided and indexed accesses. Address translation is performed in a two level TLB structure. The hardware managed, first level, microTLB has four entries and four ports, while the main TLB has 32 double-page entries and a single access port. The main TLB is managed by software. The memory unit is pipelined and up to 64 independent accesses may be pending at any time.

Architecture Details (2)
Main memory system No SRAM cache for the vector unit 8 2-MByte DRAM macros Single bank per macro, 2Kb page size 256b synchronous, non-multiplexed I/O interface 25ns random access time, 7.5ns page access time Crossbar interconnect 12.8 GBytes/s peak bandwidth per direction (load/store) Up to 5 independent addresses transmitted per cycle Off-chip interface 64b SysAD bus to external chip-set (100 MHz) 2 channel DMA engine Embedded DRAM is used in Vector IRAM as the main memory for both scalar and vector units. There is not SRAM cache for the vector unit which accesses DRAM directly. The memory system consists of 8 DRAM macro. Each bank has capacity of 2 Mbytes with 2 Kbit pages, and a monolithic bank organization. They interface is synchronous at 200MHz and 256 bits wide. Random and page access time for each macro are 25 and 7.5 nsec respectively. Embedded DRAM macros are designed and provided by IBM Microelectronics. The macros are connected to the scalar and vector units with a crossbar interconnect. The crossbar has peak bandwidth of 12.8 Gbytes per direction, load or store, and is able to transmit up to 5 independent addresses per cycle. The external interface of vector IRAM is 64b SysAD, the standard bus for MIPS based processors. It operates at 100MHz. A simple 2 channel DMA engine is also included for transferring data between external devices and the on-chip DRAM.

Vector Unit Pipeline Single-issue, in-order pipeline
Efficient for short vectors Pipelined instruction start-up Full support for instruction chaining, the vector equivalent of result forwarding Hides long DRAM access latency Random access latency could lead to stalls due to long load®use RAW hazards Simple solution: “delayed” vector pipeline The execution pipeline in the vector unit is simple, single issue and in order. Keep in mind that each issued vector instruction specifies a large number of independent operations that can keep a functional unit busy for up to eight cycles. To make the pipeline efficient for applications with short vectors, the start-up of vector instructions is pipelined. We also provide full support to instructions chaining, the vector equivalent of result forwarding or bypassing. The pipeline must also be able to hide the long latency of on-chip DRAM. While embedded DRAM is faster than external memory, it is still slower than SRAM and several processor cycles are necessary to load a value, especially if a random access is required. In this case, performance can be significantly reduced from stalls due to load->use RAW hazards. To address this issue we use a very simple and power efficient way, structure called the “delayed vector pipeline”. This is not the typical pipeline organization of vector supercomputers.

Modular Vector Unit Design
256b 64b Xbar IF Integer Datapath 0 Flag Reg. Elements & Datapaths Vector Reg. Elements FP Datapath Integer Datapath 1 64b Xbar IF Integer Datapath 0 Flag Reg. Elements & Datapaths Vector Reg. Elements FP Datapath Integer Datapath 1 64b Xbar IF Integer Datapath 0 Flag Reg. Elements & Datapaths Vector Reg. Elements FP Datapath Integer Datapath 1 64b Xbar IF Integer Datapath 0 Flag Reg. Elements & Datapaths Vector Reg. Elements FP Datapath Integer Datapath 1 Control I will now talk about the implementation of Vector IRAM and discuss why it is scalable and modular. For the heart of the vector unit we have designed a single 64b component we call a “lane”, which we replicated 4 times. A lane includes one 64b datapath from every functional unit and a partition of the vector register file. This partition stores the vector elements to be processed by local datapaths during the execution of a vector instruction. This modular approach significantly simplifies design and testing, as a single simpler component must be implemented. It also provides a simple scaling path for our architecture. Future implementations can provide higher or low performance, power, and area by scaling the number of lanes in the system, without requiring significant datapath or control redesign. One can generate multiple high-end or low-end implementations from a single design database this way. In addition, Most instructions require only local interconnect within a lane for their execution, a fact that reduces the effect of interconnect delay scaling on this design. This makes the architecture appropriate for deep submicron processes. Single 64b “lane” design replicated 4 times Reduces design and testing time Provides a simple scaling model (up or down) without major control or datapath redesign Most instructions require only intra-lane interconnect Tolerance to interconnect delay scaling

Floorplan Technology: IBM SA-27E 290 mm2 die area
0.18mm CMOS 6 metal layers (copper) 290 mm2 die area 225 mm2 for memory/logic DRAM: 161 mm2 Vector lanes: 51 mm2 Transistor count: ~150M Power supply 1.2V for logic, 1.8V for DRAM Peak vector performance 1.6/3.2/6.4 Gops wo. multiply-add (64b/32b/16b operations) 3.2/6.4 /12.8 Gops w. multiply-add 1.6 Gflops (single-precision) 14.5 mm 20.0 mm This figure presents the floorplan of Vector IRAM. It occupies nearly 300 square mm and 150 million transistors in a 0.18um CMOS process by IBM. Blue blocks on the floorplan indicate DRAM macros or compiled SRAM blocks. Golden blocks are those designed at Berkeley. They included synthesized logic for control and the FP datapaths, and full custom logic for register files, integer datapaths and DRAM. Vector IRAM operates at 200MHz. The power supply is 1.2V for logic and 1.8V for DRAM. The peak performance for the vector unit is 1.6 giga ops for 64bit integer operations. Performance doubles or quadruples for 32 and 16b operations respectively. Peak floating point performance is 1.6 Gflops. There are several interesting things to notice on the floorplan. First the overall design modularity and scalability. It mostly consists of replicated DRAM macros and vector lanes connected through a crossbar. Another very interesting feature is the percentage of this design directly visible to software. Compilers can control any part of the design that is registers, datapaths or main memory. They do that by scheduling proper arithmetic or load store instructions. The majority of our design is used for main memory, vector registers and datapaths. On the other hand, if you take a look at a processor like Pentium 3, you will see that less than 20% of its are is used for datapaths and registers. The rest is caches and dynamic issue logic. While this usually work for the benefit of applications, they cannot be controlled by compiler and they cannot be turned off when not necessary.

Alternative Floorplans (1)
“VIRAM-8MB” 4 lanes, 8 Mbytes 190 mm2 3.2 Gops at 200 MHz (32-bit ops) “VIRAM-2Lanes” 2 lanes, 4 Mbytes 120 mm2 1.6 Gops at 200 MHz “VIRAM-Lite” 1 lane, 2 Mbytes 60 mm2 0.8 Gops at 200 MHz

Alternative Floorplans (2)
“RAMless” VIRAM 2 lanes, 55 mm2, 1.6 Gops at 200 MHz 2 high-bandwidth DRAM interfaces and decoupling buffers Vector processors need high bandwidth, but they can tolerate latency

Power Consumption Power saving techniques
Low power supply for logic (1.2 V) Possible because of the low clock rate (200 MHz) Wide vector datapaths provide high performance Extensive clock gating and datapath disabling Utilizing the explicit parallelism information of vector instructions and conditional execution Simple, single-issue, in-order pipeline Typical power consumption: 2.0 W MIPS core: 0.5 W Vector unit: 1.0 W (min ~0 W) DRAM: W (min ~0 W) Misc.: 0.3 W (min ~0 W) A number of architectural techniques were used to reduce the power consumption of Vector IRAM. First using the low clock frequency of 200 MHz allowed to use 1.2V for logic power supply and that was the major contribution to power saving. Performance on this chip does not come from high frequency operation but from the use of wide datapaths on which we executed multiple vector element operations in parallel. We also made extensive clock gating and datapath disabling. For that we utilize the explicit parallelism information provided by vector instructions. Each instruction describes the resources it will use for the next few cycles. We also used predicated execution for disabling additional resources and clocks. Finally the simple, in-order nature of the vector unit leads to low power overhead for issue and control logic. We have estimated the typical power consumption of Vector IRAM to be 2 Watts. About half of that is for the vector unit where the major contributor is the large register file. One quarter goes to the scalar core and another quarter to the DRAM and I/O logic. Note that when not used, most of the components in our design dissipate almost 0 power as they are dynamically turned off.

VIRAM Compiler Frontends Optimizer Code Generators C T3D/T3E Cray’s
PDGCS C++ C90/T90/SV1 Fortran95 SV2/VIRAM Based on the Cray’s PDGCS production environment for vector supercomputers Extensive vectorization and optimization capabilities including outer loop vectorization No need to use special libraries or variable types for vectorization Apart from the hardware, we have also worked on software development tools. We have a vectorizing compiler with C, C++, and Fortran front-ends. It is based on the production compiler by Cray for its vector supercomputers, which we ported to our architecture. Its has extensive vectorization capabilities including outer-loop vectorization. Using this compiler, vectorize applications written in high level languages without necessarily using optimized libraries or “special” (non-standard) variable types in his application.

Exploiting 0n-Chip Bandwidth
The vector ISA + compiler technology uses high bandwidth to mask latency Compiled matrix-vector multiplication: 2 Flops/element Easy compilation problem; stresses memory bandwidth Compare to 304 Mflops (64-bit) for Power3 (hand-coded) Performance normally scales with number of lanes Need more memory banks than default DRAM macro The IBM Power 3 number is from the latest LAPACK manual, and is for the BLAS2 (dgemv) performance, presumably hand-coded by IBM experts. The IRAM numbers are from the Cray compiler. This algorithm requires either strided access or reduction operations. The compiler uses the strided accesses. (Reductions are worse, because more time is spent with short vectors.) Because of the strided accesses, we start to have bank conflicts with more lanes. I think we had trouble getting the simulator to do anything reasonable with subbanks, so this reports 16 banks, rather than 8 banks with 2 subbanks per bank. The BLAS numbers for the IBM are better than for most other machines without such expensive memory systems. E.g., the Pentium III is 141, the SGI O2K is 216, the Alpha Miata is 66, and the Sun Enterprise is 450 is 267. Only the a AlphaServer DS-20 at 372 is beats VIRAM-1 (4 lanes, 8 banks) at 312. None of the IRAM number use a multiply add – performance would definitely increase with that.

Compiling Media Kernels on IRAM
The compiler generates code for narrow data widths, e.g., 16-bit integer Compilation model is simple, more scalable (across generations) than MMX, VIS, etc. Strided and indexed loads/stores simpler than pack/unpack Maximum vector length is longer than datapath width (256 bits); all lane scalings done with single executable The IBM Power 3 number is from the latest LAPACK manual, and is for the BLAS2 (dgemv) performance, presumably hand-coded by IBM experts. The IRAM numbers are from the Cray compiler. This algorithm requires either strided access or reduction operations. The compiler uses the strided accesses. (Reductions are worse, because more time is spent with short vectors.) Because of the strided accesses, we start to have bank conflicts with more lanes. I think we had trouble getting the simulator to do anything reasonable with subbanks, so this reports 16 banks, rather than 8 banks with 2 subbanks per bank. The BLAS numbers for the IBM are better than for most other machines without such expensive memory systems. E.g., the Pentium III is 141, the SGI O2K is 216, the Alpha Miata is 66, and the Sun Enterprise is 450 is 267. Only the a AlphaServer DS-20 at 372 is beats VIRAM-1 (4 lanes, 8 banks) at 312. None of the IRAM number use a multiply add – performance would definitely increase with that.

Compiler Challenges Generate code for variable data type width
Vectorizer starts with largest width (64b) At the end, vectorization discarded if greatest width met is smaller; vectorization restarted For simplicity, a single loop will use the largest width present in it Consistency between scalar cache and DRAM Problem when vector unit writes cached data Vector unit invalidates cache entries on writes Compiler generates synchronization instructions Vector after scalar, scalar after vector Read after write, write after read, write after write

Performance: Efficiency
Peak Sustained % of Peak Image Composition 6.4 GOPS 6.40 GOPS 100% iDCT 3.10 GOPS 48.4% Color Conversion 3.2 GOPS 3.07 GOPS 96.0% Image Convolution 3.16 GOPS 98.7% Integer VM Multiply 3.00 GOPS 93.7% FP VM Multiply 1.6 GFLOPS 1.59 GFLOPS 99.6% Average 89.4% In the next few slides I will talk about the performance of Vector IRAM, evaluated using simulations at this point of the design. This slides shows the theoretical peak and sustained throughput of Vector IRAM for a number of media kernels like image composition and convolution. Despite the lack of caches, VIRAM can achieve throughput very close to its peak for the majority of these kernels, 90% of it on average. iDCT is limited by memory conflicts and the ability to generate just four addresses per cycle for strided and indexed operations. What % of peak delivered by superscalar or VLIW designs? 50%? 25%?

Performance: Comparison
VIRAM MMX iDCT 0.75 3.75 (5.0x) Color Conversion 0.78 8.00 (10.2x) Image Convolution 1.23 5.49 (4.5x) QCIF (176x144) 7.1M 33M (4.6x) CIF (352x288) 28M 140M (5.0x) This slide compares Vector IRAM to SIMD multimedia extensions for media kernels. The numbers presented here are in cycles per pixel. All results for SIMD extensions assume that caches are preloaded. The red numbers in parenthesis are the speedup of the Vector IRAM architecture. You can see that Vector IRAM outperforms SIMD extensions 4 to 17 times, even for applications like iDCT that it does not achieve its peak performance. That means that a processor using these extensions must be clocked 4 to 17 times faster to achieve similar performance. And that would lead to much higher power consumption. Vector instructions allow us to keep wide datapaths for multiple cycles. A SIMD instruction can utilize a 64b datapath for a single cycle. To utilize multiple SIMD datapaths one must fetch, decode and issue multiple instructions per cycle. VIRAM also supports vector load and store operations without any alignment restriction for the whole vector. Only the individual elements have to be aligned. Vector load and store instructions take care of the alignment of the vector automatically. On the other hand, with SIMD extensions on have to issue a number of load operations and shift/rotate instructions to achieve the functionality of a vector load. QCIF and CIF numbers are in clock cycles per frame All other numbers are in clock cycles per pixel MMX results assume no first level cache misses

Vector Vs. SIMD Vector SIMD
One instruction keeps multiple datapaths busy for many cycles One instruction keeps one datapath busy for one cycle Wide datapaths can be used without changes in ISA or issue logic redesign Wide datapaths can be used either after changing the ISA or after changing the issue width Strided and indexed vector load and store instructions Simple scalar loads; multiple instructions needed to load a vector No alignment restriction for vectors; only individual elements must be aligned to their width Short vectors must be aligned in memory; otherwise multiple instructions needed to load them

Vector Vs. SIMD: Example
Simple example: conversion from RGB to YUV Y = [( 9798*R *G *B) / 32768] U = [(-4784*R *G *B) / 32768] + 128 V = [(20218*R – 16941*G – 3277*B) / 32768] + 128

VIRAM Code (22 instrs, 16 arith)
RGBtoYUV: vlds.u.b r_v, r_addr, stride3, addr_inc # load R vlds.u.b g_v, g_addr, stride3, addr_inc # load G vlds.u.b b_v, b_addr, stride3, addr_inc # load B xlmul.u.sv o1_v, t0_s, r_v # calculate Y xlmadd.u.sv o1_v, t1_s, g_v xlmadd.u.sv o1_v, t2_s, b_v vsra.vs o1_v, o1_v, s_s xlmul.u.sv o2_v, t3_s, r_v # calculate U xlmadd.u.sv o2_v, t4_s, g_v xlmadd.u.sv o2_v, t5_s, b_v vsra.vs o2_v, o2_v, s_s vadd.sv o2_v, a_s, o2_v xlmul.u.sv o3_v, t6_s, r_v # calculate V xlmadd.u.sv o3_v, t7_s, g_v xlmadd.u.sv o3_v, t8_s, b_v vsra.vs o3_v, o3_v, s_s vadd.sv o3_v, a_s, o3_v vsts.b o1_v, y_addr, stride3, addr_inc # store Y vsts.b o2_v, u_addr, stride3, addr_inc # store U vsts.b o3_v, v_addr, stride3, addr_inc # store V subu pix_s,pix_s, len_s bnez pix_s, RGBtoYUV

MMX Code (part 1) RGBtoYUV: movq mm1, [eax] pxor mm6, mm6
movq mm0, mm1 psrlq mm1, 16 punpcklbw mm0, ZEROS movq mm7, mm1 punpcklbw mm1, ZEROS movq mm2, mm0 pmaddwd mm0, YR0GR movq mm3, mm1 pmaddwd mm1, YBG0B movq mm4, mm2 pmaddwd mm2, UR0GR movq mm5, mm3 pmaddwd mm3, UBG0B punpckhbw mm7, mm6; pmaddwd mm4, VR0GR paddd mm0, mm1 pmaddwd mm5, VBG0B movq mm1, 8[eax] paddd mm2, mm3 movq mm6, mm1 paddd mm4, mm5 movq mm5, mm1 psllq mm1, 32 paddd mm1, mm7 punpckhbw mm6, ZEROS movq mm3, mm1 pmaddwd mm1, YR0GR movq mm7, mm5 pmaddwd mm5, YBG0B psrad mm0, 15 movq TEMP0, mm6 movq mm6, mm3 pmaddwd mm6, UR0GR psrad mm2, 15 paddd mm1, mm5 movq mm5, mm7 pmaddwd mm7, UBG0B psrad mm1, 15 pmaddwd mm3, VR0GR packssdw mm0, mm1 pmaddwd mm5, VBG0B psrad mm4, 15 movq mm1, 16[eax]

MMX Code (part 2) paddd mm6, mm7 movq mm7, mm1 psrad mm6, 15
psllq mm7, 16 movq mm5, mm7 psrad mm3, 15 movq TEMPY, mm0 packssdw mm2, mm6 movq mm0, TEMP0 punpcklbw mm7, ZEROS movq mm6, mm0 movq TEMPU, mm2 psrlq mm0, 32 paddw mm7, mm0 movq mm2, mm6 pmaddwd mm2, YR0GR movq mm0, mm7 pmaddwd mm7, YBG0B packssdw mm4, mm3 add eax, 24 add edx, 8 movq TEMPV, mm4 movq mm4, mm6 pmaddwd mm6, UR0GR movq mm3, mm0 pmaddwd mm0, UBG0B paddd mm2, mm7 pmaddwd mm4, pxor mm7, mm7 pmaddwd mm3, VBG0B punpckhbw mm1, paddd mm0, mm6 movq mm6, mm1 pmaddwd mm6, YBG0B punpckhbw mm5, movq mm7, mm5 paddd mm3, mm4 pmaddwd mm5, YR0GR movq mm4, mm1 pmaddwd mm4, UBG0B psrad mm0, 15 paddd mm0, OFFSETW psrad mm2, 15 paddd mm6, mm5 movq mm5, mm7

MMX Code (pt. 3: 121 instrs, 40 arith)
pmaddwd mm7, UR0GR psrad mm3, 15 pmaddwd mm1, VBG0B psrad mm6, 15 paddd mm4, OFFSETD packssdw mm2, mm6 pmaddwd mm5, VR0GR paddd mm7, mm4 psrad mm7, 15 movq mm6, TEMPY packssdw mm0, mm7 movq mm4, TEMPU packuswb mm6, mm2 movq mm7, OFFSETB paddd mm1, mm5 paddw mm4, mm7 psrad mm1, 15 movq [ebx], mm6 packuswb mm4, movq mm5, TEMPV packssdw mm3, mm4 paddw mm5, mm7 paddw mm3, mm7 movq [ecx], mm4 packuswb mm5, mm3 add ebx, 8 add ecx, 8 movq [edx], mm5 dec edi jnz RGBtoYUV

Performance: FFT (1) This slide presents the performance of VIRAM for for floating point and fixed point FFT and how it compares to that of various other architectures. The code for all machines on this slide is hand optimized. First for floating point FFT. VIRAM outperforms VLIW based DSP chips like Tigersharc and VelociTI by factors of 2 to 3. It also comes close to specialized FFT systems like the Pathfinder and Wildstar boards.

Performance: FFT (2) This slide shows similar results for fixed point FFT. Vector IRAM performs as good as DSP architectures and specialized systems, while it outperforms general purpose processors by an order of magnitude.

Conclusions (1/2) Vector IRAM Advantages
An integrated architecture for media processing Based on vector processing and embedded DRAM Simple, scalable, and efficient Advantages High operation throughput with low instruciton bandwidth Parallel lanes exploit long vector parallelism Parellel execution allows reduction in clock frequency, voltage Embedded DRAM provides sufficient bandwidth for vector arch. Vector processor tolerates latency of embedded DRAM Most of area used for resources controlled by processor Modularity simplifies design, scaling and limits long wire latency

Conclusions (2/2) One thing to keep in mind
Use the most efficient solution to exploit each level of parallelism Make the best solutions for each level work together Vector processing is very efficient for data level parallelism Data Irregular ILP Thread Multi-programming Levels of Parallelism Efficient Solution Clusters? NUMA? SMP? MT? SMT? CMP? To conclude my talk, today I have presented to you Vector IRAM. This is an integrated architecture for media processing that combines a 256 bit vector unit with 16 Mbytes of embedded DRAM. It uses 150 million transistors and 300 square mm. At just 200 MHz, it achieves 3.2 giga ops for 32b integers and consumes 2 Watts. It is a simple, scalable design that is efficient in terms of performance, power, and area. The current status of the prototype design is the following. We are currently in the verification and back-end stage of the design. RTL development and the design of several full custom components has been completed. We expect to tape-out the design by the end of the year. The compiler is also operational and is being tuned for performance. We are also working on applications for this system. VLIW? Superscalar? VECTOR

Backup slides Before I answer any questions I would like to acknowledge the help of several companies and individuals with this research project.

Delayed Vector Pipeline
F D R E M W . DRAM latency: >25ns vld VLD A T VW vadd Load ® Add RAW hazard vst vld VADD DELAY VR VX VW vadd vst VST A T VR . This slide presents how the delayed pipeline works. When a vector load operation is issued by the scalar core to the vector unit, it may take several for data to start arriving from the memory. If a following arithmetic instruction that uses that data was to be executed immediately by the vector unit, we would have a long RAW hazard, therefore a large number of stall cycles and performance decrease. To overcome this, we include the latency of a random DRAM access in the pipeline for all operations. In other words, the execution of arithmetic and store operations is delayed by a number of cycles so that the RAW hazard is eliminated or significantly shortened. This eliminates long pipeline stalls for common loop cases. The resulting pipeline is, of course, deep and in our case it has 15 stages. Random access latency included in the vector unit pipeline Arithmetic operations and stores are delayed to shorten RAW hazards Long hazards eliminated for the common loop cases Vector pipeline length: 15 stages

Handling Memory Conflicts
Single sub-bank DRAM macro can lead to memory conflicts for non-sequential access patterns Solution 1: address interleaving Selects between 3 address interleaving modes for each virtual page Solution 2: address decoupling buffer (128 slots) Allows scheduling of long indexed accesses without stalling the arithmetic operations executing in parallel Another potential performance bottleneck is that of memory conflicts. Sequential accesses or accesses with a small stride can be served with a single page access from DRAM, large strided or indexed accesses require accessing multiple DRAM rows. If those rows are in the same bank, then we have DRAM bank conflicts. If the DRAM macro used had a multi-bank structure, then accesses to different row could be overlapped and that would not be a significant performance problem. Unfortunately, the macro available to us has a single bank structure and accesses to different row have to be fully serialized. To handle this problem we employ two solution. First we use 3 interleaving schemes for mapping addresses to memory location. The first one is optimized for performance and power consumption of unit stride accesses, the second on for small strides and the third one for large strides. The interleaving scheme can be selected independently for each virtual page. The second solution is to use a buffer for address decoupling. These allows the hardware to buffer the addresses of on vector memory instruction. These accesses are served by the memory system as soon as possible without stalling independent instructions executed in the arithmetic units at that time. Hence properly scheduling instructions that may generate a large number of conflicts allows maintain high performance. The figure here presents the simulated performance of a 8x8 IDCT on Vector IRAM. IDCT uses large strides. We have simulated DRAM macros with 1, 2, and 4 sub-banks as well as with 1 sub-bank and address decoupling support. Address decoupling allows achieving the performance level of having a DRAM macro with two sub-banks, about 60% higher than that with the single-bank structure.

Hardware Exposed to Software
Pentium® III <25% of area for registers and datapaths The rest is still useful, but not visible to software Cannot turn off is not needed The blocks in blue are those used to execute operations (SSE/MMX included). The rest are used to do instruction and data fetching (speculation, reordering etc), tolerate latency (2 level caching), solve dependency problems (write buffers) etc. In other words, 80% of the die is there to keep about 20% of it busy. This is bad engineering. Extra hardware means: a) larger cost, b) larger complexity (these xtors must be verified) and c) higher power (more transistors switch per operation). While these (speculation, reordering, caching) are very useful and we should definitely keep using them in some form, we cannot rely just on that (just on HW) to keep increasing performance at rates similar to those of the last 10 years. Caching and speculation techniques are running out-of-steam and become less and less efficient in terms of performance benefit and complexity/die area increase. Multimedia applications change the name of the game as well. Due to real time performance requirements, out-of-order architectures are more difficult to use. Since temporal locality not always there, the caches create a performance problem. That’s why cache bypass buffers are build-in several processors (how embarrassing). Caches are also slow and difficult to manage optimally in HW when they are large.

IRAM A Media-oriented Processor with Embedded DRAM

Similar presentations

Presentation on theme: "IRAM A Media-oriented Processor with Embedded DRAM"— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

IRAM A Media-oriented Processor with Embedded DRAM

Similar presentations

Presentation on theme: "IRAM A Media-oriented Processor with Embedded DRAM"— Presentation transcript:

Similar presentations

About project

Feedback