IRAM and ISTORE Projects

IRAM and ISTORE Projects
Aaron Brown, James Beck, Rich Fromm, Joe Gebis, Paul Harvey, Adam Janin, Dave Judd, Kimberly Keeton, Christoforos Kozyrakis, David Martin, Rich Martin, Thinh Nguyen, David Oppenheimer, Steve Pope, Randi Thomas, Noah Treuhaft, Sam Williams, John Kubiatowicz, Kathy Yelick, and David Patterson Fall 2000 DIS DARPA Meeting

IRAM and ISTORE Vision Integrated processor in memory provides efficient access to high memory bandwidth Two “Post-PC” applications: IRAM: Single chip system for embedded and portable applications Target media processing (speech, images, video, audio) ISTORE: Building block when combined with disk for storage and retrieval servers Up to 10K nodes in one rack Non-IRAM prototype addresses key scaling issues: availability, manageability, evolution Photo from Itsy, Inc.

IRAM Overview A processor architecture for embedded/portable systems running media applications Based on media processing and embedded DRAM Simple, scalable, energy and area efficient Good compiler target Flag 0 Flag 1 Instr Cache (8KB) FPU Flag Register File (512B) MIPS64™ 5Kc Core CP IF Arith 0 Arith 1 256b 256b SysAD IF I will start with what is interesting about Vector IRAM. This is a prototype microprocessor that integrates a vector unit with 256 bit datapaths with a 16 MByte embedded DRAM memory system. The design uses 150 million transistors and occupies nearly 300 square mm. While operating at just 200 MHz, Vector IRAM achieves 3.2 giga ops and consumes 2 Watts. Vector IRAM also comes with an industrial strength vectorizing compiler for software development. Vector IRAM is being implemented by a group of only 6 graduate students, responsible for architecture, design, simulation and testing. So, if Patterson and Hennessy decide to introduce performance/watt/man year as major processor metric in the new version of their book, this processor will likely be one of the best in this class. Data Cache (8KB) Vector Register File (8KB) 64b 64b Memory Unit TLB JTAG IF 256b DMA Memory Crossbar … JTAG DRAM0 (2MB) DRAM1 (2MB) DRAM7 (2MB)

Architecture Details MIPS64™ 5Kc core (200 MHz) Vector unit (200 MHz)
Single-issue scalar core with 8 Kbyte I&D caches Vector unit (200 MHz) 8 KByte register file (32 64b elements per register) 256b datapaths, can be subdivided into 16b, 32b, 64b: 2 arithmetic (1 FP, single), 2 flag processing Memory unit 4 address generators for strided/indexed accesses Main memory system 8 2-MByte DRAM macros 25ns random access time, 7.5ns page access time Crossbar interconnect 12.8 GBytes/s peak bandwidth per direction (load/store) Off-chip interface 2 channel DMA engine and 64n SysAD bus The vector unit is also connected to the coprocessor interface of the MIPS processor and works at 200 MHz. It includes a multiported 8 Kbyte register file. This allows each of the 32 registers to hold 32 64bit elements or 64 32b elements and so on. The flag register file has capacity of half a Kbyte. There are two functional units for arithmetic operations. Both can executed integer and logical operations, but only one can executed floating-point. There are also 2 flag processing units which provide support for predicated execution and exception handling. Each of the functional units has a 256 bit pipelined datapath. One each cycle, 4 64b operations or 8 32b operations or 16 16b operations can execute in parallel. To simplify the design and reduce area requirements, our prototype does not implement 8b integer operations and double precision arithmetic. All operations excluding divides are fully pipelined. The vector coprocessor also includes one memory or load/store unit. The LSU can exchange up to 265b per cycle with the memory system and has four address generators for strided and indexed accesses. Address translation is performed in a two level TLB structure. The hardware managed, first level, microTLB has four entries and four ports, while the main TLB has 32 double-page entries and a single access port. The main TLB is managed by software. The memory unit is pipelined and up to 64 independent accesses may be pending at any time. The 64b SysAD bus connects to external chip-set at 100 MHz

Floorplan Technology: IBM SA-27E 0.18mm CMOS, 6 metal layers
290 mm2 die area 225 mm2 for memory/logic Transistor count: ~150M Power supply 1.2V for logic, 1.8V for DRAM Typical power consumption: 2.0 W 0.5 W (scalar) W (vector) W (DRAM) W (misc) Peak vector performance 1.6/3.2/6.4 Gops wo. multiply-add (64b/32b/16b operations) 3.2/6.4 /12.8 Gops w. madd 1.6 Gflops (single-precision) Tape-out planned for March ‘01 14.5 mm 20.0 mm This figure presents the floorplan of Vector IRAM. It occupies nearly 300 square mm and 150 million transistors in a 0.18um CMOS process by IBM. Blue blocks on the floorplan indicate DRAM macros or compiled SRAM blocks. Golden blocks are those designed at Berkeley. They included synthesized logic for control and the FP datapaths, and full custom logic for register files, integer datapaths and DRAM. Vector IRAM operates at 200MHz. The power supply is 1.2V for logic and 1.8V for DRAM. The peak performance for the vector unit is 1.6 giga ops for 64bit integer operations. Performance doubles or quadruples for 32 and 16b operations respectively. Peak floating point performance is 1.6 Gflops. There are several interesting things to notice on the floorplan. First the overall design modularity and scalability. It mostly consists of replicated DRAM macros and vector lanes connected through a crossbar. Another very interesting feature is the percentage of this design directly visible to software. Compilers can control any part of the design that is registers, datapaths or main memory. They do that by scheduling proper arithmetic or load store instructions. The majority of our design is used for main memory, vector registers and datapaths. On the other hand, if you take a look at a processor like Pentium 3, you will see that less than 20% of its are is used for datapaths and registers. The rest is caches and dynamic issue logic. While this usually work for the benefit of applications, they cannot be controlled by compiler and they cannot be turned off when not necessary.

Alternative Floorplans
“VIRAM-8MB” 4 lanes, 8 Mbytes 190 mm2 3.2 Gops at 200 MHz (32-bit ops) “VIRAM-2Lanes” 2 lanes, 4 Mbytes 120 mm2 1.6 Gops at 200 MHz “VIRAM-Lite” 1 lane, 2 Mbytes 60 mm2 0.8 Gops at 200 MHz

VIRAM Compiler Frontends Optimizer Code Generators C T3D/T3E Cray’s
PDGCS C++ C90/T90/SV1 Fortran95 SV2/VIRAM Based on the Cray’s production compiler Challenges: narrow data types and scalar/vector memory consistency Advantages relative to media-extensions: powerful addressing modes and ISA independent of datapath width Apart from the hardware, we have also worked on software development tools. We have a vectorizing compiler with C, C++, and Fortran front-ends. It is based on the production compiler by Cray for its vector supercomputers, which we ported to our architecture. Its has extensive vectorization capabilities including outer-loop vectorization. Using this compiler, vectorize applications written in high level languages without necessarily using optimized libraries or “special” (non-standard) variable types in his application.

Exploiting 0n-Chip Bandwidth
Vector ISA uses high bandwidth to mask latency Compiled matrix-vector multiplication: 2 Flops/element Easy compilation problem; stresses memory bandwidth Compare to 304 Mflops (64-bit) for Power3 (hand-coded) Performance scales with number of lanes up to 4 Need more memory banks than default DRAM macro for 8 lanes The IBM Power 3 number is from the latest LAPACK manual, and is for the BLAS2 (dgemv) performance, presumably hand-coded by IBM experts. The IRAM numbers are from the Cray compiler. This algorithm requires either strided access or reduction operations. The compiler uses the strided accesses. (Reductions are worse, because more time is spent with short vectors.) Because of the strided accesses, we start to have bank conflicts with more lanes. I think we had trouble getting the simulator to do anything reasonable with subbanks, so this reports 16 banks, rather than 8 banks with 2 subbanks per bank. The BLAS numbers for the IBM are better than for most other machines without such expensive memory systems. E.g., the Pentium III is 141, the SGI O2K is 216, the Alpha Miata is 66, and the Sun Enterprise is 450 is 267. Only the a AlphaServer DS-20 at 372 is beats VIRAM-1 (4 lanes, 8 banks) at 312. None of the IRAM number use a multiply add – performance would definitely increase with that.

Compiling Media Kernels on IRAM
The compiler generates code for narrow data widths, e.g., 16-bit integer Compilation model is simple, more scalable (across generations) than MMX, VIS, etc. Strided and indexed loads/stores simpler than pack/unpack Maximum vector length is longer than datapath width (256 bits); all lane scalings done with single executable The IBM Power 3 number is from the latest LAPACK manual, and is for the BLAS2 (dgemv) performance, presumably hand-coded by IBM experts. The IRAM numbers are from the Cray compiler. This algorithm requires either strided access or reduction operations. The compiler uses the strided accesses. (Reductions are worse, because more time is spent with short vectors.) Because of the strided accesses, we start to have bank conflicts with more lanes. I think we had trouble getting the simulator to do anything reasonable with subbanks, so this reports 16 banks, rather than 8 banks with 2 subbanks per bank. The BLAS numbers for the IBM are better than for most other machines without such expensive memory systems. E.g., the Pentium III is 141, the SGI O2K is 216, the Alpha Miata is 66, and the Sun Enterprise is 450 is 267. Only the a AlphaServer DS-20 at 372 is beats VIRAM-1 (4 lanes, 8 banks) at 312. None of the IRAM number use a multiply add – performance would definitely increase with that.

IRAM Status Chip Compiler Application & Benchmarks
ISA has not changed significantly in over a year Verilog complete, except SRAM for scalar cache Testing framework in place Compiler Backend code generation complete Continued performance improvements, especially for narrow data widths Application & Benchmarks Handcoded kernels better than MMX,VIS, gp DSPs DCT, FFT, MVM, convolution, image composition,… Compiled kernels demonstrate ISA advantages MVM, sparse MVM, decrypt, image composition,… Full applications: H263 encoding (done), speech (underway) To conclude my talk, today I have presented to you Vector IRAM. This is an integrated architecture for media processing that combines a 256 bit vector unit with 16 Mbytes of embedded DRAM. It uses 150 million transistors and 300 square mm. At just 200 MHz, it achieves 3.2 giga ops for 32b integers and consumes 2 Watts. It is a simple, scalable design that is efficient in terms of performance, power, and area. The current status of the prototype design is the following. We are currently in the verification and back-end stage of the design. RTL development and the design of several full custom components has been completed. We expect to tape-out the design by the end of the year. The compiler is also operational and is being tuned for performance. We are also working on applications for this system.

Scaling to 10K Processors
IRAM + micro-disk offer huge scaling opportunities Still many hard system problems (AME) Availability systems should continue to meet quality of service goals despite hardware and software failures Maintainability systems should require only minimal ongoing human administration, regardless of scale or complexity Evolutionary Growth systems should evolve gracefully in terms of performance, maintainability, and availability as they are grown/upgraded/expanded These are problems at today’s scales, and will only get worse as systems grow

Is Maintenance the Key? Rule of Thumb: Maintenance 10X HW
so over 5 year product life, ~ 95% of cost is maintenance VAX crashes ‘85, ‘93 [Murp95]; extrap. to ‘01 HW/OS 70% in ‘85 to 28% in ‘93. In ‘01, 10%? [Murp95] Murphy, B.; Gent, T. Measuring system and software reliability using an automated data collection process. Quality and Reliability Engineering International, vol.11, (no.5), Sept.-Oct p

Hardware Techniques for AME
Cluster of Storage Oriented Nodes (SON) Scalable, tolerates partial failures, automatic redundancy Heavily instrumented hardware Sensors for temp, vibration, humidity, power, intrusion Independent diagnostic processor on each node Remote control of power; collects environmental data for Diagnostic processors connected via independent network On-demand network partitioning/isolation Allows testing, repair of online system Managed by diagnostic processor Built-in fault injection capabilities Used for hardware introspection Important for AME benchmarking

Storage-Oriented Node “Brick”
ISTORE-1 system Hardware: plug-and-play intelligent devices with self-monitoring, diagnostics, and fault injection hardware intelligence used to collect and filter monitoring data diagnostics and fault injection enhance robustness networked to create a scalable shared-nothing cluster Scheduled for 4Q 00 ISTORE Chassis 80 nodes, 8 per tray 2 levels of switches: Mb/s 2 1 Gb/s Environment Monitoring: UPS, redundant PS, fans, heat and vibration sensors... Storage-Oriented Node “Brick” Portable PC CPU: Pentium II/266 + DRAM Redundant NICs (4 100 Mb/s links) Diagnostic Processor Disk Half-height canister -more CPU per node than most existing NAS devices, fewer disks per node than most clusters - gives us research flexibility: runs existing cluster apps while leaving lots of free CPU for introspection

ISTORE-1 System Layout PE1000s PE1000s: PowerEngines 100Mb switches
PE5200s: PowerEngines 1 Gb switches UPSs: “used” Patch panel Patch panel PE5200 Brick shelf Brick shelf Brick shelf Brick shelf Brick shelf Brick shelf Brick shelf Brick shelf Brick shelf Brick shelf Patch panel Patch panel PE5200 UPS UPS UPS UPS UPS UPS

ISTORE Brick Node Block Diagram
Mobile Pentium II Module SCSI North Bridge CPU Disk (18 GB) South Bridge Diagnostic Net DUAL UART DRAM 256 MB Super I/O Monitor & Control Diagnostic Processor BIOS Ethernets 4x100 Mb/s PCI Sensors for heat and vibration Control over power to individual nodes Flash RTC RAM

ISTORE Brick Node Pentium-II/266MHz 256 MB DRAM 18 GB SCSI (or IDE) disk 4x100Mb Ethernet m68k diagnostic processor & CAN diagnostic network Packaged in standard half-height RAID array canister

Software Techniques Reactive introspection
“Mining” available system data Proactive introspection Isolation + fault insertion => test recovery code Semantic redundancy Use of coding and application-specific checkpoints Self-Scrubbing data structures Check (and repair?) complex distributed structures Load adaptation for performance faults Dynamic load balancing for “regular” computations Benchmarking Define quantitative evaluations for AME

Network Redundancy Each brick node has 4 100Mb ethernets
TCP striping used for performance Demonstration on 2-node prototype using 3 links When a link fails, packets on that link are dropped Nodes detect failures using independent pings More scalable approach being developed Mb/s

Load Balancing for Performance Faults
Failure is not always a discrete property Some fraction of components may fail Some components may perform poorly Graph shows effect of “Graduated Declustering” on cluster I/O with disk performance faults

Availability benchmarks
Goal: quantify variation in QoS as fault events occur Leverage existing performance benchmarks to generate fair workloads to measure & trace quality of service metrics Use fault injection to compromise system Results are most accessible graphically

Example: Faults in Software RAID
Linux Solaris Compares Linux and Solaris reconstruction Linux: minimal performance impact but longer window of vulnerability to second fault Solaris: large perf. impact but restores redundancy fast

Towards Manageability Benchmarks
Goal is to gain experience with a small piece of the problem can we measure the time and learning-curve costs for one task? Task: handling disk failure in RAID system includes detection and repair Same test systems as availability case study Windows 2000/IIS, Linux/Apache, Solaris/Apache Five test subjects and fixed training session (Too small to draw statistical conclusions)

Sample results: time Graphs plot human time, excluding wait time

Analysis of time results
Rapid convergence across all OSs/subjects despite high initial variability final plateau defines “minimum” time for task plateau invariant over individuals/approaches Clear differences in plateaus between OSs Solaris < Windows < Linux note: statistically dubious conclusion given sample size!

ISTORE Status ISTORE Hardware All 80 Nodes (boards) manufactured
PCB backplane: in layout Finish 80 node system: December 2000 Software 2-node system running -- boots OS Diagnostic Processor SW and device driver done Network striping done; fault adaptation ongoing Load balancing for performance heterogeneity done Benchmarking Availability benchmark example complete Initial maintainability benchmark complete, revised strategy underway

BACKUP SLIDES IRAM

IRAM Latency Advantage
1997 estimate: 5-10x improvement No parallel DRAMs, memory controller, bus to turn around, SIMM module, pins… 30ns for IRAM (or much lower with DRAM redesign) Compare to Alpha 600: 180 ns for 128b; 270 ns for 512b 2000 estimate: 5x improvement IRAM memory latency is 25 ns for 256 bits, fixed pipeline delay Alpha 4000/4100: 120 ns 1st 2nd innovate inside DRAM Even compared to latest Alpha

IRAM Bandwidth Advantage
1997 estimate: 100x 1024 1Mbit modules, each 1Kb wide(1Gb chip) 40 ns RAS/CAS = 320 GBytes/sec If cross bar switch or multiple busses deliver 1/3 to 2/3 of total Þ GBytes/sec Compare to: AlphaServer 8400 = 1.2 GBytes/sec, =1.1 Gbytes/sec 2000 estimate: x VIRAM-1 16 MB chip divided into 8 banks => 51.2 GB peak from memory banks Crossbar can consume 12.8 GB/s 6.4GB/s from Vector Unit GB/s from either scalar or I/O 2nd reason Delivered BW on Alpha Server

Power and Energy Advantages
1997 Case study of StrongARM memory hierarchy vs. IRAM memory hierarchy cell size advantages Þ much larger cache Þ fewer off-chip references Þ up to 2X-4X energy efficiency for memory-intensive algorithms less energy per bit access for DRAM Power target for VIRAM-1 2 watt goal Based on preliminary spice runs, this looks very feasible today Scalar core included bigger caches or less memory on board Cache in logic process vs. SRAM in SRAM process vs. DRAM in DRAM process Main reason

Summary IRAM takes advantage of high on-chip bandwidth
Vector IRAM ISA utilizes this bandwidth Unit, strided, and indexed memory access patterns supported Exploits fine-grained parallelism, even with pointer chasing Compiler Well-understood compiler model, semi-automatic Still some work on code generation quality Application benchmarks Compiled and hand-coded Include FFT, SVD, MVM, sparse MVM, and other kernels used in image and signal processing

IRAM and ISTORE Projects

Similar presentations

Presentation on theme: "IRAM and ISTORE Projects"— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

IRAM and ISTORE Projects

Similar presentations

Presentation on theme: "IRAM and ISTORE Projects"— Presentation transcript:

Similar presentations

About project

Feedback