IRAM Vision Microprocessor & DRAM on a single chip: B u s Proc L o g i c f a b I/O I/O Microprocessor & DRAM on a single chip: on-chip memory latency 5-10X, bandwidth 50-100X improve energy efficiency 2X-4X (no off-chip bus) serial I/O 5-10X v. buses smaller board area/volume adjustable memory size/width $ $ L2$ Bus Bus $B for separate lines for logic and memory Single chip: either processor in DRAM or memory in logic fab D R A M I/O I/O Proc D R A M f a b Bus D R A M
IRAM Update 2 test chips: serial lines (MOSIS) + Embedded DRAM/Crossbar (LG Semicon) Simulator/Architecture Manual Completed Initial Compiler (“VIC”) Completed Partner for scalar processor (Sandcraft/MIPS) LG delays, prospects => stick to plan to re-evaluate options for IRAM prototype Foundary: TSMC, UMC DRAM companies: IBM, Micron, NEC, Toshiba Applications: FFT, segmentation, ...
IRAM App: ISTORE (“Intelligent Storage”) 1 IRAM/DRAM + crossbar switch + fast serial link v. conventional SMP Move function to data v. data to CPU B u s Proc I/O I/O $ $ Conventional CPU L2$ Bus Bus How does TPC-D scale with dataset size? Compare NCR 5100M 20 node system (each node is 8 133 MHz Pentium CPUs), March 28, 1997; 100 GB, 300GB, 1000GB Per 19 queries, all but 2 go up linearly with database size: (3-5 vs 300, 7-15 vs. 1000) e.g, interval time ratios 300/100 = 3.35; 1000/100=9.98; 1000/300= 2.97 How much memory for IBM SP2 node? 100 GB: 12 processors with 24 GB; 300 GB: 128 thin nodes with 32 GB total; 256 MB/node (2 boards/processor) TPC-D is business analysis vs. business operation 17 read only queries; results in queries per Gigabyte Hour Scale Factor (SF) multiplies each portion of the data: 10 to 10000 SF 10 is about 10 GB; indices + temp table increase 3X - 5X I R A M … cross bar
Another Vision of ISTORE CPU/Memory 1 IRAM/disk + xbar + fast serial link v. conventional SMP, cluster Network latency = f(SW overhead), not link distance Move function to data v. data to CPU (scan, sort, join,...) Cost/performace, more scalable cross bar How does TPC-D scale with dataset size? Compare NCR 5100M 20 node system (each node is 8 133 MHz Pentium CPUs), March 28, 1997; 100 GB, 300GB, 1000GB Per 19 queries, all but 2 go up linearly with database size: (3-5 vs 300, 7-15 vs. 1000) e.g, interval time ratios 300/100 = 3.35; 1000/100=9.98; 1000/300= 2.97 How much memory for IBM SP2 node? 100 GB: 12 processors with 24 GB; 300 GB: 128 thin nodes with 32 GB total; 256 MB/node (2 boards/processor) TPC-D is business analysis vs. business operation 17 read only queries; results in queries per Gigabyte Hour Scale Factor (SF) multiplies each portion of the data: 10 to 10000 SF 10 is about 10 GB; indices + temp table increase 3X - 5X cross bar cross bar IRAM IRAM IRAM IRAM … … … … … … IRAM IRAM IRAM IRAM … … …
ISTORE Update Build prototypes to gain experience, develop software before IRAM chips arrive Replace with IRAM chips once available ISTORE-0: 2 Sandcraft Development boards + Fast Ethernet + Real-time OS ISTORE-1: Design small board (CPU, DRAM, Ethernet) and place inside disk enclosure, build 64 - 128 node system (Ethernet switch) ISTORE-2: “Intelligent SIMM” module based on Mitsubishi M32RXD (DRAM interface+CPU)
IRAM/ISTORE Schedule IRAM ISTORE/OS Compiler
1998 IRAM/ISTORE Presentations Articles MicroDesign Resources Dinner Meeting, 1/8/98 Embedded Memory Workshop, Japan, 3/15/98 Stanford Computer Science Colloquim, 5/6/98 University of Virginia Distinguished Lecture, 5/19/98 SIGMOD98 Keynote Address, 6/3/98 Articles “New Processor Paradigm: V-IRAM”, Microprocessor Report, 3/9/98, 17-19. “A perfect match.” New Scientist, 4/18/98, 36-39. "Professor's Idea for Speedy Chip Could Be More Than Academic ," Wall Street Journal, 8/28/98, B1, B4.
VIRAM-1 Specs/Goals Technology 0.18-0.20 micron, 5-6 metal layers, fast xtor Memory 16-32 MB Die size ≈ 250-300 mm2 Vector pipes/lanes 4 64-bit (or 8 32-bit or 16 16-bit) Target Low Power High Performance Serial I/O 4 lines @ 1 Gbit/s 8 lines @ 2 Gbit/s Poweruniversity ≈2 w @ 1-1.5 volt logic ≈10 w @ 1.5-2 volt logic Clockunivers. 200scalar/200vector MHz 300sc/300vector MHz Perfuniversity 1.6 GFLOPS64-6 GOPS16 2.4 GFLOPS64-10 GOPS16 Powerindustry ≈1 w @ 1-1.5 volt logic ≈10 w @ 1.5-2 volt logic Clockindustry 400scalar/400vector MHz 600s/600v MHz Perfindustry 3.2 GFLOPS64-12 GOPS16 4 GFLOPS64-16 GOPS16