Toward an Advanced Intelligent Memory System University of Illinois Josep Torrellas FlexRAM.

Slides:



Advertisements
Similar presentations
Philips Research ICS 252 class, February 3, The Trimedia CPU64 VLIW Media Processor Kees Vissers Philips Research Visiting Industrial Fellow
Advertisements

DSPs Vs General Purpose Microprocessors
AMD OPTERON ARCHITECTURE Omar Aragon Abdel Salam Sayyad This presentation is missing the references used.
A Framework for Dynamic Energy Efficiency and Temperature Management (DEETM) Michael Huang, Jose Renau, Seung-Moon Yoo, Josep Torrellas University of Illinois.
POLITECNICO DI MILANO Parallelism in wonderland: are you ready to see how deep the rabbit hole goes? ILP: VLIW Architectures Marco D. Santambrogio:
The University of Adelaide, School of Computer Science
Data Marshaling for Multi-Core Architectures M. Aater Suleman Onur Mutlu Jose A. Joao Khubaib Yale N. Patt.
1/1/ / faculty of Electrical Engineering eindhoven university of technology Processor support devices Part 1:Interrupts and shared memory dr.ir. A.C. Verschueren.
Evaluating an Adaptive Framework For Energy Management in Processor- In-Memory Chips Michael Huang, Jose Renau, Seung-Moon Yoo, Josep Torrellas.
Chapter 6 Computer Architecture
 Understanding the Sources of Inefficiency in General-Purpose Chips.
Main Mem.. CSE 471 Autumn 011 Main Memory The last level in the cache – main memory hierarchy is the main memory made of DRAM chips DRAM parameters (memory.
7-1 Chapter 7 - Memory Computer Architecture and Organization by M. Murdocca and V. Heuring © 2007 M. Murdocca and V. Heuring Computer Architecture and.
Embedded Systems Programming
Some Thoughts on Technology and Strategies for Petaflops.
Room: E-3-31 Phone: Dr Masri Ayob TK 2123 COMPUTER ORGANISATION & ARCHITECTURE Lecture 4: Computer Performance.
Computational Astrophysics: Methodology 1.Identify astrophysical problem 2.Write down corresponding equations 3.Identify numerical algorithm 4.Find a computer.
VIRAM-1 Architecture Update and Status Christoforos E. Kozyrakis IRAM Retreat January 2000.
1 COMP 206: Computer Architecture and Implementation Montek Singh Wed, Nov 9, 2005 Topic: Caches (contd.)
6/30/2015HY220: Ιάκωβος Μαυροειδής1 Moore’s Law Gordon Moore (co-founder of Intel) predicted in 1965 that the transistor density of semiconductor chips.
1 SRAM: –value is stored on a pair of inverting gates –very fast but takes up more space than DRAM (4 to 6 transistors) DRAM: –value is stored as a charge.
1 IRAM and ISTORE David Patterson, Katherine Yelick, John Kubiatowicz U.C. Berkeley, EECS
1 CSE SUNY New Paltz Chapter Seven Exploiting Memory Hierarchy.
PlayStation 2 Architecture Irin Jose Farid Momin Quy Ngo Olivia Wong.
Computer performance.
Programming the Cell Multiprocessor Işıl ÖZ. Outline Cell processor – Objectives – Design and architecture Programming the cell – Programming models CellSs.
Computer System Architectures Computer System Software
Motivation Mobile embedded systems are present in: –Cell phones –PDA’s –MP3 players –GPS units.
ECE 526 – Network Processing Systems Design Network Processor Architecture and Scalability Chapter 13,14: D. E. Comer.
Semiconductor Memory 1970 Fairchild Size of a single core –i.e. 1 bit of magnetic core storage Holds 256 bits Non-destructive read Much faster than core.
1 Hardware Support for Collective Memory Transfers in Stencil Computations George Michelogiannakis, John Shalf Computer Architecture Laboratory Lawrence.
1 Chapter 04 Authors: John Hennessy & David Patterson.
Topic:The Motorola M680X0 Family Team:Ulrike Eckardt Frederik Fleck André Kudra Jan Schuster Date:Thursday, 12/10/1998 CS-350 Computer Organization Term.
Multi-core architectures. Single-core computer Single-core CPU chip.
Multi-Core Architectures
Introduction of Intel Processors
Computers organization & Assembly Language Chapter 0 INTRODUCTION TO COMPUTING Basic Concepts.
High Performance Computing Processors Felix Noble Mirayma V. Rodriguez Agnes Velez Electric and Computer Engineer Department August 25, 2004.
Toward an Advanced Intelligent Memory System University of Illinois Y. Kang, W. Huang, S. Yoo, D. Keen Z. Ge, V. Lam, P. Pattnaik, J. Torrellas
The Fresh Breeze Memory Model Status: Linear Algebra and Plans Guang R. Gao Jack Dennis MIT CSAIL University of Delaware Funded in part by NSF HECURA Grant.
XStream: Rapid Generation of Custom Processors for ASIC Designs Binu Mathew * ASIC: Application Specific Integrated Circuit.
Lecture 12: Reconfigurable Systems II October 20, 2004 ECE 697F Reconfigurable Computing Lecture 12 Reconfigurable Systems II: Exploring Programmable Systems.
Lab 2 Parallel processing using NIOS II processors
80386DX functional Block Diagram PIN Description Register set Flags Physical address space Data types.
1 Chapter Seven. 2 Users want large and fast memories! SRAM access times are ns at cost of $100 to $250 per Mbyte. DRAM access times are ns.
1 Chapter Seven CACHE MEMORY AND VIRTUAL MEMORY. 2 SRAM: –value is stored on a pair of inverting gates –very fast but takes up more space than DRAM (4.
Computer performance issues* Pipelines, Parallelism. Process and Threads.
Playstation2 Architecture Architecture Hardware Design.
Princess Sumaya Univ. Computer Engineering Dept. Chapter 5:
1  1998 Morgan Kaufmann Publishers Chapter Seven.
3/12/2013Computer Engg, IIT(BHU)1 PARALLEL COMPUTERS- 2.
Fundamentals of Programming Languages-II
1 Adapted from UC Berkeley CS252 S01 Lecture 18: Reducing Cache Hit Time and Main Memory Design Virtucal Cache, pipelined cache, cache summary, main memory.
Programming in the Context of a Typical Computer Computer Studies Created by Rex Woollard.
Chao Han ELEC6200 Computer Architecture Fall 081ELEC : Han: PowerPC.
1 Chapter Seven. 2 SRAM: –value is stored on a pair of inverting gates –very fast but takes up more space than DRAM (4 to 6 transistors) DRAM: –value.
1 Adapted from UC Berkeley CS252 S01 Lecture 17: Reducing Cache Miss Penalty and Reducing Cache Hit Time Hardware prefetching and stream buffer, software.
Page 1 Computer Architecture and Organization 55:035 Final Exam Review Spring 2011.
COMPUTER SYSTEMS ARCHITECTURE A NETWORKING APPROACH CHAPTER 12 INTRODUCTION THE MEMORY HIERARCHY CS 147 Nathaniel Gilbert 1.
On-chip Parallelism Alvin R. Lebeck CPS 220/ECE 252.
Chapter 11 System Performance Enhancement. Basic Operation of a Computer l Program is loaded into memory l Instruction is fetched from memory l Operands.
Hardware Architecture
Implementing Advanced Intelligent Memory
Rough Schedule 1:30-2:15 IRAM overview 2:15-3:00 ISTORE overview break
FPGAs in AWS and First Use Cases, Kees Vissers
CS703 - Advanced Operating Systems
Gwangsun Kim Niladrish Chatterjee Arm, Inc. NVIDIA Mike O’Connor
Main Memory Background
ADSP 21065L.
Presentation transcript:

Toward an Advanced Intelligent Memory System University of Illinois Josep Torrellas FlexRAM

People Involved Students Other faculty Michael Huang Seung Yoo H. V. Jagadish Joe Renau David Padua Jaejin Lee Daniel Reed

Technological Landscape Merged Logic and DRAM (MLD): IBM, Mitsubishi, Samsung, Toshiba and others 0.18 um (chips for 1 Gbit DRAM) Powerful: e.g. IBM SA-27E ASIC (Feb 99) Logic frequency: 400 MHz IBM PowerPC 603 proc + 16 KB I, D caches = 3% Further advances in the horizon Opportunity: How to exploit MLD best?

Terminology Processor In Memory (PIM) Intelligent Memory or Intelligent RAM (IRAM) =

Key Applications Benefit from HW Data Mining (decision trees and neural networks) Computational Biology (protein sequence matching) Molecular Dynamics (short-range forces) Financial Modeling (stock options, derivatives) Multimedia (MPEG-2) Decision Support Systems (TPC-D) Speech Recognition All these are Data Intensive Applications

Example App: DNA Matching z Problem: Find areas of database DNA chains that match (modulo some mutations) the sample DNA chains

How the Algorithm Works z Generate 50+ most-likely mutations zPick 4 consecutive aminoacids from sample

Example App: DNA Matching z If match is found: try to extend it zCompare them to every positions in the database DNAs

How to Use MLD 1. Main compute engine of the machine Add proc to DRAM chip Include a vector processor or multiple processors Incremental gains Hard to program UC Berkeley: IRAM Notre Dame: Execube, Petaflops MIT: Raw Stanford: Smart Memories

How to Use MLD (II) 2. Co-processor, special-purpose processor ATM switch controller Process data beside the disk Graphics accelerator Stanford: Imagine UC Berkeley: ISTORE

How to Use MLD (III) 3. Our approach: take the place of memory chips in a workstation or server PIM chip processes the memory-intensive parts of the program Illinois: FlexRAM UC Davis: Active Pages USC-ISI: DIVA

Our Solution: Principles zExtract high bandwidth from DRAM: yMany simple processing units zRun legacy codes with high performance: yDo not replace off-the-shelf uP in workstation yTake place of memory chip. Same interface as DRAM yIntelligent memory defaults to plain DRAM zSmall increase in cost over DRAM: ySimple processing units, still dense zGeneral purpose: yDo not hardwire any algorithm. No Special purpose

Architecture Proposed

The FlexRAM Memory System Can exploit multiple levels of parallelism For a high-end workstation: 100’s of P.Mems in memory (e.g. IBM PowerPC 603) 1 P.Host processor (e.g. Merced, IBM GP) 100,000’s of very simple P.Arrays in memory

Chip Organization

Memory in one FlexRAM Chip 64 Mbytes of DRAM organized as 16Mx32 bits Organized in 64 1-Mbyte banks 1 single port Associated to 1 P.Array 2 2-Kbyte row buffers (no P.Array cache) P.Array access to memory: 10 ns (row hit) or 20 ns (miss) On-chip memory bandwidth: 102 Gbytes/second Each bank:

Memory in one FlexRAM Chip Group of 4 P.Arrays share one 8-Kbyte, 4-ported SRAM instruction memory Holds the P.Array code Small because short code Aggressive access time: 1 cycle = 2.5 ns

P.Array 64 P.Arrays per chip. Not SIMD but SPMD 32-bit integer arithmetic; 16 registers 4 P.Arrays share one multiplier No caches, no floating point 28 different 16-bit instructions Can access own 1 Mbyte of DRAM plus DRAM of left and right neighbors. Connection forms a ring Broadcast and notify primitives: Barrier

P.Mem 2-issue static superscalar like IBM PowerPC Kbyte I, D caches Communication with P.Arrays: Executes serial sections Broadcast/notify or plain write/read to memory Communication with other P.Mems: Memory in all chips is visible Access via the inter-chip network Must flush caches to ensure data coherence

Issues Communication P.Mem-P.Host: P.Mem cannot be the master of bus P.Host polls a register in Rambus interf. of master P.Mem P.Host starts P.Mems by writing register in Rambus interf. If P.Mem not finished: memory controller retries. Retries are invisible to P.Host Virtual memory: They share a range of virtual addresses with P.Host P.Mems and P.Arrays use virtual memory

Chip Architecture

Basic Block

Area Estimation (mm ) PowerPC 603+caches: 12 SRAM instruction memory: Mbytes of DRAM: 330 P.Arrays: 96 Pads + network interf. + refresh logic 20 Rambus interface: 3.4 Multipliers: 10 Total = 505 Of which 28% logic, 65% DRAM, 7% SRAM 2 VERY CONSERVATIVE

Evaluation

Utilization zHigh P.Array Utilz Low P.Mem Util

Utilization z Low P.Host Utilization

Speedups zConstant Problem Szz Scaled Problem Sz

Speedups zVarying Logic Frequency

Programming FlexRAM FlexRAM programmed in C + extensions: C-Flex Library of Intelligent Memory Operations (IMOs) Executed by P.Arrays or P.Mem C subroutines that can be called from main pgm Operate on large data sets with poor locality Library also contains plain subroutines Link program with IMOs or plain subroutines

C-Flex Programming Extensions On processor_range: where the following code is executed Waitfor processor_range: processors waiting for others Release object Map object to processor_range: mapping of pages Flush(object), Flush&Inval(object): flush from cache Broadcast(address), Poll(), Receive(address), Notify() FlexRAM_malloc(), P_mem_malloc(), P_array_malloc()

Performance Evaluation zHardware performance monitoring embedded in the chip zSoftware tools to extract and interpret performance info

Current Status zIdentified and wrote all applications zDesigned architecture based on apps & feasible technology zConceived ideas behind language/compiler zNeed to do: chip layout and fabrication development of the compiler zFunds needed for: yprocessor core (P.Mem) ychip fabrication yhardware and software engineers

Overall Goal Fabricate chips Build a workstation with an intelligent memory system Demonstrate significant speedups on real applications Build a compiler for the intelligent memory system

Conclusion zWe have a handle on: yA promising technology (MLD) yKey applications of industrial interest zReal chance to transform the computing landscape