A solution to the Von Neumann bottleneck

Slides:



Advertisements
Similar presentations
Full-System Timing-First Simulation Carl J. Mauer Mark D. Hill and David A. Wood Computer Sciences Department University of Wisconsin—Madison.
Advertisements

© 2004 Wayne Wolf Topics Task-level partitioning. Hardware/software partitioning.  Bus-based systems.
Hadi Goudarzi and Massoud Pedram
Intel Pentium 4 ENCM Jonathan Bienert Tyson Marchuk.
A Scalable and Reconfigurable Search Memory Substrate for High Throughput Packet Processing Sangyeun Cho and Rami Melhem Dept. of Computer Science University.
Microprocessors. Microprocessor Buses Address Bus Address Bus One way street over which microprocessor sends an address code to memory or other external.
Computer Architecture and Data Manipulation Chapter 3.
Real-Time Video Analysis on an Embedded Smart Camera for Traffic Surveillance Presenter: Yu-Wei Fan.
Summary Background –Why do we need parallel processing? Applications Introduction in algorithms and applications –Methodology to develop efficient parallel.
2009/04/07 Yun-Yang Ma.  Overview  What is CUDA ◦ Architecture ◦ Programming Model ◦ Memory Model  H.264 Motion Estimation on CUDA ◦ Method ◦ Experimental.
Query Reordering for Photon Mapping Rohit Saboo. Photon Mapping A two step solution for global illumination: Step 2: Shoot eye rays and perform a “gather”
1 Single Reference Frame Multiple Current Macroblocks Scheme for Multiple Reference IEEE TRANSACTIONS ON CIRCUITS AND SYSTEMS FOR VIDEO TECHNOLOGY Tung-Chien.
Introduction to Systems Architecture Kieran Mathieson.
COMS W1004 Introduction to Computer Science June 17, 2009.
A Low-Power VLSI Architecture for Full-Search Block-Matching Motion Estimation Viet L. Do and Kenneth Y. Yun IEEE Transactions on Circuits and Systems.
Technion – Israel Institute of Technology Department of Electrical Engineering High Speed Digital Systems Lab Written by: Haim Natan Benny Pano Supervisor:
HW/SW CODESIGN OF THE MPEG-2 VIDEO DECODER Matjaz Verderber, Andrej Zemva, Andrej Trost University of Ljubljana Faculty of Electrical Engineering Trzaska.
HW/SW CODESIGN OF THE MPEG-2 VIDEO DECODER Matjaz Verderber, Andrej Zemva, Andrej Trost University of Ljubljana Faculty of Electrical Engineering Trzaska.
Adaptive Video Coding to Reduce Energy on General Purpose Processors Daniel Grobe Sachs, Sarita Adve, Douglas L. Jones University of Illinois at Urbana-Champaign.
Presenter MaxAcademy Lecture Series – V1.0, September 2011 Introduction and Motivation.
Analyzing the Energy Efficiency of a Database Server Hanskamal Patel SE 521.
CSE378 Gen. Intro1 Machine Organization and Assembly Language Programming Machine Organization –Hardware-centric view (in this class) –Not at the transistor.
Part 1.  Intel x86/Pentium family  32-bit CISC processor  SUN SPARC and UltraSPARC  32- and 64-bit RISC processors  Java  C  C++  Java  Why Java?
9/22/2010Lecture 1 - Introduction1 ECE 5465 Advanced Microcomputers.
1 Chapter 1 Parallel Machines and Computations (Fundamentals of Parallel Processing) Dr. Ranette Halverson.
Software Pipelining for Stream Programs on Resource Constrained Multi-core Architectures IEEE TRANSACTIONS ON PARALLEL AND DISTRIBUTED SYSTEM 2012 Authors:
Telecommunications and Signal Processing Seminar Ravi Bhargava * Lizy K. John * Brian L. Evans Ramesh Radhakrishnan * The University of Texas at.
Introduction Computer Organization and Architecture: Lesson 1.
Mobile Relay Configuration in Data-Intensive Wireless Sensor Networks.
Research on Reconfigurable Computing Using Impulse C Carmen Li Shen Mentor: Dr. Russell Duren February 1, 2008.
High Performance Computing Processors Felix Noble Mirayma V. Rodriguez Agnes Velez Electric and Computer Engineer Department August 25, 2004.
[Tim Shattuck, 2006][1] Performance / Watt: The New Server Focus Improving Performance / Watt For Modern Processors Tim Shattuck April 19, 2006 From the.
Parallelization of the Classic Gram-Schmidt QR-Factorization
L28:Lower Power Algorithm for Multimedia Systems(2) 성균관대학교 조 준 동
Spring 2006ICOM 4036 Programming Laguages Lecture 2 1 The Nature of Computing Prof. Bienvenido Velez ICOM 4036 Lecture 2.
Part 1.  Intel x86/Pentium family  32-bit CISC processor  SUN SPARC and UltraSPARC  32- and 64-bit RISC processors  Java  C  C++  Java  Why Java?
Parallelization of Classification Algorithms For Medical Imaging on a Cluster Computing System 指導教授 : 梁廷宇 老師 系所 : 碩光通一甲 姓名 : 吳秉謙 學號 :
Finding Body Parts with Vector Processing Cynthia Bruyns Bryan Feldman CS 252.
Introduction Advantage of DSP: - Better signal quality & repeatable performance - Flexible  Easily modified (Software Base) - Handle more complex processing.
Hyper Threading Technology. Introduction Hyper-threading is a technology developed by Intel Corporation for it’s Xeon processors with a 533 MHz system.
Making Watson Fast Daniel Brown HON111. Need for Watson to be fast to play Jeopardy successfully – All computations have to be done in a few seconds –
Quantum Processing Simulation
 Introduction to SUN SPARC  What is CISC?  History: CISC  Advantages of CISC  Disadvantages of CISC  RISC vs CISC  Features of SUN SPARC  Architecture.
Jason Li Jeremy Fowers 1. Speedups and Energy Reductions From Mapping DSP Applications on an Embedded Reconfigurable System Michalis D. Galanis, Gregory.
Motivation: Sorting is among the fundamental problems of computer science. Sorting of different datasets is present in most applications, ranging from.
1Thu D. NguyenCS 545: Distributed Systems CS 545: Distributed Systems Spring 2002 Communication Medium Thu D. Nguyen
Chapter 5: Computer Systems Design and Organization Dr Mohamed Menacer Taibah University
Multi-objective Topology Synthesis and FPGA Prototyping Framework of Application Specific Network-on-Chip m Akram Ben Ahmed Xinyu LI, Omar Hammami.
Lx: A Technology Platform for Customizable VLIW Embedded Processing.
SR: 599 report Channel Estimation for W-CDMA on DSPs Sridhar Rajagopal ECE Dept., Rice University Elec 599.
MARC ProgramEssential Computing for Bioinformatics 1 The Nature of Computing Prof. Bienvenido Velez ICOM 4995 Lecture 3.
INTRODUCTION TO COMPUTER ENGINEERING (ECE 001) Dr. Ahmed Bayoumi Dr. Shady Yehia Elmashad 1.
Today's Software For Tomorrow's Hardware: An Introduction to Parallel Computing Rahul.S. Sampath May 9 th 2007.
Architectural Effects on DSP Algorithms and Optimizations Sajal Dogra Ritesh Rathore.
An Algorithm for Automatically Obtaining Distributed and Fault Tolerant Static Schedules Alain Girault - Hamoudi Kalla - Yves Sorel - Mihaela Sighireanu.
Representation of Data - Instructions Start of the lesson: Open this PowerPoint from the A451 page – Representation of Data/ Instructions How confident.
Computational Chemistry Trygve Helgaker CTCC, Department of Chemistry, University of Oslo.
MAHARANA PRATAP COLLEGE OF TECHNOLOGY SEMINAR ON- COMPUTER PROCESSOR SUBJECT CODE: CS-307 Branch-CSE Sem- 3 rd SUBMITTED TO SUBMITTED BY.
Optimizing Interconnection Complexity for Realizing Fixed Permutation in Data and Signal Processing Algorithms Ren Chen, Viktor K. Prasanna Ming Hsieh.
Backprojection Project Update January 2002
Computer Systems – Memory & the 3 box Model
Augmented von Neumann Processors
Experiment Evaluation
Intro to Architecture & Organization
Summary Background Introduction in algorithms and applications
Midterm review.
Overheads for Computers as Components 2nd ed.
CS 252 Spring 2000 Jeff Herman John Loo Xiaoyi Tang
Automatic Tuning of Two-Level Caches to Embedded Applications
A microprocessor into a memory chip Dave Patterson, Berkeley, 1997
Presentation transcript:

A solution to the Von Neumann bottleneck Sylvain EUDIER Union College, 2004 MSCS Candidate In-memory computing A solution to the Von Neumann bottleneck

Seminar - Processing in Memory Plan Introduction to a new architecture Different architectures The C-RAM Architecture: Implication / Application Performances Conclusion Seminar - Processing in Memory

Seminar - Processing in Memory Introduction Von Neumann architecture The situation (gap evolution) Some improvements were made Can we avoid this bottleneck? (graph) The Von Neumann Bottleneck: The Von Neumann Architecture: a von Neumann computer has 3 parts: - a central processing unit (or CPU), - a store -and a connecting tube that can transmit a single word between the CPU and the store (and send an address to the store).‘ Backus proposed to call this tube the von Neumann bottleneck. The situation: CPU speed * 2 every 18 months => Bigger die size; >> solve with pipelining but increases latencies (cache access, branch prediction penalties and complexity of processor design) CPU-Memory gap: increase of mem speed by only 7% a year. >> caching, prefetching, multithreading. Discuss the cache memory (pros and cons) Seminar - Processing in Memory

Different Architectures Architectures and designs IRAM (design) RAW (design) CRAM (design) IRAM: Berkeley university: taped out in October 2002 by IBM (72 chips on a wafer). Testing in progress (email) IRAM stands for Intelligent RAM (CPU + RAM) Principle: Grouping CPU+Cache+RAM+Networking onto the same chip => Smaller, cheaper, less power. Massive vector processing IRAM is designed to be a stand alone (But first results are good: 13MB of DRAM w/ 200Mhz proc gives 1.6GFlops (200*8bit per cycle) 2W;) RAW: MIT: Instead of building one proc on the chip, several processors (tiles) implemented and connected with a network. A tile is: a RISC processor, 128KB of SRAM, a FPU and a communication processor. Proto of 16 tiles The memory is located at the peripheral part of the 16 tiles All the tiles can access memory either via a (1) static or (2) dynamic network. (1) 3 cycles latency btw nearest tiles. +1 for each hop (2) send data packet (The prototype works @ 300Mhz and 3.2 GFlops) CRAM: University of Alberta: Prof. ELLIOTT. And also Carleton University, Ottawa First prototype in 1996. 4 protos developed, the last one is currently being designed. Processing elements are added to the sense amplifiers to take advantage of the memory bandwidth at this point. PE’s are very simple. They are 1-bit serial PE’s. Computation based on the truth table of the 8-1 Mux. Lots of small PE rather than a big uP. Designed for multi purpose. Only 5-10% of chip space for PE’s The PE’s communicate with a shift-left/shift-right register Seminar - Processing in Memory

The C-RAM (Computational RAM) architecture Applications Performances Implications New software design Energy consumption I chose to focus on the CRAM for its multi-purpose design, not intended to perform in a restricted domain. This architecture is not so new since the 1st prototype had been taped out in 1996 by D. Elliott at the university of Alberta, Canada Application of the CRAM: Very good for // computing for // reducible algorithms. The greater the // degree of a computation the better. Because of the PE’s, a bigger computation doesn’t necessary mean more time as long as it fits in the memory. More CRAM = More RAM and more POWER. Even if the problem is not really parallel, it can be faster because of the bandwidth. However, the process has to be big enough otherwise, the time to put it to the memory will be longer than the actual computation These applications / Perf tests were chosen for: The different fields their represent, to demonstrate the general purpose of the CRAM Their different computation complexity and models They are all based on practical problems Implications: The CRAM implies a different way of writing programs and new interfaces Due to the integration of all the elements on-chip (mem, cpu, bus), consumption very low Seminar - Processing in Memory

Seminar - Processing in Memory CRAM Applications Image processing : Low-level adjustments (brightness, average filter…) Databases searches : Equivalence, Extremes, between limits… Multimedia Compression: MPEG Motion estimation Image processing: Brightness due to the CRAM design is very fast, pure computation Average Filter is slower due to the communication between the PE’s => Depends on the problem and the parallel degree of it Databases Searches: Randomly generated lists All the searches are about the same speed but the difference uniprocessor / CRAM is not so high => Linear running time Multimedia compression: This algo requires use of shift registers (slows down the process) but the problem suits particularly well to the // processing => Lots of redundant computation on groups of pixels Seminar - Processing in Memory

Performances - Configs CRAM 200Mhz; 32MB; 64K PE’s on a Pentium 133Mhz (simulated) Pentium 133Mhz with 32 MB Ram Sun SPARC Station 167 Mhz CPU with 64 MB Simulator has been certified with tests and results from the prototype. The results are very close. Seminar - Processing in Memory

Performances – Basic ops Because of the small amount of available proto’s, the results are based on a simulator. But the tests proved that it is very precise, very close to the real results Depends on the precision required The perf of this CRAM is about 0.01 GFlops which is good if we consider the fact that it’s only 200Mhz and 32MB. At least 200 times faster than the equivalent PC Ops complexity Seminar - Processing in Memory

Performances - Comparison The first case assume the CRAM is used as the main memory or video memory. Don’t care about the transfer time. The second case is when the CRAM is used as an extension card, as an accelerator for massive parallel computation The overhead case takes into consideration the transfer of data from the host to the CRAM and the creation of overhead during the tranfer. Seminar - Processing in Memory

New Software Design (Step 1) Think Parallel (pseudo code) Seminar - Processing in Memory

New Software Design (Step 2) Use a different language (modified C++) Seminar - Processing in Memory

New Software Design (Step 3) Possibly coding in assembly to optimize Can code in assembly to optimize or when you want to switch from host to CRAM computing The comprehensive compiler is not yet finished The instruction defined for the CRAM are translated into the corresponding host’s MOVE’s. Seminar - Processing in Memory

CRAM Energy Consumption We avoid the use of a bus We have a direct access to memory No overhead in communication Finally the CRAM use 20 times less energy therefore less heat Seminar - Processing in Memory

Seminar - Processing in Memory The future… Which architecture will be chosen? End of today’s architecture? A PetaOps is feasible with CRAM Blue Gene/P aims at the petaFlops (view) Which Architecture will survive? : Talk about the new software design, performances and compare with RAMBUS case. Will probably be a matter of sponsors and money. End of today’s architecture: Some scientists believe that one day or another we will have to move to these techniques because of the power available, the power reduction and therefore the heat reduction A PetaOps: A study of 4 scientists from the Carleton University in Canada (D. Elliott is part of) shows that a PetaOps is feasible with 500Mhz SRAM and PE’s for every 512 bytes  Just need 1TB of CRAM. Blue Gene: This is the super computer of IBM. Arrays of PIM’s Because of the use of PIM’s techniques, it will be air-cooled. Blue Gene/C released last year. Protein folding ranked 73rd in the top500 supercomputers 2TeraFlops. Blue Gene/L  2005  200/360 TFlops  Faster than the total computing power of today’s top500 supercomputers Blue Gene/P: PetaFlops ; 1000 times faster than Deep Blue; Comp. to today’s supercomputers: 6*faster; 1/15th of the power; 10 times compacter. (just half a tennis court) 128 times bigger than Blue Gene/C About 2millions times faster than a desktop computer => For 2007 Seminar - Processing in Memory

Seminar - Processing in Memory Questions ? Seminar - Processing in Memory

Memory bandwidth in a computer Based on 256 MBytes of 16 Mb, 50ns DRAM chips,and a 100 MHz CPU with a 64-bit bus. !!! Log Scale !!! Back Seminar - Processing in Memory

Seminar - Processing in Memory IRAM Design Back Seminar - Processing in Memory

Seminar - Processing in Memory RAW Design Back Seminar - Processing in Memory

Seminar - Processing in Memory CRAM Design Back Seminar - Processing in Memory

Operations Complexity for CRAM Back Seminar - Processing in Memory

Seminar - Processing in Memory Blue Gene/P Scale Seminar - Processing in Memory

Seminar - Processing in Memory Computing power Scale !!! Log Scale !!! Seminar - Processing in Memory Back

Memory – processors gap evolution Seminar - Processing in Memory Back