Gheorghe M. Ștefan “The semiconductor industry threw the equivalent of a Hail Mary pass when it switched from making microprocessors.

Slides:

Advertisements

Similar presentations

RAMP Gold : An FPGA-based Architecture Simulator for Multiprocessors Zhangxi Tan, Andrew Waterman, David Patterson, Krste Asanovic Parallel Computing Lab,

Advertisements

Communication-Avoiding Algorithms Jim Demmel EECS & Math Departments UC Berkeley.

Distributed Systems CS

Slides Prepared from the CI-Tutor Courses at NCSA By S. Masoud Sadjadi School of Computing and Information Sciences Florida.

Lecture 38: Chapter 7: Multiprocessors Today’s topic –Vector processors –GPUs –An example 1.

A Complete GPU Compute Architecture by NVIDIA Tamal Saha, Abhishek Rawat, Minh Le {ts4rq, ar8eb,

Computer Abstractions and Technology

Systems Architecture Lecture 5: MIPS Instruction Set

A reconfigurable system featuring dynamically extensible embedded microprocessor, FPGA, and customizable I/O Borgatti, M. Lertora, F. Foret, B. Cali, L.

March 18, 2008SSE Meeting 1 Mary Hall Dept. of Computer Science and Information Sciences Institute Multicore Chips and Parallel Programming.

Chapter 5: Computer Systems Organization Invitation to Computer Science, Java Version, Third Edition.

Computational Astrophysics: Methodology 1.Identify astrophysical problem 2.Write down corresponding equations 3.Identify numerical algorithm 4.Find a computer.

11/14/05ELEC Fall Multi-processor SoCs Yijing Chen.

The Case for Precision Timed (PRET) Machines Edward A. Lee Professor, Chair of EECS UC Berkeley With thanks to Stephen Edwards, Columbia University. National.

EE314 Basic EE II Silicon Technology [Adapted from Rabaey’s Digital Integrated Circuits, ©2002, J. Rabaey et al.]

RISC. Rational Behind RISC Few of the complex instructions were used –data movement – 45% –ALU ops – 25% –branching – 30% Cheaper memory VLSI technology.

Part 1.  Intel x86/Pentium family  32-bit CISC processor  SUN SPARC and UltraSPARC  32- and 64-bit RISC processors  Java  C  C++  Java  Why Java?

Computer Architecture ECE 4801 Berk Sunar Erkay Savas.

Membrane Computing in Connex Environment 1WMC8 June 2007 Membrane Computing in the Connex Environment Gheorghe Stefan BrightScale Inc., Sunnyvale, CA &

CCSE251 Introduction to Computer Organization

1 Machine Language Alex Ostrovsky. 2 Introduction Hierarchy of computer languages: 1. Application-Specific Language (Matlab compiler) 2. High-Level Programming.

Invitation to Computer Science 5th Edition

Atlanta, Georgia TiNy Threads on BlueGene/P: Exploring Many-Core Parallelisms Beyond The Traditional OS Handong Ye, Robert Pavel, Aaron Landwehr, Guang.

One-Chip TeraArchitecture 19 martie 2009 One-Chip TeraArchitecture Gheorghe Stefan

Bulk Synchronous Parallel Processing Model Jamie Perkins.

COMPUTER SCIENCE &ENGINEERING Compiled code acceleration on FPGAs W. Najjar, B.Buyukkurt, Z.Guo, J. Villareal, J. Cortes, A. Mitra Computer Science & Engineering.

Chapter 1 The Big Picture.

C OMPUTER O RGANIZATION AND D ESIGN The Hardware/Software Interface 5 th Edition Chapter 1 Computer Abstractions and Technology Sections 1.5 – 1.11.

Chapter 2 Parallel Architecture. Moore’s Law The number of transistors on a chip doubles every years. – Has been valid for over 40 years – Can’t.

Performance Prediction for Random Write Reductions: A Case Study in Modelling Shared Memory Programs Ruoming Jin Gagan Agrawal Department of Computer and.

Spring 2006ICOM 4036 Programming Laguages Lecture 2 1 The Nature of Computing Prof. Bienvenido Velez ICOM 4036 Lecture 2.

Bulk Synchronous Processing (BSP) Model Course: CSC 8350 Instructor: Dr. Sushil Prasad Presented by: Chris Moultrie.

Part 1.  Intel x86/Pentium family  32-bit CISC processor  SUN SPARC and UltraSPARC  32- and 64-bit RISC processors  Java  C  C++  Java  Why Java?

© David Kirk/NVIDIA and Wen-mei W. Hwu, ECE 498AL, University of Illinois, Urbana-Champaign 1 Basic Parallel Programming Concepts Computational.

1 Instruction Set Architecture (ISA) Alexander Titov 10/20/2012.

The Cosmic Cube Charles L. Seitz Presented By: Jason D. Robey 2 APR 03.

EE3A1 Computer Hardware and Digital Design

The Nature of Computing INEL 4206 – Microprocessors Lecture 3 Bienvenido Vélez Ph. D. School of Engineering University of Puerto Rico - Mayagüez.

GPU-Accelerated Computing and Case-Based Reasoning Yanzhi Ren, Jiadi Yu, Yingying Chen Department of Electrical and Computer Engineering, Stevens Institute.

GPUs: Overview of Architecture and Programming Options Lee Barford firstname dot lastname at gmail dot com.

The Nature of Computing INEL 4206 – Microprocessors Lecture 2 Bienvenido Vélez Ph. D. School of Engineering University of Puerto Rico - Mayagüez.

ECEG-3202 Computer Architecture and Organization Chapter 7 Reduced Instruction Set Computers.

Chapter 5 Computer Systems Organization. Levels of Abstraction – Figure 5.1e The Concept of Abstraction.

Von Neumann Computers Article Authors: Rudolf Eigenman & David Lilja

3/12/2013Computer Engg, IIT(BHU)1 INTRODUCTION-1.

Review of Basic Computer Science Background. Computer Science You’ll Need Some models of computationSome models of computation –Finite state machines.

MARC ProgramEssential Computing for Bioinformatics 1 The Nature of Computing Prof. Bienvenido Velez ICOM 4995 Lecture 3.

Chapter 1 — Computer Abstractions and Technology — 1 Uniprocessor Performance Constrained by power, instruction-level parallelism, memory latency.

EBIZ 509 Foundations of E-Business. 2 © UW Business School, University of Washington 2004 Agenda Today Class schedule and class plan Basic computer concepts.

CDA-5155 Computer Architecture Principles Fall 2000 Multiprocessor Architectures.

KERRY BARNES WILLIAM LUNDGREN JAMES STEED

Computer Organization IS F242. Course Objective It aims at understanding and appreciating the computing system’s functional components, their characteristics,

Hardware Architecture

Computer Architecture Furkan Rabee

Internal hardware of a computer Learning Objectives Learn how the processor works Learn about the different types of memory and what memory is used for.

Mihaela Malița Gheorghe M. Ștefan

ECE354 Embedded Systems Introduction C Andras Moritz.

ESE532: System-on-a-Chip Architecture

Morgan Kaufmann Publishers

Architecture & Organization 1

Map-Scan Node Accelerator for Big-Data

Computer Architecture and Organization

Multi-Processing in High Performance Computer Architecture:

Architecture & Organization 1

Systems Architecture Lecture 5: MIPS Instruction Set

The performance requirements for DSP applications continue to grow and the traditional solutions do not adequately address this new challenge Paradigm.

Chapter 5: Computer Systems Organization

Chapter 1 Introduction.

Chapter 15 Functional Programming 6/1/2019.

6- General Purpose GPU Programming

Presentation transcript:

Gheorghe M. Ștefan

“The semiconductor industry threw the equivalent of a Hail Mary pass when it switched from making microprocessors run faster to putting more of them on a chip – doing so without any clear notion of how such devices would in general be programmed” David Patterson IEEE Spectrum, July 2010 November 6, 2014ETTI Colloquia2

Outline: Little history How parallel computing could be restarted Kleene’s mathematical model Recursive MapReduce abstract model Backus’ architectural description Programming the MapReduce hierarchy Generic one-chip parallel structure Concluding remarks November 6, 2014ETTI Colloquia3

History: mono-core computation 1936 – mathematical computational models : Turing, Post, Church, Kleene – abstract machine models : Harvard abstract model von Neumann abstract model 1953 – manufacturing in quantity : IBM – computer architecture : the concept allows independent evolution for software and hardware Consequently, now we have few stable and successful sequential architectures : x86, ARM, PowerPC, … November 6, 2014ETTI Colloquia4

History: parallel computation 1962 – manufacturing in quantity : the first MIMD engine is introduced on the computer market by Burroughs 1965 – architectural issues : Dijkstra formulates the first concerns about parallel programming issues – abstract machine models : the first abstract models (PRAM models) start to come in after almost two decades of non-systematic experiments ? – computation model : it is there waiting for us Consequently “the semiconductor industry threw the equivalent of a Hail Mary pass when it switched from making microprocessors run faster to putting more of them on a chip” November 6, 2014ETTI Colloquia5

About PRAM-like models Parallel Random Access Machine – PRAM - (bit vector models in [Pratt et al. 1974] and PRAM models in [Fortune and Wyllie 1978]) is considered a “natural generalization” of the Random Access Machine model. Parallel Memory Hierarchy [Alpern et al. 1993] is also a “generalization”, but this time of the Memory Hierarchy model applied to the RAM model. Bulk Synchronous Parallel model divides the program in super-steps [Valiant 1990]. Latency-overhead-gap-Processors – LogP - is designed to model the communication cost [Culler et al. 1991]. November 6, 2014ETTI Colloquia6

How parallel computing could be consistently restarted 1. Use Kleene’s partial recursive functions model as the foundational mathematical framework 2. Define an abstract machine model using meaningful forms derived from Kleene’s model 3. Interface the abstract machine with an architectural (low level) description based on Backus’ FP Systems 4. Provide the simplest generic parallel structure able to run the functions requested by the architecture 5. Evaluate, using the computational motifs highlighted by Berkeley’s View, the options made in the previous three steps and improve them when needed November 6, 2014ETTI Colloquia7

Kleene’s mathematical model for parallel computation From the three rules: composition primitive recursion minimalization only the first one, the composition, is independent. f(x) = g(h 1 (x), … h m (x)) November 6, 2014ETTI Colloquia8

Integral parallel abstract model: data-parallel November 6, 2014ETTI Colloquia9

Integral parallel abstract model: reduction-parallel November 6, 2014ETTI Colloquia10

Integral parallel abstract model: speculative-parallel November 6, 2014ETTI Colloquia11

Integral parallel abstract model: time-parallel November 6, 2014ETTI Colloquia12

Integral parallel abstract model: thread-parallel November 6, 2014ETTI Colloquia13

Putting all forms together: integral parallel abstract model The MapReduce abstract model: Map means data, speculative and thread parallelism Reduce means reduce parallelism November 6, 2014ETTI Colloquia14

From one-chip to cloud: MapReduce recursive abstract model November 6, 2014ETTI Colloquia15

Backus’ architectural description John Backus: “Can Programming Be Liberated from the von Neumann Style? A Functional Style and Its Algebra of Programs”, Communications of the ACM, August, Functional Programming Systems primitive functions functional forms definitions November 6, 2014ETTI Colloquia16

Functional forms Apply to all: αf : x  (x = )  Construction: [f 1, …, f p ] : x  Threaded construction:  [f 1, …, f p ] : x  (x = )  Insert: /f : x  ((x = ) & (p  2))  f : > Composition: (f q  f q-1  …  f 1 ) : x  f q : (f q-1 : (f q-2 : ( … :(f 1 : x)…))) November 6, 2014ETTI Colloquia17

Kleene – Backus synergy November 6, 2014ETTI Colloquia18

MapReduce hierarchy programming Any level in the hierarchy uses the same programming forms: Map & Reduce (define (Map funcs args) (cond ((and (atom? funcs) (atom? args)) ; one funcs one args (funcs args) ) ((and (atom? funcs) (list? args)) ; one funcs many args (if (null? args)() (cons(funcs(car args)) (Map funcs (cdr args))) )) ((and (list? funcs) (atom? args)) ; many funcs one args (if (null? funcs) () (cons((car funcs) args) (Map (cdr funcs) args))) )) ((and (list? funcs) (list? args)) ; many funcs many args (if (or (null? funcs)(null? args))() (cons((car funcs) (car args))(Map (cdr funcs) (cdr args))) )) November 6, 2014ETTI Colloquia19

MapReduce hierarchy programming (define(Reduce binaryOp argList) (cond((atom? argList)argList) (#t(binaryOp(car argList) (Reduce binaryOp (cdr argList)))) )) The 0-level functions in the hierarchy are: Add, Sub, Mult, … And, Or, Xor, … Inc, Dec, … Not, … Max, Min, … November 6, 2014ETTI Colloquia20

Generic one-chip parallel structure November 6, 2014ETTI Colloquia21

The ConnexArray TM : BA1024 Last version, March nm 9×9 mm 2 (entire chip) bit cells 1 KB/cell 400 MHz 400 GOPS > 120 GOPS/W > 6.25 GOPS/mm 2 The first version, 11×11 mm 2, in 90 nm November 6, 2014ETTI Colloquia22

Updated version in 28 nm bit cells with 8KB/cell 1MHz < 15Watt, at T<85 o C 2 TOPS to 500 GFLOPS 86 mm 2 < 15$/chip (mass production) 133 GOPS/W to 33 GFLOPS/W (7.5 – 30 pJ/ OP – FLOP) OP: logic/arithmetic/memory access 32-bit integer operations Tianhe-2: 3.08 GFLOPS/W (325 pJ/FLOP) More than 10x – 40x in energy efficiency ETTI ColloquiaNovember 6,

Validating MapReduce architecture Krste Asanovic, et al.: The Landscape of Parallel Computing Research: A View from Berkeley, EECS Department, University of California, Berkeley Technical Report No. UCB/EECS , 2006 Provides 13 computational motifs. November 6, 2014ETTI Colloquia24

AES: ConnexArray64 vs. Cortex9 Area & power for Connex64 (16-bit cell) is similar with Cortex9 On Cortex 9: 173 cycle/byte On 64-cell Connex: 2.1 cycle/byte The use of area & power is 82x on Connex November 6, 2014ETTI Colloquia25

FFT: ConnexArray32 vs. Cortex9 Area & power for Connex32 (32-bit cell) is similar with Cortex9 because Connex32 multiplies sequentially the use of area & power is: 18.8x on Connex32 for less than 128 × 128 samples is determined by the transpose time for big number of samples November 6, 2014ETTI Colloquia26

Sorting: ConnexArray64 vs. Cortex9 Interleaving transparently Sort… and Trans… on two sets of streams to be sorted improves the performance. Sorting number sequence the acceleration is 84x For longer sequences, it is possible that the transpose operations become dominant and the performance will maybe diminishes. November 6, 2014ETTI Colloquia27

Concluding remarks Kleene’s mathematical computational model fits perfect as the theoretical foundation for parallel computing. Integral parallel abstract machine model is defined as the model of the simplest generic parallel engine. Both, Kleene’s model and Backus’ architecture promote one-dimension arrays, thus supporting the simplest hardware configuration. MapReduce is a recursive model working from the chip level to the cloud computation. November 6, 2014ETTI Colloquia28

Concluding remarks (cont.) Big fallacy: putting together Turing inspired sequential machines results a parallel computer which provides high performance Cellular approach is successful only if: By increasing the number of cells, their size and complexity is reduced Interleaving a network of simple & small engines with a network of big & fast memories is the only solution to achieve high performance A recursive growth of the hierarchy is used supported by functional languages Separating the complex from the simple is the key ETTI ColloquiaNovember 6,

Concluding remarks (cont.) Complex computation Mono/multi core big & complex processor organization Multi-threaded programming model Operating system oriented design Cache based memory hierarchy Intense computation Many small & simple cell organization Array (vector and/or stream) computing High latency functional pipe oriented system Multi buffer oriented memory hierarchy (the flow of code and data is very predictable) ETTI ColloquiaNovember 6,

Concluding remarks (cont.) Cellular approach fits perfect to problems related with processes characterized by high locality The growing rates of the size and complexity of an integrated circuit are very different: Size grows exponential Complexity grows polynomial (n a with a< 1) A cellular system must be programmable because high size implies marriage of circuit & information ETTI ColloquiaNovember 6,

Thank you Q&A November 6, 2014ETTI Colloquia32

Bibliography Martin Davis: The Undecidable: Basic Papers on Undecidable Propositions, Unsolvable Problems and Computable Functions, Dover Publications John Backus: “Can Programming Be Liberated from the von Neumann Style? A Functional Style and Its Algebra of Programs”, Communications of the ACM, August, Krste Asanovic, et al.: The Landscape of Parallel Computing Research: A View from Berkeley, EECS Department, University of California, Berkeley Technical Report No. UCB/EECS , 2006 November 6, 2014ETTI Colloquia33