Gheorghe M. Ștefan “The semiconductor industry threw the equivalent of a Hail Mary pass when it switched from making microprocessors.

Gheorghe M. Ștefan http://arh.pub.ro/gstefan/

“The semiconductor industry threw the equivalent of a Hail Mary pass when it switched from making microprocessors run faster to putting more of them on a chip – doing so without any clear notion of how such devices would in general be programmed” David Patterson IEEE Spectrum, July 2010 November 6, 2014ETTI Colloquia2

Outline: Little history How parallel computing could be restarted Kleene’s mathematical model Recursive MapReduce abstract model Backus’ architectural description Programming the MapReduce hierarchy Generic one-chip parallel structure Concluding remarks November 6, 2014ETTI Colloquia3

History: mono-core computation 1936 – mathematical computational models : Turing, Post, Church, Kleene 1944-45 – abstract machine models : Harvard abstract model von Neumann abstract model 1953 – manufacturing in quantity : IBM 701 1964 – computer architecture : the concept allows independent evolution for software and hardware Consequently, now we have few stable and successful sequential architectures : x86, ARM, PowerPC, … November 6, 2014ETTI Colloquia4

History: parallel computation 1962 – manufacturing in quantity : the first MIMD engine is introduced on the computer market by Burroughs 1965 – architectural issues : Dijkstra formulates the first concerns about parallel programming issues 1974-76 – abstract machine models : the first abstract models (PRAM models) start to come in after almost two decades of non-systematic experiments ? – computation model : it is there waiting for us Consequently “the semiconductor industry threw the equivalent of a Hail Mary pass when it switched from making microprocessors run faster to putting more of them on a chip” November 6, 2014ETTI Colloquia5

About PRAM-like models Parallel Random Access Machine – PRAM - (bit vector models in [Pratt et al. 1974] and PRAM models in [Fortune and Wyllie 1978]) is considered a “natural generalization” of the Random Access Machine model. Parallel Memory Hierarchy [Alpern et al. 1993] is also a “generalization”, but this time of the Memory Hierarchy model applied to the RAM model. Bulk Synchronous Parallel model divides the program in super-steps [Valiant 1990]. Latency-overhead-gap-Processors – LogP - is designed to model the communication cost [Culler et al. 1991]. November 6, 2014ETTI Colloquia6

How parallel computing could be consistently restarted 1. Use Kleene’s partial recursive functions model as the foundational mathematical framework 2. Define an abstract machine model using meaningful forms derived from Kleene’s model 3. Interface the abstract machine with an architectural (low level) description based on Backus’ FP Systems 4. Provide the simplest generic parallel structure able to run the functions requested by the architecture 5. Evaluate, using the computational motifs highlighted by Berkeley’s View, the options made in the previous three steps and improve them when needed November 6, 2014ETTI Colloquia7

Kleene’s mathematical model for parallel computation From the three rules: composition primitive recursion minimalization only the first one, the composition, is independent. f(x) = g(h 1 (x), … h m (x)) November 6, 2014ETTI Colloquia8

Integral parallel abstract model: data-parallel November 6, 2014ETTI Colloquia9

Integral parallel abstract model: reduction-parallel November 6, 2014ETTI Colloquia10

Integral parallel abstract model: speculative-parallel November 6, 2014ETTI Colloquia11

Integral parallel abstract model: time-parallel November 6, 2014ETTI Colloquia12

Integral parallel abstract model: thread-parallel November 6, 2014ETTI Colloquia13

Putting all forms together: integral parallel abstract model The MapReduce abstract model: Map means data, speculative and thread parallelism Reduce means reduce parallelism November 6, 2014ETTI Colloquia14

From one-chip to cloud: MapReduce recursive abstract model November 6, 2014ETTI Colloquia15

Backus’ architectural description John Backus: “Can Programming Be Liberated from the von Neumann Style? A Functional Style and Its Algebra of Programs”, Communications of the ACM, August, 1978. Functional Programming Systems primitive functions functional forms definitions November 6, 2014ETTI Colloquia16

Functional forms Apply to all: αf : x  (x = )  Construction: [f 1, …, f p ] : x  Threaded construction:  [f 1, …, f p ] : x  (x = )  Insert: /f : x  ((x = ) & (p  2))  f : > Composition: (f q  f q-1  …  f 1 ) : x  f q : (f q-1 : (f q-2 : ( … :(f 1 : x)…))) November 6, 2014ETTI Colloquia17

Kleene – Backus synergy November 6, 2014ETTI Colloquia18

MapReduce hierarchy programming Any level in the hierarchy uses the same programming forms: Map & Reduce (define (Map funcs args) (cond ((and (atom? funcs) (atom? args)) ; one funcs one args (funcs args) ) ((and (atom? funcs) (list? args)) ; one funcs many args (if (null? args)() (cons(funcs(car args)) (Map funcs (cdr args))) )) ((and (list? funcs) (atom? args)) ; many funcs one args (if (null? funcs) () (cons((car funcs) args) (Map (cdr funcs) args))) )) ((and (list? funcs) (list? args)) ; many funcs many args (if (or (null? funcs)(null? args))() (cons((car funcs) (car args))(Map (cdr funcs) (cdr args))) )) November 6, 2014ETTI Colloquia19

MapReduce hierarchy programming (define(Reduce binaryOp argList) (cond((atom? argList)argList) (#t(binaryOp(car argList) (Reduce binaryOp (cdr argList)))) )) The 0-level functions in the hierarchy are: Add, Sub, Mult, … And, Or, Xor, … Inc, Dec, … Not, … Max, Min, … November 6, 2014ETTI Colloquia20

Generic one-chip parallel structure November 6, 2014ETTI Colloquia21

The ConnexArray TM : BA1024 Last version, March 2008 65 nm 9×9 mm 2 (entire chip) 1024 16-bit cells 1 KB/cell 400 MHz 400 GOPS > 120 GOPS/W > 6.25 GOPS/mm 2 The first version, 11×11 mm 2, in 90 nm November 6, 2014ETTI Colloquia22

Updated version in 28 nm 2048 32-bit cells with 8KB/cell 1MHz < 15Watt, at T<85 o C 2 TOPS to 500 GFLOPS 86 mm 2 < 15$/chip (mass production) 133 GOPS/W to 33 GFLOPS/W (7.5 – 30 pJ/ OP – FLOP) OP: logic/arithmetic/memory access 32-bit integer operations Tianhe-2: 3.08 GFLOPS/W (325 pJ/FLOP) More than 10x – 40x in energy efficiency ETTI ColloquiaNovember 6, 201423

Validating MapReduce architecture Krste Asanovic, et al.: The Landscape of Parallel Computing Research: A View from Berkeley, EECS Department, University of California, Berkeley Technical Report No. UCB/EECS-2006-183, 2006 Provides 13 computational motifs. November 6, 2014ETTI Colloquia24

AES: ConnexArray64 vs. Cortex9 Area & power for Connex64 (16-bit cell) is similar with Cortex9 On Cortex 9: 173 cycle/byte On 64-cell Connex: 2.1 cycle/byte The use of area & power is 82x on Connex November 6, 2014ETTI Colloquia25

FFT: ConnexArray32 vs. Cortex9 Area & power for Connex32 (32-bit cell) is similar with Cortex9 because Connex32 multiplies sequentially the use of area & power is: 18.8x on Connex32 for less than 128 × 128 samples is determined by the transpose time for big number of samples November 6, 2014ETTI Colloquia26

Sorting: ConnexArray64 vs. Cortex9 Interleaving transparently Sort… and Trans… on two sets of streams to be sorted improves the performance. Sorting 64 16-number sequence the acceleration is 84x For longer sequences, it is possible that the transpose operations become dominant and the performance will maybe diminishes. November 6, 2014ETTI Colloquia27

Concluding remarks Kleene’s mathematical computational model fits perfect as the theoretical foundation for parallel computing. Integral parallel abstract machine model is defined as the model of the simplest generic parallel engine. Both, Kleene’s model and Backus’ architecture promote one-dimension arrays, thus supporting the simplest hardware configuration. MapReduce is a recursive model working from the chip level to the cloud computation. November 6, 2014ETTI Colloquia28

Concluding remarks (cont.) Big fallacy: putting together Turing inspired sequential machines results a parallel computer which provides high performance Cellular approach is successful only if: By increasing the number of cells, their size and complexity is reduced Interleaving a network of simple & small engines with a network of big & fast memories is the only solution to achieve high performance A recursive growth of the hierarchy is used supported by functional languages Separating the complex from the simple is the key ETTI ColloquiaNovember 6, 201429

Concluding remarks (cont.) Complex computation Mono/multi core big & complex processor organization Multi-threaded programming model Operating system oriented design Cache based memory hierarchy Intense computation Many small & simple cell organization Array (vector and/or stream) computing High latency functional pipe oriented system Multi buffer oriented memory hierarchy (the flow of code and data is very predictable) ETTI ColloquiaNovember 6, 201430

Concluding remarks (cont.) Cellular approach fits perfect to problems related with processes characterized by high locality The growing rates of the size and complexity of an integrated circuit are very different: Size grows exponential Complexity grows polynomial (n a with a< 1) A cellular system must be programmable because high size implies marriage of circuit & information ETTI ColloquiaNovember 6, 201431

Thank you Q&A November 6, 2014ETTI Colloquia32

Bibliography Martin Davis: The Undecidable: Basic Papers on Undecidable Propositions, Unsolvable Problems and Computable Functions, Dover Publications John Backus: “Can Programming Be Liberated from the von Neumann Style? A Functional Style and Its Algebra of Programs”, Communications of the ACM, August, 1978. Krste Asanovic, et al.: The Landscape of Parallel Computing Research: A View from Berkeley, EECS Department, University of California, Berkeley Technical Report No. UCB/EECS-2006-183, 2006 November 6, 2014ETTI Colloquia33

Gheorghe M. Ștefan “The semiconductor industry threw the equivalent of a Hail Mary pass when it switched from making microprocessors.

Similar presentations

Presentation on theme: "Gheorghe M. Ștefan “The semiconductor industry threw the equivalent of a Hail Mary pass when it switched from making microprocessors."— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

Gheorghe M. Ștefan “The semiconductor industry threw the equivalent of a Hail Mary pass when it switched from making microprocessors.

Similar presentations

Presentation on theme: "Gheorghe M. Ștefan “The semiconductor industry threw the equivalent of a Hail Mary pass when it switched from making microprocessors."— Presentation transcript:

Similar presentations

About project

Feedback