Membrane Computing in Connex Environment 1WMC8 June 2007 Membrane Computing in the Connex Environment Gheorghe Stefan BrightScale Inc., Sunnyvale, CA &

Slides:

Advertisements

Similar presentations

Vectors, SIMD Extensions and GPUs COMP 4611 Tutorial 11 Nov. 26,

Advertisements

DSPs Vs General Purpose Microprocessors

Multiprocessors— Large vs. Small Scale Multiprocessors— Large vs. Small Scale.

Lecture 38: Chapter 7: Multiprocessors Today’s topic –Vector processors –GPUs –An example 1.

Lecture 6: Multicore Systems

The University of Adelaide, School of Computer Science

Vector Processing. Vector Processors Combine vector operands (inputs) element by element to produce an output vector. Typical array-oriented operations.

Computer Architecture and Data Manipulation Chapter 3.

Connex Technology Proprietary and Confidential 1 The CA1024: A Massively Parallel Processor for Cost-Effective HDTV.

Embedded Systems Programming

1 COMP 206: Computer Architecture and Implementation Montek Singh Mon, Dec 5, 2005 Topic: Intro to Multiprocessors and Thread-Level Parallelism.

Data Parallel Algorithms Presented By: M.Mohsin Butt

Instruction Level Parallelism (ILP) Colin Stevens.

Computational Astrophysics: Methodology 1.Identify astrophysical problem 2.Write down corresponding equations 3.Identify numerical algorithm 4.Find a computer.

Chapter Hardwired vs Microprogrammed Control Multithreading

Chapter 17 Parallel Processing.

Recap – Our First Computer WR System Bus 8 ALU Carry output A B S C OUT F 8 8 To registers’ input/output and clock inputs Sequence of control signal combinations.

State Machines Timing Computer Bus Computer Performance Instruction Set Architectures RISC / CISC Machines.

GCSE Computing - The CPU

Chapter 6 Memory and Programmable Logic Devices

Group 5 Alain J. Percial Paula A. Ortiz Francis X. Ruiz.

SUPERSCALAR EXECUTION. two-way superscalar The DLW-2 has two ALUs, so it’s able to execute two arithmetic instructions in parallel (hence the term two-way.

Advanced Computer Architectures

C.S. Choy95 COMPUTER ORGANIZATION Logic Design Skill to design digital components JAVA Language Skill to program a computer Computer Organization Skill.

ECE 526 – Network Processing Systems Design Network Processor Architecture and Scalability Chapter 13,14: D. E. Comer.

Basic Microcomputer Design. Inside the CPU Registers – storage locations Control Unit (CU) – coordinates the sequencing of steps involved in executing.

One-Chip TeraArchitecture 19 martie 2009 One-Chip TeraArchitecture Gheorghe Stefan

Computer Architecture and Organization Introduction.

Parallel Processing - introduction  Traditionally, the computer has been viewed as a sequential machine. This view of the computer has never been entirely.

Multiprocessing. Going Multi-core Helps Energy Efficiency William Holt, HOT Chips 2005 Adapted from UC Berkeley "The Beauty and Joy of Computing"

Buffer-On-Board Memory System 1 Name: Aurangozeb ISCA 2012.

István Lőrentz 1 Mihaela Malita 2 Răzvan Andonie 3 Mihaela MalitaRăzvan Andonie 3 (presenter) 1 Electronics and Computers Department, Transylvania University.

© 2004, D. J. Foreman 1 Computer Organization. © 2004, D. J. Foreman 2 Basic Architecture Review  Von Neumann ■ Distinct single-ALU & single-Control.

Chapter 2 Data Manipulation © 2007 Pearson Addison-Wesley. All rights reserved.

© 2004, D. J. Foreman 1 Computer Organization. © 2004, D. J. Foreman 2 Basic Architecture Review  Von Neumann ■ Distinct single-ALU & single-Control.

Chapter 2 Data Manipulation. © 2005 Pearson Addison-Wesley. All rights reserved 2-2 Chapter 2: Data Manipulation 2.1 Computer Architecture 2.2 Machine.

L/O/G/O Input Output Chapter 4 CS.216 Computer Architecture and Organization.

M. Mateen Yaqoob The University of Lahore Spring 2014.

Pipelining and Parallelism Mark Staveley

Processor Structure and Function Chapter8:. CPU Structure  CPU must:  Fetch instructions –Read instruction from memory  Interpret instructions –Instruction.

Overview von Neumann Architecture Computer component Computer function

EKT303/4 Superscalar vs Super-pipelined.

Chapter 2 Data Manipulation © 2007 Pearson Addison-Wesley. All rights reserved.

Lecture 3: Computer Architectures

Fundamentals of Programming Languages-II

Simple ALU How to perform this C language integer operation in the computer C=A+B; ? The arithmetic/logic unit (ALU) of a processor performs integer arithmetic.

3/12/2013Computer Engg, IIT(BHU)1 CONCEPTS-1. Pipelining Pipelining is used to increase the speed of processing It uses temporal parallelism In pipelining,

MODULE 5 INTEL TODAY WE ARE GOING TO DISCUSS ABOUT, FEATURES OF 8086 LOGICAL PIN DIAGRAM INTERNAL ARCHITECTURE REGISTERS AND FLAGS OPERATING MODES.

Chapter 2 Data Manipulation © 2007 Pearson Addison-Wesley. All rights reserved.

1 Device Controller I/O units typically consist of A mechanical component: the device itself An electronic component: the device controller or adapter.

Chapter 11 System Performance Enhancement. Basic Operation of a Computer l Program is loaded into memory l Instruction is fetched from memory l Operands.

Auburn University COMP8330/7330/7336 Advanced Parallel and Distributed Computing Parallel Hardware Dr. Xiao Qin Auburn.

PROGRAMMABLE LOGIC CONTROLLERS SINGLE CHIP COMPUTER

William Stallings Computer Organization and Architecture 8th Edition

buses, crossing switch, multistage network.

Parallel Processing - introduction

Architecture & Organization 1

Array Processor.

Architecture & Organization 1

The CA1024: A Massively Parallel Processor for Cost-Effective HDTV

Comparison of Two Processors

buses, crossing switch, multistage network.

AN INTRODUCTION ON PARALLEL PROCESSING

Computer Organization

Introduction to Microprocessor Programming

Modified from notes by Saeid Nooshabadi

Multicore and GPU Programming

CSE 502: Computer Architecture

Multicore and GPU Programming

Presentation transcript:

Membrane Computing in Connex Environment 1WMC8 June 2007 Membrane Computing in the Connex Environment Gheorghe Stefan BrightScale Inc., Sunnyvale, CA & Politehnica University of Bucharest

Membrane Computing in Connex Environment 2WMC8 June 2007 Outline  Integral Parallel Architecture  The Connex Chip  The Connex Architecture  How to Use the Connex Environment  Concluding Remarks

Membrane Computing in Connex Environment 3WMC8 June 2007 Integral Parallel Architecture The Ubiquitousness of Parallelism Asks for Integral Parallel Architectures Partial Recursive Functions & Parallel Computation A Functional Taxonomy of Parallel Computation

Membrane Computing in Connex Environment 4WMC8 June 2007 Parallelism can not be avoided anymore  Intel’s approach Multi-processors: the best approach for multi-threading on MIMD architecture Inefficient on SIMD architecture Ignores the MISD architecture Many-processors asking for another taxonomy They work as accelerators They perform critical functions  Berkeley’s 13 dwarfs is a functional approach for many-processors  Real applications ask for all kind of parallelism to solve corner cases – the places where the devil hides

Membrane Computing in Connex Environment 5WMC8 June 2007 Partial Recursive Functions & Parallel Computation Composition Rule & the Basic Parallel Structures Primitive Recursive Rule Minimalization Rule

Membrane Computing in Connex Environment 6WMC8 June 2007 Composition & the Associated Structure f(x 0, … x n-1 ) = g(h 0 (x 0, … x n-1 ), h 1 (x 0, … x n-1 ), … h m-1 (x 0, … x n-1 )) x 0, … x n-1... f(x 0, … x n-1 ) h0h0 h0h0 h1h1 h1h1 h m-1 g(h 0, h 1, … h m-1 )

Membrane Computing in Connex Environment 7WMC8 June 2007 Data Parallel Composition X = {x 0, … x n-1 }  {h(x 0 ), h(x 0 ), … h(x 0 )} x 0 x 1 x n-1... h(x 0 ) h(x 1 ) h(x n-1 ) h h h h h h

Membrane Computing in Connex Environment 8WMC8 June 2007 Speculative Composition function vector: H = [h 0, h 1, … h n-1 ], scalar: x H(x) = {h 0 (x), h 1 (x) … h n-1 (x)} x... h 0 (x) h 1 (x) h n-1 (x) h0h0 h0h0 h1h1 h1h1 h m-1

Membrane Computing in Connex Environment 9WMC8 June 2007 Serial Composition f(x) = g(h(x)) x Time parallelism The general case: f(x) = g 1 (g 2 ( g 3 ( … g p (x) …))) f(x) h h g(h(x))

Membrane Computing in Connex Environment 10WMC8 June 2007 Reduction Composition f(x 0, … x m-1 ) = g(x 0, … x m-1 ) x 0 x 1 … x m-1 g(x 0, … x m-1 ) g(x 0, x 1, … x m-1 )

Membrane Computing in Connex Environment 11WMC8 June 2007 Primitive recursive rule f(x,y) = h(x, f(x, y-1)), where f(x,0) = g(x) f(x,y) = h(x, h(x, h(x, … h(x, g(x) )… ))) Parallel solution makes sense only if the function must be computed many times. Implementations: 1.Data parallel composition 2.Loop in a serial composition

Membrane Computing in Connex Environment 12WMC8 June 2007 Data Parallel Composition for the Primitive Recursive Rule x, Y = {y 0, … y n-1 }  {f(x,y 0 ), f(x,y 1 ), … f(x,y n-1 )} (x, y 0 ) (x, y 1 ) (x, y n-1 )... f(x, y 0 ) f(x, y 1 ) f(x, y n-1 ) h h h h h h

Membrane Computing in Connex Environment 13WMC8 June 2007 Serial Composition for the Primitive Recursive Rule x, =  = x,... h h h h h h sel

Membrane Computing in Connex Environment 14WMC8 June 2007 Minimalization rule f(x) = min(y)[m(x,y) = 0] Implementations: 1.Speculative composition & reduction composition 2.Serial composition & reduction composition

Membrane Computing in Connex Environment 15WMC8 June 2007 Speculative Composition & Reduction Composition for Minimalization x... {m(x,0), 0} {m(x,n-1), n-1} f(x) = i m(x,0) m(x,1) m(x,n-1) first{0, i}

Membrane Computing in Connex Environment 16WMC8 June 2007 Serial Composition & Reduction Composition for Minimalization y i-1 y i-2 … y i-s selection code y i ( P i : the i-th pipe stage ) Example of dynamic reconfiguration P i-5 selector fifi fifi P i-1 P i-2 P i-3 P i-4 PiPi PiPi

Membrane Computing in Connex Environment 17WMC8 June 2007 Functional Taxonomy of Parallel Computation Data Parallel Computation: uses SIMD-like machines Time Parallel Computation: is a very special sort of MIMD used to compute only one function Speculative Computation: is MISD machine completely ignored by the actual implementations

Membrane Computing in Connex Environment 18WMC8 June 2007 Integral Parallel Architecture An Integral Parallel Architecture (IPA) uses all kinds of parallelism to build a real machine, in two versions: complex IPA: all types of parallel mechanisms tightly interleaved on the same physical structure (pipelined superscalar speculative general purpose processors) intensive IPA: all types of parallel mechanisms highly separated, implemented on specific physical structures (accelerators for embedded computation in a SoC approach)

Membrane Computing in Connex Environment 19WMC8 June 2007 Intensive IPA Intensive IPA are used as accelerators for complex IPA 1.Monolithic intensive IPA: the same machine works in two modes: Data parallel Time parallel 2.Segregated intensive IPA: two distinct machines are used, one for data parallel computation and the other for time parallel (i.e. speculative) computation

Membrane Computing in Connex Environment 20WMC8 June 2007 The Connex Chip The organization of BA1024: multi-core area of 4 MIPS many-core data parallel area of 1024 simple PEs speculative time parallel pipe of 8 PEs interfaces (DDR, PCI, video & audio interfaces for 2 HDTV channels)

Membrane Computing in Connex Environment 21WMC8 June 2007 The Connex System 1 I/O Controller (4KB data & 4KB program memory) Connex Array Connex Array: 1,024 linearly connected 16-bit Processing Cells Sequencer: 32-bit stack machine & program memory & data memory issues in each cycle (on a 2-stage pipe) one 64-bit instruction for Connex Array and a 24-bit instruction for itself IO Controller: 32-bit stack machine controls a 3.2 GB/s IO channel Processing Cell: Integer unit & data memory & Boolean unit I/O channel works in parallel with code running on the Connex Array Connex I/O AUX 16-bit RAM For data Address Boolean Index 16 bit ALU Sequencer (4KB data & 32Kb program memory) 255 R0 R R2 R3 R4 R5 R6 R7

Membrane Computing in Connex Environment 22WMC8 June bit ALU 16 bit ALU 16 bit ALU Connex Array Structure  Processing Cells are linearly connected using only the register R0  IO Plan consists in all R1s supervised mainly by the IO Controller  Conditional execution based on the state of Boolean unit  Integer unit, Boolean unit and Data memory execute in each cycle command fields from a 64- bit instruction issued by Sequencer  Vector reduction operations with scalar results in the TOS of Sequencer (receiving through a 3-stage pipe data from the array of cells) R0 R R2 R3 R4 R5 R6 R7 off 1023 on R0 R on 1 R2 R3 R4 R5 R6 R7 255 R0 R R2 R3 R4 R5 R6 R7

Membrane Computing in Connex Environment 23WMC8 June 2007 I/O System I/O Plane Connex Array IOC Switch Fabric (128-bit word) IS Interrupts DDR-DRAM Controller DRAM

Membrane Computing in Connex Environment 24WMC8 June 2007 Configurable Switch Fabric Audio Out Video Out Video Out HOST I/F Audio Out Ext. Bus Audio In Audio In Video In Video In Test ICE PCI v2.2 or Generic 64-bit Wide DRAM 1x-I2S 4xI2S BT.656/1120 Flash 1x-I2S BT.656/1120 1x-I2S BT.656/1120 DDR-DRAM Ctrl (400 MHz Data Rate) EJTAG GPIOI2C S/PDIF Stream Accelerator Host CPU Audio CPU TS/Sec CPU Video CPU Instruction Sequencer Configurable Switch Fabric Test I/O Sequencer ConnexArray™ Programmable Media Processor Multi-Codec Processing Pre-Analysis 3D Filter Scaling Video Merge/Blend Motion Adaptive De-interlacing BA1024 Configurable Switch Fabric

Membrane Computing in Connex Environment 25WMC8 June 2007 The Connex Architecture Vectors & selections Programming Connex Performances

Membrane Computing in Connex Environment 26WMC8 June 2007 Vectors & Selections Linear array of processing elements  vectors Local data memory in each processing element  array of vectors Data dependency operations at the level of each processing element  selections

Membrane Computing in Connex Environment 27WMC8 June 2007 Full Line Operations Line i Line k Line j +, -, *, XOR, etc. = Line k = Line i OP Line j Line k = Line i OP scalar value (repeated for all elements) 16-bit data operand

Membrane Computing in Connex Environment 28WMC8 June 2007 Columns Active Based On Repeating Patterns Line i Line k Line j +, -, *, XOR, etc. = Mark all odd columns active. Or mark every third column active. Or mark every third and fourth column active, etc.

Membrane Computing in Connex Environment 29WMC8 June 2007 Columns Active Based On Data Content Line i Line k Line j +, -, *, XOR, etc. = Apparently random columns are active, marked, based on data-dependent results of previous operations.

Membrane Computing in Connex Environment 30WMC8 June Line i Line j Example: 128 sets of 8x8 run in parallel in a 1024-cell array 7 7 8x8 Outer-Loop Parallelism ……..

Membrane Computing in Connex Environment 31WMC8 June 2007 Programming Connex  VectorC is an extension/restriction of C++  Code that operates on scalar data written in regular C notation  Connex-specific operators defined as functions for features not available in C++, e.g. operations on vectors and selections (Boolean vectors)  VectorC uses sequential operators and control structures on vector and select data-types  Using VectorC the Connex Machine is programmed the same way as conventional sequential machines int main() { vector V1 = 2; // V1 = {2, 2, … 2} vector V2 = 3; // V2 = {3, 3, … 3} vector V; // V = {0, 0, … 0} vector Index = indexvector(); // Index = {0, 1, … } V = mm_absdiff(V1, V2); // V = {1,1, … 1} return 0; } // Find the absolute difference between two vectors vector mm_absdiff(vector V1, vector V2) { vector V; V = V1 - V2; WHERE (V < 0) { V = -V; // V = abs(V); } ENDW return V; } Vectors are arrays of scalar components. Selections are arrays of Boolean values that dictate what vector components are active.

Membrane Computing in Connex Environment 32WMC8 June 2007 Overall performances of BA GOP/sec 3.2 GB/sec: external bandwidth 400 GB/sec: internal bandwidth > 60 GOP/Watt > 2 GOP/mm 2 Note: 1 OP = 16-bit simple integer operation (excluding multiplication)

Membrane Computing in Connex Environment 33WMC8 June 2007 How to Use the Connex Environment for Membrane Computation Example (G. Paun): the initial configuration: [ 1 [ 2 [ 3 a f c] 3 ] 2 ] 1... R1: e  (e, out), f  f R2: b  d, d  de, ff  f, cf  cdδ R3: a  ab, a  bδ, f  ff

Membrane Computing in Connex Environment 34WMC8 June 2007 The first example of processing Initial vector: (1,[) (2,[) (3,[) (0,a) (0,f) (0,c) (3,]) (2,]) (1,])... [[[a f c] ] ]... a  ab, f  ff: [[[a b f f c] ] ]... // 11 clock cycles a  ab, f  ff: [[[a b b f f f f c] ] ]... // 15 clock cycles a  bδ, f  ff: [[b b b f f f f f f f f c ] ]... // 27 clock cycles b  d, ff  f: [[d d d f f f f c ] ]... // 10 clock cycles d  de, ff  f: [[d e d e d e f f c ] ]... // 10 clock cycles d  de, cf  cdδ: [d e e d e e d e e d f c ]... // 10 clock cycles e  (e, out), f  f: [d d d d f c ] e e e e e e... // 15 clock cycles total: 98 clock cycles

Membrane Computing in Connex Environment 35WMC8 June 2007 The second example of processing Initial vector: (1,[) (2,[) (3,[) (1,a) (1,f) (1,c) (3,]) (2,]) (1,])... [[[1a 1f 1c] ] ]...  [[[1a 1b 2f 1c] ] ]...  // in 5 clock cycles [[[1a 2b 4f 1c] ] ]...  // in 5 clock cycles [[3b 8f 1c ] ]...  // in 10 clock cycles [[3d 4f 1c ] ]...  // in 7 clock cycles [[3d 3e 2f 1c ] ]...  // in 8 clock cycles [4d 3e 1f 1c ]...  // in 8 clock cycles [4d 1f 1c ] 3e...  // in 5 clock cycles total: 48 clock cycles

Membrane Computing in Connex Environment 36WMC8 June 2007 The third example of processing The third membrane is duplicated (multiplicated), but the content can be different [[[1a 1f 1c] [2a 1f 1c] ] ]...  [[[1a 1b 2f 1c] [2a 2b 2f 1c] ] ]...  // in 5 clock cycles [[[1a 2b 4f 1c] [2a 4b 4f 1c] ] ]...  // in 5 clock cycles [[3b 8f 1c 6b 8f 1c ] ]...  // in 10 clock cycles [[3d 4f 1c 6d 4f 1c ] ]...  // in 7 clock cycles [[3d 3e 2f 1c 6d 6e 2f 1c ] ]...  // in 8 clock cycles [4d 3e 1f 1c 7d 6e 1f 1c]...  // in 8 clock cycles [4d 1f 1c 7d 1f 1c ] 9e...  // in 10 clock cycles total: 53 clock cycles For up to 200 level 3 membranes the number of clock cycles remains 53.

Membrane Computing in Connex Environment 37WMC8 June 2007 Concluding Remarks 1. Functional taxonomy vs. Flynn taxonomy 2. Connex architecture accelerates membrane computation 3. An efficient P-architecture asks for few additional features to the Connex architecture 4. Why not a P-language?

Membrane Computing in Connex Environment 38WMC8 June 2007 Main technical contributors to the Connex project: Emanuele Altieri, BrightScale Inc., CA Lazar Bivolarski, BrightScale Inc., CA Frank Ho, BrightScale Inc., CA Mihaela Malita, St. Anselm College, NH Bogdan Mitu, BrightScale Inc., CA Dominique Thiebaut, Smith College, MA Tom Thomson, BrightScale Inc., CA Dan Tomescu, BrightScale Inc., CA

Membrane Computing in Connex Environment 39WMC8 June 2007 Thank You Mihaela’s webpage on VectorC Q&A