CS294-6 Reconfigurable Computing Day 25 Heterogeneous Systems and Interfacing.

Slides:



Advertisements
Similar presentations
Threads, SMP, and Microkernels
Advertisements

Multiprocessors— Large vs. Small Scale Multiprocessors— Large vs. Small Scale.
Multi-Level Caches Vittorio Zaccaria. Preview What you have seen: Data organization, Associativity, Cache size Policies -- how to manage the data once.
Lecture 6: Multicore Systems
EDA (CS286.5b) Day 10 Scheduling (Intro Branch-and-Bound)
Lecture 15: Reconfigurable Coprocessors October 31, 2013 ECE 636 Reconfigurable Computing Lecture 15 Reconfigurable Coprocessors.
Lecture 9: Coarse Grained FPGA Architecture October 6, 2004 ECE 697F Reconfigurable Computing Lecture 9 Coarse Grained FPGA Architecture.
Caltech CS184a Fall DeHon1 CS184a: Computer Architecture (Structures and Organization) Day17: November 20, 2000 Time Multiplexing.
Reference: Message Passing Fundamentals.
CS294-6 Reconfigurable Computing Day 5 September 8, 1998 Comparing Computing Devices.
Penn ESE Spring DeHon 1 ESE (ESE534): Computer Organization Day 21: April 2, 2007 Time Multiplexing.
CS294-6 Reconfigurable Computing Day 6 September 10, 1998 Comparing Computing Devices.
11/14/05ELEC Fall Multi-processor SoCs Yijing Chen.
Caltech CS184a Fall DeHon1 CS184a: Computer Architecture (Structures and Organization) Day8: October 18, 2000 Computing Elements 1: LUTs.
Penn ESE Spring DeHon 1 ESE (ESE534): Computer Organization Day 9: February 7, 2007 Instruction Space Modeling.
Chapter 17 Parallel Processing.
CS294-6 Reconfigurable Computing Day 2 August 27, 1998 FPGA Introduction.
HSRA: High-Speed, Hierarchical Synchronous Reconfigurable Array William Tsu, Kip Macy, Atul Joshi, Randy Huang, Norman Walker, Tony Tung, Omid Rowhani,
Trends toward Spatial Computing Architectures Dr. André DeHon BRASS Project University of California at Berkeley.
CS294-6 Reconfigurable Computing Day 18 October 22, 1998 Control.
CS294-6 Reconfigurable Computing Day 26 Thursday, November 19 Integrating Processors and RC Arrays.
CS294-6 Reconfigurable Computing Day 19 October 27, 1998 Multicontext.
Penn ESE Spring DeHon 1 ESE (ESE534): Computer Organization Day 23: April 9, 2007 Control.
Penn ESE Spring DeHon 1 FUTURE Timing seemed good However, only student to give feedback marked confusing (2 of 5 on clarity) and too fast.
February 12, 1998 Aman Sareen DPGA-Coupled Microprocessors Commodity IC’s for the Early 21st Century by Aman Sareen School of Electrical Engineering and.
ECE 526 – Network Processing Systems Design Network Processor Architecture and Scalability Chapter 13,14: D. E. Comer.
CALTECH cs184c Spring DeHon CS184c: Computer Architecture [Parallel and Multithreaded] Day 12: May 15, 2001 Interfacing Heterogeneous Computational.
Paper Review: XiSystem - A Reconfigurable Processor and System
A RISC ARCHITECTURE EXTENDED BY AN EFFICIENT TIGHTLY COUPLED RECONFIGURABLE UNIT Nikolaos Vassiliadis N. Kavvadias, G. Theodoridis, S. Nikolaidis Section.
Chapter 2 Parallel Architecture. Moore’s Law The number of transistors on a chip doubles every years. – Has been valid for over 40 years – Can’t.
Chapter 6 Multiprocessor System. Introduction  Each processor in a multiprocessor system can be executing a different instruction at any time.  The.
Super computers Parallel Processing By Lecturer: Aisha Dawood.
Caltech CS184 Winter DeHon 1 CS184a: Computer Architecture (Structure and Organization) Day 7: January 24, 2003 Instruction Space.
CprE / ComS 583 Reconfigurable Computing Prof. Joseph Zambreno Department of Electrical and Computer Engineering Iowa State University Lecture #24 – Reconfigurable.
1 - CPRE 583 (Reconfigurable Computing): Reconfigurable Computing Architectures Iowa State University (Ames) Reconfigurable Architectures Forces that drive.
Embedded Network Interface (ENI). What is ENI? Embedded Network Interface Originally called DPO (Digital Product Option) card Printer without network.
Caltech CS184a Fall DeHon1 CS184a: Computer Architecture (Structures and Organization) Day18: November 22, 2000 Control.
Lecture 3 : Performance of Parallel Programs Courtesy : MIT Prof. Amarasinghe and Dr. Rabbah’s course note.
Caltech CS184a Fall DeHon1 CS184a: Computer Architecture (Structures and Organization) Day7: October 16, 2000 Instruction Space (computing landscape)
Caltech CS184 Winter DeHon 1 CS184a: Computer Architecture (Structure and Organization) Day 20: March 3, 2003 Control.
Caltech CS184a Fall DeHon1 CS184a: Computer Architecture (Structures and Organization) Day16: November 15, 2000 Retiming Structures.
Survey of multicore architectures Marko Bertogna Scuola Superiore S.Anna, ReTiS Lab, Pisa, Italy.
Caltech CS184 Winter DeHon 1 CS184a: Computer Architecture (Structure and Organization) Day 8: January 27, 2003 Empirical Cost Comparisons.
3/12/2013Computer Engg, IIT(BHU)1 PARALLEL COMPUTERS- 2.
1 - CPRE 583 (Reconfigurable Computing): Reconfigurable Computing Architectures Iowa State University (Ames) CPRE 583 Reconfigurable Computing Lecture.
Multimedia Retrieval Architecture Electrical Communication Engineering, Indian Institute of Science, Bangalore – , India Multimedia Retrieval Architecture.
Lecture 17: Dynamic Reconfiguration I November 10, 2004 ECE 697F Reconfigurable Computing Lecture 17 Dynamic Reconfiguration I Acknowledgement: Andre DeHon.
Parallel Computing Presented by Justin Reschke
Caltech CS184 Winter DeHon 1 CS184a: Computer Architecture (Structure and Organization) Day 11: January 31, 2005 Compute 1: LUTs.
1 of 14 Lab 2: Formal verification with UPPAAL. 2 of 14 2 The gossiping persons There are n persons. All have one secret to tell, which is not known to.
My Coordinates Office EM G.27 contact time:
Background Computer System Architectures Computer System Software.
Penn ESE534 Spring DeHon 1 ESE534: Computer Organization Day 11: February 20, 2012 Instruction Space Modeling.
CALTECH cs184c Spring DeHon CS184c: Computer Architecture [Parallel and Multithreaded] Day 13: May 17 22, 2001 Interfacing Heterogeneous Computational.
Penn ESE534 Spring DeHon 1 ESE534: Computer Organization Day 22: April 16, 2014 Time Multiplexing.
Lecture 13 Parallel Processing. 2 What is Parallel Computing? Traditionally software has been written for serial computation. Parallel computing is the.
Caltech CS184 Winter DeHon 1 CS184a: Computer Architecture (Structure and Organization) Day 10: January 28, 2005 Empirical Comparisons.
Lecture 5: Lecturer: Simon Winberg Review of paper: Temporal Partitioning Algorithm for a Coarse-grained Reconfigurable Computing Architecture by Chongyong.
System on a Programmable Chip (System on a Reprogrammable Chip)
Heterogeneous Processing KYLE ADAMSKI. Overview What is heterogeneous processing? Why it is necessary Issues with heterogeneity CPU’s vs. GPU’s Heterogeneous.
Overview Parallel Processing Pipelining
CS427 Multicore Architecture and Parallel Computing
Distributed Processors
CS 147 – Parallel Processing
Instructor: Dr. Phillip Jones
ESE534: Computer Organization
ESE534: Computer Organization
ESE534: Computer Organization
ESE534: Computer Organization
ESE534: Computer Organization
Presentation transcript:

CS294-6 Reconfigurable Computing Day 25 Heterogeneous Systems and Interfacing

Previously Homogenous model of computational array –single word granularity, depth, interconnect –all post-fabrication programmable Understand tradeoffs of each

Today Heterogeneous architectures –Why? –How? catalog of techniques fit in framework optimization and mapping

Why? Why would we be interested in heterogeneous architecture? –E.g.

Why? Applications have a mix of characteristics Already accepted –seldom can afford to build most general (unstructured) array bit-level, deep context, p=1 –=> are picking some structure to exploit May be beneficial to have portions of computations optimized for different structure conditions.

Examples Processor+FPGA Processors or FPGA add –multiplier or MAC unit –FPU –Motion Estimation coprocessor

Optimization Prospect Less capacity for composite than either pure –(A 1 +A 2 )T 12 < A 1 T 1 –(A 1 +A 2 )T 12 < A 2 T 2

Optimization Prospect Example Floating Point –Task: I integer Ops + F FP-ADDs –A proc =125M 2 –A FPU =40M 2 –I cycles / FP Ops = 60 –125(I+60F)  165(I+F) ( )/40 = I/F 183  I/F

How? Design issues: –Interconnect space and time –Control –Instructions configuration path and control Mapping Cost/Benefits: –Costs Area, Power –Performance Bandwidth, latency

Interconnect Bus (degenerate network) Memory (shared retiming resource) RF/Coproc (traditional processor inter.) Network

Interconnect: Bus Minimal physical network –shared with memory and/or other peripherals –10s-100s of cycles away from processor (fpga) –low->moderate bandwidth –can handle multiple, different functional units but serial bottleneck of bus prevents simultaneous communication among devices

Interconnect: Bus Example XC6200

Interconnect: Memory Use memory (retiming) block to buffer data between heterogeneous regions –DMA (usually implies shared bus) –FIFO –dual port or shared RAM decoupled, moderate latency (10-100cycles) moderate bandwidth

Interconnect: Memory Example PAM, SPLASH

Interconnect: RF/Coproc Coupled directly to processor datapath –low latency (1-2 cycles) –moderately high bandwidth limited by RF ports and control

Interconnect: RF/Coproc Examples GARP, Chimaera –(more on this case Thursday)

Interconnect: Network Unify spatial network composing various heterogeneous components –high bandwidth –latency vary with distance –support simultaneous operate and data transfer –potentially dominant cost A interconnect > A function –Granularity question coarse (large blocks of each type) fine (interleaved)

Interconnect: Network Coarse Cheops, Pleiades

HSRA Heterogeneous Blocks

Interconnect: Network Coarse vs. Fine Multiplier/FPGA example

Interconnect: Network Coarse vs. Fine Fine –possibly share interconnect –locality –uniform tiling –not shared may get concentrations of heavy/no use interconnect limit use as independent resources –ratio less flexible? –More difficult design Coarse –flexible ratio –easier to keep dense homogeneous blocks –requires own interconnect –doesn’t disrupt base layout(s) –non-local route to/from more/longer wires –boundaries in net

Admin For POWER: Update on –rcore simulation –HSRA energy –Jsim size problems? Fix in works

Control As before –How many controllers? –How many pinsts slaved off of each? Common classes: –Single Controller / Lock-step –Decoupled, datastream –Autonomous MIMD

Control: Lockstep Master controller (usually processor) –issues instruction (instruction tag) every cycle explicitly when device should operate –Single thread of control everything known to be in synch –Idle while processor doing other tasks –Ex. VLIW (TriMedia), PRISC, GARP

Control: Data Stream Configure then run on data decoupled from control processor –run parallel with processor processor run orthogonal tasks maybe several simultaneous tasks running on spatial –unit not typically fed by processor directly –need to synchronize data transfer and operation polling, interrupt, semaphore –Ex. Cheops, PADDI-2, Pleiades, SPLASH

Control: Autonomous (MIMD) Multiple (potential) control processors –not necessarily slaved –distribute control –more care in synchronization –Ex. Floe (MagicEight)

HSRA Multi-Hetero Coupling –unifying networks Balance –sequential/spatial –control units w/ management task

Configuration Share interface bus –config and data XC6200, PAM, SPLASH –config and memory path GARP Separate path / network –VLIW, Pleiades Explicit –XC6200, PAM, SPLASH, …. Implicit –GARP/PRISC

Mapping –often option on where runs –must sort out what goes where faster in one resource? …but limited number of each resource

Mapping: Limited Resource What runs on faster, limited resource? –E.g. Tim’s C extraction last-time –General: what allocated to resource when reconfigure N candidate ops -> each choice –Greedy Break into temporal regions –local working set and points of reconfiguration while resource available –add op offering most benefit

Mapping: Spatial Choice Different kinds of resources –(e.g. LUTs, multipliers) Multiple resources can solve same problem Limited number of each resource Match users with resources

Mapping: Bipartite Partitioning =>Bipartite matching –deal with unit resource consumption –also w/ regional/interconnect constraints –not directly deal with performance… Postpass(?) allocate –faster resources to critical path ? N of R1 vs. M of R2 Example/Details: Liu FPGA98

Mapping More common: –can solve with: 12A’s and 2B’s or 4 A’s and 4 B’s –common need 4 A’s and 2 B’s –choice 8 A’s vs. 2 B’s

Highlights Fit into existing framework –not that much new here –new issue: who and how share resources Issues: interconnect, control + density when hit balance - efficiency when balance mismatch - harder mapping (resource sharing)