CALTECH cs184c Spring2001 -- DeHon CS184c: Computer Architecture [Parallel and Multithreaded] Day 14: May 24, 2001 SCORE.

Slides:

Advertisements

Similar presentations

Embedded System, A Brief Introduction

Advertisements

ILP: IntroductionCSCE430/830 Instruction-level parallelism: Introduction CSCE430/830 Computer Architecture Lecturer: Prof. Hong Jiang Courtesy of Yifeng.

BRASS Eylon Caspi, Michael Chu, Randy Huang, Joseph Yeh, John Wawrzynek University of California, Berkeley – BRASS.

Instruction-Level Parallelism (ILP)

Reconfigurable Computing: What, Why, and Implications for Design Automation André DeHon and John Wawrzynek June 23, 1999 BRASS Project University of California.

Synthesis of Embedded Software Using Free-Choice Petri Nets.

SCORE - Stream Computations Organized for Reconfigurable Execution Eylon Caspi, Michael Chu, Randy Huang, Joseph Yeh, Yury Markovskiy Andre DeHon, John.

BRASS SCORE: Eylon Caspi, Randy Huang, Yury Markovskiy, Joe Yeh, John Wawrzynek BRASS Research Group University of California, Berkeley Stream Computations.

Mahapatra-Texas A&M-Fall'001 cosynthesis Introduction to cosynthesis Rabi Mahapatra CPSC498.

CS294-6 Reconfigurable Computing Day 22 November 5, 1998 Requirements for Computing Systems (SCORE Introduction)

Memory Management 2010.

Chapter 17 Parallel Processing.

Compiling Application-Specific Hardware Mihai Budiu Seth Copen Goldstein Carnegie Mellon University.

Chapter 1 and 2 Computer System and Operating System Overview

1 Last Class: Introduction Operating system = interface between user & architecture Importance of OS OS history: Change is only constant User-level Applications.

A Streaming Multi-Threaded Model Eylon Caspi,Randy Huang,Yury Markovskiy, Joe Yeh,André DeHon,John Wawrzynek BRASS Research Group University of California,

Penn ESE535 Spring DeHon 1 ESE535: Electronic Design Automation Day 8: February 11, 2009 Dataflow.

CS294-6 Reconfigurable Computing Day 23 November 10, 1998 Stream Processing.

Big Kernel: High Performance CPU-GPU Communication Pipelining for Big Data style Applications Sajitha Naduvil-Vadukootu CSC 8530 (Parallel Algorithms)

1 Threads Chapter 4 Reading: 4.1,4.4, Process Characteristics l Unit of resource ownership - process is allocated: n a virtual address space to.

BRASS Analysis of QuasiStatic Scheduling Techniques in a Virtualized Reconfigurable Machine Yury Markovskiy, Eylon Caspi, Randy Huang, Joseph Yeh, Michael.

The Vector-Thread Architecture Ronny Krashinsky, Chris Batten, Krste Asanović Computer Architecture Group MIT Laboratory for Computer Science

Computer System Architectures Computer System Software

Development in hardware – Why? Option: array of custom processing nodes Step 1: analyze the application and extract the component tasks Step 2: design.

1 COMPSCI 110 Operating Systems Who - Introductions How - Policies and Administrative Details Why - Objectives and Expectations What - Our Topic: Operating.

Parallel Programming Models Jihad El-Sana These slides are based on the book: Introduction to Parallel Computing, Blaise Barney, Lawrence Livermore National.

Segmentation & O/S Input/Output Chapter 4 & 5 Tuesday, April 3, 2007.

CBSSS 2002: DeHon Architecture as Interface André DeHon Friday, June 21, 2002.

Software Pipelining for Stream Programs on Resource Constrained Multi-core Architectures IEEE TRANSACTIONS ON PARALLEL AND DISTRIBUTED SYSTEM 2012 Authors:

Automated Design of Custom Architecture Tulika Mitra

Compilation for Scalable, Paged Virtual Hardware Eylon Caspi Qualifying Exam 3/6/01 University of California, Berkeley IAIA IBIB OAOA OBOB.

Chapter 2 Parallel Architecture. Moore’s Law The number of transistors on a chip doubles every years. – Has been valid for over 40 years – Can’t.

CALTECH cs184c Spring DeHon CS184c: Computer Architecture [Parallel and Multithreaded] Day 8: April 26, 2001 Simultaneous Multi-Threading (SMT)

1 Towards Optimal Custom Instruction Processors Wayne Luk Kubilay Atasu, Rob Dimond and Oskar Mencer Department of Computing Imperial College London HOT.

Caltech CS184b Winter DeHon 1 CS184b: Computer Architecture [Single Threaded Architecture: abstractions, quantification, and optimizations] Day14:

DISTRIBUTED SYSTEMS Principles and Paradigms Second Edition ANDREW S

Super computers Parallel Processing By Lecturer: Aisha Dawood.

1 - CPRE 583 (Reconfigurable Computing): Reconfigurable Computing Architectures Iowa State University (Ames) Reconfigurable Architectures Forces that drive.

Harmony: A Run-Time for Managing Accelerators Sponsor: LogicBlox Inc. Gregory Diamos and Sudhakar Yalamanchili.

Processor Architecture

Gedae, Inc. Gedae: Auto Coding to a Virtual Machine Authors: William I. Lundgren, Kerry B. Barnes, James W. Steed HPEC 2004.

CSC Multiprocessor Programming, Spring, 2012 Chapter 11 – Performance and Scalability Dr. Dale E. Parson, week 12.

1 Copyright  2001 Pao-Ann Hsiung SW HW Module Outline l Introduction l Unified HW/SW Representations l HW/SW Partitioning Techniques l Integrated HW/SW.

Full and Para Virtualization

Hybrid Multi-Core Architecture for Boosting Single-Threaded Performance Presented by: Peyman Nov 2007.

CALTECH cs184c Spring DeHon CS184c: Computer Architecture [Parallel and Multithreaded] Day 1: April 3, 2001 Overview and Message Passing.

CALTECH CS137 Spring DeHon CS137: Electronic Design Automation Day 13: May 20, 2002 Page Generation (Area and IO Constraints) [working problem with.

3/12/2013Computer Engg, IIT(BHU)1 PARALLEL COMPUTERS- 2.

High Performance Embedded Computing © 2007 Elsevier Lecture 4: Models of Computation Embedded Computing Systems Mikko Lipasti, adapted from M. Schulte.

Lecture 17: Dynamic Reconfiguration I November 10, 2004 ECE 697F Reconfigurable Computing Lecture 17 Dynamic Reconfiguration I Acknowledgement: Andre DeHon.

Autumn 2006CSE P548 - Dataflow Machines1 Von Neumann Execution Model Fetch: send PC to memory transfer instruction from memory to CPU increment PC Decode.

Background Computer System Architectures Computer System Software.

CALTECH cs184c Spring DeHon CS184c: Computer Architecture [Parallel and Multithreaded] Day 13: May 17 22, 2001 Interfacing Heterogeneous Computational.

Compiler Research How I spent my last 22 summer vacations Philip Sweany.

Lecture 13 Parallel Processing. 2 What is Parallel Computing? Traditionally software has been written for serial computation. Parallel computing is the.

These slides are based on the book:

TensorFlow– A system for large-scale machine learning

ESE532: System-on-a-Chip Architecture

CS184b: Computer Architecture (Abstractions and Optimizations)

ESE532: System-on-a-Chip Architecture

ESE532: System-on-a-Chip Architecture

Pipelining: Advanced ILP

Introduction to cosynthesis Rabi Mahapatra CSCE617

Threads Chapter 4.

Prof. Leonardo Mostarda University of Camerino

The Vector-Thread Architecture

ESE535: Electronic Design Automation

CSC Multiprocessor Programming, Spring, 2011

ESE532: System-on-a-Chip Architecture

Prof. Onur Mutlu Carnegie Mellon University

Presentation transcript:

CALTECH cs184c Spring DeHon CS184c: Computer Architecture [Parallel and Multithreaded] Day 14: May 24, 2001 SCORE

CALTECH cs184c Spring DeHon Previously Interfacing Array logic with Processors Single thread, single-cycle operations Scaling –models weak on allowing more active hardware Can imagine a more general, heterogeneous, concurrent, multithreaded compute model….

CALTECH cs184c Spring DeHon Today SCORE –scalable compute model –architecture to support –mapping and runtime issues

CALTECH cs184c Spring DeHon UCB BRASS RISC+HSRA Integrate: –processor –reconfig. Array –memory Key Idea: –best of both worlds temporal/spatial

CALTECH cs184c Spring DeHon Bottom Up GARP –Interface –streaming HSRA –clocked array block –scalable network Embedded DRAM –high density/bw –array integration Good handle on: raw building blocks tradeoffs

CALTECH cs184c Spring DeHon HSRA Architecture CS184a: Day16

CALTECH cs184c Spring DeHon Top Down Question remained –How do we control this? –Allow hardware to scale? What is the higher level model –capture computation? –allows scaling?

CALTECH cs184c Spring DeHon SCORE An attempt at defining a computational model for reconfigurable systems –abstract out physical hardware details especially size / # of resources timing Goal –achieve device independence –approach density/efficiency of raw hardware –allow application performance to scale based on system resources (w/out human intervention)

CALTECH cs184c Spring DeHon SCORE Basics Abstract computation is a dataflow graph –stream links between operators –dynamic dataflow rates Allow instantiation/modification /destruction of dataflow during execution –separate dataflow construction from usage Break up computation into compute pages –unit of scheduling and virtualization –stream links between pages Runtime management of resources

CALTECH cs184c Spring DeHon Stream Links Sequence of data flowing between operators –e.g. vector, list, image Same –source –destination –processing

CALTECH cs184c Spring DeHon Virtual Hardware Model Dataflow graph is arbitrarily large Hardware has finite resources –resources vary from implementation to implementation Dataflow graph must be scheduled on the hardware Must happen automatically (software) –physical resources are abstracted in compute model

CALTECH cs184c Spring DeHon Example

CALTECH cs184c Spring DeHon Ex: Serial Implementation

CALTECH cs184c Spring DeHon Ex: Spatial Implementation

CALTECH cs184c Spring DeHon Compute Model Primitives SFSM –FA with Stream Inputs –each state: required input set STM –may create any of these nodes SFIFO –unbounded –abstracts delay between operators SMEM –single owner (user)

CALTECH cs184c Spring DeHon SFSM Model view for an operator or compute page –FIR, FFT, Huffman Encoder, DownSample Less powerful than an arbitrary software process –bounded physical resources (no dynamic allocation) –only interface to state through streams More powerful than an SDF operator –dynamic input and output rates –dynamic flow rates

CALTECH cs184c Spring DeHon SFSM Operators are FSMs not just Dataflow graphs Variable Rate Inputs –FSM state indicates set of inputs require to fire Lesson from hybrid dataflow –control flow cheaper when succ. known DF Graph of operators gives task-level parallelism –GARP and C models are all just one big TM Gives programmer convenience of writing familiar code for operator –use well-known techniques in translation to extract ILP within an operator

CALTECH cs184c Spring DeHon STM Abstraction of a process running on the sequential processor Interfaced to graph like SFSM More restricted/stylized than threads –cannot side-effect shared state arbitrarily –stream discipline for data transfer –single-owner memory discipline

CALTECH cs184c Spring DeHon STM Adds power to allocate memory –can give to SFSM graphs Adds power to create and modify SCORE graph –abstraction for allowing the logical computation to evolve and reconfigure –Note different from physical reconfiguration of hardware that happens below the model of computation invisible to the programmer, since hardware dependent

CALTECH cs184c Spring DeHon Model consistent across levels Abstract computational model –think about at high level Programming Model –what programmer thinks about –no visible size limits –concretized in language: e.g. TDF Execution Model –what the hardware runs –adds fixed-size hardware pages –primitive/kernel operations (e.g. ISA)

CALTECH cs184c Spring DeHon Architecture Lead: Randy Huang

CALTECH cs184c Spring DeHon Array CP CMB Processor I Cache D Cache SCORE Processor Architecture for SCORE Compute page interfaceConfigurable memory block interface instruction stream ID stream data GPR Global Controller SIDPID location process ID Memory & DMA Controller Memory & DMA Controller data addr/cntl data addr/cntl data Processor to array interface

CALTECH cs184c Spring DeHon Processor ISA Level Operation User operations –Stream write STRMWR Rstrm, Rdata –Stream read STRMRD Rstrm, Rdata Kernel operation (not visible to users) –{Start,stop} {CP,CMB,IPSB} –{Load,store} {CP,CMB,IPSB} {config,state,FIFO} –Transfer {to,from} main memory –Get {array processor, compute page} status

CALTECH cs184c Spring DeHon Communication Overhead Note single cycle to send/receive data no packet/communication overhead –once a connection is setup and resident contrast with MP machines and NI we saw earlier

CALTECH cs184c Spring DeHon SCORE Graph on Hardware One master application graph Operators run on processor and array Communicate directly amongst

CALTECH cs184c Spring DeHon SCORE OS: Reconfiguration Array managed by OS Only OS can manipulate array configuration

CALTECH cs184c Spring DeHon Allocation goes through OS Similar to sbrk in conventional API SCORE OS: Allocation

CALTECH cs184c Spring DeHon Performance Scaling: JPEG Encoder

CALTECH cs184c Spring DeHon Performance Scaling: JPEG Encoder

CALTECH cs184c Spring DeHon Page Generation (work in progress) Eylon Caspi, Laura Pozzi

CALTECH cs184c Spring DeHon SCORE Compilation in a Nutshell Programming ModelExecution Model Graph of TDF FSMD operators Graph of page configs - unlimited size, # IOs- fixed size, # IOs - no timing constraints- timed, single-cycle firing Compile memory segment TDF operator stream memory segment compute page stream

CALTECH cs184c Spring DeHon How Big is an Operator? Wavelet Decode Wavelet Encode JPEG Encode MPEG Encode JPEG Encode JPEG Decode MPEG (I) MPEG (P) Wavelet Encode IIR

CALTECH cs184c Spring DeHon Unique Synthesis / Partitioning Problem Inter-page stream delay not known by compiler: –HW implementation –Page placement –Virtualization –Data-dependent token emission rates Partitioning must retain stream abstraction –also gives us freedom in timing Synchronous array hardware

CALTECH cs184c Spring DeHon Clustering is Critical Inter-page comm. latency may be long Inter-page feedback loops are slow Cluster to: –Fit feedback loops within page –Fit feedback loops on device

CALTECH cs184c Spring DeHon Pipeline Extraction Hoist uncontrolled FF data-flow out of FSMD Benefits: –Shrink FSM cyclic core –Extracted pipeline has more freedom for scheduling and partitioning Extract state foo(i): acc=acc+2*i state foo(two_i): acc=acc+two_i i state DF CF *2 two_i i pipeline

CALTECH cs184c Spring DeHon Pipeline Extraction – Extractable Area JPEG Encode JPEG Decode MPEG (I) MPEG (P) Wavelet Encode IIR

CALTECH cs184c Spring DeHon Page Generation Pipeline extraction –removes dataflow can freely extract from FSMD control Still have to partition potentially large FSMs –approach: turn into a clustering problem

CALTECH cs184c Spring DeHon State Clustering Start: consider each state to be a unit Cluster states into page-size sub- FSMDs –Inter-page transitions become streams Possible clustering goals: –Minimize delay (inter-page latency) –Minimize IO(inter-page BW) –Minimize area(fragmentation) IAIA IBIB OAOA OBOB

CALTECH cs184c Spring DeHon State Clustering to Minimize Inter-Page State Transfer Inter-page state transfer is slow Cluster to: –Contain feedback loops –Minimize frequency of inter-page state transfer Previously used in : –VLIW trace scheduling [Fisher ‘81] –FSM decomposition for low power [Benini/DeMicheli ISCAS ‘98] –VM/cache code placement –GarpCC code selection [Callahan ‘00]

CALTECH cs184c Spring DeHon Scheduling (work in progress) Lead: Yury Markovskiy

CALTECH cs184c Spring DeHon Scheduling Time-multiplex the operators onto the hardware To exploit scaling: – page capacity is a late-bound parameter –cannot do scheduling at compile time To exploit dynamic data –want to look at application, data characteristics

CALTECH cs184c Spring DeHon Scheduling: First Try Dynamic Fully Dynamic Time sliced List-scheduling based Very expensive: –100, ,000 cycles –scheduling 30 virtual pages –onto 10 physical

CALTECH cs184c Spring DeHon Overhead Effects

CALTECH cs184c Spring DeHon Overhead Costs

CALTECH cs184c Spring DeHon Scheduling: Why Different, Challenging Distributed Memory vs. Uniform Memory –placement/shuffling matters Multiple memory ports –increase bandwidth –fixed limit on number of ports available Schedule subgraphs –reduce latency and memory

CALTECH cs184c Spring DeHon Scheduling: Taxonomy (How Dynamic?) Static/Dynamic Boundary? Placement Sequence Rate Timing

CALTECH cs184c Spring DeHon Dynamic  Load Time Scheduling Dynamic Scheduler Static Scheduler

CALTECH cs184c Spring DeHon Static Scheduler Overhead

CALTECH cs184c Spring DeHon Static Scheduler Performance

CALTECH cs184c Spring DeHon Anomalies and How Dynamic? Anomalies on previous graph –early stall on stream data –from assuming fixed timeslice model Solve by –dynamic epoch termination –detect when appropriate to advance schedule Placement Sequence Rate Timing

CALTECH cs184c Spring DeHon Static Scheduler w/ Early Stall Detection

CALTECH cs184c Spring DeHon More Heterogeneous Programmable SoC

CALTECH cs184c Spring DeHon Broader Programmable SOC Applicability Model potentially valuable beyond homogenous array Already introduced idea of different page types

CALTECH cs184c Spring DeHon Heterogeneous Pages Small conceptual step to generalize Memory (CMB) Processor FPGA –vary granularity –vary depth IO Custom ( e.g. FPU )

CALTECH cs184c Spring DeHon Summary Advantage and value for programmable spatial computing components Need a compute model –to permit device scaling –while preserving human effort SCORE model captures parallelism and freedom in these applications Believe it can be efficient Starting to get a handle on hardware/compiler/runtime support

CALTECH cs184c Spring DeHon Additional Information SCORE: – –especially see “Introduction and Tutorial” CALTECH: –

CALTECH cs184c Spring DeHon Big Ideas Model –basis for virtualization –basis for scaling –allows common-case optimizations –supports kind of computations which exploit this architecture spatial composition of computing blocks

CALTECH cs184c Spring DeHon Big Ideas Expose parallelism –hidden by sequential control flow in ISA- based models Communication to operator –not to resource (ala. GARP) Support spatial composition –contrast sequential composition in ISA Data presence [self timed!] –tolerant to timing and resource variations

CALTECH cs184c Spring DeHon Big Ideas Persistent Dataflow –separate creation and use –use many times (amortize cost of creation) Persistent Communication –separate setup/allocation form use –amortize out cost of routing/negotiation/setup