CS184b: Computer Architecture (Abstractions and Optimizations)

Slides:



Advertisements
Similar presentations
ILP: IntroductionCSCE430/830 Instruction-level parallelism: Introduction CSCE430/830 Computer Architecture Lecturer: Prof. Hong Jiang Courtesy of Yifeng.
Advertisements

BRASS Eylon Caspi, Michael Chu, Randy Huang, Joseph Yeh, John Wawrzynek University of California, Berkeley – BRASS.
Reconfigurable Computing: What, Why, and Implications for Design Automation André DeHon and John Wawrzynek June 23, 1999 BRASS Project University of California.
SCORE - Stream Computations Organized for Reconfigurable Execution Eylon Caspi, Michael Chu, Randy Huang, Joseph Yeh, Yury Markovskiy Andre DeHon, John.
BRASS SCORE: Eylon Caspi, Randy Huang, Yury Markovskiy, Joe Yeh, John Wawrzynek BRASS Research Group University of California, Berkeley Stream Computations.
CS294-6 Reconfigurable Computing Day 22 November 5, 1998 Requirements for Computing Systems (SCORE Introduction)
A Streaming Multi-Threaded Model Eylon Caspi,Randy Huang,Yury Markovskiy, Joe Yeh,André DeHon,John Wawrzynek BRASS Research Group University of California,
Penn ESE535 Spring DeHon 1 ESE535: Electronic Design Automation Day 8: February 11, 2009 Dataflow.
CS294-6 Reconfigurable Computing Day 23 November 10, 1998 Stream Processing.
BRASS Analysis of QuasiStatic Scheduling Techniques in a Virtualized Reconfigurable Machine Yury Markovskiy, Eylon Caspi, Randy Huang, Joseph Yeh, Michael.
Automated Design of Custom Architecture Tulika Mitra
Caltech CS184b Winter DeHon 1 CS184b: Computer Architecture [Single Threaded Architecture: abstractions, quantification, and optimizations] Day14:
CALTECH cs184c Spring DeHon CS184c: Computer Architecture [Parallel and Multithreaded] Day 14: May 24, 2001 SCORE.
Processor Architecture
CALTECH cs184c Spring DeHon CS184c: Computer Architecture [Parallel and Multithreaded] Day 1: April 3, 2001 Overview and Message Passing.
CALTECH CS137 Spring DeHon CS137: Electronic Design Automation Day 13: May 20, 2002 Page Generation (Area and IO Constraints) [working problem with.
3/12/2013Computer Engg, IIT(BHU)1 PARALLEL COMPUTERS- 2.
Computer Organization and Architecture Lecture 1 : Introduction
These slides are based on the book:
DISTRIBUTED SYSTEMS Principles and Paradigms Second Edition ANDREW S
Auburn University
TensorFlow– A system for large-scale machine learning
ESE532: System-on-a-Chip Architecture
Processes and threads.
Process Management Process Concept Why only the global variables?
Distributed Processors
ESE532: System-on-a-Chip Architecture
ESE532: System-on-a-Chip Architecture
ESE532: System-on-a-Chip Architecture
Parallel Programming By J. H. Wang May 2, 2017.
Auburn University COMP7330/7336 Advanced Parallel and Distributed Computing Mapping Techniques Dr. Xiao Qin Auburn University.
ESE532: System-on-a-Chip Architecture
ESE532: System-on-a-Chip Architecture
ESE532: System-on-a-Chip Architecture
CS184a: Computer Architecture (Structure and Organization)
/ Computer Architecture and Design
Lecture 5: GPU Compute Architecture
Hyperthreading Technology
Instructor: Dr. Phillip Jones
Pipelining: Advanced ILP
Introduction to cosynthesis Rabi Mahapatra CSCE617
CSCI1600: Embedded and Real Time Software
Software Defined Networking (SDN)
Levels of Parallelism within a Single Processor
Lecture 5: GPU Compute Architecture for the last time
ESE534: Computer Organization
Hardware Multithreading
Outline Module 1 and 2 dealt with processes, scheduling and synchronization Next two modules will deal with memory and storage Processes require data to.
Software models - Software Architecture Design Patterns
Lecture Topics: 11/1 General Operating System Concepts Processes
Chapter 1 Introduction.
Threads Chapter 4.
/ Computer Architecture and Design
/ Computer Architecture and Design
Prof. Leonardo Mostarda University of Camerino
The Vector-Thread Architecture
ECE 352 Digital System Fundamentals
ESE535: Electronic Design Automation
Levels of Parallelism within a Single Processor
ESE534: Computer Organization
CSE 153 Design of Operating Systems Winter 2019
Design of Digital Circuits Lecture 19a: VLIW
The University of Adelaide, School of Computer Science
Operating System Overview
Chapter 13: I/O Systems.
CSCI1600: Embedded and Real Time Software
ESE532: System-on-a-Chip Architecture
Prof. Onur Mutlu Carnegie Mellon University
Presentation transcript:

CS184b: Computer Architecture (Abstractions and Optimizations) Day 19: May 20, 2003 SCORE Caltech CS184 Spring2003 -- DeHon

Previously Interfacing compute blocks with Processors Reconfigurable, specialized Single thread, single-cycle operations Scaling models weak on allowing more active hardware Can imagine a more general, heterogeneous, concurrent, multithreaded compute model…. Caltech CS184 Spring2003 -- DeHon

Today SCORE scalable compute model architecture to support mapping and runtime issues Caltech CS184 Spring2003 -- DeHon

Processor + Reconfig Integrate: Key Idea: processor reconfig. Array memory Key Idea: best of both worlds temporal/spatial Since we see that bit-level, spatial architectures are complementary to word-level temporal architectures, one focus of the BRASS project is to combine these two architectures together to build a robust system device. Central to this effort is the development of a compute model reconfigurable computing which allows us to virtualize the physical array resources then schedule their use at run time. Caltech CS184 Spring2003 -- DeHon

Bottom Up GARP HSRA Embedded DRAM Interface streaming clocked array block scalable network Embedded DRAM high density/bw array integration Good handle on: raw building blocks tradeoffs Caltech CS184 Spring2003 -- DeHon

Top Down Question remained What is the higher level model How do we control this? Allow hardware to scale? What is the higher level model capture computation? allows scaling? Caltech CS184 Spring2003 -- DeHon

SCORE An attempt at defining a computational model for reconfigurable systems abstract out physical hardware details especially size / # of resources timing Goal achieve device independence approach density/efficiency of raw hardware allow application performance to scale based on system resources (w/out human intervention) Caltech CS184 Spring2003 -- DeHon

SCORE Basics Abstract computation is a dataflow graph persistent stream links between operators dynamic dataflow rates Allow instantiation/modification/destruction of dataflow during execution separate dataflow construction from usage (compare TAM dataflow unfolding) Break up computation into compute pages unit of scheduling and virtualization stream links between pages Runtime management of resources Caltech CS184 Spring2003 -- DeHon

Stream Links Sequence of data flowing between operators Same e.g. vector, list, image Same source destination processing Caltech CS184 Spring2003 -- DeHon

Virtual Hardware Model Dataflow graph is arbitrarily large Remember 0, 1,  Hardware has finite resources resources vary from implementation to implementation Dataflow graph must be scheduled on the hardware Must happen automatically (software) physical resources are abstracted in compute model Caltech CS184 Spring2003 -- DeHon

Example Caltech CS184 Spring2003 -- DeHon

Ex: Serial Implementation Caltech CS184 Spring2003 -- DeHon

Ex: Spatial Implementation Caltech CS184 Spring2003 -- DeHon

Compute Model Primitives SFSM FA with Stream Inputs each state: required input set STM may create any of these nodes SFIFO unbounded abstracts delay between operators SMEM single owner (user) Caltech CS184 Spring2003 -- DeHon

SFSM Model view for an operator or compute page FIR, FFT, Huffman Encoder, DownSample Less powerful than an arbitrary software process (multithreaded model) bounded physical resources (no dynamic allocation) only interface to state through streams More powerful than an SDF operator dynamic input and output rates dynamic flow rates Caltech CS184 Spring2003 -- DeHon

SFSM Operators are FSMs not just Dataflow graphs Variable Rate Inputs FSM state indicates set of inputs require to fire Lesson from hybrid dataflow control flow cheaper when succ. known DF Graph of operators gives task-level parallelism GARP and C models are all just one big TM Gives programmer convenience of writing familiar code for operator use well-known techniques in translation to extract ILP within an operator Caltech CS184 Spring2003 -- DeHon

STM Abstraction of a process running on the sequential processor Interfaced to graph like SFSM More restricted/stylized than threads cannot side-effect shared state arbitrarily stream discipline for data transfer single-owner memory discipline computation remains deterministic Caltech CS184 Spring2003 -- DeHon

STM Adds power to allocate memory can give to SFSM graphs Adds power to create and modify SCORE graph abstraction for allowing the logical computation to evolve and reconfigure Note different from physical reconfiguration of hardware that happens below the model of computation invisible to the programmer, since hardware dependent Caltech CS184 Spring2003 -- DeHon

Model consistent across levels Abstract computational model think about at high level Programming Model what programmer thinks about no visible size limits concretized in language: e.g. TDF Execution Model what the hardware runs adds fixed-size hardware pages primitive/kernel operations (e.g. ISA) Caltech CS184 Spring2003 -- DeHon

Architecture Lead: Randy Huang Caltech CS184 Spring2003 -- DeHon

Architecture for SCORE instruction stream ID stream data GPR Global Controller SID PID location process ID Memory & DMA Controller data addr/cntl Processor to array interface Compute page interface Configurable memory block interface Array CP CMB Processor I Cache D Cache SCORE Processor In the past, we have told you that the SCORE processor is going to be a hybrid processor. It will include a microprocessor core to handle coarse-grained irregular computation and a reconfigurable array to handle regular fine-grained computation. Here we see that the processor core has the standard instruction and data cache. And the reconfigurable array contains a pair of compute pages and configurable memory blocks, connected together hierarchically. Today I am going to tell you the stuffs you have not seen and this is the outline of my talk. I will start by describing the compute page interface, how it will interface with the inter-page stream network. Next I will describe the CMB interface. Finally, I will tell you a bit more details about how the processor and the array communicate with each other. And I will also touch on briefly how the array will communicate with the primary memory. Before I start, I just want to point out a few things. the score processor includes a conventional memory controller + DMA controller. So we will maintain the traditional memory hierarchy. reconfiguration array can talk to processor and main memory simultaneously. Since everything is on the same die, we are no longer limited by pin count. the reconfigurable array is controlled by the global controller (the yellow block). the green boxes are the inter-page switch boxes. And the yellow box inside each block represents the interface and controller. 5) array to memory operation will not trash the processor caches. Caltech CS184 Spring2003 -- DeHon

Processor ISA Level Operation User operations Stream write STRMWR Rstrm, Rdata Stream read STRMRD Rstrm, Rdata Kernel operation (not visible to users) {Start,stop} {CP,CMB,IPSB} {Load,store} {CP,CMB,IPSB} {config,state,FIFO} Transfer {to,from} main memory Get {array processor, compute page} status There are two types of operations the top-level interfaces have to support. The first one is the kernel operations. These are the operations like start page, stop page, load page configuration. These operations let you control the reconfiguration array. These operations have to be kernel-level or else one user application can ruin other user applications. The second type of operation is the user operation. These operations let user applications communicate with the array. Examples of user operations are stream write, stream read, and stream eos operation. Caltech CS184 Spring2003 -- DeHon

Communication Overhead Note single Processor cycle to send/receive data no packet/communication overhead once a connection is setup and resident contrast with MP machines and NI we saw earlier Once persistent streams in model Can build SLB to perform mapping… Caltech CS184 Spring2003 -- DeHon

SCORE Graph on Hardware One master application graph Operators run on processor and array Communicate directly amongst OS does not have touch each byte Caltech CS184 Spring2003 -- DeHon

SCORE OS: Reconfiguration Array managed by OS Only OS can manipulate array configuration Caltech CS184 Spring2003 -- DeHon

SCORE OS: Allocation Allocation goes through OS Similar to sbrk in conventional API Caltech CS184 Spring2003 -- DeHon

Performance Scaling: JPEG Encoder Caltech CS184 Spring2003 -- DeHon

Performance Scaling: JPEG Encoder Caltech CS184 Spring2003 -- DeHon

Page Generation (work in progress) Eylon Caspi, Laura Pozzi Caltech CS184 Spring2003 -- DeHon

SCORE Compilation in a Nutshell Programming Model Execution Model • Graph of TDF FSMD operators • Graph of page configs - unlimited size, # IOs - fixed size, # IOs - no timing constraints - timed, single-cycle firing memory segment TDF operator stream memory segment compute page stream Compile Caltech CS184 Spring2003 -- DeHon

How Big is an Operator? Wavelet Decode Wavelet Encode JPEG Encode JPEG Decode MPEG (I) MPEG (P) Wavelet Encode IIR Wavelet Decode Wavelet Encode JPEG Encode MPEG Encode Caltech CS184 Spring2003 -- DeHon

Unique Synthesis / Partitioning Problem Inter-page stream delay not known by compiler: HW implementation Page placement Virtualization Data-dependent token emission rates Partitioning must retain stream abstraction stream abstraction gives us freedom in timing Synchronous array hardware Caltech CS184 Spring2003 -- DeHon

Clustering is Critical Inter-page comm. latency may be long Inter-page feedback loops are slow Cluster to: Fit feedback loops within page Fit feedback loops on device Caltech CS184 Spring2003 -- DeHon

Pipeline Extraction Hoist uncontrolled FF data-flow out of FSMD Benefits: Shrink FSM cyclic core Extracted pipeline has more freedom for scheduling and partitioning i *2 two_i i pipeline state DF CF Extract state foo(i): acc=acc+2*i state foo(two_i): acc=acc+two_i Caltech CS184 Spring2003 -- DeHon

Pipeline Extraction – Extractable Area JPEG Encode JPEG Decode MPEG (I) MPEG (P) Wavelet Encode IIR Caltech CS184 Spring2003 -- DeHon

Page Generation Pipeline extraction removes dataflow can freely extract from FSMD control Still have to partition potentially large FSMs approach: turn into a clustering problem Caltech CS184 Spring2003 -- DeHon

State Clustering Start: consider each state to be a unit Cluster states into page-size sub-FSMDs Inter-page transitions become streams Possible clustering goals: Minimize delay (inter-page latency) Minimize IO (inter-page BW) Minimize area (fragmentation) IA IB OA OB Caltech CS184 Spring2003 -- DeHon

State Clustering to Minimize Inter-Page State Transfer Inter-page state transfer is slow Cluster to: Contain feedback loops Minimize frequency of inter-page state transfer Previously used in: VLIW trace scheduling [Fisher ‘81] FSM decomposition for low power [Benini/DeMicheli ISCAS ‘98] VM/cache code placement GarpCC code selection [Callahan ‘00] Caltech CS184 Spring2003 -- DeHon

Scheduling (work in progress) Lead: Yury Markovskiy Caltech CS184 Spring2003 -- DeHon

Scheduling Time-multiplex the operators onto the hardware To exploit scaling: page capacity is a late-bound parameter cannot do scheduling at compile time To exploit dynamic data want to look at application, data characteristics Caltech CS184 Spring2003 -- DeHon

Scheduling: First Try Dynamic Fully Dynamic Time sliced List-scheduling based Very expensive: 100,000-200,000 cycles scheduling 30 virtual pages onto 10 physical Caltech CS184 Spring2003 -- DeHon

Overhead Effects Caltech CS184 Spring2003 -- DeHon Randy: reconfiguration + scheduling, change (no overhead) to math diff The line on the top is the one that we’ve shown you all along Bottom two lines shows what happens Memorize two points: percentage when array size 6, 18. Caltech CS184 Spring2003 -- DeHon

Overhead Costs Caltech CS184 Spring2003 -- DeHon Major difference between this and last slide. Total vs. per timeslice Caltech CS184 Spring2003 -- DeHon

Scheduling: Why Different, Challenging Distributed Memory vs. Uniform Memory placement/shuffling matters Multiple memory ports increase bandwidth fixed limit on number of ports available Schedule subgraphs reduce latency and memory Caltech CS184 Spring2003 -- DeHon

Scheduling: Taxonomy (How Dynamic?) Static/Dynamic Boundary? static dynamic All Static All Dynamic Placement Sequence Rate Timing Dynamic Scheduler SuperScalar VLIW Asynchronous Cricuit Synchronous Circuit Caltech CS184 Spring2003 -- DeHon

DynamicLoad Time Scheduling Dynamic Scheduler Static Scheduler hammer, take dyn sched  static overhead, create a script for the scheduler at loadtime read script at runtime [compile]---design[schedgen]---script[applicator] profiling info ---^ schedgen can run any algo: use static info from compile, use dynamic flow rates; first one tried is an adoptation of dyn's frontier algo. Expected behavior =!= we got (similar quality of results but different overhead) Caltech CS184 Spring2003 -- DeHon

Static Scheduler Overhead MainEvents vs. Mem Commands (~CMB#) Move green on the bottom Compare dynamic vs. static overhead Caltech CS184 Spring2003 -- DeHon

Compare Caltech CS184 Spring2003 -- DeHon MainEvents vs. Mem Commands (~CMB#) Move green on the bottom Compare dynamic vs. static overhead Caltech CS184 Spring2003 -- DeHon

Static Scheduler Performance Caltech CS184 Spring2003 -- DeHon

Anomalies and How Dynamic? Anomalies on previous graph early stall on stream data from assuming fixed timeslice model Solve by dynamic epoch termination detect when appropriate to advance schedule Placement Sequence Rate Timing Score Static Caltech CS184 Spring2003 -- DeHon

Static Scheduler w/ Early Stall Detection Caltech CS184 Spring2003 -- DeHon

More Heterogeneous Programmable SoC Caltech CS184 Spring2003 -- DeHon

Broader Programmable SOC Applicability Model potentially valuable beyond homogenous array Already introduced idea of different page types Caltech CS184 Spring2003 -- DeHon

Heterogeneous Pages Small conceptual step to generalize Memory (CMB) Processor FPGA vary granularity vary depth IO Custom (e.g. FPU) Caltech CS184 Spring2003 -- DeHon

Consequence Uniform compute model General way to integrate additional functional units Operate concurrently Streams serve as interconnect/comm abstraction Caltech CS184 Spring2003 -- DeHon

Additional Information SCORE: http://brass.cs.berkeley.edu/SCORE especially see “Introduction and Tutorial” CALTECH: http://www.cs.caltech.edu/research/ic/ Caltech CS184 Spring2003 -- DeHon

Admin Friday back in 74 (ps borrowing our videoconf equipment) Caltech CS184 Spring2003 -- DeHon

Big Ideas Model basis for virtualization basis for scaling allows common-case optimizations supports kind of computations which exploit this architecture spatial composition of computing blocks Caltech CS184 Spring2003 -- DeHon

Big Ideas Expose parallelism Communication to operator hidden by sequential control flow in ISA-based models Communication to operator not to resource (ala. GARP) Support spatial composition contrast sequential composition in ISA Data presence [self timed!] tolerant to timing and resource variations Caltech CS184 Spring2003 -- DeHon

Big Ideas Persistent Dataflow Persistent Communication separate creation and use use many times (amortize cost of creation) Persistent Communication separate setup/allocation form use amortize out cost of routing/negotiation/setup Both instances of: Make the common case fast Caltech CS184 Spring2003 -- DeHon