BRASS SCORE: Eylon Caspi, Randy Huang, Yury Markovskiy, Joe Yeh, John Wawrzynek BRASS Research Group University of California, Berkeley Stream Computations.

Slides:



Advertisements
Similar presentations
Embedded System, A Brief Introduction
Advertisements

© 2004 Wayne Wolf Topics Task-level partitioning. Hardware/software partitioning.  Bus-based systems.
CS 7810 Lecture 4 Overview of Steering Algorithms, based on Dynamic Code Partitioning for Clustered Architectures R. Canal, J-M. Parcerisa, A. Gonzalez.
Lecture 38: Chapter 7: Multiprocessors Today’s topic –Vector processors –GPUs –An example 1.
CPE 731 Advanced Computer Architecture ILP: Part V – Multiple Issue Dr. Gheith Abandah Adapted from the slides of Prof. David Patterson, University of.
POLITECNICO DI MILANO Parallelism in wonderland: are you ready to see how deep the rabbit hole goes? ILP: VLIW Architectures Marco D. Santambrogio:
A Scalable and Reconfigurable Search Memory Substrate for High Throughput Packet Processing Sangyeun Cho and Rami Melhem Dept. of Computer Science University.
University of Michigan Electrical Engineering and Computer Science 1 A Distributed Control Path Architecture for VLIW Processors Hongtao Zhong, Kevin Fan,
Data Marshaling for Multi-Core Architectures M. Aater Suleman Onur Mutlu Jose A. Joao Khubaib Yale N. Patt.
Implementation Approaches with FPGAs Compile-time reconfiguration (CTR) CTR is a static implementation strategy where each application consists of one.
1/1/ /e/e eindhoven university of technology Microprocessor Design Course 5Z008 Dr.ir. A.C. (Ad) Verschueren Eindhoven University of Technology Section.
BRASS Eylon Caspi, Michael Chu, Randy Huang, Joseph Yeh, John Wawrzynek University of California, Berkeley – BRASS.
Caltech CS184a Fall DeHon1 CS184a: Computer Architecture (Structures and Organization) Day17: November 20, 2000 Time Multiplexing.
Reconfigurable Computing: What, Why, and Implications for Design Automation André DeHon and John Wawrzynek June 23, 1999 BRASS Project University of California.
Penn ESE Spring DeHon 1 ESE (ESE534): Computer Organization Day 21: April 2, 2007 Time Multiplexing.
SCORE - Stream Computations Organized for Reconfigurable Execution Eylon Caspi, Michael Chu, Randy Huang, Joseph Yeh, Yury Markovskiy Andre DeHon, John.
Statically Bounding Memory Usage for SCORE Process Networks Eylon Caspi EE290N 5/15/02 University of California, Berkeley IAIA IBIB OAOA OBOB.
CS294-6 Reconfigurable Computing Day 22 November 5, 1998 Requirements for Computing Systems (SCORE Introduction)
Chapter 17 Parallel Processing.
A Streaming Multi-Threaded Model Eylon Caspi,Randy Huang,Yury Markovskiy, Joe Yeh,André DeHon,John Wawrzynek BRASS Research Group University of California,
Penn ESE535 Spring DeHon 1 ESE535: Electronic Design Automation Day 8: February 11, 2009 Dataflow.
CS294-6 Reconfigurable Computing Day 3 September 1, 1998 Requirements for Computing Devices.
CS294-6 Reconfigurable Computing Day 19 October 27, 1998 Multicontext.
CS294-6 Reconfigurable Computing Day 23 November 10, 1998 Stream Processing.
BRASS Analysis of QuasiStatic Scheduling Techniques in a Virtualized Reconfigurable Machine Yury Markovskiy, Eylon Caspi, Randy Huang, Joseph Yeh, Michael.
Dynamic Hardware Software Partitioning A First Approach Komal Kasat Nalini Kumar Gaurav Chitroda.
1 Compiling with multicore Jeehyung Lee Spring 2009.
Chapter 91 Memory Management Chapter 9   Review of process from source to executable (linking, loading, addressing)   General discussion of memory.
Paper Review Building a Robust Software-based Router Using Network Processors.
February 12, 1998 Aman Sareen DPGA-Coupled Microprocessors Commodity IC’s for the Early 21st Century by Aman Sareen School of Electrical Engineering and.
ECE 526 – Network Processing Systems Design Network Processor Architecture and Scalability Chapter 13,14: D. E. Comer.
Institute of Information Sciences and Technology Towards a Visual Notation for Pipelining in a Visual Programming Language for Programming FPGAs Chris.
TRIPS – An EDGE Instruction Set Architecture Chirag Shah April 24, 2008.
Automated Design of Custom Architecture Tulika Mitra
Sogang University Advanced Computing System Chap 1. Computer Architecture Hyuk-Jun Lee, PhD Dept. of Computer Science and Engineering Sogang University.
Compilation for Scalable, Paged Virtual Hardware Eylon Caspi Qualifying Exam 3/6/01 University of California, Berkeley IAIA IBIB OAOA OBOB.
Chapter 2 Parallel Architecture. Moore’s Law The number of transistors on a chip doubles every years. – Has been valid for over 40 years – Can’t.
A Programmable Processing Array Architecture Supporting Dynamic Task Scheduling and Module-Level Prefetching Junghee Lee *, Hyung Gyu Lee *, Soonhoi Ha.
Frank Casilio Computer Engineering May 15, 1997 Multithreaded Processors.
Computers Operating System Essentials. Operating Systems PROGRAM HARDWARE OPERATING SYSTEM.
Lecture 10: Logic Emulation October 8, 2013 ECE 636 Reconfigurable Computing Lecture 13 Logic Emulation.
Operating Systems David Goldschmidt, Ph.D. Computer Science The College of Saint Rose CIS 432.
Lecture 13: Logic Emulation October 25, 2004 ECE 697F Reconfigurable Computing Lecture 13 Logic Emulation.
CALTECH cs184c Spring DeHon CS184c: Computer Architecture [Parallel and Multithreaded] Day 14: May 24, 2001 SCORE.
Chapter 1 Computer Abstractions and Technology. Chapter 1 — Computer Abstractions and Technology — 2 The Computer Revolution Progress in computer technology.
1 - CPRE 583 (Reconfigurable Computing): Reconfigurable Computing Architectures Iowa State University (Ames) Reconfigurable Architectures Forces that drive.
Anurag Dwivedi. Basic Block - Gates Gates -> Flip Flops.
System-level power analysis and estimation September 20, 2006 Chong-Min Kyung.
1 Copyright  2001 Pao-Ann Hsiung SW HW Module Outline l Introduction l Unified HW/SW Representations l HW/SW Partitioning Techniques l Integrated HW/SW.
High Performance Embedded Computing © 2007 Elsevier Chapter 7, part 3: Hardware/Software Co-Design High Performance Embedded Computing Wayne Wolf.
CALTECH CS137 Spring DeHon CS137: Electronic Design Automation Day 13: May 20, 2002 Page Generation (Area and IO Constraints) [working problem with.
1 - CPRE 583 (Reconfigurable Computing): Reconfigurable Computing Architectures Iowa State University (Ames) CPRE 583 Reconfigurable Computing Lecture.
Memory Management Program must be brought (from disk) into memory and placed within a process for it to be run Main memory and registers are only storage.
Lecture 17: Dynamic Reconfiguration I November 10, 2004 ECE 697F Reconfigurable Computing Lecture 17 Dynamic Reconfiguration I Acknowledgement: Andre DeHon.
Introduction to Field Programmable Gate Arrays (FPGAs) EDL Spring 2016 Johns Hopkins University Electrical and Computer Engineering March 2, 2016.
CprE / ComS 583 Reconfigurable Computing Prof. Joseph Zambreno Department of Electrical and Computer Engineering Iowa State University Lecture #22 – Multi-Context.
Microarchitecture.
SOFTWARE DESIGN AND ARCHITECTURE
CS184b: Computer Architecture (Abstractions and Optimizations)
Parallel Programming By J. H. Wang May 2, 2017.
Chapter 9 – Real Memory Organization and Management
Architecture & Organization 1
Parallel Algorithm Design
Architecture & Organization 1
Chapter 8: Memory management
Outline Module 1 and 2 dealt with processes, scheduling and synchronization Next two modules will deal with memory and storage Processes require data to.
Chapter 1 Introduction.
HIGH LEVEL SYNTHESIS.
/ Computer Architecture and Design
Presentation transcript:

BRASS SCORE: Eylon Caspi, Randy Huang, Yury Markovskiy, Joe Yeh, John Wawrzynek BRASS Research Group University of California, Berkeley Stream Computations Organized for Reconfigurable Execution André DeHon IC Research Group California Institute of Technology SOC 2002 – November 21, 2002

BRASS 11/21/02Eylon Caspi — SOC Protecting Software Investment  Device sizes are growing  Moore’s law: 2x transistors every 18 months  Product / design cycles are shrinking  Riding Moore’s Law: Reuse  Module reuse is here  What about software reuse?  Need to let software survive, automatically scale to next generation device  Need a stronger model for HW-SW interface, with better parallelism time Dev ic e s iz e D es ign ti me SW

BRASS 11/21/02Eylon Caspi — SOC Outline  Motivation, Streams  SCORE  SCORE for Reconfigurable Systems  Scheduling  Compilation  Summary

BRASS 11/21/02Eylon Caspi — SOC Software is Everywhere VGA IR USBPCIAudio MicroprocessorDSP FireWire FPGARAM LED JTAG Ethernet VGA IR USBPCIAudio MicroprocessorMPEG FireWire FPGARAM LED JTAG Ethernet FPGA Programmable System on a Chip FPGA

BRASS 11/21/02Eylon Caspi — SOC Why Doesn’t Software Scale?  Is software reusable if have…  One more processor?  One more accelerator core?  Larger FPGA?  Larger memory?  Will performance improve?  Usually no – why?  Device size exposed to programmer  Number of processing elements, memory size, bandwidths  Algorithmic decisions locked in early  Need a better model / abstraction SW

BRASS 11/21/02Eylon Caspi — SOC A Lesson from ISA Processors  ISA (Instruction Set Architecture) decouples SW from HW  Hide details from SW: # function units, timing, memory size  SW survives to compatible, next generation devices  Performance scales with device speed + size  Survival for decades—e.g. IBM 360, x86  But an ISA cannot scale forever  Latency scales with device size (cycles to cross chip, access mem)  Need parallelism to hide latency  ILP:expensive to extract + exploit (caches, branch pred., etc.)  Data:(Vector, MMX) limited applicability; MMX not scalable  Thread:(MP, multi-threaded) IPC expensive; hard to program Gluing together conventional processors is insufficient

BRASS 11/21/02Eylon Caspi — SOC What is the Next Abstraction?  Goal:more hardware  better performance, without rewriting / recompiling software  Need to:  Abstract device sizes  Number of fn units / cores; memory size; memory ports  Abstract latencies + bandwidths  Support rescheduling / reparallelizing a computation to available resources  Handle large, heterogeneous systems  Have predictable performance

BRASS 11/21/028 Streams, Process Networks  Stream =FIFO communication channel with blocking read, non-blocking write, conceptually unbounded capacity  Basic primitive for communication, synchronization  Exposed at all levels – application (programming model), architecture  Application = graph of stream connected processes (threads), memories  Kahn Process Network, 1974  Stream semantics ensure determinism regardless of communication timing, thread scheduling, etc. (Kahn continuity)  Architecture = graph of stream connected processors (cores), memories  Processor (core) runs one or more processes (threads)  Some processes always present, on fixed cores (e.g. off-chip interfaces)  Some processes sequenced on programmable cores (e.g. DCT on DSP) Process Mem

BRASS 11/21/02Eylon Caspi — SOC Stream-Aware Scheduling  Streams expose inter-process dependencies (data flow)  Streams enable efficient, flexible schedules  Efficient: fewer blocked cycles, shorter run time  Automatically schedule to available resources  Number of processors, memory size, network bandwidth, etc.  E.g. Fully spatial, pipelined  E.g. Time multiplexed with data batching  Amortize cost of context swap over larger data set Process Mem

BRASS 11/21/02Eylon Caspi — SOC Stream Reuse  Persistent streams enable reuse  Establish connection once (network route / buffer)  Reuse connection while processes loaded  Cheap (single cycle) stream access  Amortize per-message cost of communication Process Mem

BRASS 11/21/02Eylon Caspi — SOC Outline  Motivation, Streams  SCORE  SCORE for Reconfigurable Systems  Scheduling  Compilation  Summary

BRASS 11/21/02Eylon Caspi — SOC Components of a Streaming SoC  Graph based compute model  Streams  Scheduler  To decide where + when processes run, memory is stored  Hardware support  Common network interface for all cores, supports stalling, queueing (e.g. OCP)  Stream buffer memory  Task swap capability for sequenced cores  Controller for core sequencing, DMA, and online parts of scheduler (e.g. microprocessor) Scheduler

BRASS 11/21/02Eylon Caspi — SOC SCORE Compute Model  Program =data flow graph of stream-connected threads  Kahn process network (blocking read, non-blocking write)  Compute: Thread  Task with local control  Communication: Stream  FIFO channel, unbounded buffer capacity, blocking read, non-blocking write  Memory: Segment  Memory block with stream interface (e.g. streaming read)  Dynamics:  Dynamic local thread behavior  dynamic flow rates  Unbounded resource usage: may need stream buffer expansion  Dynamic graph allocation  Model admits parallelism at multiple levels: ILP, pipeline, data Memory Segment Process (thread) Stream

BRASS 11/21/02Eylon Caspi — SOC Outline  Motivation, Streams  SCORE  SCORE for Reconfigurable Systems  Scheduling  Compilation  Summary

BRASS 11/21/02Eylon Caspi — SOC SCORE for Reconfigurable Hardware  SCORE:Stream Computations Organized for Reconfigurable Execution  Programmable logic + Programmable Interconnect  E.g. Field Programmable Gate Arrays (FPGAs)  Hardware scales by tiling / duplicating  High parallelism; spatial data paths  Stylized, high bandwidth interconnect  But no abstraction for software survival, to date  No binary compatibility  No performance scaling  Designer targets a specific device, specific resource constraints Graphics copyright by their respective company

BRASS 16 Virtual Hardware  Compute model has unbounded resources  Programmer does not target a particular device size  Paging  “Compute pages” swapped in/out (like virtual memory)  Page context = thread (FSM to block on stream access)  Efficient virtualization  Amortize reconfiguration cost over an entire input buffer  Requires “working sets” of tightly-communicating pages to fit on device buffers TransformQuantizeRLEEncode compute pages

BRASS 11/21/02Eylon Caspi — SOC SCORE Reconfigurable Hardware Model  Paged FPGA  Compute Page (CP)  Fixed-size slice of reconfig. hardware (e.g LUTs)  Fixed number of I/O ports  Stream interface with input queue  Configurable Memory Block (CMB)  Distributed, on-chip memory (e.g. 2 Mbit)  Stream interface with input queue  High-level interconnect  Circuit switched with valid + back-pressure bits  Microprocessor  Run-time support + user code

BRASS 11/21/02Eylon Caspi — SOC Heterogeneous SCORE  SCORE extends to other processor types  Network interface  Route traffic to network or buffer  Block on empty/full stream access Processor FPU IO

BRASS 11/21/02Eylon Caspi — SOC Efficient Streams on Microprocessor  Stream instructions: stream_read(reg,idx) stream_write(reg,idx) Network Interface

BRASS 11/21/02Eylon Caspi — SOC Application: JPEG Encode

BRASS 11/21/02Eylon Caspi — SOC JPEG Encode Performance Scaling Hardware Size (CP-CMB Pairs) Scheduling Heuristic:

BRASS 11/21/02Eylon Caspi — SOC Performance Scaling Observations  Performance scales predictably with added hardware  Time sequencing is efficient  Application can run on substantially fewer pages than page threads, with negligible performance loss  Scheduling heuristics work well  Scheduling analysis can be cheap

BRASS 11/21/02Eylon Caspi — SOC CAD with Streams is Hierarchical  Two level CAD hierarchy  (1) inside page– compiler  (2) between pages– scheduler  Architecture islocallysynchronous, globallyasynchronous,  Traditional timing closure only inside page  Gracefully accommodate high page-to-page latency  Not a free lunch – still impacts application performance

BRASS 11/21/02Eylon Caspi — SOC Outline  Motivation, Streams  SCORE  SCORE for Reconfigurable Systems  Scheduling  Compilation  Summary

BRASS 11/21/02Eylon Caspi — SOC Page Scheduling  Where, when, how long to run each page thread?  Time slice model  Reconfigure all pages together  Decisions for each time slice:  Temporal Partitioning  Choose group of pages to run  Resource allocation  Allocate stream buffers to/from non-resident page threads (“stitch buffers”)  Place and Route resident pages, buffers, user segments  Reconfiguration  “Buffer lock” recovery  If deadlock due to insufficient buffer size, then expand buffers Schedule ReconfigRun Time slice Time

BRASS 11/21/02Eylon Caspi — SOC JPEG Encode (improved)  11 page threads  5 user segments

BRASS 11/21/02Eylon Caspi — SOC JPEG Encode: Temporal Partitions  Assume device has 4 CPs, 16 CMBs

BRASS 11/21/02Eylon Caspi — SOC JPEG Encode: Stitch Buffers  Stitch buffer

BRASS 11/21/02Eylon Caspi — SOC JPEG Encode: Resource Assignment

BRASS 11/21/02Eylon Caspi — SOC Reconfiguration Control  Between time slices:  Halt CPs, CMBs  Wait for in-flight communication to drain into input queues  Save CP, CMB context to CMB memory  Compute / look-up next temporal partition  Swap CMB contents to/from off-chip memory, if necessary  Load CP, CMB context from CMB memory; reconfigure  Reconfigure interconnect  Start CPs, CMBs

BRASS 11/21/02Eylon Caspi — SOC Schedule Binding Time  When to make each scheduling decision?  Dynamic (at run time)  High run-time overhead  Responsive to application dynamics  Static (at load time / install time)  Low run-time overhead  Possible wasted cycles due to mis-prediced application dynamics

BRASS 11/21/02Eylon Caspi — SOC Dynamic Scheduler  Premise: dynamic rate applications benefit from dynamic page groups, time slice duration  E.g. compressor / decompressor stages  Temporal partitioning:  List schedule, order pages by size of available, buffered input  Memory management:  Allocate 1Mb stitch buffers and swap off-chip, as necessary, every time slice  Place + Route:  Every time slice (several 10’s of pages; possible with HW assist)  Result:  Very high run-time overhead for scheduling decisions

BRASS 11/21/02Eylon Caspi — SOC Dynamic Scheduler Results  Scheduler overhead is 36% of total execution time

BRASS 11/21/0234 Dynamic Scheduler Overhead  Average overhead per time slice = 127K cycles  Scheduler overhead= 124K cycles (avg.)  Reconfiguration= 3.5K cycles (avg.)  Mandates a large time slice – 250K cycles  A thread may idle for much of it if it blocks or exhausts input

BRASS 11/21/02Eylon Caspi — SOC Quasi-static Scheduler  Premise: reduce run-time overhead by iterating a schedule generated off-line  Temporal partitioning:  Multi-way graph partitioning constrained by  Precedence (of SCCs)  Number of CPs Number of CMBs  3 partitioning heuristics  Exhaustive search for max utilization  Min cut Topological sort  Memory management:  Off-line allocate 1Mb stitch buffers and schedule off-chip swaps  Place + Route: Off-line  Quasi-static if can end a time-slice early  Result: Low run-time overhead AND better quality

BRASS 11/21/02Eylon Caspi — SOC Quasi-Static Scheduler Overhead  Reduced average overhead per time slice by 7x  Scheduler overhead= 14K cycles (avg.)  Reconfiguration= 4K cycles (avg.)

BRASS 11/21/0237 Quasi-Static Scheduler Results  Reduced total execution time by 4.5x (not 36%)  Because:(1) Better partitioning (global view, not greedy) 50/50 (2) Can end time slice early if everything stalls

BRASS 11/21/02Eylon Caspi — SOC Temporal Partitioning Heuristics  “Exhaustive”  Goal: maximize utilization, i.e. non-idle page-cycles  How: cluster to avoid rate mis-matches  Profile avg. consumption, production rates per firing, for each thread  Given a temporal partition (group of pages) + I/O rates, deduce average firing rate (Synchronous Data Flow balance eqns)  Firing rate ~ page utilization (% non-idle cycles)  Exhaustive search of feasible partitions for max total utilization  Tried up to 30 pages (6 hours for small array, minutes for large array)  “Min Cut”  FBB Flow-Based, Balanced, multi-way partitioning [Yang+Wong, ACM 1994]  CP limit  area constraint  CMB limit  I/O cut constraint; every user segment has edge to sink to cost CMB in cut  “Topological”  Pre-cluster strongly connected components (SCCs)  Pre-cluster pairs of clusters if reduce I/O of pair  Topological sort, partition at CP limit P1P2 Tokens per firing Expected firings per cycle

BRASS 11/21/02Eylon Caspi — SOC Partitioning Heuristics Results Hardware Size (CP-CMB Pairs)

BRASS 11/21/02Eylon Caspi — SOC Partitioning Heuristics Results (2) Hardware Size (CP-CMB Pairs)

BRASS 11/21/02Eylon Caspi — SOC Scheduling: Future Work  Buffer sizing based on stream rates  Software pipelining of time slices P3 P2P1 P3 P2 P4 P1 P3 P2 P4 P1 P3 P2 P4 P1 P3 P2 P Software Pipelined Time Slices: Original: Time slice:

BRASS 11/21/02Eylon Caspi — SOC Outline  Motivation, Streams  SCORE  SCORE for Reconfigurable Systems  Scheduling  Compilation  Summary

BRASS 43 Programming Model: TDF  TDF = intermediate, behavioral language for:  SFSM threads (Streaming Extended Finite State Machine)  Static graph composition, hierarchical  State machine for:  Firing signatures (input guards)  Control flow (branching)  Firing semantics:  When in state X, wait for X’s inputs, then fire (consume, act) select (input boolean s, input unsigned[8] t, input unsigned[8] f, output unsigned[8] o ) { state S (s) : if (s) goto T; else goto F; state T (t) : o=t; goto S; state F (f) : o=f; goto S; } stf o select

BRASS 3/6/0144 The Compilation Problem Programming ModelExecution Model Communicating SFSMs Communicating page configs - unrestricted size, # IOs, timing- fixed size, # IOs, timing Paged virtual hardware Compile memory segment TDF SFSM thread stream memory segment page thread stream Compilation is a resource-binding xform on state machines + data-paths

BRASS 11/21/02Eylon Caspi — SOC Impact of Communication Delay  With virtualization, Inter-page delay is unknown, sensitive to:  Placement  Interconnect implementation  Page schedule  Technology – wire delay is growing  Inter-page feedback is slow  Partitionto contain FB loops in page  Scheduleto contain FB loops on device

BRASS 11/21/02Eylon Caspi — SOC Latency Sensitive Compilation with Streams  Pipeline Extraction  Shrink SFSM by extracting control-independent functions  Help timing and page partitioning  Page Partitioning – SFSM Decomposition  State clustering for minimum inter-page transitions  Page Packing  Reduce area fragmentation  Contain streams

BRASS 47 Pipeline Extraction  Hoist uncontrolled, feed-forward data-flow out of FSM  Benefits:  Shrink FSM cyclic core  Extracted pipeline has more freedom for scheduling + partitioning Extract state foo(x): if (x==0)... state foo(xz): if (xz)... x state DF CF x==0 xz x pipeline state DF CF

BRASS 11/21/02Eylon Caspi — SOC Pipeline Extraction – SFSM Area Wavelet Decode Wavelet Encode JPEG Encode MPEG Encode JPEG Encode JPEG Decode MPEG (I) MPEG (P) Wavelet Encode IIR

BRASS 11/21/02Eylon Caspi — SOC Pipeline Extraction – Extractable Area JPEG Encode JPEG Decode MPEG (I) MPEG (P) Wavelet Encode IIR

BRASS 11/21/02Eylon Caspi — SOC Delay-Oriented SFSM Decomposition  Indivisible unit: state (CF+DF)  Spatial locality in state logic  Cluster states into page-size sub-machines  Inter-page communication for data flow, state flow  Sequential delay is in inter-page state transfer  Cluster to maintain local control  Cluster to contain state loops  Similar to:  VLIW trace scheduling [Fisher ‘81]  FSM decomp. for low power [Benini/DeMicheli ISCAS ‘98]  VM/cache code placement  GarpCC HW/SW partitioning [Callahan ‘00]

BRASS 11/21/02Eylon Caspi — SOC Page Packing  Cluster SFSMs + pipelines  (1) avoid area fragmentation, (2) contain streams  Contain stream buffers  Stream buffer implemented as registers inside page is fixed size, may cause deadlock (buffer-lock)  Choice 1:if prove stream has bounded buffer, then ok to contain (halting problem – dis/provable only for some cases)  Choice 2:if cannot prove, use buffer-expandable page-to-page I/O  Choice 3:if cannot prove, do not pack

BRASS 11/21/02Eylon Caspi — SOC Outline  Motivation, Streams  SCORE  SCORE for Reconfigurable Systems  Scheduling  Compilation  Summary

BRASS 11/21/02Eylon Caspi — SOC Summary  SCORE enables software to survive, automatically scale to next-gen device  Stream + Process network abstraction at all levels (application, architecture)  Demonstrated scalable, hybrid reconfigurable architecture for SCORE  Applications Programming model  Compiler Scheduler  Device simulator Architecture  More info on the web 

BRASS 11/21/02Eylon Caspi — SOC SUPPLEMENTAL MATERIAL

BRASS 11/21/02Eylon Caspi — SOC Device Simulation  Simulator engine  Cycle level  Behavioral model (single-step) for each page thread, emitted by compiler  Simplified timing model: Page-to-page latency = 1 cycle CMB access latency = 1 cycle  Device characteristics  FPGA based on HSRA [U.C. Berkeley, FPGA ’99]  CP = LUTs  CMB = 2Mbit DRAM, 64-bit data interface, fully pipelined  Area for CP-CMB pair:  Page reconfiguration time, from CMB = 5000 cycles  Synchronous, 250MHz (but some apps not properly timed for 250)  Microprocessor: x86 (PIII)

BRASS 11/21/02Eylon Caspi — SOC More Dynamic vs. Static Results Hardware Size (CP-CMB Pairs)

BRASS 11/21/02Eylon Caspi — SOC More Dynamic vs. Static Results (2) Hardware Size (CP-CMB Pairs)