A Streaming Multi-Threaded Model Eylon Caspi,Randy Huang,Yury Markovskiy, Joe Yeh,André DeHon,John Wawrzynek BRASS Research Group University of California,

Slides:

Advertisements

Similar presentations

Lecture 38: Chapter 7: Multiprocessors Today’s topic –Vector processors –GPUs –An example 1.

Advertisements

Computer Abstractions and Technology

Implementation Approaches with FPGAs Compile-time reconfiguration (CTR) CTR is a static implementation strategy where each application consists of one.

1/1/ /e/e eindhoven university of technology Microprocessor Design Course 5Z008 Dr.ir. A.C. (Ad) Verschueren Eindhoven University of Technology Section.

THE RAW MICROPROCESSOR: A COMPUTATIONAL FABRIC FOR SOFTWARE CIRCUITS AND GENERAL- PURPOSE PROGRAMS Taylor, M.B.; Kim, J.; Miller, J.; Wentzlaff, D.; Ghodrat,

A reconfigurable system featuring dynamically extensible embedded microprocessor, FPGA, and customizable I/O Borgatti, M. Lertora, F. Foret, B. Cali, L.

BRASS Eylon Caspi, Michael Chu, Randy Huang, Joseph Yeh, John Wawrzynek University of California, Berkeley – BRASS.

1 Architectural Complexity: Opening the Black Box Methods for Exposing Internal Functionality of Complex Single and Multiple Processor Systems EECC-756.

Hardwired networks on chip for FPGAs and their applications

Introduction to Operating Systems CS-2301 B-term Introduction to Operating Systems CS-2301, System Programming for Non-majors (Slides include materials.

Reconfigurable Computing: What, Why, and Implications for Design Automation André DeHon and John Wawrzynek June 23, 1999 BRASS Project University of California.

SCORE - Stream Computations Organized for Reconfigurable Execution Eylon Caspi, Michael Chu, Randy Huang, Joseph Yeh, Yury Markovskiy Andre DeHon, John.

Statically Bounding Memory Usage for SCORE Process Networks Eylon Caspi EE290N 5/15/02 University of California, Berkeley IAIA IBIB OAOA OBOB.

BRASS SCORE: Eylon Caspi, Randy Huang, Yury Markovskiy, Joe Yeh, John Wawrzynek BRASS Research Group University of California, Berkeley Stream Computations.

CS294-6 Reconfigurable Computing Day 22 November 5, 1998 Requirements for Computing Systems (SCORE Introduction)

Chapter 17 Parallel Processing.

Compiling Application-Specific Hardware Mihai Budiu Seth Copen Goldstein Carnegie Mellon University.

Figure 1.1 Interaction between applications and the operating system.

1 Computer Science, University of Warwick Architecture Classifications A taxonomy of parallel architectures: in 1972, Flynn categorised HPC architectures.

BRASS Analysis of QuasiStatic Scheduling Techniques in a Virtualized Reconfigurable Machine Yury Markovskiy, Eylon Caspi, Randy Huang, Joseph Yeh, Michael.

Computer performance.

February 12, 1998 Aman Sareen DPGA-Coupled Microprocessors Commodity IC’s for the Early 21st Century by Aman Sareen School of Electrical Engineering and.

ECE 526 – Network Processing Systems Design Network Processor Architecture and Scalability Chapter 13,14: D. E. Comer.

Lecture 2: Field Programmable Gate Arrays September 13, 2004 ECE 697F Reconfigurable Computing Lecture 2 Field Programmable Gate Arrays.

1 COMPSCI 110 Operating Systems Who - Introductions How - Policies and Administrative Details Why - Objectives and Expectations What - Our Topic: Operating.

A Flexible Interconnection Structure for Reconfigurable FPGA Dataflow Applications Gianluca Durelli, Alessandro A. Nacci, Riccardo Cattaneo, Christian.

Institute of Information Sciences and Technology Towards a Visual Notation for Pipelining in a Visual Programming Language for Programming FPGAs Chris.

TRIPS – An EDGE Instruction Set Architecture Chirag Shah April 24, 2008.

Automated Design of Custom Architecture Tulika Mitra

1 Computer Architecture Research Overview Rajeev Balasubramonian School of Computing, University of Utah

Compilation for Scalable, Paged Virtual Hardware Eylon Caspi Qualifying Exam 3/6/01 University of California, Berkeley IAIA IBIB OAOA OBOB.

Chapter 2 Parallel Architecture. Moore’s Law The number of transistors on a chip doubles every years. – Has been valid for over 40 years – Can’t.

SJSU SPRING 2011 PARALLEL COMPUTING Parallel Computing CS 147: Computer Architecture Instructor: Professor Sin-Min Lee Spring 2011 By: Alice Cotti.

Frank Casilio Computer Engineering May 15, 1997 Multithreaded Processors.

J. Christiansen, CERN - EP/MIC

Reminder Lab 0 Xilinx ISE tutorial Research Send me an if interested Looking for those interested in RC with skills in compilers/languages/synthesis,

© 2004 Mercury Computer Systems, Inc. FPGAs & Software Components Graham Bardouleau & Jim Kulp Mercury Computer Systems, Inc. High Performance Embedded.

Computers Operating System Essentials. Operating Systems PROGRAM HARDWARE OPERATING SYSTEM.

Caltech CS184b Winter DeHon 1 CS184b: Computer Architecture [Single Threaded Architecture: abstractions, quantification, and optimizations] Day14:

Lecture 10: Logic Emulation October 8, 2013 ECE 636 Reconfigurable Computing Lecture 13 Logic Emulation.

Operating Systems David Goldschmidt, Ph.D. Computer Science The College of Saint Rose CIS 432.

CALTECH cs184c Spring DeHon CS184c: Computer Architecture [Parallel and Multithreaded] Day 14: May 24, 2001 SCORE.

Computer Organization & Assembly Language © by DR. M. Amer.

Task Graph Scheduling for RTR Paper Review By Gregor Scott.

Anurag Dwivedi. Basic Block - Gates Gates -> Flip Flops.

Harmony: A Run-Time for Managing Accelerators Sponsor: LogicBlox Inc. Gregory Diamos and Sudhakar Yalamanchili.

Lecture 12: Reconfigurable Systems II October 20, 2004 ECE 697F Reconfigurable Computing Lecture 12 Reconfigurable Systems II: Exploring Programmable Systems.

Development of Programmable Architecture for Base-Band Processing S. Leung, A. Postula, Univ. of Queensland, Australia A. Hemani, Royal Institute of Tech.,

ECE 526 – Network Processing Systems Design Network Processor Introduction Chapter 11,12: D. E. Comer.

Next Generation ISA Itanium / IA-64. Operating Environments IA-32 Protected Mode/Real Mode/Virtual Mode - if supported by the OS IA-64 Instruction Set.

Gedae, Inc. Gedae: Auto Coding to a Virtual Machine Authors: William I. Lundgren, Kerry B. Barnes, James W. Steed HPEC 2004.

The Structure of the “THE”- Multiprogramming System Edsger W. Dijkstra Presented by: Jin Li.

Introduction Why are virtual machines interesting?

Survey of multicore architectures Marko Bertogna Scuola Superiore S.Anna, ReTiS Lab, Pisa, Italy.

Hybrid Multi-Core Architecture for Boosting Single-Threaded Performance Presented by: Peyman Nov 2007.

3/12/2013Computer Engg, IIT(BHU)1 PARALLEL COMPUTERS- 2.

1 - CPRE 583 (Reconfigurable Computing): Reconfigurable Computing Architectures Iowa State University (Ames) CPRE 583 Reconfigurable Computing Lecture.

Lecture 17: Dynamic Reconfiguration I November 10, 2004 ECE 697F Reconfigurable Computing Lecture 17 Dynamic Reconfiguration I Acknowledgement: Andre DeHon.

Lecture 1: Introduction CprE 585 Advanced Computer Architecture, Fall 2004 Zhao Zhang.

Runtime Reconfigurable Network-on- chips for FPGA-based systems Mugdha Puranik Department of Electrical and Computer Engineering

Fall 2012 Parallel Computer Architecture Lecture 4: Multi-Core Processors Prof. Onur Mutlu Carnegie Mellon University 9/14/2012.

Introduction to Operating Systems Concepts

CS184b: Computer Architecture (Abstractions and Optimizations)

ESE532: System-on-a-Chip Architecture

Architecture & Organization 1

Hyperthreading Technology

Architecture & Organization 1

Computer Evolution and Performance

/ Computer Architecture and Design

Prof. Leonardo Mostarda University of Camerino

Presentation transcript:

A Streaming Multi-Threaded Model Eylon Caspi,Randy Huang,Yury Markovskiy, Joe Yeh,André DeHon,John Wawrzynek BRASS Research Group University of California, Berkeley MSP-3 12/2/01

12/2/01Eylon Caspi — MSP-32 Protecting Software Investment  Technology trends: bigger, faster  Moore’s Law: 2x transistors every 18 months  Device landscape growing  Microprocessors, DSPs, FPGAs, communication processors, network processors, PSOCs, etc.  Need a way to let SW survive, automatically scale to next-gen device  Need a strong model for SW-HW interface with better parallelism

12/2/01Eylon Caspi — MSP-33 Outline  Motivation  SCORE  SCORE for Reconfigurable Hardware  SCORE for Microprocessors  Summary / Future Work

12/2/01Eylon Caspi — MSP-34 A Lesson from ISA Processors  ISA (Instruction Set Architecture) decouples SW from HW  Survival to compatible, next generation devices  Performance scales with device speed + size  Survival for decades—e.g. IBM 360, x86  An ISA cannot scale forever  Latency scales with device size (cycles to cross chip, access mem)  Need parallelism to hide latency  ILP:expensive to extract + exploit (caches, branch pred., etc.)  Data:(Vector, MMX) limited applicability; MMX not scalable  Thread:(MP, multi-threaded) IPC expensive; hard to program Gluing together conventional processors is insufficient

12/2/01Eylon Caspi — MSP-35 Streams  Stream =FIFO communication channel with blocking read, non-blocking write, conceptually unbounded capacity  Basic primitive for communication, synchronization  Exposed at all levels—programming model, architecture  Application =data flow graph of threads, memories  Kahn process network  Stream semantics ensure determinism regardless of communication timing, thread scheduling (Kahn continuity) Thread Mem

12/2/01Eylon Caspi — MSP-36 Stream-Aware Scheduling  Streams expose inter-thread dependencies (data flow)  Streams enable efficient, flexible schedules  Efficient: fewer blocked cycles, shorter run time  Automatically schedule to available resources  Number of processors, memory size, network bandwidth, etc.  E.g. Fully spatial, pipelined  E.g. Time multiplexed with data batching  Amortize cost of context swap over larger data set Thread Mem

12/2/01Eylon Caspi — MSP-37 Stream Reuse  Persistent streams enable reuse  Establish connection once (network route / buffer)  Reuse connection while threads loaded  Cheap (single cycle) stream access  Amortize per-message cost of communication Thread Mem

12/2/01Eylon Caspi — MSP-38 SCORE Compute Model  Program =data flow graph of stream-connected threads  Kahn process network (blocking read, non-blocking write)  Compute: Thread  Task with local control  Communication: Stream  FIFO channel, unbounded buffer capacity, blocking read, non-blocking write  Memory: Segment  Memory block with stream interface (e.g. streaming read)  Dynamics:  Dynamic local thread behavior  dynamic flow rates  Unbounded resource usage: may need stream buffer expansion  Dynamic graph allocation  Model admits parallelism at multiple levels: ILP, pipeline, data

12/2/01Eylon Caspi — MSP-39 SCORE for Reconfigurable Hardware  SCORE:Stream Computations Organized for Reconfigurable Execution  Programmable logic + Programmable Interconnect  E.g. Field Programmable Gate Arrays (FPGAs)  Hardware scales by tiling / duplicating  High parallelism; spatial data paths  But no abstraction for software survival  No binary compatibility  No performance scaling  Designer targets a specific device, specific resource constraints

10 Virtual Hardware  Compute model has unbounded resources  Programmer no longer targets particular device size  Paging  “Compute pages” swapped in/out (like VM)  Page context = thread (FSM to access streams, block)  Efficient virtualization  Amortize reconfiguration cost over an entire input buffer buffers TransformQuantizeRLEEncode compute pages

12/2/01Eylon Caspi — MSP-311 SCORE Hardware Model  Paged FPGA  Compute Page (CP)  Fixed-size slice of RC hardware (e.g LUTs)  Fixed number of I/O ports  Configurable Memory Block (CMB)  Distributed, on-chip memory (e.g. 2 Mbit)  Stream access  High-level interconnect  Microprocessor  Run-time support + user code

12 Programming Model: TDF  TDF = intermediate, behavioral language for:  EFSM Operators Static operator graphs  State machine for:  Firing signatures Control flow (branching)  Firing semantics:  When in state X, wait for X’s inputs, then fire (consume, act) select (input boolean s, input unsigned[8] t, input unsigned[8] f, output unsigned[8] o ) { state S (s) : if (s) goto T; else goto F; state T (t) : o=t; goto S; state F (f) : o=f; goto S; } stf o select

12/2/01Eylon Caspi — MSP-313 Page Scheduling  Schedule = time-sliced eviction / loading  Choose pages to run  Manage stream buffers (modify page graph; swap memory)  Configure CPs, CMBs, network  Implemented several schedulers  Dynamic:Dynamic loading order based on buffered input  Static:Static, repeated loading order  Quasi-Static:Static loading order, dynamic time slice  Page loading order (static / quasi-static)  Topological:dependence order (arbitrary topological sort of page graph)  Min-cut:minimize # of live stream buffers (min-cut page graph)  Exhaustive:minimize stall cycles based on profiled I/O rates (exhaustively search all topological orders)

12/2/01Eylon Caspi — MSP-314 Execution Results Hardware Size (CP-CMB Pairs)

12/2/01Eylon Caspi — MSP-315 Heterogeneous SCORE  SCORE extends to other processor types  Network interface  Route traffic to network or buffer  Block on empty/full stream access Processor FPU IO

12/2/01Eylon Caspi — MSP-316 Microprocessor Stream Support  Stream instructions: stream_read(reg,idx) stream_write(reg,idx) Network Interface

12/2/01Eylon Caspi — MSP-317 Summary  Exposing streams at all levels (programming model, architecture) enables software survival + performance scaling in high-capacity architectures  Demonstrated scalable hybrid reconfigurable architecture; proposed heterogeneous / multi-processor extensions  Future work  Page partitioning for reconfigurable  Scheduling with I/O rate matching  More Information  SCORE web page  FPGA 2002 paper (February 24-26)

12/2/01Eylon Caspi — MSP-318 Supplemental

12/2/01Eylon Caspi — MSP-319 Functional Simulation  FPGA based on HSRA [Berkeley, FPGA ’99]  CP:512 4-LUTs  CMB:2Mbit DRAM  Area for CP-CMB pair:  Page reconfiguration:5000 cycles (from CMB)  Synchronous operation(same clock speed as processor)  x86 microprocessor  Page Scheduler task  Swap on timer interrupt (every 250,000 cycles)  Fully dynamic scheduling.25  :12.9mm 2 (1/9 of PII-450).18  : 6.7mm 2 (1/16 of PIII-600)

12/2/01Eylon Caspi — MSP-320 Application: JPEG Encode

12/2/01Eylon Caspi — MSP-321 Execution Results Hardware Size (CP-CMB Pairs)

12/2/01Eylon Caspi — MSP-322 Execution Results Hardware Size (CP-CMB Pairs)

12/2/01Eylon Caspi — MSP-323 Execution Results Hardware Size (CP-CMB Pairs)