CS294-6 Reconfigurable Computing Day 23 November 10, 1998 Stream Processing.

Slides:

Advertisements

Similar presentations

Mutual Exclusion – SW & HW By Oded Regev. Outline: Short review on the Bakery algorithm Short review on the Bakery algorithm Black & White Algorithm Black.

Advertisements

DATAFLOW TESTING DONE BY A.PRIYA, 08CSEE17, II- M.s.c [C.S].

CSCI 4717/5717 Computer Architecture

Architecture-dependent optimizations Functional units, delay slots and dependency analysis.

Stanford University CS243 Winter 2006 Wei Li 1 Register Allocation.

Ch. 7 Process Synchronization (1/2) I Background F Producer - Consumer process :  Compiler, Assembler, Loader, · · · · · · F Bounded buffer.

Timed Automata.

Enforcing Sequential Consistency in SPMD Programs with Arrays Wei Chen Arvind Krishnamurthy Katherine Yelick.

Requirements on the Execution of Kahn Process Networks Marc Geilen and Twan Basten 11 April 2003 /e.

Atomicity in Multi-Threaded Programs Prachi Tiwari University of California, Santa Cruz CMPS 203 Programming Languages, Fall 2004.

DATAFLOW PROCESS NETWORKS Edward A. Lee Thomas M. Parks.

Synthesis of Embedded Software Using Free-Choice Petri Nets.

Synchronous Data Flow Presenter: Zohair Hyder Gang Zhou Synchronous Data Flow E. A. Lee and D. G. Messerschmitt Proc. of the IEEE, September, Joint.

SCORE - Stream Computations Organized for Reconfigurable Execution Eylon Caspi, Michael Chu, Randy Huang, Joseph Yeh, Yury Markovskiy Andre DeHon, John.

Using Interfaces to Analyze Compositionality Haiyang Zheng and Rachel Zhou EE290N Class Project Presentation Dec. 10, 2004.

Scheduling for Embedded Real-Time Systems Amit Mahajan and Haibo.

Copyright © 2001 Stephen A. Edwards All rights reserved Dataflow Languages Prof. Stephen A. Edwards.

A Schedulability-Preserving Transformation of BDF to Petri Nets Cong Liu EECS 290n Class Project December 10, 2004.

CS294-6 Reconfigurable Computing Day 22 November 5, 1998 Requirements for Computing Systems (SCORE Introduction)

Multithreading and Dataflow Architectures CPSC 321 Andreas Klappenecker.

1 Quasi-Static Scheduling of Embedded Software Using Free-Choice Petri Nets Marco Sgroi, Alberto Sangiovanni-Vincentelli Luciano Lavagno University of.

Dataflow Process Networks Lee & Parks Synchronous Dataflow Lee & Messerschmitt Abhijit Davare Nathan Kitchen.

Design of Fault Tolerant Data Flow in Ptolemy II Mark McKelvin EE290 N, Fall 2004 Final Project.

(Page 554 – 564) Ping Perez CS 147 Summer 2001 Alternative Parallel Architectures  Dataflow  Systolic arrays  Neural networks.

State Machines Timing Computer Bus Computer Performance Instruction Set Architectures RISC / CISC Machines.

Penn ESE535 Spring DeHon 1 ESE535: Electronic Design Automation Day 8: February 11, 2009 Dataflow.

Joint Minimization of Code and Data for Synchronous Dataflow Programs Kaushik Ravindran EE 249 – Presentation.

BRASS Analysis of QuasiStatic Scheduling Techniques in a Virtualized Reconfigurable Machine Yury Markovskiy, Eylon Caspi, Randy Huang, Joseph Yeh, Michael.

EDA (CS286.5b) Day 18 Retiming. Today Retiming –cycle time (clock period) –C-slow –initial states –register minimization.

1 Compiling with multicore Jeehyung Lee Spring 2009.

Maria-Cristina Marinescu Martin Rinard Laboratory for Computer Science Massachusetts Institute of Technology High-level Specification and Efficient Implementation.

Precision Going back to constant prop, in what cases would we lose precision?

Secure Embedded Processing through Hardware-assisted Run-time Monitoring Zubin Kumar.

S. M. Farhad PhD Student Supervisor: Dr. Bernhard Scholz

Voicu Groza, 2008 SITE, HARDWARE/SOFTWARE CODESIGN OF EMBEDDED SYSTEMS Hardware/Software Codesign of Embedded Systems Voicu Groza SITE Hall, Room.

Institute of Information Sciences and Technology Towards a Visual Notation for Pipelining in a Visual Programming Language for Programming FPGAs Chris.

Software Pipelining for Stream Programs on Resource Constrained Multi-core Architectures IEEE TRANSACTIONS ON PARALLEL AND DISTRIBUTED SYSTEM 2012 Authors:

Operating Systems ECE344 Ashvin Goel ECE University of Toronto Threads and Processes.

CALTECH cs184c Spring DeHon CS184c: Computer Architecture [Parallel and Multithreaded] Day 8: April 26, 2001 Simultaneous Multi-Threading (SMT)

Chapter 3 Parallel Programming Models. Abstraction Machine Level – Looks at hardware, OS, buffers Architectural models – Looks at interconnection network,

1 - CPRE 583 (Reconfigurable Computing): Compute Models Iowa State University (Ames) CPRE 583 Reconfigurable Computing Lecture 7: Wed 10/28/2009 (Compute.

Design Issues. How to parallelize  Task decomposition  Data decomposition  Dataflow decomposition Jaruloj Chongstitvatana 2 Parallel Programming: Parallelization.

CALTECH CS137 Winter DeHon CS137: Electronic Design Automation Day 7: February 3, 2002 Retiming.

Processor Architecture

CSCI1600: Embedded and Real Time Software Lecture 33: Worst Case Execution Time Steven Reiss, Fall 2015.

1 - CPRE 583 (Reconfigurable Computing): Compute Models Iowa State University (Ames) CPRE 583 Reconfigurable Computing Lecture 20: Wed 11/2/2011 (Compute.

High Performance Embedded Computing © 2007 Elsevier Chapter 1, part 3: Embedded Computing High Performance Embedded Computing Wayne Wolf.

High Performance Embedded Computing © 2007 Elsevier Lecture 4: Models of Computation Embedded Computing Systems Mikko Lipasti, adapted from M. Schulte.

Static Translation of Stream Program to a Parallel System S. M. Farhad The University of Sydney.

Mergesort example: Merge as we return from recursive calls Merge Divide 1 element 829.

IThreads A Threading Library for Parallel Incremental Computation Pramod Bhatotia Pedro Fonseca, Björn Brandenburg (MPI-SWS) Umut Acar (CMU) Rodrigo Rodrigues.

Incremental Parallel and Distributed Systems Pramod Bhatotia MPI-SWS & Saarland University April 2015.

Code Optimization.

CS137: Electronic Design Automation

ESE532: System-on-a-Chip Architecture

ESE532: System-on-a-Chip Architecture

Amir Kamil and Katherine Yelick

Threads and Memory Models Hal Perkins Autumn 2011

CSCI1600: Embedded and Real Time Software

From C to Elastic Circuits

Dynamically Scheduled High-level Synthesis

Threads and Memory Models Hal Perkins Autumn 2009

ESE535: Electronic Design Automation

Amir Kamil and Katherine Yelick

Well-behaved Dataflow Graphs

CSCI1600: Embedded and Real Time Software

ESE532: System-on-a-Chip Architecture

Prof. Onur Mutlu Carnegie Mellon University

Presentation transcript:

CS294-6 Reconfigurable Computing Day 23 November 10, 1998 Stream Processing

Previously Computing Requirements SCORE –stream-based computing model –use streams for linking computations instead of shared memory locations expose parallelism freedom of sequential/spatial implementation

Today Streams moderately well developed for –sequential atoms in multithreaded/multiprocessor environment General DF case SDF Expression...thoughts on adapting ideas for SCORE- like execution

General Dataflow case Dataflow graph exposes parallelism Operators enabled as soon as data is available Captures partial ordering for computation Adaptive/tolerant to latencies in system => great for exposing parallelism

General Dataflow Fine-grained –expose maximum parallelism –…but rendevous/presence overhead for every operator Who runs when is unpredictable –variable latencies –variable consumption/production –=> force runtime synchronization/scheduling

General Dataflow What structure to exploit to reduce requirements?

General Dataflow What structure to exploit to reduce requirements? –Spatial operator locality most communication local (sequential) –Operation blocks only do dataflow presence on input to region of code sequential/direct computation of subgraph –all local/deterministic computations in subgraph –Cyclic/predictable dataflow?

Dataflow Multithreading Original DF: –synchronize per instruction Hybrid DF -> TAM –synchronize on remote memory access (msgs) –run scheduling quanta (several instructions) Multithreading –coarse-grain tasks –synchronize on input data –(also locking)

What to watch for With arbitrary I/O rates –unbounded buffering requirements

Synchronous Data Flow Restriction –number of tokens produced/consumed is constant per operator firing –these numbers known at compile time –each edge has predetermined number of initial tokens Consistent –admissible and periodic

SDF: Periodic Periodic –invoke each operator at least once –return to initial state (# tokens on each edge) –can determine by balance equations

SDF: Admissible Admissible –firing sequence not yield deadlock

SDF: Inadmissible

SDF: Admissible

Benefits Periodic schedules Bounded buffer requirements –Acyclic graphs optimal algorithm –Cycle NP-complete heuristic algorithm … close to optimal buffering

SDF Example By Balance Equations –1 A, 2 B, 4 C Firing Sequences: –ABCBCCC –ABCCBCC –ABBCCCC Buffer Costs –5 (AB=2 BC=3) –4 (AB=2 BC=2) –6 (AB=2 BC=4)

Scheduling (min buffer) F= fireable operator D=deferrable(F) = edge has enough tokens to fire sink While (F  ) –if ((F-D)  ) fire from F-D –else fire operator which increases number of tokens least

Buffer Minimization Repeat –1 A –2 B –4 C F={A}, D=  –A F={B}, D=  –B F={B,C},D={B} –C F={B,C},D={B} –C F={B}, D=  –B

SDF  BDF What is SDF missing? –Restricts range of expression –Allows static scheduling

SDF  BDF Sufficient Addition:

SDF  BDF BDF –SDF + switch and select operators BDF is Turing Complete

Expression: Block Diagram Ptolemy example from Buck’94

Expression: Stream Language Function AveragePairs(D: Signal returns Signal) –stream integer [(D[0]+D[1])/2] || AveragePairs(stream_rest(D)) Ex: Dennis94

Convert to Static Data Flow

Composition of Stream Operators Function Process(D:ImageStream, w:integer returns MarkStream) –let R:=for I in 1,w return array of –FourForThree(AveragePairsD[I])) end for –in PeakDetect(TwoDimFilter(R,w)) –end let end function

Adapting How different?

Adapting How different? –Expensive to change operators –Possibility of spatial pipelining of operators Operator AT Operator copies –Allow dynamic rates… violate fixed firing

SDF: Timeslice Multiples of repetition/firing schedule –valid for acyclic graph –require greater buffering

SDF: Spatial Can realize spatially Repetition/firing schedule –gives relative throughput rates –simple cases => suggest Area-Throughput points

Dynamic Note that adding switch/select gives general, dynamic dataflow Suggests can identify: –static regions (obey SDF restrictions) –dynamic boundaries (where dynamic operators exist) Static schedule static regions Dynamic control at boundary/invocation of static blocks

Dynamic Flow Rates Cannot schedule completely at compile time Use feedback to get expected flow rate –schedule like SDF –track data presence at dynamic boundaries –allow additional buffer space (overflow) –stall slower operator as necessary careful check possible deadlock conditions

Summary Stream datatype captures computational structure –good for spatial implementations –expose parallelism Rich experience in DF/DSP to exploit Static powerful where applicable Can still help schedule “mostly static” cases