A Streaming Multi-Threaded Model Eylon Caspi,Randy Huang,Yury Markovskiy, Joe Yeh,André DeHon,John Wawrzynek BRASS Research Group University of California,

A Streaming Multi-Threaded Model Eylon Caspi,Randy Huang,Yury Markovskiy, Joe Yeh,André DeHon,John Wawrzynek BRASS Research Group University of California, Berkeley MSP-3 12/2/01

12/2/01Eylon Caspi — MSP-32 Protecting Software Investment  Technology trends: bigger, faster  Moore’s Law: 2x transistors every 18 months  Device landscape growing  Microprocessors, DSPs, FPGAs, communication processors, network processors, PSOCs, etc.  Need a way to let SW survive, automatically scale to next-gen device  Need a strong model for SW-HW interface with better parallelism

12/2/01Eylon Caspi — MSP-33 Outline  Motivation  SCORE  SCORE for Reconfigurable Hardware  SCORE for Microprocessors  Summary / Future Work

12/2/01Eylon Caspi — MSP-34 A Lesson from ISA Processors  ISA (Instruction Set Architecture) decouples SW from HW  Survival to compatible, next generation devices  Performance scales with device speed + size  Survival for decades—e.g. IBM 360, x86  An ISA cannot scale forever  Latency scales with device size (cycles to cross chip, access mem)  Need parallelism to hide latency  ILP:expensive to extract + exploit (caches, branch pred., etc.)  Data:(Vector, MMX) limited applicability; MMX not scalable  Thread:(MP, multi-threaded) IPC expensive; hard to program Gluing together conventional processors is insufficient

12/2/01Eylon Caspi — MSP-35 Streams  Stream =FIFO communication channel with blocking read, non-blocking write, conceptually unbounded capacity  Basic primitive for communication, synchronization  Exposed at all levels—programming model, architecture  Application =data flow graph of threads, memories  Kahn process network  Stream semantics ensure determinism regardless of communication timing, thread scheduling (Kahn continuity) Thread Mem

12/2/01Eylon Caspi — MSP-36 Stream-Aware Scheduling  Streams expose inter-thread dependencies (data flow)  Streams enable efficient, flexible schedules  Efficient: fewer blocked cycles, shorter run time  Automatically schedule to available resources  Number of processors, memory size, network bandwidth, etc.  E.g. Fully spatial, pipelined  E.g. Time multiplexed with data batching  Amortize cost of context swap over larger data set Thread Mem

12/2/01Eylon Caspi — MSP-37 Stream Reuse  Persistent streams enable reuse  Establish connection once (network route / buffer)  Reuse connection while threads loaded  Cheap (single cycle) stream access  Amortize per-message cost of communication Thread Mem

12/2/01Eylon Caspi — MSP-38 SCORE Compute Model  Program =data flow graph of stream-connected threads  Kahn process network (blocking read, non-blocking write)  Compute: Thread  Task with local control  Communication: Stream  FIFO channel, unbounded buffer capacity, blocking read, non-blocking write  Memory: Segment  Memory block with stream interface (e.g. streaming read)  Dynamics:  Dynamic local thread behavior  dynamic flow rates  Unbounded resource usage: may need stream buffer expansion  Dynamic graph allocation  Model admits parallelism at multiple levels: ILP, pipeline, data

12/2/01Eylon Caspi — MSP-39 SCORE for Reconfigurable Hardware  SCORE:Stream Computations Organized for Reconfigurable Execution  Programmable logic + Programmable Interconnect  E.g. Field Programmable Gate Arrays (FPGAs)  Hardware scales by tiling / duplicating  High parallelism; spatial data paths  But no abstraction for software survival  No binary compatibility  No performance scaling  Designer targets a specific device, specific resource constraints

10 Virtual Hardware  Compute model has unbounded resources  Programmer no longer targets particular device size  Paging  “Compute pages” swapped in/out (like VM)  Page context = thread (FSM to access streams, block)  Efficient virtualization  Amortize reconfiguration cost over an entire input buffer buffers TransformQuantizeRLEEncode compute pages

12/2/01Eylon Caspi — MSP-311 SCORE Hardware Model  Paged FPGA  Compute Page (CP)  Fixed-size slice of RC hardware (e.g. 512 4-LUTs)  Fixed number of I/O ports  Configurable Memory Block (CMB)  Distributed, on-chip memory (e.g. 2 Mbit)  Stream access  High-level interconnect  Microprocessor  Run-time support + user code

12 Programming Model: TDF  TDF = intermediate, behavioral language for:  EFSM Operators Static operator graphs  State machine for:  Firing signatures Control flow (branching)  Firing semantics:  When in state X, wait for X’s inputs, then fire (consume, act) select (input boolean s, input unsigned[8] t, input unsigned[8] f, output unsigned[8] o ) { state S (s) : if (s) goto T; else goto F; state T (t) : o=t; goto S; state F (f) : o=f; goto S; } stf o select

12/2/01Eylon Caspi — MSP-313 Page Scheduling  Schedule = time-sliced eviction / loading  Choose pages to run  Manage stream buffers (modify page graph; swap memory)  Configure CPs, CMBs, network  Implemented several schedulers  Dynamic:Dynamic loading order based on buffered input  Static:Static, repeated loading order  Quasi-Static:Static loading order, dynamic time slice  Page loading order (static / quasi-static)  Topological:dependence order (arbitrary topological sort of page graph)  Min-cut:minimize # of live stream buffers (min-cut page graph)  Exhaustive:minimize stall cycles based on profiled I/O rates (exhaustively search all topological orders)

12/2/01Eylon Caspi — MSP-314 Execution Results Hardware Size (CP-CMB Pairs)

12/2/01Eylon Caspi — MSP-315 Heterogeneous SCORE  SCORE extends to other processor types  Network interface  Route traffic to network or buffer  Block on empty/full stream access Processor FPU IO

12/2/01Eylon Caspi — MSP-316 Microprocessor Stream Support  Stream instructions: stream_read(reg,idx) stream_write(reg,idx) Network Interface

12/2/01Eylon Caspi — MSP-317 Summary  Exposing streams at all levels (programming model, architecture) enables software survival + performance scaling in high-capacity architectures  Demonstrated scalable hybrid reconfigurable architecture; proposed heterogeneous / multi-processor extensions  Future work  Page partitioning for reconfigurable  Scheduling with I/O rate matching  More Information  SCORE web page http://brass.cs.berkeley.edu/SCORE/  FPGA 2002 paper (February 24-26)

12/2/01Eylon Caspi — MSP-318 Supplemental

12/2/01Eylon Caspi — MSP-319 Functional Simulation  FPGA based on HSRA [Berkeley, FPGA ’99]  CP:512 4-LUTs  CMB:2Mbit DRAM  Area for CP-CMB pair:  Page reconfiguration:5000 cycles (from CMB)  Synchronous operation(same clock speed as processor)  x86 microprocessor  Page Scheduler task  Swap on timer interrupt (every 250,000 cycles)  Fully dynamic scheduling.25  :12.9mm 2 (1/9 of PII-450).18  : 6.7mm 2 (1/16 of PIII-600)

12/2/01Eylon Caspi — MSP-320 Application: JPEG Encode

A Streaming Multi-Threaded Model Eylon Caspi,Randy Huang,Yury Markovskiy, Joe Yeh,André DeHon,John Wawrzynek BRASS Research Group University of California,

Similar presentations

Presentation on theme: "A Streaming Multi-Threaded Model Eylon Caspi,Randy Huang,Yury Markovskiy, Joe Yeh,André DeHon,John Wawrzynek BRASS Research Group University of California,"— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

A Streaming Multi-Threaded Model Eylon Caspi,Randy Huang,Yury Markovskiy, Joe Yeh,André DeHon,John Wawrzynek BRASS Research Group University of California,

Similar presentations

Presentation on theme: "A Streaming Multi-Threaded Model Eylon Caspi,Randy Huang,Yury Markovskiy, Joe Yeh,André DeHon,John Wawrzynek BRASS Research Group University of California,"— Presentation transcript:

Similar presentations

About project

Feedback