A Streaming Multi-Threaded Model Eylon Caspi,Randy Huang,Yury Markovskiy, Joe Yeh,André DeHon,John Wawrzynek BRASS Research Group University of California, Berkeley MSP-3 12/2/01
12/2/01Eylon Caspi — MSP-32 Protecting Software Investment Technology trends: bigger, faster Moore’s Law: 2x transistors every 18 months Device landscape growing Microprocessors, DSPs, FPGAs, communication processors, network processors, PSOCs, etc. Need a way to let SW survive, automatically scale to next-gen device Need a strong model for SW-HW interface with better parallelism
12/2/01Eylon Caspi — MSP-33 Outline Motivation SCORE SCORE for Reconfigurable Hardware SCORE for Microprocessors Summary / Future Work
12/2/01Eylon Caspi — MSP-34 A Lesson from ISA Processors ISA (Instruction Set Architecture) decouples SW from HW Survival to compatible, next generation devices Performance scales with device speed + size Survival for decades—e.g. IBM 360, x86 An ISA cannot scale forever Latency scales with device size (cycles to cross chip, access mem) Need parallelism to hide latency ILP:expensive to extract + exploit (caches, branch pred., etc.) Data:(Vector, MMX) limited applicability; MMX not scalable Thread:(MP, multi-threaded) IPC expensive; hard to program Gluing together conventional processors is insufficient
12/2/01Eylon Caspi — MSP-35 Streams Stream =FIFO communication channel with blocking read, non-blocking write, conceptually unbounded capacity Basic primitive for communication, synchronization Exposed at all levels—programming model, architecture Application =data flow graph of threads, memories Kahn process network Stream semantics ensure determinism regardless of communication timing, thread scheduling (Kahn continuity) Thread Mem
12/2/01Eylon Caspi — MSP-36 Stream-Aware Scheduling Streams expose inter-thread dependencies (data flow) Streams enable efficient, flexible schedules Efficient: fewer blocked cycles, shorter run time Automatically schedule to available resources Number of processors, memory size, network bandwidth, etc. E.g. Fully spatial, pipelined E.g. Time multiplexed with data batching Amortize cost of context swap over larger data set Thread Mem
12/2/01Eylon Caspi — MSP-37 Stream Reuse Persistent streams enable reuse Establish connection once (network route / buffer) Reuse connection while threads loaded Cheap (single cycle) stream access Amortize per-message cost of communication Thread Mem
12/2/01Eylon Caspi — MSP-38 SCORE Compute Model Program =data flow graph of stream-connected threads Kahn process network (blocking read, non-blocking write) Compute: Thread Task with local control Communication: Stream FIFO channel, unbounded buffer capacity, blocking read, non-blocking write Memory: Segment Memory block with stream interface (e.g. streaming read) Dynamics: Dynamic local thread behavior dynamic flow rates Unbounded resource usage: may need stream buffer expansion Dynamic graph allocation Model admits parallelism at multiple levels: ILP, pipeline, data
12/2/01Eylon Caspi — MSP-39 SCORE for Reconfigurable Hardware SCORE:Stream Computations Organized for Reconfigurable Execution Programmable logic + Programmable Interconnect E.g. Field Programmable Gate Arrays (FPGAs) Hardware scales by tiling / duplicating High parallelism; spatial data paths But no abstraction for software survival No binary compatibility No performance scaling Designer targets a specific device, specific resource constraints
10 Virtual Hardware Compute model has unbounded resources Programmer no longer targets particular device size Paging “Compute pages” swapped in/out (like VM) Page context = thread (FSM to access streams, block) Efficient virtualization Amortize reconfiguration cost over an entire input buffer buffers TransformQuantizeRLEEncode compute pages
12/2/01Eylon Caspi — MSP-311 SCORE Hardware Model Paged FPGA Compute Page (CP) Fixed-size slice of RC hardware (e.g LUTs) Fixed number of I/O ports Configurable Memory Block (CMB) Distributed, on-chip memory (e.g. 2 Mbit) Stream access High-level interconnect Microprocessor Run-time support + user code
12 Programming Model: TDF TDF = intermediate, behavioral language for: EFSM Operators Static operator graphs State machine for: Firing signatures Control flow (branching) Firing semantics: When in state X, wait for X’s inputs, then fire (consume, act) select (input boolean s, input unsigned[8] t, input unsigned[8] f, output unsigned[8] o ) { state S (s) : if (s) goto T; else goto F; state T (t) : o=t; goto S; state F (f) : o=f; goto S; } stf o select
12/2/01Eylon Caspi — MSP-313 Page Scheduling Schedule = time-sliced eviction / loading Choose pages to run Manage stream buffers (modify page graph; swap memory) Configure CPs, CMBs, network Implemented several schedulers Dynamic:Dynamic loading order based on buffered input Static:Static, repeated loading order Quasi-Static:Static loading order, dynamic time slice Page loading order (static / quasi-static) Topological:dependence order (arbitrary topological sort of page graph) Min-cut:minimize # of live stream buffers (min-cut page graph) Exhaustive:minimize stall cycles based on profiled I/O rates (exhaustively search all topological orders)
12/2/01Eylon Caspi — MSP-314 Execution Results Hardware Size (CP-CMB Pairs)
12/2/01Eylon Caspi — MSP-315 Heterogeneous SCORE SCORE extends to other processor types Network interface Route traffic to network or buffer Block on empty/full stream access Processor FPU IO
12/2/01Eylon Caspi — MSP-316 Microprocessor Stream Support Stream instructions: stream_read(reg,idx) stream_write(reg,idx) Network Interface
12/2/01Eylon Caspi — MSP-317 Summary Exposing streams at all levels (programming model, architecture) enables software survival + performance scaling in high-capacity architectures Demonstrated scalable hybrid reconfigurable architecture; proposed heterogeneous / multi-processor extensions Future work Page partitioning for reconfigurable Scheduling with I/O rate matching More Information SCORE web page FPGA 2002 paper (February 24-26)
12/2/01Eylon Caspi — MSP-318 Supplemental
12/2/01Eylon Caspi — MSP-319 Functional Simulation FPGA based on HSRA [Berkeley, FPGA ’99] CP:512 4-LUTs CMB:2Mbit DRAM Area for CP-CMB pair: Page reconfiguration:5000 cycles (from CMB) Synchronous operation(same clock speed as processor) x86 microprocessor Page Scheduler task Swap on timer interrupt (every 250,000 cycles) Fully dynamic scheduling.25 :12.9mm 2 (1/9 of PII-450).18 : 6.7mm 2 (1/16 of PIII-600)
12/2/01Eylon Caspi — MSP-320 Application: JPEG Encode
12/2/01Eylon Caspi — MSP-321 Execution Results Hardware Size (CP-CMB Pairs)
12/2/01Eylon Caspi — MSP-322 Execution Results Hardware Size (CP-CMB Pairs)
12/2/01Eylon Caspi — MSP-323 Execution Results Hardware Size (CP-CMB Pairs)