Baring It All to Software: Raw Machines Waingold, Taylor, et. al. Massachusetts Institute of Technology, Lab. for CS Presented by Garver Moore for ECE 259: Advanced Computer Architecture II Prof. D.J. Sorin Duke University
These three trends . . . Verification Complexity and Constraints - Superscalar verification - Dynamic execution structures Area, Complexity - Corner cases++; 2) Chip Wire Length Constraints - Pipelined communication b/w resources - Clock net limits - Xmission-line design 3) “Dynamic” Workload Space - “Changing application workloads” - Y2K ISA appropriate for Y2K workloads - E.G. Streaming I/D Apps (MMX / SSE)?
. . . motivate “Raw” Architectures Philosophy: Tile machine (a la 128-CMP) Per tile: - Instruction Stream - Cache (I$ D$ and memories) - Functional units (vis regs, ALU) - Switch (reprogrammable) - (Re)configurable units (More on this later) - Leverage STATIC information - Provide correctness for dynamic events
Proposed Raw tile 3 Distinct Approaches Point 2 Point inter-tile network No instruction traverses more than 1 tile width per cycle Reconfigurable switch memory enables scheduling directives Architecturally visible registers ALU operations Configurable logic -- 3 memory/state distribution models: Raw – memory ports and register file is disributed amoing a switched p2p network between f-units and state -- Superscalar communicates between 1 mem and state port and distributed functional units on a large, often pipelined, global bus -- Traditional Multiprocessors distribute memory,state,and functional units on a switched network though memory. Key diff b/w Raw and M-P is granularity of communications. Raw Superscalar Multiprocessor
Configurable Logic (CL) Do-it-yourself architecture extensions Create customized instructions Example: Game of Life “benchmark” drop 22 cycle software sequence to 1 instruction
Raw vs. Other Architectures I A[b[i]] = A[c[i]]; Systolic Arrays: (Mark II Colossus) - slightly more recently, NuMesh (MIT) - Almost ZERO support for dynamic events, reconfiguration, patterns. FPGAs: - Configurable, application specific VLIW: - large Register namespace - Distributed register file - Massive compiler dependency Systolic Arrays: One of first computers (Mark II Colossus) code breaking – wheel settings for the Lorentz enciphering machine – directional dataflow between functional units – e.g. inputs taken from “NW” and outputs given to “SE.” Obviously does not allow for dynamic dependencies in its simplest form. NuMesh is a packaging and interconnect technology supporting high-bandwidth systolic communications on a 3D nearest-neighbor lattice; our goal is to combine Lego-like modularity with supercomputer performance. To date, the primary focus of the project has been the class of applications whose static communication patterns can be precompiled into independent and carefully choreographed finite state machines running on each node. extensions of the NuMesh to more general communication paradigms have been implemented. FPGAs: configurability obvious, do not support instruction sequencing and “onerous” compilation times. RAW architecture has complex but pre-compiled units (ALUs, et. Cetera). VLIW: “inspiring” RAW – similar dependence on static information, distributed registers, many registers, etc. However RAW allows for multiple I streams – can perform independent but static scheduled computations in different tiles.
Raw vs. Other Architectures II Multiscalar - “Deceptive” similarity - Resources unexposed - E.G. Value forwarding CMP - Simple replication - Message startup / synchronization performance issues IRAM - “on-chip balance.” - still, long bitlines and multibanked memory delays - might suffice “now” (1997) but in future processes will be exposed Multiscalar – hardware Renaming, expose only 32 arch-vis registers – Raw gives compiler more flexibility Raw allows explicit value forwarding to tiles, which allows for a scalable interconnect – multiscalar approach uses a bus for broadcast to tiles.
Results – “RawLogic” FPGA Implementation Does not support “general” instruction processing – converted static control sequences into state machines Less flexible, more compilation time
Questions / Discussion I Small register name-space problem? “Reducing HW support . . . opposes current trends, but [more area] and reduced verification complexity. Taken together, these benefits can make the software synthesis of complex operations competitive with hardware for overall application performance. (Emphasis mine)” Limits of do-it-yourself ISA? Where is the dynamic limit? I/O? Contexts? Along same vein, appropriate performance evaluation? Or too-tailored (i.e. tarantula) Market size?
Questions / Dicussions II How have innovations since 1997 affected this Is there a limit to multiple-granularity reconfigurability’s usefulness? The Prophecy: “In 10 to 15 years, we believe that [giga-xistor chips] faster switch speeds, and growing compiler sophistication will allow a Raw machine’s performance/cost ratio to surpass that of traditional architectures for future, general-purpose workloads” Dynamic event support – too thin? “The Google Test”