Presentation is loading. Please wait.

Presentation is loading. Please wait.

1 RAMP Tutorial Introduction/Overview Krste Asanovic UC Berkeley RAMP Tutorial, ASPLOS, Seattle, WA March 2, 2008.

Similar presentations


Presentation on theme: "1 RAMP Tutorial Introduction/Overview Krste Asanovic UC Berkeley RAMP Tutorial, ASPLOS, Seattle, WA March 2, 2008."— Presentation transcript:

1 1 RAMP Tutorial Introduction/Overview Krste Asanovic UC Berkeley RAMP Tutorial, ASPLOS, Seattle, WA March 2, 2008

2 2 Technology Trends: CPU Microprocessor: Power Wall + Memory Wall + ILP Wall = Brick Wall  End of uniprocessors and faster clock rates  Every program(mer) is a parallel program(mer), Sequential algorithms are slow algorithms Since parallel more power efficient (W ≈ CV 2 F) New “Moore’s Law” is 2X processors or “cores” per socket every 2 years, same clock frequency  Conservative: 2007 4 cores, 2009 8 cores, 2011 16 cores for embedded, desktop, & server  Sea change for HW and SW industries since changing programmer model, responsibilities  HW/SW industries bet farm that parallel successful

3 3 1. Algorithms, Programming Languages, Compilers, Operating Systems, Architectures, Libraries, … not ready for 1000 CPUs / chip 2.  Only companies can build HW, and it takes years 3. Software people don’t start working hard until hardware arrives 3 months after HW arrives, SW people list everything that must be fixed, then we all wait 4 years for next iteration of HW/SW 4. How get 1000 CPU systems in hands of researchers to innovate in timely fashion on in algorithms, compilers, languages, OS, architectures, … ? 5. Can avoid waiting years between HW/SW iterations? Problems with “Manycore” Sea Change

4 4 Vision: Build Research MPP from FPGAs As  16 CPUs will fit in Field Programmable Gate Array (FPGA), 1000-CPU system from  64 FPGAs? 8 32-bit simple “soft core” RISC at 100MHz in 2004 (Virtex-II) FPGA generations every 1.5 yrs;  2X CPUs,  1.2X clock rate HW research community does logic design (“gate shareware”) to create out-of-the-box, MPP  E.g., 1000 processor, standard ISA binary-compatible, 64-bit, cache-coherent supercomputer @  150 MHz/CPU in 2007  6 universities, 10 faculty 3rd party sells RAMP 2.0 (BEE3) hardware at low cost “Research Accelerator for Multiple Processors”

5 5 Why RAMP Good for Research MPP? SMPCluster CustomSimulate RAMP Scalability (1k) CAAAA Cost (1k CPUs) F ($20M) C ($1M) F ($3M) A+ ($0M) A ($0.1M) Cost to ownADAAA Power/Space (kilowatts, racks) D (120 kw, 6 racks) A (100 kw, 3 racks) A+ (.1 kw, 0.1 racks) A (1.5 kw, 0.3 racks) CommunityDAFAA ObservabilityDCDA+ ReproducibilityBDBA+ ReconfigurabilityDCDA+ CredibilityA+ A-FB Perform. (clock) A (2 GHz) A (3 GHz) B (.4 GHz) F (0 GHz) C (.1 GHz) GPACB-C+BA-

6 6 Partnerships Co-PIs: Krste Asanovíc (UCB), Derek Chiou (UT Austin), Joel Emer (MIT/Intel), James Hoe (CMU), Christos Kozyrakis (Stanford), Shih-Lien Lu (Intel), Mark Oskin (Washington), David Patterson (Berkeley), and John Wawrzynek (Berkeley) RAMP hardware development activity centered at Berkeley Wireless Research Center. Three year NSF grant for staff (awarded 3/06). GSRC (Jan Rabaey) has paid partial staff and some students. Major continuing commitment from Xilinx Collaboration with MSR (Chuck Thacker) on BEE3 FPGA platform. Sun, IBM contributing processor designs, IBM faculty awards. High-speed high-confidence emulation is widely recognized as a necessary component of multiprocessor research and development. FPGA emulation is the only practical approach.

7 7 7 BEE3,1st prototype 11/07 New RAMP systems to be based on Berkeley Emulation Engine version 3 (BEE3). BEECube, Inc. – –(UC Berkeley spinout startup company) – –To provide manufacturing, distribution, and support to commercial and academic users. – –General availability 2Q08 BEE3 Design Chuck Thacker Chen Chang, UC Berkeley BEE3,1st prototype 11/07 For small scale design, or to get started, use Xilinx ML505

8 RAMP: An infrastructure to build simulators using FPGAs

9 9 Host Platform CPU Interconnect Network DRAM Target Model Hard Work Run Target Model on Host Platform

10 10 Reduce, Reuse, Recycle Reduce effort to build target models  Users just build components (units), infrastructure handles connections (The RDL Compiler) Reuse units by having good abstractions  Across different target models  Across different host platforms XUP, Calinx, BEE2, BEE3, ML505 also Altera platforms Recycle existing IP for use as simulation models  Commercial processor RTL is (almost) its own model

11 11 RAMP Target Model Units Relatively large chunks of functionality  e.g., processor + L1 cache User-written in some HDL or software Channels Point-point, undirectional, two kinds:  FIFO channel: Flow-controlled interface  Pipeline channel: Simple shift register, bits drop off end Generated by RAMP infrastructure Unit C Unit B Unit A FIFO Channel Pipeline Channel

12 12 Target Pipeline Channel Parameters D Forward Latency Datawidth D

13 13 RAMP Description Language (RDL) Unit C Unit B Unit A User describes target model topology, channel parameters, and (manual) mapping to host platform FPGAs using RDL RDL Compiler (RDLC) generates configurations Unit C Uni t B Uni t A FPGA1 FPGA2 RDLC Generated Unit Wrappers Generated links carry channels Target: Host: [ Greg Gibeling, UCB ]

14 14 Virtual Target Clock

15 15 Virtualized RTL Improves FPGA Resource Usage RAMP allows units to run at varying target-host clock ratios to optimize area and overall performance Example 1: Multiported register file  Example, Sun Niagara has 3 read ports and 2 write ports to 6KB of register storage  If RTL mapped directly, requires 48K flip-flops Slow cycle time, large area  If mapping into block RAMs (one read+one write per cycle), takes 3 host cycles and 3x2KB block RAMs Faster cycle time (~3X) and far less resources Example 2: Large L2/L3 caches  Current FPGAs only have ~1MB of on-chip SRAM  Use on-chip SRAM to build cache of active piece of L2/L3 cache, stall target cycle if access misses and fetch data from off-chip DRAM

16 16 Start/Done Timing Interface Wrapper generated by RDL asserts “Start” on the physical FPGA cycle when the inputs to the unit are ready for the next target cycle Unit asserts “Done” when it finishes the target cycle and its outputs are ready Unit can take variable amount of time Unvirtualized RTL unit can connect “Done” to “Start” (but must not clock until “Start”) Unit Start Done Wrapper Out In1 In2

17 17 Distributed Timing Models

18 18 Distributed Timing Example Unit A Unit B Latency L D Target:RDYsRDY Host: Unit A Unit B DD Start Done Start Done DEQs ENQDEQ Pipeline target channel implemented as distributed FIFO with at least L buffers

19 19 Other Automatically Generated Networks Control network has workstation as master and every unit as slave device  Memory-mapped interface with block transfers  Used for initialization, stats gathering, debugging, and monitoring Units can connect to DRAM resources outside of timed target channels  Used to support emulation and virtualization state Units can communicate with each other outside of timed target channels  Support arbitrary communication. E.g., for distributed stats gathering

20 20 Wide Variety of RAMP Simulators

21 21 Simulator Design Choices Structural Analog versus Highly Virtualized Functional-only versus Functional+Timing Timing via (virtual) RTL design versus separate functional and timing models Hybrid software/hardware simulators

22 22 Host Multithreading (Zhangxi Tan (UCB), Chung, (CMU)) CPU 1 CPU 2 CPU 3 CPU 4 Target Model Multithreading emulation engine reduces FPGA resource use and improves emulator throughput Hides emulation latencies (e.g., communicating across FPGAs) Multithreaded Host Emulation Engine (on FPGA) +1 2 PC 1 PC 1 PC 1 PC 1 I$ IR GPR1 X Y 2 D$ Single hardware pipeline with multiple copies of CPU state

23 23 Split Functional/Timing Models (HASIM Emer (MIT/Intel), FAST Chiou, (UT Austin)) Functional model executes CPU ISA correctly, no timing information  Only need to develop functional model once for each ISA Timing model captures pipeline timing details, does not need to execute code  Much easier to change timing model for architectural experimentation  Without RTL design, cannot be 100% certain that timing is accurate Many possible splits between timing and functional model Functional Model Timing Model

24 24 Multithreaded Func. & Timing Models (RAMP Gold: Tan, Gibeling, Asanovic, UCB) MT-Unit multiplexes multiple target units on a single host engine MT-Channel multiplexes multiple target channels over a single host link Functional Model Pipeline Arch State Timing Model Pipeline Timing State MT-Unit MT-Channels

25 25 Schedule 9:00- 9:45 Welcome/Overview 9:45-10:15 RAMP Blue Overview & Demo 10:15-10:45 Break 10:45-12:30 RAMP White Live Demo BEE3 Rollout (MSR/BEEcube/Q&A) 12:30-13:30 Lunch 13:30-15:00 ATLAS Transactional Memory (RAMP Red) 15:00-15:15 Break 15:15-16:45 CMU Simics/RAMP Cache Study 16:45 Wrapup

26 26 RAMP Blue Release 2/25/2008 - design available from RAMP website - ramp.eecs.berkeley.edu

27 27 RAMP White Hari Angepat, Derek Chiou (UT Austin) RAMP-White27 Leon 3 MstSlvDbgInt Leon3 shim MP IntCntrl DSUEthDDR2 Leon 3 MstSlvDbgInt AHB bus Leon3 shim Intersectio n Unit NIU Intersectio n Unit NIU Route r Scalable Coherent Shared Memory Multiprocessor Support standard shared memory programming models DDR2 AHB bus AHB shim

28 28

29 29 CMU Simics/RAMP Simulator 16-CPU Shared-memory UltraSPARC III Server (SunFire 3800) BEE2 Platform

30 30 RAMP Home Page/Repository ramp.eecs.berkeley.edu Remotely accessible subversion repository

31 31 Thank You! Questions?


Download ppt "1 RAMP Tutorial Introduction/Overview Krste Asanovic UC Berkeley RAMP Tutorial, ASPLOS, Seattle, WA March 2, 2008."

Similar presentations


Ads by Google