Download presentation
Presentation is loading. Please wait.
1
RAMP Gold RAMPants {rimas,waterman,yunsup}@cs Parallel Computing Laboratory University of California, Berkeley
2
A Survey of μArch Simulation Trends Typical ISCA 2008 papers simulated about about twice as many instructions as those in 1998. So what?
3
A Survey of μArch Simulation Trends Something seems broken here…
4
A Survey of μArch Simulation Trends Something is clearly broken here.
5
Something is Rotten in the State of California A median ISCA ‘08 paper’s simulations run for fewer than four OS scheduling quanta! – We run yesterday’s apps at yesteryear’s timescales – And attempt to model N communicating cores with O(1/N) instructions per core?! The problem is that simulators are too slow – Irony: since performance scales as sqrt(complexity), simulated instructions per wall-clock second falls as processors get faster
6
RAMP Gold: Our Solution RAMP Gold is an FPGA-based, 100 MIPS manycore simulator – Only 100x slower than real-time – Economical: RTL is BSD-licensed; commodity HW Cost Performance (MIPS) Simulations per day Software Simulator $2,0000.1 - 11 RAMP Gold$2,000 + $75050 - 100100
7
Our Target Machine SPARC V8 CORE SPARC V8 CORE I$ D$ DRAM Shared L2$ / Interconnect SPARC V8 CORE SPARC V8 CORE I$ D$ SPARC V8 CORE SPARC V8 CORE I$ D$ SPARC V8 CORE SPARC V8 CORE I$ D$ … 64 cores
8
RAMP Gold Architecture Mapping the target machine directly to an FPGA is inefficient Solution: split timing and functionality – The timing logic decides how many target cycles an instruction sequence should take – Simulating the functionality of an instruction might take multiple host cycles – Target time and host time are orthogonal
9
Function/Timing Split Advantages Flexibility – Can configure target at runtime – Synthesize design once, change target model parameters at will Efficient FPGA resource usage – Example 1: model a 2-cycle FPU in 10 host cycles – Example 2: model a 16MB L2$ using only 256KB host BRAM to store tags/metadata
10
Host Multithreading How are we going to model 64 cores? F F D D X X M M W W F F D D X X M M W W F F D D X X M M W W F F D D X X M M W W … 64 pipelines time F F D D X X M M W W F F D D X X M M W W F F D D X X M M W W Build 64 pipelinesTime-multiplex one pipeline Single hardware pipeline with multiple copies of CPU state No bypass path required Not multithreaded target! … F F D D X X M M W W F F D D X X M M W W
11
Cache Modeling The cache model maintains tag, state, protocol bits internally Whenever the functional model issues a memory operation, the cache model determines how many target cycles to stall … tag index offset tag, state = = = = = = hit : don’t stall miss : stall arbitrary cycles max associativity
12
Putting it all together instruction cache ifetch stage decode stage register access stage memory stage data cache exception stage memory controller cache model & performance counters Resource Utilization (XC5VLX110T) LUTs – 14%, BRAM – 23% We can fit 3 pipelines on one FPGA!
13
Infrastructure
14
Our accomplishments this semester Jan 2009Last Night Single Threaded64 Way Host Multithreaded 0.000032GB BRAM2GB DDR2 SDRAM Hello World works (sometimes) ParLab Damascene CBIR App, SPLASH2 + SPEC CPU2000 No Timing Model or Introspection Runtime Configurable Cache Model, Performance Counters No Floating Point Hardware FPU Multiply/Add + Software Emulation
15
HARDware ain’t no joke 500+ Man Hours DDR2 Memory Controller Debugging 100+ Man HoursDMA Engine/Pipeline Issues 150+ Man HoursCAD Tool Issues 00 Pipeline Corner Cases
16
Sample Use Case: L1 D$ Tradeoffs Assume we have a 64-core CMP with private 16KB direct-mapped L1 D$ In the next tech gen, we can fit either of these improved configurations in a clock cycle: – 32KB direct-mapped L1 – 16KB 4-way set-associative L1 Which should we choose?
17
Sample Use Case: L1 D$ Tradeoffs Evidently, the associative cache is superior It took longer to make these slides than to run these 10+ billion instruction simulations
18
Future Directions RAMP Gold closes two critical feedback loops – Expedient HW/SW co-tuning is within our grasp – Simulations can now be run on a thermal timescale, enabling the exploration of temperature-aware scheduling policies We intend to explore both avenues!
19
DEMO: Damascene Image Convert Colorspace Textons: K-means Textons: K-means Texture Gradient Combine Non-max suppression Non-max suppression Intervening Contour Generalized Eigensolver Oriented Energy Combination Combine, Normalize Contours Bg Cga Cgb
Similar presentations
© 2025 SlidePlayer.com. Inc.
All rights reserved.