RAMP Gold RAMPants Parallel Computing Laboratory University of California, Berkeley.

RAMP Gold RAMPants {rimas,waterman,yunsup}@cs Parallel Computing Laboratory University of California, Berkeley

A Survey of μArch Simulation Trends Typical ISCA 2008 papers simulated about about twice as many instructions as those in 1998. So what?

A Survey of μArch Simulation Trends Something seems broken here…

A Survey of μArch Simulation Trends Something is clearly broken here.

Something is Rotten in the State of California A median ISCA ‘08 paper’s simulations run for fewer than four OS scheduling quanta! – We run yesterday’s apps at yesteryear’s timescales – And attempt to model N communicating cores with O(1/N) instructions per core?! The problem is that simulators are too slow – Irony: since performance scales as sqrt(complexity), simulated instructions per wall-clock second falls as processors get faster

RAMP Gold: Our Solution RAMP Gold is an FPGA-based, 100 MIPS manycore simulator – Only 100x slower than real-time – Economical: RTL is BSD-licensed; commodity HW Cost Performance (MIPS) Simulations per day Software Simulator $2,0000.1 - 11 RAMP Gold$2,000 + $75050 - 100100

Our Target Machine SPARC V8 CORE SPARC V8 CORE I$ D$ DRAM Shared L2$ / Interconnect SPARC V8 CORE SPARC V8 CORE I$ D$ SPARC V8 CORE SPARC V8 CORE I$ D$ SPARC V8 CORE SPARC V8 CORE I$ D$ … 64 cores

RAMP Gold Architecture Mapping the target machine directly to an FPGA is inefficient Solution: split timing and functionality – The timing logic decides how many target cycles an instruction sequence should take – Simulating the functionality of an instruction might take multiple host cycles – Target time and host time are orthogonal

Function/Timing Split Advantages Flexibility – Can configure target at runtime – Synthesize design once, change target model parameters at will Efficient FPGA resource usage – Example 1: model a 2-cycle FPU in 10 host cycles – Example 2: model a 16MB L2$ using only 256KB host BRAM to store tags/metadata

Host Multithreading How are we going to model 64 cores? F F D D X X M M W W F F D D X X M M W W F F D D X X M M W W F F D D X X M M W W … 64 pipelines time F F D D X X M M W W F F D D X X M M W W F F D D X X M M W W Build 64 pipelinesTime-multiplex one pipeline Single hardware pipeline with multiple copies of CPU state No bypass path required Not multithreaded target! … F F D D X X M M W W F F D D X X M M W W

Cache Modeling The cache model maintains tag, state, protocol bits internally Whenever the functional model issues a memory operation, the cache model determines how many target cycles to stall … tag index offset tag, state = = = = = = hit : don’t stall miss : stall arbitrary cycles max associativity

Putting it all together instruction cache ifetch stage decode stage register access stage memory stage data cache exception stage memory controller cache model & performance counters Resource Utilization (XC5VLX110T) LUTs – 14%, BRAM – 23% We can fit 3 pipelines on one FPGA!

Infrastructure

Our accomplishments this semester Jan 2009Last Night Single Threaded64 Way Host Multithreaded 0.000032GB BRAM2GB DDR2 SDRAM Hello World works (sometimes) ParLab Damascene CBIR App, SPLASH2 + SPEC CPU2000 No Timing Model or Introspection Runtime Configurable Cache Model, Performance Counters No Floating Point Hardware FPU Multiply/Add + Software Emulation

HARDware ain’t no joke 500+ Man Hours DDR2 Memory Controller Debugging 100+ Man HoursDMA Engine/Pipeline Issues 150+ Man HoursCAD Tool Issues 00 Pipeline Corner Cases

Sample Use Case: L1 D$ Tradeoffs Assume we have a 64-core CMP with private 16KB direct-mapped L1 D$ In the next tech gen, we can fit either of these improved configurations in a clock cycle: – 32KB direct-mapped L1 – 16KB 4-way set-associative L1 Which should we choose?

Sample Use Case: L1 D$ Tradeoffs Evidently, the associative cache is superior It took longer to make these slides than to run these 10+ billion instruction simulations

Future Directions RAMP Gold closes two critical feedback loops – Expedient HW/SW co-tuning is within our grasp – Simulations can now be run on a thermal timescale, enabling the exploration of temperature-aware scheduling policies We intend to explore both avenues!

DEMO: Damascene Image Convert Colorspace Textons: K-means Textons: K-means Texture Gradient Combine Non-max suppression Non-max suppression Intervening Contour Generalized Eigensolver Oriented Energy Combination Combine, Normalize Contours Bg Cga Cgb

RAMP Gold RAMPants Parallel Computing Laboratory University of California, Berkeley.

Similar presentations

Presentation on theme: "RAMP Gold RAMPants Parallel Computing Laboratory University of California, Berkeley."— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

RAMP Gold RAMPants Parallel Computing Laboratory University of California, Berkeley.

Similar presentations

Presentation on theme: "RAMP Gold RAMPants Parallel Computing Laboratory University of California, Berkeley."— Presentation transcript:

Similar presentations

About project

Feedback