Download presentation
Presentation is loading. Please wait.
Published byKaleb Noah Modified over 9 years ago
1
Combining Statistical and Symbolic Simulation Mark Oskin Fred Chong and Matthew Farrens Dept. of Computer Science University of California at Davis
2
Overview HLS is a hybrid performance simulation –Statistical + Symbolic Fast Accurate Flexible
3
Motivation I-cache hit rate I-cache miss penalty Branch miss-predict penalty Basic block size Dispatch bandwidth
4
Motivation Fast simulation –seconds instead of hours or days –Ideally is interactive Abstract simulation –simulate performance of unknown designs –application characteristics not applications
5
Outline Simulation technologies and HLS From applications to profiles Validation Examples Issues Conclusion
6
Design Flow with HLS Cycle-by- Cycle Simulation HLS Profile Design Issue Possible solution Estimate Performance
7
Traditional Simulation Techniques Cycle-by-cycle (Simplescalar, SimOS,etc.) + accurate – slow Native emulation/basic block models (Atom, Pixie) + fast, complex applications – useful to a point (no low-level modifications)
8
Statistical / Symbolic Execution HLS + fast (near interactive) + accurate / – within regions + permits variation of low-level parameters + arbitrary design points / – use carefully
9
HLS: A Superscalar Statistical and Symbolic Simulator L2 Cache L1 I-cache L1 D-cache Main Memory Branch Predictor Fetch Unit Out of order Dispatch Unit Out of order Completion Unit Out of order Execution core StatisticalSymbolic
10
Workflow Code Binary sim-stat sim-outorder app profile Stat-binary HLS machine-profile R10k machine-configuration
11
Machine Configurations Number of Functional units (I,F,[L,S],B) Functional unit pipeline depths Fetch, Dispatch and completion bandwidths Memory access latencies Mis-speculation penalties
12
Profiles Machine profile: –cache hit rates => ( ) –branch prediction accuracy => ( ) Application profile: –basic block size => ( , ) –instruction mix (% of I,F,L,S,B) –dynamic instruction distance (histogram)
13
Statistical Binary 100 basic blocks Correlated: –random instruction mix –random assignment of dynamic instruction distance –random distribution of cache and branch behaviors
14
Statistical Binary load (l1 i-cache, l2 i-cache, l1 d-cache l2 d-cache, dependence 0) integer (l1 i-cache, l2 i-cache, dependence 0, dependence 1) branch (l1 i-cache, l2 i-cache, branch-predictor accr., dep 0, dep 1) store (l1 i-cache, l2 i-cache, l1 d-cache l2 d-cache, dep 0, dep 1) load (l1 i-cache, l2 i-cache, l1 d-cache l2 d-cache, dependence 0) core functional unit requirements cache behavior during I-fetch cache behavior during data access dynamic instruction distance branch predictor behavior
15
HLS Instruction Fetch Stage integer (...) branch (...) store (...) load (...) integer (...) branch (...) load (...) integer (..) Similar to conventional instruction fetch: - has a PC - has a fetch window - interacts with caches - utilizes branch predictor - passes instructions to dispatch Differences: - caches and branch predictor are statistical models Fetches symbolic instructions and interacts with a statistical memory system and branch predictor model.
16
Validation - SimpleScalar vs. HLS
17
Validation - R10k vs. HLS
18
HLS Multi-value Validation with SimpleScalar HLS Simple-Scalar (Perl)
19
HLS Multi-Value Validation with SimpleScalar HLS Simple-Scalar (Xlisp)
20
Example use of HLS An intuitive result: branch prediction accuracy becomes less important (crosses fewer iso-IPC contour lines, as basic block size increase). (Perl)
21
Example use of HLS Another intuitive result: gains in IPC due to basic block size are front-loaded (Perl) Trade-off between front-end (fetch/dispatch) and back-end (ILP) processor performance
22
Example use of HLS This space intentionally left blank. (Perl)
23
Related work R. Carl and J.E. Smith. Modeling superscalar processors via statistical simulation - PAID Workshop - June 1998. N. Jouppi. The non-uniform distribution of instruction-level and machine parallelism and its effect on performance. - IEEE Trans. 1989. D. Noonburg and John Shen. Theoretical modeling of superscalar processor performance - MICRO27 - November 1994.
24
Questions & Future Directions How important are different well-performing benchmarks anyway? –easily summarized –summaries are not precise => yet precise enough –Will the statistical+symbolic technique work for poorly behaved applications? Will it extend to deeper pipelines and more real processors (i.e. Alpha, P6 architecture)?
25
Conclusion HLS: Statistical + Symbolic Execution –Intuitive design space exploration Fast Accurate –Flexible Validated against cycle-by-cycle and R10k Future work: deeper pipelines, more hardware validations, additional domains source code at: http://arch.cs.ucdavis.edu/~oskin
Similar presentations
© 2024 SlidePlayer.com. Inc.
All rights reserved.