Download presentation
Presentation is loading. Please wait.
1
Using Sampled and Incomplete Profiles David Kaeli Department of Electrical and Computer Engineering Northeastern University Boston, MA kaeli@ece.neu.edu
2
Overview Trace-based simulation is expensive (caches are getting larger, CPUs and networks are getting faster) Approximate results are many times sufficient How can we utilize a reduced or sampled profile and still obtain accurate modeling/simulation results?
3
How does sampling affect our Metrics for Evaluating Trace Collection Methodologies Speed – sampled profiles reduce speed requirements Memory – sampled profiles may take less space Accuracy – sampled profiles are less accurate Intrusiveness – sampling is less intrusive Completeness – no change Granularity – may affect our ability to capture fast events (nyquist) Flexibility – no change Portability – clock speed may affect sampling accuracy Capacity – sampling should reduce this Cost – potentially less cost and less time
4
Application Areas : Memory system performance – cache simulation, working set models, temporal and spatial locality CPU pipeline modeling – instruction frequencies, instruction sequences (2-at-a- time, 3-at-a time, pipeline snapshots) Network simulation – input traffic distribution, queue lengths, throughput, burstiness
5
Memory Systems Temporal locality – addresses will be referenced close in time Spatial locality – addresses close by will be referenced next Working set models –Belady (1966) – Virtual memory page replacement algorithms (Optimal replacement defined) –Denning (1980) – The pattern of page access over the execution of the program –Thiebaut and Stone (1986) – The number of misses incurred due to task switches can be modeled as a binomial distribution
6
Memory Systems Cold start – How do we model behavior when a program starts execution for the first time? Warm start – How do we model behavior when a program resumes execution? How does memory organization (e.g., set associativity)and typical program behavior affect our ability to utilize sampled profiles effectively? Do we sample in time or can we also sample in space (e.g., a set)?
7
Memory Systems If we only capture a subset of the important addresses, can we reproduce the full trace? –Abstract Execution (Larus 1990) – basic block traces –Trace Reduction (Smith 1977, Puzak 1985) – generate a reduced trace that contains the exact same number of misses and writebacks as the original trace (a type of filtering is performed, similar to Agarwal’s Block Filter) –Trace compaction (Samples 1989) – perform a diff on sequential addresses and only capture important diffs
8
Can we do something simple? Laha 1988 Fu 1994 ignore sample nsample n+1 ignore sampling ratio = sample size/sample interval sampling intervalsample size Two types of errors 1.sampling errors – is the ratio optimal? 2.accurately predicting effects of ignored portions But this only suggests when to sample, not what to sample…
9
Sampling Rate: What is the Nyquist Frequency for an Program? We must sample a program at a rate of twice the frequency of the event of interest Half the sampling frequency is termed the Nyquist frequency Sampling at lower rates than twice the Nyquist frequency can cause aliasing and distortion The biggest problem is that not all events of interest in a program exhibit a nice periodic pattern
10
Example: gprof( ) Produces 3 things – A listing of the total execution times and call counts for each of the functions in the program, sorted by decreasing time – The functions sorted according to the time they represent, including the time of their call graph descendents –Total execution in a cycle and the members in that cycle (a cycle is a back edge) gprof samples a program’s execution –Obtains exact call statistics –Does not obtain exact time measurements –Accuracy is obtained through statistical sampling –Sampling reduces the associated overhead
11
When to sample: Cold start vs. Warm start
12
Sampling dimensions We can sample in both time and space –Time (periodic sampling) –Filtered sampling (e.g., using addresses ranges) Time –Periodic –Random –#misses, #instructions, #loads/stores Space –Address ranges – may not be representative of all ranges –Cache sets – may limit the utility of the sample –Statically tagged events – focuses in on particular instructions and data of interest
13
How do we account for unknown references? Assume that this behavior does not affect past/future behavior Assume that some percentage of past behavior is overwritten by unknown reference behavior –Decay model –Footprints in the cache –MRU model Assume that all of the past behavior is overwritten by the unknown reference behavior
14
How do we model the effects of multiprogramming?? Flush all tables (caches, branch predictors, TLBs, load buffers, etc.) Estimate interference using a model –Invalidate some % of all entries based on the relative time since last execution –Utilize working set models and utilize these to estimate the effect of the interference Allow aliasing to occur where appropriate (branch predictors, but not caches)
15
Sampled Instruction Execution Profiles Instruction frequencies –For SPEC92int programs on a IA32 CPU 43% ALU, 22% loads, 12% stores, 23% control flow (H&P AQA) Instruction sequences –Top pairs –Top triples –Continue up until average BB size Branches in the pipeline –Sampled versus modeled
16
Analytical Models of Workload Behavior Squillante and Kaeli, 1997 We an capture distributions of the distance between events We can then compute the probability of different events occurring in a pipeline of length n (think of n as a window of execution) We can also compute the conditional probabilities of multiple events occurring in the pipeline of length n We can then assign weights to each of these multiple events to compute the throughput (IPC) of a pipeline Our model uses random marked point process (time between events) and produces very accurate estimates of pipeline throughput
17
Analytical Models of Workload Behavior Squillante and Kaeli, 1997 Some constraints are assumed in the sake of simplicity: –Inter-branch times are independent –Inter-branch times are identically distributed –Delay due to taken branches is constant
18
Analytical Models of Workload Behavior Squillante and Kaeli, 1997
19
A analytical formula is used to compute approximate CPI and speedup measures for an n-stage pipelined processor Traces are captured from benchmark execution The distribution of the number of instructions between successive taken branches is computed using a “window of execution” filter
20
Analytical Models of Workload Behavior Squillante and Kaeli, 1997
21
For a pipeline length of 8, the obtained results are quite precise for three out of the five benchmarks For bubblesort and prime, the IID assumption is violated, thus introducing some inaccuracies in our model Future work looks at handling inaccuracies that provide multi-level conditional probabilities into our model
22
Capturing n-length Instruction Sequences Sampling over n sequential instructions Capturing the most frequently executed sequences Utilizing these sequences to drive pipeline design Capturing longer profiles may allow us to predict design hardware trace caches Gonzalez, Tubella and Molina describe a mechanism for both profiled instructions and operand values
Similar presentations
© 2024 SlidePlayer.com. Inc.
All rights reserved.