112-12-2001CS752 Decoupled Architecture for Data Prefetching Jichuan Chang Kai Xu.

112-12-2001CS752 Decoupled Architecture for Data Prefetching Jichuan Chang Kai Xu

212-12-2001CS752 Outline Motivation Design and Evaluation Results Conclusions

312-12-2001CS752 Motivation Processor-memory performance gap –Prefetching helps, but it has overhead. Transistor is cheap, will a coprocessor help? Main ProcessorPrefetching CoProcessor Cache Info Flow L1-L2 Internal Bus Prefetch Requests Data

412-12-2001CS752 Why a dedicated coprocessor? Simple –It simplifies the design of main processor. Powerful –It can (hopefully) exploit complex algorithms; –It handles computation overhead (i.e. pattern computation, address computation). Flexible –It can (hopefully) adapt to different situations; –It can implement different algorithms. But are these true?

512-12-2001CS752 Main ProcessorPrefetching CoProcessor Cache Info Flow Bus Prefetch Requests Data The Generic Design Stream Buffer Tables RPT, PPW, CT, History, … ALU What ? Where ? When ?

612-12-2001CS752 Data Prefetching Techniques Regular Access Prefetching –Tagged Next Block Lookahead [Smith 82] Exploit sequential access pattern; –Stride Prefetching [Baer & Chen 91] Exploit stride access pattern; Dependency-based Prefetching [Roth, et al 98] –Discover Linked-Data-Structure access pattern Dead Block Correlation [Lai, et al 01] –History based correlation prediction Stream Buffer [Joppi 90] –Reduce cache pollution

712-12-2001CS752 Simulation Settings SimpleScalar v3.0 –Modified sim-outorder to implement information sharing between MP and PCP; –Modified cache module to implement Prefetching schemes (between L1 and L2 cache), Prefetch queue (len = 16); Bus sharing/contention, Stream buffer. Memory Parameters –L1 Data Cache: 4KB, 32B line, 4-way associative; –L2 Cache: 64KB, 64B line, 4-way associative; –Stream buffer: 8 entries, fully associated, 1 cycle hit; –Hit latency (cycle): L1 = 1 L2 = 12 Mem = 70 (2*); –Pipelined bus: bus contention/latency are modeled.

812-12-2001CS752 Benchmarks From Spec95 –gcc –compress –swim –tomcatv Microbenchmark Matrix multiplication (128 X 128 double) Binary tree (1M nodes, similar to treeadd)

912-12-2001CS752 Results (IPC)

1012-12-2001CS752 Results (Miss Ratio)

1112-12-2001CS752 Results (Prefetch Accuracy)

1212-12-2001CS752 L1-L2 Traffic Increase

1312-12-2001CS752 Results (Delay Tolerance) How many cycles of delay can PCP tolerate? –More delay Less useful (can’t get back before demand references) More pollution (due to outdated information) Less prefetches (due to bus contention) –To avoid pollution, impl. prefetch queue as circular buffer. Overwrite outdated entries when queue is full. The major effect of larger delay will be less prefetches. Hard to model memory behavior in SimpleScalar –Predetermine latency, no wake-up, no MSHR.

1412-12-2001CS752 Delay tolerance Preliminary result –For almost all schemes on all benchmarks: –PCP can tolerant 8 cycles of delay

1512-12-2001CS752 Can we integrate different schems? Different applications need different schems Brute force approach –Use both tagged and stride prefetching –Good speedup, but much more memory traffic. Adapt prefetching policy dynamically? 1.Share the same hardware table +Using similar matching schemes –Hard to reconfigure/flush when context-swithes 2.Use separate tables –More hardware –Similar to tournament predictor (just a thought)

1612-12-2001CS752 Conclusions PCP helps performance! (2-30% speedup) –PCP handles prefetching, can tolerates some delays. Different schemes work for different applications –Requires different information (from different places); –PCP should be placed close to the info source; –Not easy to integrate different schemes. Limitation of our approach –PCP not fully utilized. –Relies on tables (caches/queues/buffers) DBCP requires large history table (7.6 M memory)! –Delay is critical to performance It limits the complexity of prefetch schemes, It also determines where to place PCP.

1712-12-2001CS752 Future Work To evaluate more prefetching schemes –Dependency-based prefetching, etc. PCP Running Ahead –Probably with the help of trace cache; –To fully utilize PCP; –Need chkpt/rollback mechanisms. CoProcessor to Support Other Functionalities –Branch prediction, power mgmt. PCP for Multiprocessor –Suitable for One-Block-Lookahead. –Need to change CC protocol.

1812-12-2001CS752 Thank You! Questions?

1904-04-2001Gauges Backup Slides

2012-12-2001CS752 Tagged Prefetching 0demand-fetched 1prefetched 0 0 0 0demand-fetched 0prefetched 1 0 0 0demand-fetched 0prefetched 0 1 0

2112-12-2001CS752 Stride Prefetching Recurrence Prediction Table (RPT) –Organized like a cache, indexed by PC –(Data addresses, stride, state) State Machine

2212-12-2001CS752 Dependency-based Prefetching Potential Producer Window Correlation Table One Step Ahead Jump Pointer Generation/Maintenance

112-12-2001CS752 Decoupled Architecture for Data Prefetching Jichuan Chang Kai Xu.

Similar presentations

Presentation on theme: "112-12-2001CS752 Decoupled Architecture for Data Prefetching Jichuan Chang Kai Xu."— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

112-12-2001CS752 Decoupled Architecture for Data Prefetching Jichuan Chang Kai Xu.

Similar presentations

Presentation on theme: "112-12-2001CS752 Decoupled Architecture for Data Prefetching Jichuan Chang Kai Xu."— Presentation transcript:

Similar presentations

About project

Feedback