Download presentation
Presentation is loading. Please wait.
Published byGeorgina Whitehead Modified over 9 years ago
1
Access Region Locality for High- Bandwidth Processor Memory System Design Sangyeun Cho Samsung/U of Minnesota Pen-Chung Yew U of Minnesota Gyungho Lee Iowa State U 32nd Annual International Symposium on Microarchitecture
2
MICRO-32 November 17, 1999 Cho, Yew, and Lee2 Big Picture
3
On-Chip D-Cache Bandwidth Problem
4
MICRO-32 November 17, 1999 Cho, Yew, and Lee4 Wide-Issue Superscalar Processors n Current Generation –Alpha 21264 –Intel’s Merced n Future Generation (IEEE Computer, Sept. ‘97) –Superspeculative Processors –Trace Processors
5
MICRO-32 November 17, 1999 Cho, Yew, and Lee5 Multi-Ported Data Cache n Replicated Cache –Alpha 21164 n Time-Division Multiplexed Cache –Alpha 21264 n Interleaved Cache –MIPS R10K
6
MICRO-32 November 17, 1999 Cho, Yew, and Lee6 Window Logic Complexity n Pointed out as the major hardware complexity (Parlacharla et al., ISCA ‘97) n More severe for Memory window –Difficult to partition –Thick network needed to connect RSs and LSUs
7
Data Decoupling
8
MICRO-32 November 17, 1999 Cho, Yew, and Lee8 Data Decoupling: What is it? n A Divide-and-Conquer approach –Instruction stream partitioned before entering RS –Narrower networks –Less ports to each cache –Needs mechanism for proper partitioning
9
MICRO-32 November 17, 1999 Cho, Yew, and Lee9 Data Decoupling: Operating Issues n Memory Stream Partitioning –Hardware classification –Compiler classification n Load Balancing –Enough instructions in different groups? –Are they well interleaved?
10
Access Region Locality & Access Region Prediction
11
MICRO-32 November 17, 1999 Cho, Yew, and Lee11 Access Region: Overview n Access Region R –R = (L, U) n L: Lower Bound on Addr. n U: Upper Bound on Addr. n If (D<A) or (B<C), –Region R and Q are said to be exclusive or non-overlapping. n Locations in exclusive regions are independent.
12
MICRO-32 November 17, 1999 Cho, Yew, and Lee12 Access Region and Mem. Instructions
13
MICRO-32 November 17, 1999 Cho, Yew, and Lee13 Partitioning Memory Space n One way of partitioning memory space into regions: –Data Region / Heap Region / Stack Region n This work assumes this partitioning.
14
MICRO-32 November 17, 1999 Cho, Yew, and Lee14 Partitioning Memory Space, Cont’d n Many accesses are toward Data and Stack regions. n Some programs don’t access the Heap region at all. (%)
15
MICRO-32 November 17, 1999 Cho, Yew, and Lee15 Partitioning Memory Space, Cont’d n Accesses to Data region are less bursty than others. n Programs such as ijpeg have clustered region accesses. n Window Size = 32
16
MICRO-32 November 17, 1999 Cho, Yew, and Lee16 Partitioning Memory Space, Cont’d n W/ a large window, Stack accesses become less bursty. n Data and Stack regions have quite stable, constant demand. n Window Size = 64
17
MICRO-32 November 17, 1999 Cho, Yew, and Lee17 Partitioning Memory Space, Cont’d gom88ksimgcccompressliijpegperlvortexInt.AvgFP.Avgtomcatvswimsu2cormgrid 1.9%1.8% 51.1%50.4% 1.6% 16.2% 45.4% 31.6% n Many instructions access a single region (~98%). n Multi-region-accessing instructions account for 0 ~ 9.6% of dynamic memory references.
18
MICRO-32 November 17, 1999 Cho, Yew, and Lee18 Access Region Locality n “A memory reference instruction typically accesses a single region at run time” –Only about 2% of all static memory instructions access more than a single region. n “(Thus) the region it accesses is highly predictable” –Simple predictors with a small look-up table achieve high prediction accuracy.
19
MICRO-32 November 17, 1999 Cho, Yew, and Lee19 Predicting Regions: Unlimited Case n One predictor per memory instruction n Predictor types: –1-bit history saver (0: Data, 1: Stack) –2-bit saturating counter
20
MICRO-32 November 17, 1999 Cho, Yew, and Lee20 Predicting Regions: Adding Context n Run-time context –Caller’s ID (CID): in Link Register –Global Branch History (GBH) –Hybrid of above
21
MICRO-32 November 17, 1999 Cho, Yew, and Lee21 Predicting Regions: Utilizing Static Info. n Some instructions’ access regions are revealed through architecture and compiler conventions : –Use of Stack Pointer ( $SP ) or Frame Pointer ($FP) suggests that the region is Stack. –Use of Global Pointer ( $GP ) suggests that the region is non- Stack. –For others, assume non-Stack. n Directly exporting some high-level region information from compiler to processor may improve prediction accuracy.
22
MICRO-32 November 17, 1999 Cho, Yew, and Lee22 Region Pred. Result: Unlimited Case gom88ksimgcccompressliijpegperlvortexInt.AvgFP.Avgtomcatvswimsu2cormgrid Simple 1-bit w/ GBH w/ CID Static w/ Hybrid n 1-bit predictors do better than 2-bit predictors (not shown). n Hybrid context bits achieve the best prediction rate on average.
23
MICRO-32 November 17, 1999 Cho, Yew, and Lee23 Predicting Regions: Limited-Size ARPT n Low n bits of PC, XOR’ed with hybrid context bits are used to index into Access Region Prediction Table (ARPT): –Table Entries Initialized to 0’s –1 to denote stack access –Decoding information exploited to save ARPT space
24
MICRO-32 November 17, 1999 Cho, Yew, and Lee24 Region Prediction Result: ARPT gom88ksimgcccompressliijpegperlvortexInt.AvgFP.Avgtomcatvswimsu2cormgrid Unlimited 8 KB4 KB 2 KB 1 KB n Over 99.9% Accuracy w/ 4 KB or larger ARPT w/o compiler hints. n Compiler hints relieve pressure due to smaller sizes.
25
Dynamic Data Decoupling
26
MICRO-32 November 17, 1999 Cho, Yew, and Lee26 Dynamic Data Decoupling
27
MICRO-32 November 17, 1999 Cho, Yew, and Lee27 Dynamic Data Decoupling, Cont’d n Dynamically predicting access regions to classify memory instructions: –Utilize Access Region Prediction Table (ARPT). –Utilize any region information revealed through instruction decoding. n Dispatching partitioned memory instructions into separate memory pipelines, connetected to separate caches. n Dynamically Verifying Region Prediction –Let TLB (i.e., page table) contain verification information such that memory access is reissued on mis-predictions.
28
MICRO-32 November 17, 1999 Cho, Yew, and Lee28 Base Machine Model
29
MICRO-32 November 17, 1999 Cho, Yew, and Lee29 Overall Performance gom88ksimgcccompressliijpegperlvortexInt.AvgFP.Avgtomcatvswimsu2cormgrid n Over (2+0) conf.
30
MICRO-32 November 17, 1999 Cho, Yew, and Lee30 Conclusions n Access Region Locality says –Memory instructions access few regions at run time. –Accessed regions are accurately predictable. n Access Region Locality leads to Access Region Prediction techniques. n Access Region Prediction allows Dynamic Data Decoupling, shown to achieve comparable performance to very wide data caches.
31
Now Any Questions?
32
MICRO-32 November 17, 1999 Cho, Yew, and Lee32 Impact of LVC Size n 2KB and 4KB LVCs achieve high hit rates. (~99.9%). n Set associativity less important if LVC is 2KB or more. n Small, simple LVC works well. 0.5K1K2K4K
Similar presentations
© 2025 SlidePlayer.com. Inc.
All rights reserved.