Presentation is loading. Please wait.

Presentation is loading. Please wait.

CoLT: Coalesced Large-Reach TLBs December 2012 Binh Pham §, Viswanathan Vaidyanathan §, Aamer Jaleel ǂ, Abhishek Bhattacharjee § § Rutgers University ǂ.

Similar presentations


Presentation on theme: "CoLT: Coalesced Large-Reach TLBs December 2012 Binh Pham §, Viswanathan Vaidyanathan §, Aamer Jaleel ǂ, Abhishek Bhattacharjee § § Rutgers University ǂ."— Presentation transcript:

1 CoLT: Coalesced Large-Reach TLBs December 2012 Binh Pham §, Viswanathan Vaidyanathan §, Aamer Jaleel ǂ, Abhishek Bhattacharjee § § Rutgers University ǂ VSSAD, Intel Corporation

2 Address Translation Primer Binh Pham - Rutgers University2 LSQ Last Level Cache LSQ 8 PTEs L1 TLB L2 TLB L1 TLB L2 TLB L1 Cache On a TLB miss: x86: 1-4 memory references ARM: 1-2 memory references Sparc: 1-2 memory references

3 Address Translation Performance Impact Address translation performance overhead – 10-15% – Clark & Emer [Trans. On Comp. Sys. 1985] – Talluri & Hill [ASPLOS 1994] – Barr, Cox & Rixner [ISCA 2011] Emerging software trends – Virtualization 2D walks – 89% overheads [Bhargava et al., ASPLOS 2008] Emerging hardware trends – LLC capacity to TLB capacity ratios increasing – Manycore/hyperthreading increases TLB and LLC PTE stress Binh Pham - Rutgers University3

4 Contiguity & CoLT Binh Pham - Rutgers University4 Page Table VirtualPhysical 0512 1513 …… 5111023 5121027 5131025 5141030 5151031 5161032 5121027 High contiguity Large pages Low contiguity TLB organization Prefetching Intermediate contiguity CoLT Low HW/SW Eliminate 40-50% TLB misses 14% performance gain TLB for large pages VirtualPhysical 0 : 511512 : 1023 TLB VirtualPhysical 5121027 5131025 5141030 5151031 Coalesced TLB VirtualPhysical 5121027 5131025 514:5161030:1032 5131025

5 Intermediate Contiguity: Past Work and Our Goals Past work – TLB sub-blocking: Talluri & Hill [ASPLOS 94] – SpecTLB: Barr, Cox & Rixner [ISCA 2011] – Overheads from either HW or SW – Alignment and special placement CoLT goals: – Low overhead HW – No change in SW – No alignment Binh Pham - Rutgers University5

6 Outline Intermediate contiguity: – Why does it exist? – How much exists in real systems? – How do you exploit it in hardware? – How much can it improve performance? Conclusion Binh Pham - Rutgers University6

7 Why does Intermediate Contiguity Exist? Buddy allocator Memory compaction Qualifying Examination7 PFN 7 PFN 6 PFN 5 PFN 4 PFN 3 PFN 2 PFN 1 PFN 0 List 3 List 2 List 1 List 0 PFN 1 PFN 6/7 PFN 7 PFN 6 PFN 5 PFN 4 PFN 3 PFN 2 PFN 1 PFN 0 Moveable Pages Free Pages PFN 0/1/2/3 Free Lists Physical Memory PFN 4

8 Real System Experiments Study real system contiguity by varying: – Superpage on or off – Memory compaction daemon invocations – System load Real system configuration: – CPU: Intel Core i7, 64 entry L1 TLBs, 512 entry L2 TLB – Memory: 3 GB – OS: Fedora 15, kernel 2.6.38 Binh Pham - Rutgers University8

9 How Much Intermediate Contiguity Exists? Binh Pham - Rutgers University9 295561508483:117 Exploitable contiguity across all system configurations Not enough for superpages

10 How do you Exploit it in Hardware? Binh Pham - Rutgers University10 VirtualPhysical 010 111 220 321 422 523 Reference Stream Page Table 0 1 2 3 4 5 0 1 2 3 4 5 Standard Coalesced Virtual: 0 – 0b000 LLC Coalescing Logic PTEs for VPN 0 to 7 Miss 1! 0, 10 0,10: 1,11 Virtual: 1 – 0b001 Miss 2! Virtual: 2 – 0b010 Miss 2! Miss 3! Virtual: 3 – 0b011 Miss 4! 3, 21 Virtual: 4 – 0b100 Miss 3! Miss 5! 1, 112, 204, 22 Miss 12! 2, 205, 23 Virtual: 5 – 0b101 2,20: 3,21 4,22: 5,23 3, 214, 22

11 CoLT for Set Associative TLBs Hardware complexity – Modest lookup complexity – No additional ports – Coalesce on fill to reduce overhead – No additional page walks Binh Pham - Rutgers University11 Lookup for virtual page: 5 – 0b 1 0 1 TagValid AttributeBase physical 0b111…0b10110 LLC PTEs for VPN 0 to 7 Coalescing Logic TLB TagValid AttributeBase physical 0b111…0b10110 TagValid AttributeBase physical 0b111…0b10110 TagValid AttributeBase physical 0b111…0b10110 TagValid AttributeBase physical 0b111…0b10110 Combinational logic to calculate physical page

12 CoLT Set Associative Miss Rates Binh Pham - Rutgers University12 Left-shifting 2 index bits best coalescing, conflict miss compromise Roughly 50% average miss eliminations

13 Different CoLT Implementations Set-associative TLBs (CoLT-SA) – Low hardware but caps coalescing opportunity Fully-associative TLB (CoLT-FA) – No indexing scheme – high coalescing opportunity – More complex hardware – we use ½ baseline TLB size Hybrid scheme (CoLT-All) – CoLT-SA for limited coalescing – CoLT-FA for high coalescing Binh Pham - Rutgers University13

14 How Much Can it Improve Performance? Binh Pham - Rutgers University14 CoLT gets us half-way to a perfect TLB’s performance

15 Conclusions Buddy allocator, memory compaction, large pages, system load create intermediate contiguity CoLT uses modest hardware to eliminate 40-50% TLB misses Average performance improvements of 14% CoLT suggests: – Re-examining highly-associative TLBs? – CoLT in virtualization? How hypervisor allocates physical memory? Binh Pham - Rutgers University15

16 Thank you! Binh Pham - Rutgers University16

17 Impact of Increasing Associativity Binh Pham - Rutgers University17

18 Comparison with Sub-blocking Sub-blocking – Uses a TLB entry to keep information about multiple mappings – Complete Sub-blocking: no OS modification, big TLB – Partial Sub-blocking: OS modification, small TLB, special placement required, e.g: VPN(x) / N = VPN(y) / N; PPN(x) / N = PPN(y) / N; VPN(x) % N = PPN(x) % N; VPN(y) % N = PPN(y) % N CoLT – NO alignment, special placement required – Low overhead hardware, no overhead software Binh Pham - Rutgers University18


Download ppt "CoLT: Coalesced Large-Reach TLBs December 2012 Binh Pham §, Viswanathan Vaidyanathan §, Aamer Jaleel ǂ, Abhishek Bhattacharjee § § Rutgers University ǂ."

Similar presentations


Ads by Google