Presentation is loading. Please wait.

Presentation is loading. Please wait.

S TRex Boosting Instruction Cache Reuse in OLTP Workloads Through Stratified Transaction Execution Islam Atta Pınar Tözün* Xin Tong Islam Atta Pınar Tözün*

Similar presentations


Presentation on theme: "S TRex Boosting Instruction Cache Reuse in OLTP Workloads Through Stratified Transaction Execution Islam Atta Pınar Tözün* Xin Tong Islam Atta Pınar Tözün*"— Presentation transcript:

1 S TRex Boosting Instruction Cache Reuse in OLTP Workloads Through Stratified Transaction Execution Islam Atta Pınar Tözün* Xin Tong Islam Atta Pınar Tözün* Xin Tong Anastasia Ailamaki* Andreas Moshovos *

2 Shark #1Shark #2Starfish

3 Chocolate Base: http://www.marthastewart.com/337010/chocolate-cupcakes Vanilla Base: http://www.marthastewart.com/256334/vanilla-cupcakes

4 Swiss Meringue Buttercream: http://www.marthastewart.com/318727/swiss-meringue-buttercream-for-cupcakes

5 1 2 3

6 1 2 3 1 2 3 1 2 3

7 Shark #1Shark 2Starfish Time 1 Empty, Wash, Fill 2 3 1 2 3 1 2 3

8 © Islam Atta 8 Sssshhh…

9 Time 1 Empty/Wash/Fill 2 3 1 2 3 1 2 3 Shark #1Shark 2Starfish When executing OLTP Transactions Processors aren’t as clever Processors aren’t as clever

10 DB operations Transaction DB Query Instruction Cache Processor Icing Cakes and OLTP Transactions

11 Transaction #1Transaction #2Transaction #3 Today’s Systems Instruction Misses Better Way

12 © Islam Atta 12 Unlike Icing Cakes… Transaction Operations Unclear Boundaries Repeated Conditional Different

13 Dynamic Hardware Solution Breaks execution into L1I-sized sub-problems Time-multiplex to improve locality Performance Reduces instruction misses by up to 44% Reduces data misses by up to 37% Improves throughput by 35-55% for 2-16 cores Robust: Non-OLTP workload remains unaffected STRex © Islam Atta 13

14 OLTP Characteristics Challenges Opportunities STREX SLICC and its limitations Results Summary Roadmap © Islam Atta 14

15 $100 Billion/Yr, +10% annually E.g., banking, online purchases, stock market… Benchmarking Transaction Processing Council TPC-C: Wholesale retailer TPC-E: Brokerage market Online Transaction Processing (OLTP) OLTP drives innovation for HW and DB vendors © Islam Atta 15

16 Many concurrent transactions Transactions Suffer from Instruction Misses L1-I size Footprint Each Time Instruction Stalls due to L1 Instruction Cache Thrashing © Islam Atta 16

17 Many concurrent transactions Few DB operations 28 – 65KB Few transaction types TPC-C: 5, TPC-E: 12 Transactions fit in 128-512KB OLTP Facts Overlap within and across different transactions R()U()I()D()IT()ITP() Payment New Order CMPs’ aggregate L1-I cache is large enough © Islam Atta 17

18 Temporal Code Redundancy © Islam Atta 18 Payment Transactions perform similar operations in similar sequence time

19 Why Is There So Much Instruction Overlap? © Islam Atta 19 Payment IT(CUST) R(DIST) R(CUST) U(CUST) U(DIST) U(WH) I(HIST) R(WH) New Order R(DIST) I(NORD) R(WH) U(DIST) R(CUST) R(ITEM) R(STO) U(STO) I(OL) I(ORD) Loop (OL_CNT) Condition Transactions are built using few DB operations Similar transactions perform similar operations

20 Transaction #1 Transaction #2 Transaction #3 Today’s Systems Instruction Misses Stratified Execution

21 Challenges © Islam Atta 21 Unclear Boundaries Repeated Conditional Different

22 Generalized Transaction Scheduling NP-Complete  Heuristic needed  © Islam Atta 22

23 © Islam Atta 23 “When you cannot solve a problem… think of a problem you can solve” think of a problem you can solve” Pikos Apikos, MCMLXXXV

24 Identical Transactions Conventional STREX Scheduling Identical Transactions © Islam Atta 24 ABCABCABC AAABBBCCC Transaction ATransaction BTransaction C Miss Overhead Time AAA Transaction A BBB Transaction B CCC Transaction C Phase 1Phase 2Phase 3

25 Optimal Scheduling for Identical Transactions © Islam Atta 25 Phase 1 Transaction A Transaction B Transaction C L1-I Phase 2Phase 3 Time Do not evict a red block

26 Implementation © Islam Atta 26 Phase 1 Transaction A Transaction B Transaction C L1-I 1.Same-type transaction groups 2.First thread  Lead 3. Phase # starts at ONE 4. Touched blocks marked with current phase # 5. Victim block tagged with current phase #  switch thread 6. Lead thread increments phase # Phase 2Phase 3 Lead Time Works Well for the General Case

27 Roadmap © Islam Atta 27 OLTP Characteristics Challenges Opportunities STREX SLICC and its limitations Results Summary

28 SLICC: Self-Assembly of Instruction Cache Collectives for OLTP Workloads I. Atta, P. Tözün, A. Ailamaki, A. Moshovos MICRO-45, December, 2012. SLICC Concept © Islam Atta 28 Technology: CMP’s aggregate L1 instruction cache capacity is large enough Multiple L1-I caches Multiple threads Time SLICC is similar to icing cackes with multiple icing bags Condition: Aggregate cache capacity is sufficient SLICC was Demonstrated on 16 cores

29 SLICC Needs Enough Cores © Islam Atta 29 Few cores Larger Footprint Can these happen in practice? 1. Data center constraints limit core count 2. Increasing instruction footprints Multiple L1-I caches

30 Roadmap © Islam Atta 30 OLTP Characteristics Challenges Opportunities STREX SLICC and its limitations Results Summary

31 Simulation Zesto (x86) (thank you to GTech) 2-16 OoO cores, 32KB 8-way L1-I and L1-D, 1MB per core L2 QTrace (Xin Tong’s QEMU extension) Workloads Methodology Shore-MT © Islam Atta 31

32 Effect on INSTRUCTION and DATA misses?  L1-I (instruction locality), L1-D (data sharing) Performance impact:  Are CONTEXT SWITCHING OVERHEADS amortized? Compared to SLICC  Measure sensitivity to available CORE COUNT Experimental Evaluation © Islam Atta 32

33 Baseline: no effort to reduce instruction misses SLICC: distribute footprint across CMP cores/caches [Atta,MICRO’12] L1 Miss per Kilo Instructions (MPKI): Instructions STREX SLICC Baseline Better © Islam Atta 33

34 Baseline: no effort to reduce instruction misses SLICC: distribute footprint across CMP cores/caches [Atta,MICRO’12] L1 Miss per Kilo Instructions (MPKI): Data © Islam Atta 34 STREX SLICC Baseline Better

35 Throughput © Islam Atta 35 Better

36 Dynamic Hardware Solution Breaks execution into L1I-sized sub-problems Time-multiplex to improve locality Performance Reduces instruction misses by up to 44% Reduces data misses by up to 37% Improves throughput by 35-55% for 2-16 cores Robust: Non-OLTP workload remains unaffected STRex © Islam Atta 36

37 OLTP’s performance suffers due to instruction stalls Application Opportunities: temporal code redundancy SLICC: Thread Migration Sensitive to runtime core count STREX: Thread Stratification Synchronize transaction execution on a single core Improve L1 instruction (and data) locality Hybrid: Best of both Worlds Summary © Islam Atta 37

38 Email: iatta@eecg.toronto.edu iatta@eecg.toronto.edu Website: http://islamatta.comhttp://islamatta.com Thanks!

39 Larger L1-I caches? [DaMoN’12] Better © Islam Atta 39

40 STREX with Identical Transactions © Islam Atta 40

41 Replacement Policies © Islam Atta 41

42 Thread Latency Trade-off © Islam Atta 42

43 Zesto (x86) Qtrace (QEMU extension) Shore-MT Detailed Methodology © Islam Atta 43

44 Focus on OLTP Important Class of Applications Instruction stalls dominate performance Other workloads? Data Serving Media Streaming Web Frontend SPECweb 2009 Web Backend Workloads © Islam Atta 44 Similar to OLTP

45 Hardware Cost © Islam Atta 45

46 Hybrid © Islam Atta 46


Download ppt "S TRex Boosting Instruction Cache Reuse in OLTP Workloads Through Stratified Transaction Execution Islam Atta Pınar Tözün* Xin Tong Islam Atta Pınar Tözün*"

Similar presentations


Ads by Google