S TRex Boosting Instruction Cache Reuse in OLTP Workloads Through Stratified Transaction Execution Islam Atta Pınar Tözün* Xin Tong Islam Atta Pınar Tözün*

S TRex Boosting Instruction Cache Reuse in OLTP Workloads Through Stratified Transaction Execution Islam Atta Pınar Tözün* Xin Tong Islam Atta Pınar Tözün* Xin Tong Anastasia Ailamaki* Andreas Moshovos *

Shark #1Shark #2Starfish

Chocolate Base: http://www.marthastewart.com/337010/chocolate-cupcakes Vanilla Base: http://www.marthastewart.com/256334/vanilla-cupcakes

Swiss Meringue Buttercream: http://www.marthastewart.com/318727/swiss-meringue-buttercream-for-cupcakes

1 2 3 1 2 3 1 2 3

Shark #1Shark 2Starfish Time 1 Empty, Wash, Fill 2 3 1 2 3 1 2 3

Time 1 Empty/Wash/Fill 2 3 1 2 3 1 2 3 Shark #1Shark 2Starfish When executing OLTP Transactions Processors aren’t as clever Processors aren’t as clever

DB operations Transaction DB Query Instruction Cache Processor Icing Cakes and OLTP Transactions

Transaction #1Transaction #2Transaction #3 Today’s Systems Instruction Misses Better Way

Dynamic Hardware Solution Breaks execution into L1I-sized sub-problems Time-multiplex to improve locality Performance Reduces instruction misses by up to 44% Reduces data misses by up to 37% Improves throughput by 35-55% for 2-16 cores Robust: Non-OLTP workload remains unaffected STRex © Islam Atta 13

OLTP Characteristics Challenges Opportunities STREX SLICC and its limitations Results Summary Roadmap © Islam Atta 14

$100 Billion/Yr, +10% annually E.g., banking, online purchases, stock market… Benchmarking Transaction Processing Council TPC-C: Wholesale retailer TPC-E: Brokerage market Online Transaction Processing (OLTP) OLTP drives innovation for HW and DB vendors © Islam Atta 15

Many concurrent transactions Transactions Suffer from Instruction Misses L1-I size Footprint Each Time Instruction Stalls due to L1 Instruction Cache Thrashing © Islam Atta 16

Many concurrent transactions Few DB operations 28 – 65KB Few transaction types TPC-C: 5, TPC-E: 12 Transactions fit in 128-512KB OLTP Facts Overlap within and across different transactions R()U()I()D()IT()ITP() Payment New Order CMPs’ aggregate L1-I cache is large enough © Islam Atta 17

Temporal Code Redundancy © Islam Atta 18 Payment Transactions perform similar operations in similar sequence time

Why Is There So Much Instruction Overlap? © Islam Atta 19 Payment IT(CUST) R(DIST) R(CUST) U(CUST) U(DIST) U(WH) I(HIST) R(WH) New Order R(DIST) I(NORD) R(WH) U(DIST) R(CUST) R(ITEM) R(STO) U(STO) I(OL) I(ORD) Loop (OL_CNT) Condition Transactions are built using few DB operations Similar transactions perform similar operations

Transaction #1 Transaction #2 Transaction #3 Today’s Systems Instruction Misses Stratified Execution

Challenges © Islam Atta 21 Unclear Boundaries Repeated Conditional Different

Generalized Transaction Scheduling NP-Complete  Heuristic needed  © Islam Atta 22

Identical Transactions Conventional STREX Scheduling Identical Transactions © Islam Atta 24 ABCABCABC AAABBBCCC Transaction ATransaction BTransaction C Miss Overhead Time AAA Transaction A BBB Transaction B CCC Transaction C Phase 1Phase 2Phase 3

Optimal Scheduling for Identical Transactions © Islam Atta 25 Phase 1 Transaction A Transaction B Transaction C L1-I Phase 2Phase 3 Time Do not evict a red block

Implementation © Islam Atta 26 Phase 1 Transaction A Transaction B Transaction C L1-I 1.Same-type transaction groups 2.First thread  Lead 3. Phase # starts at ONE 4. Touched blocks marked with current phase # 5. Victim block tagged with current phase #  switch thread 6. Lead thread increments phase # Phase 2Phase 3 Lead Time Works Well for the General Case

Roadmap © Islam Atta 27 OLTP Characteristics Challenges Opportunities STREX SLICC and its limitations Results Summary

SLICC: Self-Assembly of Instruction Cache Collectives for OLTP Workloads I. Atta, P. Tözün, A. Ailamaki, A. Moshovos MICRO-45, December, 2012. SLICC Concept © Islam Atta 28 Technology: CMP’s aggregate L1 instruction cache capacity is large enough Multiple L1-I caches Multiple threads Time SLICC is similar to icing cackes with multiple icing bags Condition: Aggregate cache capacity is sufficient SLICC was Demonstrated on 16 cores

SLICC Needs Enough Cores © Islam Atta 29 Few cores Larger Footprint Can these happen in practice? 1. Data center constraints limit core count 2. Increasing instruction footprints Multiple L1-I caches

Roadmap © Islam Atta 30 OLTP Characteristics Challenges Opportunities STREX SLICC and its limitations Results Summary

Simulation Zesto (x86) (thank you to GTech) 2-16 OoO cores, 32KB 8-way L1-I and L1-D, 1MB per core L2 QTrace (Xin Tong’s QEMU extension) Workloads Methodology Shore-MT © Islam Atta 31

Effect on INSTRUCTION and DATA misses?  L1-I (instruction locality), L1-D (data sharing) Performance impact:  Are CONTEXT SWITCHING OVERHEADS amortized? Compared to SLICC  Measure sensitivity to available CORE COUNT Experimental Evaluation © Islam Atta 32

Baseline: no effort to reduce instruction misses SLICC: distribute footprint across CMP cores/caches [Atta,MICRO’12] L1 Miss per Kilo Instructions (MPKI): Instructions STREX SLICC Baseline Better © Islam Atta 33

Baseline: no effort to reduce instruction misses SLICC: distribute footprint across CMP cores/caches [Atta,MICRO’12] L1 Miss per Kilo Instructions (MPKI): Data © Islam Atta 34 STREX SLICC Baseline Better

Dynamic Hardware Solution Breaks execution into L1I-sized sub-problems Time-multiplex to improve locality Performance Reduces instruction misses by up to 44% Reduces data misses by up to 37% Improves throughput by 35-55% for 2-16 cores Robust: Non-OLTP workload remains unaffected STRex © Islam Atta 36

OLTP’s performance suffers due to instruction stalls Application Opportunities: temporal code redundancy SLICC: Thread Migration Sensitive to runtime core count STREX: Thread Stratification Synchronize transaction execution on a single core Improve L1 instruction (and data) locality Hybrid: Best of both Worlds Summary © Islam Atta 37

Email: iatta@eecg.toronto.edu iatta@eecg.toronto.edu Website: http://islamatta.comhttp://islamatta.com Thanks!

Focus on OLTP Important Class of Applications Instruction stalls dominate performance Other workloads? Data Serving Media Streaming Web Frontend SPECweb 2009 Web Backend Workloads © Islam Atta 44 Similar to OLTP

S TRex Boosting Instruction Cache Reuse in OLTP Workloads Through Stratified Transaction Execution Islam Atta Pınar Tözün* Xin Tong Islam Atta Pınar Tözün*

Similar presentations

Presentation on theme: "S TRex Boosting Instruction Cache Reuse in OLTP Workloads Through Stratified Transaction Execution Islam Atta Pınar Tözün* Xin Tong Islam Atta Pınar Tözün*"— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

S TRex Boosting Instruction Cache Reuse in OLTP Workloads Through Stratified Transaction Execution Islam Atta Pınar Tözün* Xin Tong Islam Atta Pınar Tözün*

Similar presentations

Presentation on theme: "S TRex Boosting Instruction Cache Reuse in OLTP Workloads Through Stratified Transaction Execution Islam Atta Pınar Tözün* Xin Tong Islam Atta Pınar Tözün*"— Presentation transcript:

Similar presentations

About project

Feedback