Reducing OLTP Instruction Misses with Thread Migration

Reducing OLTP Instruction Misses with Thread Migration
Islam Atta Pınar Tözün Anastasia Ailamaki Andreas Moshovos University of Toronto École Polytechnique Fédérale de Lausanne

OLTP on a Intel Xeon5660 IPC < 1 on a 4-issue machine
Shore-MT Hyper-threading disabled better IPC < 1 on a 4-issue machine 70-80% of stalls are instruction stalls

OLTP L1 Instruction Cache Misses
Trace Simulation 4-way L1-I Cache Shore-MT Most common today! better ~512KB is enough for OLTP instruction footprint

Reducing Instruction Stalls
at the hardware level Larger L1-I cache size Higher access latency Different replacement policies Does not really affect OLTP workloads Advanced prefetching Has too much space overhead (40KB per core) Simultaneous multi-threading Increases IPC per hardware context Cache polluting

Alternative: Thread Migration
Enables usage of aggregate L1-I capacity Large cache size without increased latency Can exploit instruction commonality Localizes common transaction instructions Dynamic hardware solution More general purpose

Transactions Running Parallel
Instruction parts that can fit into L1-I Threads T1 T2 T3 Transaction T3 T2 T1 Common instructions among concurrent threads

Scheduling Threads TMi Traditional L1I time Threads T1 CORES CORES 1 2
Total Misses Total Misses 1 2 3 1 2 3 time T1 T1 1 1 L1I T1 T2 T2 T1 T2 3 2 T1 T2 T3 T3 T2 T1 6 3 T3 T1 T2 T3 T1 T3 T2 9 4 T3 T3 10 4

TMi Group threads Wait till L1-I is almost full Count misses
Transaction A CORES Group threads Wait till L1-I is almost full Count misses Record last N misses Misses > threshold => Migrate 1 time L1I T1 T4 T3 Transaction B

TMi Where to migrate? Check the last N misses recorded in other caches
Transaction A CORES Where to migrate? Check the last N misses recorded in other caches 1) No matching cache => Move to an idle core if exists 2) Matching cache => Move to that core 3) None of above => Do not move 1 time L1I T1 T2 T1 T1 T2 T1 T2

Experimental Setup Trace Simulation Shore-MT as the storage manager
PIN to extract instructions & data accesses per transaction 16 core system 32KB 8-way set-associative L1 caches Miss-threshold is 256 Last 6 misses are kept Shore-MT as the storage manager Workloads: TPC-C, TPC-E

Instruction misses reduced by half
Impact on L1-I Misses better Instruction misses reduced by half

Cannot ignore increased data misses
Impact on L1-D Misses better Cannot ignore increased data misses

TMi’s Challenges Dealing with the data left behind
Prefetching Depends on thread identification Software assisted Hardware detection OS support needed Disabling OS control over thread scheduling

Conclusion Thank you! ~50% of the time OLTP stalls on instructions
Spread computation through thread migration TMi Halves L1-I misses Time-wise ~30% expected improvement Data misses should be handled Thank you!

Reducing OLTP Instruction Misses with Thread Migration

Similar presentations

Presentation on theme: "Reducing OLTP Instruction Misses with Thread Migration"— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

Reducing OLTP Instruction Misses with Thread Migration

Similar presentations

Presentation on theme: "Reducing OLTP Instruction Misses with Thread Migration"— Presentation transcript:

Similar presentations

About project

Feedback