Pınar Tözün Anastasia Ailamaki SLICC Self-Assembly of Instruction Cache Collectives for OLTP Workloads Islam Atta Andreas Moshovos.

Slides:



Advertisements
Similar presentations
From A to E: Analyzing TPCs OLTP Benchmarks Pınar Tözün Ippokratis Pandis* Cansu Kaynak Djordje Jevdjic Anastasia Ailamaki École Polytechnique Fédérale.
Advertisements

Virtual Hierarchies to Support Server Consolidation Michael Marty and Mark Hill University of Wisconsin - Madison.
OLTP on Hardware Islands Danica Porobic, Ippokratis Pandis*, Miguel Branco, Pınar Tözün, Anastasia Ailamaki Data-Intensive Application and Systems Lab,
Dynamic Thread Assignment on Heterogeneous Multiprocessor Architectures Pree Thiengburanathum Advanced computer architecture Oct 24,
Managing Wire Delay in Large CMP Caches Bradford M. Beckmann David A. Wood Multifacet Project University of Wisconsin-Madison MICRO /8/04.
An Evaluation of OpenMP on Current and Emerging Multithreaded/Multicore Processors Matthew Curtis-Maury, Xiaoning Ding, Christos D. Antonopoulos, and Dimitrios.
High Performing Cache Hierarchies for Server Workloads
FLEXclusion: Balancing Cache Capacity and On-chip Bandwidth via Flexible Exclusion Jaewoong Sim Jaekyu Lee Moinuddin K. Qureshi Hyesoon Kim.
Thread Criticality Predictors for Dynamic Performance, Power, and Resource Management in Chip Multiprocessors Abhishek Bhattacharjee Margaret Martonosi.
Optimization on Kepler Zehuan Wang
1 Line Distillation: Increasing Cache Capacity by Filtering Unused Words in Cache Lines Moinuddin K. Qureshi M. Aater Suleman Yale N. Patt HPCA 2007.
4/17/20151 Improving Memory Bank-Level Parallelism in the Presence of Prefetching Chang Joo Lee Veynu Narasiman Onur Mutlu* Yale N. Patt Electrical and.
Data Marshaling for Multi-Core Architectures M. Aater Suleman Onur Mutlu Jose A. Joao Khubaib Yale N. Patt.
Exploiting Spatial Locality in Data Caches using Spatial Footprints Sanjeev Kumar, Princeton University Christopher Wilkerson, MRL, Intel.
Accurately Approximating Superscalar Processor Performance from Traces Kiyeon Lee, Shayne Evans, and Sangyeun Cho Dept. of Computer Science University.
Teaching Old Caches New Tricks: Predictor Virtualization Andreas Moshovos Univ. of Toronto Ioana Burcea’s Thesis work Some parts joint with Stephen Somogyi.
S TRex Boosting Instruction Cache Reuse in OLTP Workloads Through Stratified Transaction Execution Islam Atta Pınar Tözün* Xin Tong Islam Atta Pınar Tözün*
1 DIEF: An Accurate Interference Feedback Mechanism for Chip Multiprocessor Memory Systems Magnus Jahre †, Marius Grannaes † ‡ and Lasse Natvig † † Norwegian.
What will my performance be? Resource Advisor for DB admins Dushyanth Narayanan, Paul Barham Microsoft Research, Cambridge Eno Thereska, Anastassia Ailamaki.
CS 7810 Lecture 23 Maximizing CMP Throughput with Mediocre Cores J. Davis, J. Laudon, K. Olukotun Proceedings of PACT-14 September 2005.
1 Improving Hash Join Performance through Prefetching _________________________________________________By SHIMIN CHEN Intel Research Pittsburgh ANASTASSIA.
CS 7810 Lecture 24 The Cell Processor H. Peter Hofstee Proceedings of HPCA-11 February 2005.
By- Jaideep Moses, Ravi Iyer , Ramesh Illikkal and
NVSleep: Using Non-Volatile Memory to Enable Fast Sleep/Wakeup of Idle Cores Xiang Pan and Radu Teodorescu Computer Architecture Research Lab
DBMSs On A Modern Processor: Where Does Time Go? by A. Ailamaki, D.J. DeWitt, M.D. Hill, and D. Wood University of Wisconsin-Madison Computer Science Dept.
Task Scheduling for Highly Concurrent Analytical and Transactional Main-Memory Workloads Iraklis Psaroudakis (EPFL), Tobias Scheuer (SAP AG), Norman May.
Cooperative Caching for Chip Multiprocessors Jichuan Chang Guri Sohi University of Wisconsin-Madison ISCA-33, June 2006.
Achieving Non-Inclusive Cache Performance with Inclusive Caches Temporal Locality Aware (TLA) Cache Management Policies Aamer Jaleel,
Unifying Primary Cache, Scratch, and Register File Memories in a Throughput Processor Mark Gebhart 1,2 Stephen W. Keckler 1,2 Brucek Khailany 2 Ronny Krashinsky.
Adaptive Cache Partitioning on a Composite Core Jiecao Yu, Andrew Lukefahr, Shruti Padmanabha, Reetuparna Das, Scott Mahlke Computer Engineering Lab University.
Ioana Burcea * Stephen Somogyi §, Andreas Moshovos*, Babak Falsafi § # Predictor Virtualization *University of Toronto Canada § Carnegie Mellon University.
A Framework for Coarse-Grain Optimizations in the On-Chip Memory Hierarchy Jason Zebchuk, Elham Safi, and Andreas Moshovos
Our work on virtualization Chen Haogang, Wang Xiaolin {hchen, Institute of Network and Information Systems School of Electrical Engineering.
DASX : Hardware Accelerator for Software Data Structures Snehasish Kumar, Naveen Vedula, Arrvindh Shriraman (Simon Fraser University), Vijayalakshmi Srinivasan.
Moshovos © 1 ReCast: Boosting L2 Tag Line Buffer Coverage “for Free” Won-Ho Park, Toronto Andreas Moshovos, Toronto Babak Falsafi, CMU
Moshovos © 1 RegionScout: Exploiting Coarse Grain Sharing in Snoop Coherence Andreas Moshovos
1 Computation Spreading: Employing Hardware Migration to Specialize CMP Cores On-the-fly Koushik Chakraborty Philip Wells Gurindar Sohi
Sampling Dead Block Prediction for Last-Level Caches
MadCache: A PC-aware Cache Insertion Policy Andrew Nere, Mitch Hayenga, and Mikko Lipasti PHARM Research Group University of Wisconsin – Madison June 20,
Authors – Jeahyuk huh, Doug Burger, and Stephen W.Keckler Presenter – Sushma Myneni Exploring the Design Space of Future CMPs.
Computer Network Lab. Korea University Computer Networks Labs Se-Hee Whang.
Design Exploration of an Instruction-Based Shared Markov Table on CMPs Karthik Ramachandran & Lixin Su Design Exploration of an Instruction-Based Shared.
1/25 June 28 th, 2006 BranchTap: Improving Performance With Very Few Checkpoints Through Adaptive Speculation Control BranchTap Improving Performance With.
The Evicted-Address Filter
Exploiting Fine-Grained Data Parallelism with Chip Multiprocessors and Fast Barriers Jack Sampson*, Rubén González†, Jean-Francois Collard¤, Norman P.
University of Toronto Department of Electrical and Computer Engineering Jason Zebchuk and Andreas Moshovos June 2006.
An Architectural Evaluation of Java TPC-W Harold “Trey” Cain, Ravi Rajwar, Morris Marden, Mikko Lipasti University of Wisconsin-Madison
Architectural Features of Transactional Memory Designs for an Operating System Chris Rossbach, Hany Ramadan, Don Porter Advanced Computer Architecture.
EFetch: Optimizing Instruction Fetch for Event-Driven Web Applications Gaurav Chadha, Scott Mahlke, Satish Narayanasamy University of Michigan August,
Quantifying and Controlling Impact of Interference at Shared Caches and Main Memory Lavanya Subramanian, Vivek Seshadri, Arnab Ghosh, Samira Khan, Onur.
1/25 HIPEAC 2008 TurboROB TurboROB A Low Cost Checkpoint/Restore Accelerator Patrick Akl 1 and Andreas Moshovos AENAO Research Group Department of Electrical.
Improving Cache Performance using Victim Tag Stores
Reducing OLTP Instruction Misses with Thread Migration
Adaptive Cache Partitioning on a Composite Core
Multiscalar Processors
ISPASS th April Santa Rosa, California
18742 Parallel Computer Architecture Caching in Multi-core Systems
5.2 Eleven Advanced Optimizations of Cache Performance
RIC: Relaxed Inclusion Caches for Mitigating LLC Side-Channel Attacks
Prefetch-Aware Cache Management for High Performance Caching
Bank-aware Dynamic Cache Partitioning for Multicore Architectures
RegLess: Just-in-Time Operand Staging for GPUs
Presented by: Isaac Martin
Lecture 14 Virtual Memory and the Alpha Memory Hierarchy
Milad Hashemi, Onur Mutlu, Yale N. Patt
Using Dead Blocks as a Virtual Victim Cache
Improving Multiple-CMP Systems with Token Coherence
Patrick Akl and Andreas Moshovos AENAO Research Group
Debashis Ganguly Ziyu Zhang, Jun Yang, Rami Melhem
Stream-based Memory Specialization for General Purpose Processors
Presentation transcript:

Pınar Tözün Anastasia Ailamaki SLICC Self-Assembly of Instruction Cache Collectives for OLTP Workloads Islam Atta Andreas Moshovos

$100 Billion/Yr, +10% annually E.g., banking, online purchases, stock market… Benchmarking Transaction Processing Council TPC-C: Wholesale retailer TPC-E: Brokerage market Online Transaction Processing (OLTP) OLTP drives innovation for HW and DB vendors © Islam Atta 2

Many concurrent transactions Transactions Suffer from Instruction Misses L1-I size Footprint Each Time Instruction Stalls due to L1 Instruction Cache Thrashing © Islam Atta 3

Even on a CMP all Transactions Suffer Cores L1-1 Caches Transactions All caches thrashed with similar code blocks Time © Islam Atta 4

Opportunity Footprint over Multiple Cores  Reduced Instruction Misses Technology: CMP’s aggregate L1 instruction cache capacity is large enough Application Behavior: Instruction overlap within and across transactions Multiple L1-I caches Multiple threads Time © Islam Atta 5

Dynamic Hardware Solution How to divide a transaction When to move Where to go Performance Reduces instruction misses by 44% (TPC-C), 68% (TPC-E) Performance improves by 60% (TPC-C), 79% (TPC-E) Robust: non-OLTP workload remains unaffected SLICC Overview © Islam Atta 6

Intra/Inter-thread instruction locality is high SLICC Concept SLICC Ingredients Results Summary Talk Roadmap © Islam Atta 7

Many concurrent transactions Few DB operations 28 – 65KB Few transaction types TPC-C: 5, TPC-E: 12 Transactions fit in KB OLTP Facts Overlap within and across different transactions R()U()I()D()IT()ITP() Payment New Order CMPs’ aggregate L1-I cache is large enough © Islam Atta 8

Instruction Commonality Across Transactions Lots of code reuse More Yellow Even higher across same-type transactions Most Few Single TPC-CTPC-E All Threads Per Transaction Type More Reuse © Islam Atta 9

Enable usage of aggregate L1-I capacity Large cache size without increased latency Exploit instruction commonality Localizes common transaction instructions Dynamic Independent of footprint size or cache configuration Requirements © Islam Atta 10

Intra/Inter-thread instruction locality is high SLICC Concept SLICC Ingredients Results Summary Talk Roadmap © Islam Atta 11

Example for Concurrent Transactions T1 T2 T3 Code segments that can fit into L1-I Transactions Control Flow Graph © Islam Atta 12

T1 T2 T1 T3T2T3 T1 Scheduling Threads T1 T2 T3 T1 T CORES T3 Conventional L1-I T1 T2 T3 Threads Time T CORES SLICC T1 T2 T3 T2 T1 T3 T1 Cache Filled 10 times Cache Filled 4 times T2 © Islam Atta 13

Intra/Inter-thread instruction locality is high SLICC Concept SLICC Ingredients Results Summary Talk Roadmap © Islam Atta 14

When to migrate? Step 1: Detect: cache full Step 2: Detect: new code segment Where to go? Step 3: Predict where is the next code segment? Migration Ingredients © Islam Atta 15

Migration Ingredients Time Idle cores When to migrate? Step 1: Detect: cache full Step 2: Detect: new segment Where to go? Step 3: Where is the next segment? Loops Idle Return back T1 © Islam Atta 16

Migration Ingredients When to migrate? Step 1: Detect: cache full Step 2: Detect: new segment Where to go? Step 3: Where is the next segment? Time T2 © Islam Atta 17

Implementation When to migrate? Step 1: Detect: cache full Step 2: Detect: new segment Where to go? Step 3: Where is the next segment? Find signature blocks on remote cores Miss Counter Miss Dilution © Islam Atta 18

More overlap across transactions of the same-type SLICC: Transaction Type-oblivious Transaction Type-aware SLICC-Pp: Pre-processing to detect similar transactions SLICC-SW : Software provides information Boosting Effectiveness © Islam Atta 19

Intra/Inter-thread instruction locality is high SLICC Concept SLICC Ingredients Results Summary Talk Roadmap © Islam Atta 20

How does SLICC affect INSTRUCTION misses?  Our primary goal How does it affect DATA misses?  Expected to increase, by how much? Performance impact:  Are DATA misses and MIGRATION OVERHEADS amortized? Experimental Evaluation © Islam Atta 21

Simulation Zesto (x86) 16 OoO cores, 32KB L1-I, 32KB L1-D, 1MB per core L2 QEMU extension User and Kernel space Workloads Methodology Shore-MT © Islam Atta 22

Baseline: no effort to reduce instruction misses Effect on Misses Better Reduce I-MPKI by 58%. Increase D-MPKI by 7%. I-MPKI D-MPKI © Islam Atta 23

Next-line: always prefetch the next-line Upper bound for Proactive Instruction Fetch [Ferdman, MICRO’11] Performance Better TPC-C: +60%TPC-E: +79% Storage per core - PIF: ~40KB - SLICC: <1KB. Next-Line PIF-No Overhead SLICC SLICC-SW © Islam Atta 24

OLTP’s performance suffers due to instruction stalls. Technology & Application Opportunities: Instruction footprint fits on aggregate L1-I capacity of CMPs. Inter- and intra-thread locality. SLICC: Thread migration  spread instruction footprint over multiple cores. Reduce I-MPKI by 58% Improve performance by Summary Baseline: +70% Next-line: +44% PIF: ±2% to +21% © Islam Atta 25

Website: Thanks!

Example: thread migrates from core A  core B. Read data on core B that is fetched on core A. Write data on core B to invalidate data on core A. When returning to core A, cache blocks might be evicted by other threads. Why data misses increase? © Islam Atta 27

SLICC Agent per Core © Islam Atta 28

Zesto (x86) Qtrace (QEMU extension) Shore-MT Detailed Methodology © Islam Atta 29

Hardware Cost © Islam Atta 30

Larger I-caches? Better © Islam Atta 31

Different Replacement Policies? Better © Islam Atta 32

Parameter Space (1) Better © Islam Atta 33

Parameter Space (2) Better © Islam Atta 34

Partial Bloom Filter Cache Signature Accuracy Better © Islam Atta 35