11 MCC-DB : Minimizing Cache Conflicts in Multi- core Processors for Databases Rubao Lee 1,2 Xiaoning Ding 2 Feng Chen 2 Qingda Lu 3 Xiaodong Zhang 2.

Slides:



Advertisements
Similar presentations
Prefetch-Aware Shared-Resource Management for Multi-Core Systems Eiman Ebrahimi * Chang Joo Lee * + Onur Mutlu Yale N. Patt * * HPS Research Group The.
Advertisements

Gennady Pekhimenko Advisers: Todd C. Mowry & Onur Mutlu
SE-292 High Performance Computing
Chapter 5 : Memory Management
Copyright © 2007 Ramez Elmasri and Shamkant B. Navathe Slide
Shredder GPU-Accelerated Incremental Storage and Computation
Chapter 4 Memory Management Basic memory management Swapping
1 PhD Defense Presentation Managing Shared Resources in Chip Multiprocessor Memory Systems 12. October 2010 Magnus Jahre.
Memory.
1 Overview Assignment 4: hints Memory management Assignment 3: solution.
Cache and Virtual Memory Replacement Algorithms
Virtual Memory II Chapter 8.
CRUISE: Cache Replacement and Utility-Aware Scheduling
Virtual Memory 1 Computer Organization II © McQuain Virtual Memory Use main memory as a cache for secondary (disk) storage – Managed jointly.
Bypass and Insertion Algorithms for Exclusive Last-level Caches
Making Time-stepped Applications Tick in the Cloud Tao Zou, Guozhang Wang, Marcos Vaz Salles*, David Bindel, Alan Demers, Johannes Gehrke, Walker White.
Virtual Memory In this lecture, slides from lecture 16 from the course Computer Architecture ECE 201 by Professor Mike Schulte are used with permission.
KAIST Computer Architecture Lab. The Effect of Multi-core on HPC Applications in Virtualized Systems Jaeung Han¹, Jeongseob Ahn¹, Changdae Kim¹, Youngjin.
HStorage-DB: Heterogeneity-aware Data Management to Exploit the Full Capability of Hybrid Storage Systems Tian Luo Rubao Lee Xiaodong Zhang Michael Mesnier.
SE-292 High Performance Computing
Memory Management in NUMA Multicore Systems: Trapped between Cache Contention and Interconnect Overhead Zoltan Majo and Thomas R. Gross Department of Computer.
SE-292 High Performance Computing Memory Hierarchy R. Govindarajan
NetSlices: Scalable Multi-Core Packet Processing in User-Space Tudor Marian, Ki Suh Lee, Hakim Weatherspoon Cornell University Presented by Ki Suh Lee.
Symbiotic Scheduling for Shared Caches in Multi-Core Systems Using Memory Footprint Signature  Mrinmoy Ghosh  Ripal Nathuji Min Lee Karsten Schwan Hsien-Hsin.
Scalable Multi-Cache Simulation Using GPUs Michael Moeng Sangyeun Cho Rami Melhem University of Pittsburgh.
1 DIEF: An Accurate Interference Feedback Mechanism for Chip Multiprocessor Memory Systems Magnus Jahre †, Marius Grannaes † ‡ and Lasse Natvig † † Norwegian.
1 Multi-Core Systems CORE 0CORE 1CORE 2CORE 3 L2 CACHE L2 CACHE L2 CACHE L2 CACHE DRAM MEMORY CONTROLLER DRAM Bank 0 DRAM Bank 1 DRAM Bank 2 DRAM Bank.
Cs 325 virtualmemory.1 Accessing Caches in Virtual Memory Environment.
Shimin Chen Big Data Reading Group.  Energy efficiency of: ◦ Single-machine instance of DBMS ◦ Standard server-grade hardware components ◦ A wide spectrum.
CSCE 212 Chapter 7 Memory Hierarchy Instructor: Jason D. Bakos.
1 Lecture 15: Virtual Memory and Large Caches Today: TLB design and large cache design basics (Sections )
Lecture 17: Virtual Memory, Large Caches
Virtual Memory and Paging J. Nelson Amaral. Large Data Sets Size of address space: – 32-bit machines: 2 32 = 4 GB – 64-bit machines: 2 64 = a huge number.
1 Lecture 14: Virtual Memory Today: DRAM and Virtual memory basics (Sections )
1  1998 Morgan Kaufmann Publishers Chapter Seven Large and Fast: Exploiting Memory Hierarchy (Part II)
ITEC 325 Lecture 29 Memory(6). Review P2 assigned Exam 2 next Friday Demand paging –Page faults –TLB intro.
COLLABORATIVE EXECUTION ENVIRONMENT FOR HETEROGENEOUS PARALLEL SYSTEMS Aleksandar Ili´c, Leonel Sousa 2010 IEEE International Symposium on Parallel & Distributed.
Jiang Lin 1, Qingda Lu 2, Xiaoning Ding 2, Zhao Zhang 1, Xiaodong Zhang 2, and P. Sadayappan 2 Gaining Insights into Multi-Core Cache Partitioning: Bridging.
Recall: Three I/O Methods Synchronous: Wait for I/O operation to complete. Asynchronous: Post I/O request and switch to other work. DMA (Direct Memory.
Computer Architecture Memory Management Units Iolanthe II - Reefed down, heading for Great Barrier Island.
Embedded System Lab. 최 길 모최 길 모 Kilmo Choi A Software Memory Partition Approach for Eliminating Bank-level Interference in Multicore.
The Memory Hierarchy 21/05/2009Lecture 32_CA&O_Engr Umbreen Sabir.
1 Lecture 13: Cache, TLB, VM Today: large caches, virtual memory, TLB (Sections 2.4, B.4, B.5)
Low-Power Cache Organization Through Selective Tag Translation for Embedded Processors with Virtual Memory Support Xiangrong Zhou and Peter Petrov Proceedings.
Embedded System Lab. 정범종 A_DRM: Architecture-aware Distributed Resource Management of Virtualized Clusters H. Wang et al. VEE, 2015.
1  2004 Morgan Kaufmann Publishers Chapter Seven Memory Hierarchy-3 by Patterson.
Layali Rashid, Wessam M. Hassanein, and Moustafa A. Hammad*
Embedded System Lab. 오명훈 Addressing Shared Resource Contention in Multicore Processors via Scheduling.
1 Lecture: Virtual Memory Topics: virtual memory, TLB/cache access (Sections 2.2)
Operating Systems: Summary INF1060: Introduction to Operating Systems and Data Communication.
Virtual Memory 1 Computer Organization II © McQuain Virtual Memory Use main memory as a “cache” for secondary (disk) storage – Managed jointly.
1 Lecture: Large Caches, Virtual Memory Topics: cache innovations (Sections 2.4, B.4, B.5)
CS161 – Design and Architecture of Computer
ECE232: Hardware Organization and Design
Memory COMPUTER ARCHITECTURE
COMP SYSTEM ARCHITECTURE
Lecture: Large Caches, Virtual Memory
Virtual Memory Use main memory as a “cache” for secondary (disk) storage Managed jointly by CPU hardware and the operating system (OS) Programs share main.
18742 Parallel Computer Architecture Caching in Multi-core Systems
Morgan Kaufmann Publishers
Lecture: Large Caches, Virtual Memory
Bank-aware Dynamic Cache Partitioning for Multicore Architectures
Lecture 14 Virtual Memory and the Alpha Memory Hierarchy
Morgan Kaufmann Publishers Memory Hierarchy: Virtual Memory
(A Research Proposal for Optimizing DBMS on CMP)
Virtual Memory: Working Sets
Virtual Memory Lecture notes from MKP and S. Yalamanchili.
Virtual Memory Use main memory as a “cache” for secondary (disk) storage Managed jointly by CPU hardware and the operating system (OS) Programs share main.
Virtual Memory 1 1.
Presentation transcript:

11 MCC-DB : Minimizing Cache Conflicts in Multi- core Processors for Databases Rubao Lee 1,2 Xiaoning Ding 2 Feng Chen 2 Qingda Lu 3 Xiaodong Zhang 2 1 Inst. of Computing Tech., Chinese Academy of Sciences 2 Dept. of Computer Sci & Eng., The Ohio State University 3 Systems Software Lab., Intel Cooperation

Parallel Computing 22 2 R&D of Database Systems Have Been Driven by Moore’s Law IEEE Spectrum, May 2008 dropping DRAM price from 400 $/MB to 0.09 ¢/MB Parallel RDBMS DIRECT, Gamma, Paradise, … R&D of RDBMS to effectively utilize multi-cores 1970 Relational Data Model 1976: System R & Ingres Memory wall Multi-cores Cache-Optimized RDBMS MonetDB, StageDB,EaseDB … After 10,000 time increase in clock rate (400KHz in 1971 to 4GHz, in 2005), we have entered multi-core era by increasing # cores with low clock rate.

33 A Multi-core Processor Provides Shared Hardware Resources Intel Nehalem Core 0Core 1Core 2Core 3 Shared Last-Level Cache (LLC) LLC utilization is critical to overall execution performance.

BUS Core 1Core 2 Shared LLC MEM 44 The Shared LLC is a Double-edge Sword! shared data Good NewsBad News Hit! BUS Core 1Core 2 Shared LLC MEM private data of core 1 private data of core 2 Miss! Reload! Cache Conflicts! Constructive Data Sharing

Huge Fact Table 55 A Motivating Example: Hash Join and Index Join p_partkey = lo_partkey ⋈ Lineorder Part Small Dimention Table The query is based on the Star Schema Benchmark. One-time access index join lineorder part B+ Tree lineorder part hash join Hash table One-time access random index lookups seldom data reuse The working set very frequently accessed How can they utilize the shared cache when running alone or together?

66 L2 Miss Rates of Co-running Hash Join and Index Join L2 Miss Rate 6% 28% 37% 39% Hash join Index join Hash join Hardware: Core2Quad Xeon X5355 (4MB L2$ for two cores) OS: Linux DBMS: PostgreSQL Tool: Perfmon2 running alone co-running with each other (1)Totally different cache behaviors (2) Hash Join: Query execution time is increased by 56%! (3) Index Join: Only slightly performance degradation

Challenges of DBMS running on Multi-core DBMSs have successfully been developed in an architecture/OS-independent mode Buffer pool management (bypassing OS buffer cache) Tablespace on raw device (bypassing OS file system) DBMS threads scheduling and multiplexing DBMSs are not multicore-aware  Shared resources, e.g. LLC, are managed by CPU and OS.  Interferences of co-running queries are out of DBMS control  Locality-based scheduling is not automatically handled by OS

Operating System 88 Who Knows What? Q1Q2 DBMS core Shared LLC PLAN Access Patterns of Data Objects (Tuples, Indices, Hash Tables, …) Cache Set Index Physical Memory Address Virtual Address Cache Allocation Hardware Control Memory Allocation How can we leverage the DBMS knowledge of query execution to guide query scheduling and cache allocation? data objects The three parties are DISCONNECTED! DBMS doesn’t know cache allocation. OS and Chip don’t know data access patterns. The problem: cache conflicts due to lack of knowledge of query executions

Operating System 99 Our Solution: MCC-DB Q1Q2 DBMS core Shared LLC PLAN Access Patterns of Data Objects (Tuples, Indices, Hash Tables, …) Cache Set Index Physical Memory Address Virtual Address Cache Allocation Hardware Control Memory Allocation Objectives : Multicore-Aware DBMS with communication and cooperation among the three parties. data objects

 10 Outlines The MCC-DB Framework – Sources and types of cache conflicts – Three components of MCC-DB – System issues – Implementation in both PostgreSQL and Linux kernel Performance Evaluation Conclusion 10

 11 Sources and Types of Cache Conflicts 1.Private data structures during query executions (cannot be shared by multi-cores) 2.Different cache sensitivities among various query plans 3.Inability to protect cache-sensitive plans by the LRU scheme 4.Limited cache space cannot hold working sets of co-running queries. The locality strength of a query plan is determined by its data access pattern and working set size. Strong locality Small working set size (relative to cache size), which is frequently accessed Moderate Locality working set size comparable with cache size, which is moderately accessed Weak Locality seldom data reuse, one-time accesses, which has a large volume. Capacity Contention Cache Pollution

 12 Core MCC-DB: A Framework to Minimize Cache Conflicts QQ Query Optimizer Shared Last Level Cache Execution Scheduler Q Cache partitioning by OS DBMS OSARCHITECTURE 1: which plan? 2: co-schedule? 3: cache allocation? 1: which plan? 2: co-schedule? 3: cache allocation? The locality strength is the key! … …

 13 Critical Issues  How to estimate the locality strength of a query plan? (in DBMS domain)  How to determine the policies for query execution co- scheduling and cache partitioning? (in DBMS domain and interfacing to OS)  How to partition the shared LLC among multiple cores in an effective way? (in OS multicore processor domain)

Locality Estimation for Warehouse Queries The figure is from Star Schema Benchmark, Pat O’Neil, Betty O’Neil, Xuedong Chen, UMass/Boston 87% to 97% of the execution times are spent on executing multi-way joins Agg Join (hash/index) Table 1: Huge fact table and small dimension tables 3: aggregations and grouping after joins 2: equal-join on key-foreign key relationships We only consider the first two- level joins for locality estimation (aggregated hash table size). Only 0.96%~3.8% of fact table tuples can reach the third level join. Locality Estimation of Join Operators  Index join is estimated to have weak locality due to random index lookups on the huge fact table.  Hash join is estimated to have strong, moderate, or weak locality, according to its hash table size (see papers for the details).

 15 Interference between Queries with Different Localities The figure is only a part of the experimental result for measuring performance degradations when co-running various hash joins and index joins (see papers!). hash join affected by index join hash join affected by hash join index join affected by index join A weak-locality query has stably low performance degradations. Safe! Co-running strong- locality queries don’t raise capacity contention. A Moderate-locality query suffers from capacity contention. A weak-locality query can easily cause cache pollution. Cache pollution is much more harmful than capacity contention!

 16 Cache Conflicts Cache pollution: A weak-locality plan ( a large volume of one-time accessed data sets) can easily pollute the shared cache space. Capacity contention: Co-running moderate-locality plans compete for the shared cache space, and misses are due to limited space. Cache pollution is a more severe problem than capacity contention! (faster cache line loading) Capacity contention cannot be removed (limited cache space), but cache pollution can be (cache partitioning)!

17 Page Coloring for Cache Partitioning virtual page number Virtual address page offset physical page number Physical address Page offset Address translation Cache tag Block offset Set index Cache address Physically indexed cache page color bits … OS control = Physically indexed caches are divided into multiple regions (colors). All cache lines in a physical page are cached in one of those regions (colors). OS can control the page color of a virtual page through address mapping (by selecting a physical page with a specific value in its page color bits).

 18 Shared LLC can be partitioned into multiple regions …... …… … … Physically indexed cache … …… … Physical pages are grouped to different bins based on their page colors … i+2 i i+1 … Process … i+2 i i+1 … Process 2 OS address mapping Shared cache is partitioned between two processes through OS address mapping. Main memory space needs to be partitioned too (co-partitioning).

Scheduling with/without cache partitioning Scheduling without cache partitioning: a DBMS-only effort SLS: co-scheduling query plans with the same locality strength. (1) Low interference between weak-locality plans (2) Avoid cache pollution by not co-running them together (3) Cache allocation is not applied: performance is sub-optimal Scheduling with cache partitioning: DBMS + OS efforts. MLS: co-scheduling query plans with mixed locality strengths. (1) Eliminate cache pollution by limiting the cache space for weak-locality queries (2) Avoid capacity contention by allocating space to each query according to their need.

 20 The Effectiveness of Cache Partitioning LLC space to index Join queryLLC space to Index Join query L2 Miss RateQuery Execution Time (s) co-running a hash join (strong/moderate locality) and an index join (weak locality) Reducing the cache space to the index join operators Index Join Hash Join Cache Partitioning Maxmizing performance of strong/moderate-locality query without slowing down of the weak-locality query 4MB3MB2MB1MB0.5MB4MB3MB2MB1MB0.5MB

 21 SLS (DB scheduling only) vs MLS (DB scheduling OS partitioning) co-running hash joins with different hash table sizes and index joins How the performance degradations of hash joins can be reduced? HashTableSize SLS: reducing performance degradations of hash joins when capacity contention is less significantly (small hash tables)! MLS: Minimizing cache conflicts needs both scheduling and partitioning! SLS: small benefit! large hash tables Worst case Cache pollution Only scheduling Capacity contention Scheduling+Partitioning Minimize cache conflicts Performance degradations compared with the case of running alone Performance degradations compared with the case of running alone

22 Summary Multicore Shared LLC is out of the DBMS management – Causing cache pollution related conflicts – Under- or over-utilizing cache space – Significantly degrading overall DBMS execution performance MCC-DB makes collaborative efforts between DBMS & OS – Make query plans with mixed locality strengths – Schedule co-running queries to avoid capacity and conflict misses – Allocating cache space according to demands – All decisions are locality centric Effectiveness is shown by experiments on warehouse DBMS MCC-DB methodology and principle can be applied to a large scope of data-intensive applications 22

 23