Improving Database Performance on Simultaneous Multithreading Processors Jingren Zhou Microsoft Research John Cieslewicz Columbia.

Slides:



Advertisements
Similar presentations
Multicore Architectures Michael Gerndt. Development of Microprocessors Transistor capacity doubles every 18 months © Intel.
Advertisements

Multithreading processors Adapted from Bhuyan, Patterson, Eggers, probably others.
Data Marshaling for Multi-Core Architectures M. Aater Suleman Onur Mutlu Jose A. Joao Khubaib Yale N. Patt.
Helper Threads via Virtual Multithreading on an experimental Itanium 2 processor platform. Perry H Wang et. Al.
Single-Chip Multiprocessor Nirmal Andrews. Case for single chip multiprocessors Advances in the field of integrated chip processing. - Gate density (More.
1 COMP 206: Computer Architecture and Implementation Montek Singh Mon, Dec 5, 2005 Topic: Intro to Multiprocessors and Thread-Level Parallelism.
@ Carnegie Mellon Databases Improving Hash Join Performance Through Prefetching Shimin Chen Phillip B. Gibbons Todd C. Mowry Anastassia Ailamaki ‡ Carnegie.
1 Improving Hash Join Performance through Prefetching _________________________________________________By SHIMIN CHEN Intel Research Pittsburgh ANASTASSIA.
Techniques for Efficient Processing in Runahead Execution Engines Onur Mutlu Hyesoon Kim Yale N. Patt.
EECS 470 Cache Systems Lecture 13 Coverage: Chapter 5.
1  2004 Morgan Kaufmann Publishers Chapter Seven.
DaMoN 2011 Paper Preview Organized by Stavros Harizopoulos and Qiong Luo Athens, Greece Jun 13, 2011.
Dutch-Belgium DataBase Day University of Antwerp, MonetDB/x100 Peter Boncz, Marcin Zukowski, Niels Nes.
Architecture Basics ECE 454 Computer Systems Programming
DBMSs On A Modern Processor: Where Does Time Go? by A. Ailamaki, D.J. DeWitt, M.D. Hill, and D. Wood University of Wisconsin-Madison Computer Science Dept.
Lecture#14. Last Lecture Summary Memory Address, size What memory stores OS, Application programs, Data, Instructions Types of Memory Non Volatile and.
CPE 631: Multithreading: Thread-Level Parallelism Within a Processor Electrical and Computer Engineering University of Alabama in Huntsville Aleksandar.
Multi-core architectures. Single-core computer Single-core CPU chip.
Buffering Database Operations for Enhanced Instruction Cache Performance Jingren Zhou, Kenneth A. Ross SIGMOD International Conference on Management of.
Multi-Core Architectures
Copyright © 2013, SAS Institute Inc. All rights reserved. MEMORY CACHE – PERFORMANCE CONSIDERATIONS CLAIRE CATES DISTINGUISHED DEVELOPER
@ Carnegie Mellon Databases Inspector Joins Shimin Chen Phillip B. Gibbons Todd C. Mowry Anastassia Ailamaki 2 Carnegie Mellon University Intel Research.
1 Process Scheduling in Multiprocessor and Multithreaded Systems Matt Davis CS5354/7/2003.
Chapter 8 CPU and Memory: Design, Implementation, and Enhancement The Architecture of Computer Hardware and Systems Software: An Information Technology.
Super computers Parallel Processing By Lecturer: Aisha Dawood.
SYNAR Systems Networking and Architecture Group CMPT 886: Computer Architecture Primer Dr. Alexandra Fedorova School of Computing Science SFU.
1 Computation Spreading: Employing Hardware Migration to Specialize CMP Cores On-the-fly Koushik Chakraborty Philip Wells Gurindar Sohi
SIMULTANEOUS MULTITHREADING Ting Liu Liu Ren Hua Zhong.
Precomputation- based Prefetching By James Schatz and Bashar Gharaibeh.
Processor Architecture
Processor Level Parallelism. Improving the Pipeline Pipelined processor – Ideal speedup = num stages – Branches / conflicts mean limited returns after.
CSC Multiprocessor Programming, Spring, 2012 Chapter 11 – Performance and Scalability Dr. Dale E. Parson, week 12.
Nov. 15, 2000Systems Architecture II1 Machine Organization (CS 570) Lecture 8: Memory Hierarchy Design * Jeremy R. Johnson Wed. Nov. 15, 2000 *This lecture.
Computer Architecture: Multithreading (I) Prof. Onur Mutlu Carnegie Mellon University.
Performance Tuning John Black CS 425 UNR, Fall 2000.
Layali Rashid, Wessam M. Hassanein, and Moustafa A. Hammad*
CS.305 Computer Architecture Memory: Virtual Adapted from Computer Organization and Design, Patterson & Hennessy, © 2005, and from slides kindly made available.
Exploiting Multithreaded Architectures to Improve Data Management Operations Layali Rashid The Advanced Computer Architecture U of C (ACAG) Department.
Computer Structure 2015 – Intel ® Core TM μArch 1 Computer Structure Multi-Threading Lihu Rappoport and Adi Yoaz.
1 Adapted from UC Berkeley CS252 S01 Lecture 17: Reducing Cache Miss Penalty and Reducing Cache Hit Time Hardware prefetching and stream buffer, software.
Lecture 1: Introduction CprE 585 Advanced Computer Architecture, Fall 2004 Zhao Zhang.
Memory Hierarchy— Five Ways to Reduce Miss Penalty.
Processor Level Parallelism 1
PipeliningPipelining Computer Architecture (Fall 2006)
1 Lecture 5a: CPU architecture 101 boris.
COMP 740: Computer Architecture and Implementation
Computer Architecture Chapter (14): Processor Structure and Function
William Stallings Computer Organization and Architecture 8th Edition
CSC 4250 Computer Architectures
Simultaneous Multithreading
Computer Structure Multi-Threading
5.2 Eleven Advanced Optimizations of Cache Performance
Embedded Computer Architecture 5SAI0 Chip Multi-Processors (ch 8)
/ Computer Architecture and Design
Buffering Database Operations for Enhanced
Hyperthreading Technology
Computer Architecture: Multithreading (I)
Lecture 14 Virtual Memory and the Alpha Memory Hierarchy
Levels of Parallelism within a Single Processor
Hardware Multithreading
CPE 631: Multithreading: Thread-Level Parallelism Within a Processor
/ Computer Architecture and Design
Embedded Computer Architecture 5SAI0 Chip Multi-Processors (ch 8)
15-740/ Computer Architecture Lecture 14: Prefetching
CSC3050 – Computer Architecture
Levels of Parallelism within a Single Processor
Hardware Multithreading
Operating Systems: Internals and Design Principles, 6/E
CSC Multiprocessor Programming, Spring, 2011
Virtual Memory 1 1.
Presentation transcript:

Improving Database Performance on Simultaneous Multithreading Processors Jingren Zhou Microsoft Research John Cieslewicz Columbia University Kenneth A. Ross Columbia University Mihir Shah Columbia University

Simultaneous Multithreading (SMT) Available on modern CPUs:  “Hyperthreading” on Pentium 4 and Xeon.  IBM POWER5  Sun UltraSparc IV Challenge: Design software to efficiently utilize SMT.  This talk: Database software Intel Pentium 4 with Hyperthreading

Superscalar Processor (no SMT) Instruction Stream Superscalar pipeline (up to 2 instructions/cycle)... Time Improved instruction level parallelism CPI = 3/4

SMT Processor Instruction Streams... Time Improved thread level parallelism More opportunities to keep the processor busy But sometimes SMT does not work so well  CPI = 5/8...

Stalls Instruction Stream 1... Time Instruction Stream 2... CPI = 3/4. Progress despite stalled thread. Stall Stalls due to cache misses ( cycles for L2 cache), branch mispredictions (20-30 cycles), etc.

Memory Consistency Instruction Stream 1... Time Instruction Stream 2... Detect conflicting access to common cache line flush pipeline + sync cache with RAM “MOMC Event” on Pentium 4. ( cycles)

SMT Processor Exposes multiple “logical” CPUs (one per instruction stream) One physical CPU (~5% extra silicon to duplicate thread state information) Better than single threading:  Increased thread-level parallelism  Improved processor utilization when one thread blocks Not as good as two physical CPUs:  CPU resources are shared, not replicated

SMT Challenges Resource Competition  Shared Execution Units  Shared Cache Thread Coordination  Locking, etc. has high overhead False Sharing  MOMC Events

Approaches to using SMT Ignore it, and write single threaded code. Naïve parallelism  Pretend the logical CPUs are physical CPUs SMT-aware parallelism  Parallel threads designed to avoid SMT-related interference Use one thread for the algorithm, and another to manage resources  E.g., to avoid stalls for cache misses

Naïve Parallelism Treat SMT processor as if it is multi-core Databases already designed to utilize multiple processors - no code modification Uses shared processor resources inefficiently:  Cache Pollution / Interference  Competition for execution units

SMT-Aware Parallelism Exploit intra-operator parallelism Divide input and use a separate thread to process each part  E.g., one thread for even tuples, one for odd tuples.  Explicit partitioning step not required. Sharing input involves multiple readers  No MOMC events, because two reads don’t conflict

SMT-Aware Parallelism (cont.) Sharing output is challenging  Thread coordination for output  read/write and write/write conflicts on common cache lines (MOMC Events) “Solution:” Partition the output  Each thread writes to separate memory buffer to avoid memory conflicts  Need an extra merge step in the consumer of the output stream  Difficult to maintain input order in the output

Managing Resources for SMT Cache misses are a well-known performance bottleneck for modern database systems  Mainly L2 data cache misses, but also L1 instruction cache misses [Ailamaki et al 98]. Goal: Use a “helper” thread to avoid cache misses in the “main” thread  load future memory references into the cache  explicit load, not a prefetch

Data Dependency Memory references that depend upon a previous memory access exhibit a data dependency E.g., Lookup hash table: Hash BucketsOverflow Cells Tuple

Data Dependency (cont.) Data dependencies make instruction level parallelism harder Modern architectures provide prefetch instructions.  Request that data be brought into the cache  Non-blocking Pitfalls:  Prefetch instructions are frequently dropped  Difficult to tune  Too much prefetching can pollute the cache

Staging Computation Hash BucketsOverflow Cells Tuple ABC 1.Preload A. 2.(other work) 3.Process A. 4.Preload B. 5.(other work) 6.Process B. 7.Preload C. 8.(other work) 9.Process C. 10.Preload Tuple. 11.(other work) 12.Process Tuple. (Assumes each element is a cache line.)

Staging Computation (cont.) By overlapping memory latency with other work, some cache miss latency can be hidden. Many probes “in flight” at the same time. Algorithms need to be rewritten. E.g. Chen, et al. [2004], Harizopoulos, et al. [2004].

Work-Ahead Set: Main Thread Writes memory address + computation state to the work-ahead set Retrieves a previous address + state Hope that helper thread can preload data before retrieval by the main thread Correct whether or not helper thread succeeds at preloading data  helper thread is read-only

Work-ahead Set Data Structure stateaddress Main Thread A1 B1 C1 D 1 E1 F 1

Work-ahead Set Data Structure stateaddress Main Thread A1 B1 C1 D 1 E1 F 1 G 1 H2 I 2 J 2 K2 L 2

Work-Ahead Set: Helper Thread Reads memory addresses from the work-ahead set, and loads their contents Data becomes cache resident Tries to preload data before main thread cycles around If successful, main thread experiences cache hits

G H2 I2 J2 Work-ahead Set Data Structure stateaddress E F Helper Thread “ temp += *slot[i] ”

G H2 I2 J2 Iterate Backwards! stateaddress E F Helper Thread i = i-1 mod size i Why? See Paper.

Helper Thread Speed If helper thread faster than main thread:  More computation than memory latency  Helper thread should not preload twice (wasted CPU cycles)  See paper for how to stop redundant loads If helper thread is slower:  No special tuning necessary  Main thread will absorb some cache misses

Work-Ahead Set Size Too Large: Cache Pollution  Preloaded data evicts other preloaded data before it can be used Too Small: Thread Contention  Many MOMC events because work-ahead set spans few cache lines Just Right: Experimentally determined  But use the smallest size within the acceptable range (performance plateaus), so that cache space is available for other purposes (for us, 128 entries) Data structure itself much smaller than L2 cache

Experimental Workload Two Operators:  Probe phase of Hash Join  CSB+ Tree Index Join Operators run in isolation and in parallel Intel VTune used to measure hardware events CPU Pentium GHz Memory2 GB DDR L1, L2 Size8 KB, 512 KB L1, L2 Cache-line Size 64 B, 128 B L1 Miss Latency18 cycles L2 Miss Latency 276 Cycles MOMC Latency ~300+ Cycles

Experimental Outline 1. Hash join 2. Index lookup 3. Mixed: Hash join and index lookup

Hash Join Comparative Performance

Hash Join L2 Cache Misses Per Tuple

CSB + Tree Index Join Comparative Performance

CSB + Tree Index Join L2 Cache Misses Per Tuple

Parallel Operator Performance 52%55%20%

Parallel Operator Performance 26%29%

Conclusion Naïve parallelSMT-AwareWork-Ahead Impl. EffortSmallModerate Data FormatUnchangedSplit outputUnchanged Data OrderUnchangedChangedUnchanged*  Performance (row) ModerateHigh  Performance (col) ModerateHighModerate Control of CacheNo Yes