Layali Rashid, Wessam M. Hassanein, and Moustafa A. Hammad*

Slides:



Advertisements
Similar presentations
The Interaction of Simultaneous Multithreading processors and the Memory Hierarchy: some early observations James Bulpin Computer Laboratory University.
Advertisements

Revisiting Co-Processing for Hash Joins on the Coupled CPU- GPU Architecture School of Computer Engineering Nanyang Technological University 27 th Aug.
Dynamic Thread Assignment on Heterogeneous Multiprocessor Architectures Pree Thiengburanathum Advanced computer architecture Oct 24,
Performance, Energy and Thermal Considerations of SMT and CMP architectures Yingmin Li, David Brooks, Zhigang Hu, Kevin Skadron Dept. of Computer Science,
1 Memory Performance and Scalability of Intel’s and AMD’s Dual-Core Processors: A Case Study Lu Peng 1, Jih-Kwon Peir 2, Tribuvan K. Prakash 1, Yen-Kuang.
An Evaluation of OpenMP on Current and Emerging Multithreaded/Multicore Processors Matthew Curtis-Maury, Xiaoning Ding, Christos D. Antonopoulos, and Dimitrios.
LIBRA: Lightweight Data Skew Mitigation in MapReduce
Introductions to Parallel Programming Using OpenMP
Matching Memory Access Patterns and Data Placement for NUMA Systems Zoltán Majó Thomas R. Gross Computer Science Department ETH Zurich, Switzerland.
Implementation of Relational Operations (Part 2) R&G - Chapters 12 and 14.
Exploiting Unbalanced Thread Scheduling for Energy and Performance on a CMP of SMT Processors Matt DeVuyst Rakesh Kumar Dean Tullsen.
Erhan Erdinç Pehlivan Computer Architecture Support for Database Applications.
Improving Database Performance on Simultaneous Multithreading Processors Jingren Zhou Microsoft Research John Cieslewicz Columbia.
Exploiting Data Deduplication to Accelerate Live Virtual Machine Migration Xiang Zhang 1,2, Zhigang Huo 1, Jie Ma 1, Dan Meng 1 1. National Research Center.
Changkyu Kim1, Eric Sedlar2, Jatin Chhugani1,
Benchmarking Parallel Code. Benchmarking2 What are the performance characteristics of a parallel code? What should be measured?
Multithreaded FPGA Acceleration of DNA Sequence Mapping Edward Fernandez, Walid Najjar, Stefano Lonardi, Jason Villarreal UC Riverside, Department of Computer.
An Analysis of Database Workload Performance on Simultaneous Multithreaded Processors Jack L. Lo, Luiz André Barroso, Susan Eggers Kourosh Gharachorloo,
Parallelized variational EM for Latent Dirichlet Allocation: An experimental evaluation of speed and scalability Ramesh Nallapati, William Cohen and John.
@ Carnegie Mellon Databases Improving Hash Join Performance Through Prefetching Shimin Chen Phillip B. Gibbons Todd C. Mowry Anastassia Ailamaki ‡ Carnegie.
1 Lecture 11: SMT and Caching Basics Today: SMT, cache access basics (Sections 3.5, 5.1)
1 Characterizing the Sort Operation on Multithreaded Architectures Layali Rashid, Wessam M. Hassanein, and Moustafa A. Hammad* The Advanced Computer Architecture.
Chapter Hardwired vs Microprogrammed Control Multithreading
Chapter 17 Parallel Processing.
1 Efficient Multithreading Implementation of H.264 Encoder on Intel Hyper- Threading Architectures Steven Ge, Xinmin Tian, and Yen-Kuang Chen IEEE Pacific-Rim.
1 Improving Hash Join Performance through Prefetching _________________________________________________By SHIMIN CHEN Intel Research Pittsburgh ANASTASSIA.
How Multi-threading can increase on-chip parallelism
1 An Empirical Study on Large-Scale Content-Based Image Retrieval Group Meeting Presented by Wyman
Inspector Joins IC-65 Advances in Data Management Systems 1 Inspector Joins By Shimin Chen, Anastassia Ailamaki, Phillip, and Todd C. Mowry VLDB 2005 Rammohan.
By- Jaideep Moses, Ravi Iyer , Ramesh Illikkal and
Performance Evaluation of Hybrid MPI/OpenMP Implementation of a Lattice Boltzmann Application on Multicore Systems Department of Computer Science and Engineering,
Venkatram Ramanathan 1. Motivation Evolution of Multi-Core Machines and the challenges Background: MapReduce and FREERIDE Co-clustering on FREERIDE Experimental.
Copyright 2005, Data Mining Research Lab, The Ohio State University Cache-conscious Frequent Pattern Mining on a Modern Processor Amol Ghoting, Gregory.
DBMSs On A Modern Processor: Where Does Time Go? by A. Ailamaki, D.J. DeWitt, M.D. Hill, and D. Wood University of Wisconsin-Madison Computer Science Dept.
Energy saving in multicore architectures Assoc. Prof. Adrian FLOREA, PhD Prof. Lucian VINTAN, PhD – Research.
Multi-core architectures. Single-core computer Single-core CPU chip.
Multi-Core Architectures
Thread Criticality Predictors for Dynamic Performance, Power, and Resource Management in Chip Multiprocessors Abhishek Bhattacharjee and Margaret Martonosi.
Parallel Processing - introduction  Traditionally, the computer has been viewed as a sequential machine. This view of the computer has never been entirely.
InCoB August 30, HKUST “Speedup Bioinformatics Applications on Multicore- based Processor using Vectorizing & Multithreading Strategies” King.
@ Carnegie Mellon Databases Inspector Joins Shimin Chen Phillip B. Gibbons Todd C. Mowry Anastassia Ailamaki 2 Carnegie Mellon University Intel Research.
The Fast Optimal Voltage Partitioning Algorithm For Peak Power Density Minimization Jia Wang, Shiyan Hu Department of Electrical and Computer Engineering.
ACMSE’04, ALDepartment of Electrical and Computer Engineering - UAH Execution Characteristics of SPEC CPU2000 Benchmarks: Intel C++ vs. Microsoft VC++
Embedding Constraint Satisfaction using Parallel Soft-Core Processors on FPGAs Prasad Subramanian, Brandon Eames, Department of Electrical Engineering,
CISC Machine Learning for Solving Systems Problems Presented by: Alparslan SARI Dept of Computer & Information Sciences University of Delaware
Parallelization and Characterization of Pattern Matching using GPUs Author: Giorgos Vasiliadis 、 Michalis Polychronakis 、 Sotiris Ioannidis Publisher:
1 Computation Spreading: Employing Hardware Migration to Specialize CMP Cores On-the-fly Koushik Chakraborty Philip Wells Gurindar Sohi
CPSC 404, Laks V.S. Lakshmanan1 External Sorting Chapter 13: Ramakrishnan & Gherke and Chapter 2.3: Garcia-Molina et al.
Hyper Threading Technology. Introduction Hyper-threading is a technology developed by Intel Corporation for it’s Xeon processors with a 533 MHz system.
Parallel Event Processing for Content-Based Publish/Subscribe Systems Amer Farroukh Department of Electrical and Computer Engineering University of Toronto.
Replicating Memory Behavior for Performance Skeletons Aditya Toomula PC-Doctor Inc. Reno, NV Jaspal Subhlok University of Houston Houston, TX By.
Department of Computer Science MapReduce for the Cell B. E. Architecture Marc de Kruijf University of Wisconsin−Madison Advised by Professor Sankaralingam.
Radix Sort and Hash-Join for Vector Computers Ripal Nathuji 6.893: Advanced VLSI Computer Architecture 10/12/00.
Exploiting Multithreaded Architectures to Improve Data Management Operations Layali Rashid The Advanced Computer Architecture U of C (ACAG) Department.
Shouqing Hao Institute of Computing Technology, Chinese Academy of Sciences Processes Scheduling on Heterogeneous Multi-core Architecture.
Computer Structure 2015 – Intel ® Core TM μArch 1 Computer Structure Multi-Threading Lihu Rappoport and Adi Yoaz.
Fan Qi Database Lab 1, com1 #01-08 CS3223 Tutorial 6.
Introduction Goal: connecting multiple computers to get higher performance – Multiprocessors – Scalability, availability, power efficiency Job-level (process-level)
Page 1 2P13 Week 1. Page 2 Page 3 Page 4 Page 5.
Analyzing Memory Access Intensity in Parallel Programs on Multicore Lixia Liu, Zhiyuan Li, Ahmed Sameh Department of Computer Science, Purdue University,
Multi-Core CPUs Matt Kuehn. Roadmap ► Intel vs AMD ► Early multi-core processors ► Threads vs Physical Cores ► Multithreading and Multi-core processing.
Parallel Processing - introduction
High Performance Computing on an IBM Cell Processor --- Bioinformatics
Computer Structure Multi-Threading
Steven Ge, Xinmin Tian, and Yen-Kuang Chen
Hyperthreading Technology
Linchuan Chen, Peng Jiang and Gagan Agrawal
Computer Architecture Lecture 4 17th May, 2006
Adaptive Single-Chip Multiprocessing
Performance And Scalability In Oracle9i And SQL Server 2000
Presentation transcript:

Exploiting Multithreaded Architectures to Improve the Hash Join Operation Layali Rashid, Wessam M. Hassanein, and Moustafa A. Hammad* The Advanced Computer Architecture Group @ U of C (ACAG) Department of Electrical and Computer Engineering *Department of Computer Science University of Calgary

Outline The SMT and the CMP Architectures The Hash Join Database Operation Motivation Architecture-Aware Hash Join Experimental Methodology Timing and Memory Analysis Conclusions MEDEA'08 University of Calgary 2/13

The SMT and the CMP Architectures Simultaneous Multithreading (SMT): multiple threads run simultaneously on a single processor. Chip Multiprocessor (CMP): more than one processor are integrated on a single chip. MEDEA'08 University of Calgary 3/13

The Hash Join Database Operation The hash join process The partition-based hash join algorithm remember that the original join is the for-loop one and I don't have it in this slide MEDEA'08 University of Calgary 4/13

Characterizing the Grace hash join on a multithreaded machine Motivation Characterizing the Grace hash join on a multithreaded machine Multithreaded architectures create new opportunities for improving essential DBMS’s operations. Hash join is one of the most important operations in current commercial DBMSs. The L2 cache load miss rate is a critical factor in main-memory hash join performance. Therefore, we have two goals: Utilize the multiple threads. Decrease the L2 miss rate. MEDEA'08 University of Calgary 5/13

Architecture-Aware Hash Join (AA_HJ) The R-relation index partition phase Tuples divided equally between threads, each thread has its own set of L2-cache size clusters. The build and S-relation index partition phase One thread builds a hash table from each key-range: write somewhere or verbally mention that this algorithm is for two threads and in the paper we have one for random number of threads explain why you have L2-cache size clusters Other threads index partition the probe relation. MEDEA'08 University of Calgary 6/13

Architecture-Aware Hash Join (cont’d) The probe phase The random accesses to any hash table whenever there is a search for a potential match are a challenge. Threads probe hash tables with similar key range simultaneously to increase temporal and spatial locality. MEDEA'08 University of Calgary 7/13

Experimental Methodology We ran our algorithms on two machines with the following specifications: MEDEA'08 University of Calgary 8/13

Experimental Methodology (cont’d) All algorithms are implemented in C. We employed the built-in OpenMP C/C++ library to manage parallelism. For Machine 1 we had a 50MByte build relation and a 100MByte probe relation. While for Machine 2 we had 250MByte build relation and 500MByte. We used the Intel VTune Performance Analyzer for Linux 9.0 to collect the hardware events. MEDEA'08 University of Calgary 9/13

AA_HJ Timing Results We achieved speedups ranging from 2 to 4.6 compared to Grace hash join on Quad Intel Xeon Dual Core server (Machine 2). Speedups for the Pentium 4 with HT ranges between 2.1 to 2.9 compared to Grace hash join. PT: Copy-partitioning hash join NPT: Non-partitioning hash join Index PT: Index-partitioning hash join 2, 4, 8, 12 or 16 is number of threads add a slide to explain why you don't have improvements for threads above 12 MEDEA'08 University of Calgary 10/13

Memory-Analysis for Multithreaded AA_HJ A decrease in L2 load miss rate is due to the cache-sized index partitioning, constructive cache sharing and Group Prefetching. A minor increase in L1 data cache load miss rate from 1.5% to 4% on Machine 2. MEDEA'08 University of Calgary 11/13

Conclusions Revisiting the join implementation to take advantage of state-of-the-art hardware improvements is an important direction to boost the performance of DBMSs. We emphasized pervious findings that the hash join is bound by the L2 miss rates, which range from 29% to 62%. We proposed an Architecture-Aware Hash Join (AA_HJ) that relies on sharing critical structures between working threads at the cache level. We find that AA_HJ decreases the L2 cache miss rate from 62% to 11%, and from 29% to 15% for tuple size = 20Bytes and 140Bytes, respectively. MEDEA'08 University of Calgary 12/13

The End

Time Breakdown Comparison (Machine 2) MEDEA'08 University of Calgary Backup