Exploiting Multithreaded Architectures to Improve Data Management Operations Layali Rashid The Advanced Computer Architecture U of C (ACAG) Department.

Slides:

Advertisements

Similar presentations

Equality Join R X R.A=S.B S : : Relation R M PagesN Pages Relation S Pr records per page Ps records per page.

Advertisements

Performance of Multithreaded Chip Multiprocessors and Implications for Operating System Design Hikmet Aras

LIBRA: Lightweight Data Skew Mitigation in MapReduce

Scalable Multi-Cache Simulation Using GPUs Michael Moeng Sangyeun Cho Rami Melhem University of Pittsburgh.

Copyright © 2011 Ramez Elmasri and Shamkant Navathe Algorithms for SELECT and JOIN Operations (8) Implementing the JOIN Operation: Join (EQUIJOIN, NATURAL.

Query Execution, Concluded Zachary G. Ives University of Pennsylvania CIS 550 – Database & Information Systems November 18, 2003 Some slide content may.

Exploiting Unbalanced Thread Scheduling for Energy and Performance on a CMP of SMT Processors Matt DeVuyst Rakesh Kumar Dean Tullsen.

Erhan Erdinç Pehlivan Computer Architecture Support for Database Applications.

Multithreaded FPGA Acceleration of DNA Sequence Mapping Edward Fernandez, Walid Najjar, Stefano Lonardi, Jason Villarreal UC Riverside, Department of Computer.

@ Carnegie Mellon Databases Improving Hash Join Performance Through Prefetching Shimin Chen Phillip B. Gibbons Todd C. Mowry Anastassia Ailamaki ‡ Carnegie.

1 Characterizing the Sort Operation on Multithreaded Architectures Layali Rashid, Wessam M. Hassanein, and Moustafa A. Hammad* The Advanced Computer Architecture.

Cache effective mergesort and quicksort Nir Zepkowitz Based on: “Improving Memory Performance of Sorting Algorithms” by Li Xiao, Xiaodong Zhang, Stefan.

Last Time –Main memory indexing (T trees) and a real system. –Optimize for CPU, space, and logging. But things have changed drastically! Hardware trend:

Chapter 17 Parallel Processing.

1 Improving Hash Join Performance through Prefetching _________________________________________________By SHIMIN CHEN Intel Research Pittsburgh ANASTASSIA.

Inspector Joins IC-65 Advances in Data Management Systems 1 Inspector Joins By Shimin Chen, Anastassia Ailamaki, Phillip, and Todd C. Mowry VLDB 2005 Rammohan.

1 External Sorting for Query Processing Yanlei Diao UMass Amherst Feb 27, 2007 Slides Courtesy of R. Ramakrishnan and J. Gehrke.

Chapter 5 Parallel Join 5.1Join Operations 5.2Serial Join Algorithms 5.3Parallel Join Algorithms 5.4Cost Models 5.5Parallel Join Optimization 5.6Summary.

Evaluation of Relational Operations. Relational Operations v We will consider how to implement: – Selection ( ) Selects a subset of rows from relation.

Submitters:Vitaly Panor Tal Joffe Instructors:Zvika Guz Koby Gottlieb Software Laboratory Electrical Engineering Faculty Technion, Israel.

By- Jaideep Moses, Ravi Iyer , Ramesh Illikkal and

External Sorting Chapter 13.. Why Sort? A classic problem in computer science! Data requested in sorted order  e.g., find students in increasing gpa.

The Group Runtime Optimization for High-Performance Computing An Install-Time System for Automatic Generation of Optimized Parallel Sorting Algorithms.

Modularizing B+-trees: Three-Level B+-trees Work Fine Shigero Sasaki* and Takuya Araki NEC Corporation * currently with 1st Nexpire Inc.

Simultaneous Multithreading: Maximizing On-Chip Parallelism Presented By: Daron Shrode Shey Liggett.

Multi-core architectures. Single-core computer Single-core CPU chip.

Architectural Support for Fine-Grained Parallelism on Multi-core Architectures Sanjeev Kumar, Corporate Technology Group, Intel Corporation Christopher.

Multi-Core Architectures

1 Multi-core processors 12/1/09. 2 Multiprocessors inside a single chip It is now possible to implement multiple processors (cores) inside a single chip.

Data Warehousing 1 Lecture-24 Need for Speed: Parallelism Virtual University of Pakistan Ahsan Abdullah Assoc. Prof. & Head Center for Agro-Informatics.

@ Carnegie Mellon Databases Inspector Joins Shimin Chen Phillip B. Gibbons Todd C. Mowry Anastassia Ailamaki 2 Carnegie Mellon University Intel Research.

The Fast Optimal Voltage Partitioning Algorithm For Peak Power Density Minimization Jia Wang, Shiyan Hu Department of Electrical and Computer Engineering.

Triangular Mesh Decimation

Performance Prediction for Random Write Reductions: A Case Study in Modelling Shared Memory Programs Ruoming Jin Gagan Agrawal Department of Computer and.

Database Architecture Optimized for the new Bottleneck: Memory Access Chau Man Hau Wong Suet Fai.

Sorting. Pseudocode of Insertion Sort Insertion Sort To sort array A[0..n-1], sort A[0..n-2] recursively and then insert A[n-1] in its proper place among.

Parallel dynamic batch loading in the M-tree Jakub Lokoč Department of Software Engineering Charles University in Prague, FMP.

Computer Science and Engineering Parallelizing Defect Detection and Categorization Using FREERIDE Leonid Glimcher P. 1 ipdps’05 Scaling and Parallelizing.

Lecture 15- Parallel Databases (continued) Advanced Databases Masood Niazi Torshiz Islamic Azad University- Mashhad Branch

Parallel Event Processing for Content-Based Publish/Subscribe Systems Amer Farroukh Department of Electrical and Computer Engineering University of Toronto.

Department of Computer Science MapReduce for the Cell B. E. Architecture Marc de Kruijf University of Wisconsin−Madison Advised by Professor Sankaralingam.

Motivation: Sorting is among the fundamental problems of computer science. Sorting of different datasets is present in most applications, ranging from.

Radix Sort and Hash-Join for Vector Computers Ripal Nathuji 6.893: Advanced VLSI Computer Architecture 10/12/00.

Layali Rashid, Wessam M. Hassanein, and Moustafa A. Hammad*

A Memory-hierarchy Conscious and Self-tunable Sorting Library To appear in 2004 International Symposium on Code Generation and Optimization (CGO ’ 04)

1 Adaptive Parallelism for Web Search Myeongjae Jeon Rice University In collaboration with Yuxiong He (MSR), Sameh Elnikety (MSR), Alan L. Cox (Rice),

Intel Multimedia Extensions and Hyper-Threading Michele Co CS451.

1 Parallel Applications Computer Architecture Ning Hu, Stefan Niculescu & Vahe Poladian November 22, 2002.

Implementation of Database Systems, Jarek Gryz1 Evaluation of Relational Operations Chapter 12, Part A.

Parallel Programming in Chess Simulations Part 2 Tyler Patton.

Computer Structure 2015 – Intel ® Core TM μArch 1 Computer Structure Multi-Threading Lihu Rappoport and Adi Yoaz.

External Sorting. Why Sort? A classic problem in computer science! Data requested in sorted order –e.g., find students in increasing gpa order Sorting.

Database Management Systems 3ed, R. Ramakrishnan and J. Gehrke1 Evaluation of Relational Operations Chapter 14, Part A (Joins)

Introduction Goal: connecting multiple computers to get higher performance – Multiprocessors – Scalability, availability, power efficiency Job-level (process-level)

Niagara: A 32-Way Multithreaded Sparc Processor Kongetira, Aingaran, Olukotun Presentation by: Mohamed Abuobaida Mohamed For COE502 : Parallel Processing.

Accelerating K-Means Clustering with Parallel Implementations and GPU Computing Janki Bhimani Miriam Leeser Ningfang Mi

Multi-Core CPUs Matt Kuehn. Roadmap ► Intel vs AMD ► Early multi-core processors ► Threads vs Physical Cores ► Multithreading and Multi-core processing.

A Study of Data Partitioning on OpenCL-based FPGAs Zeke Wang (NTU Singapore), Bingsheng He (NTU Singapore), Wei Zhang (HKUST) 1.

Algorithmic Improvements for Fast Concurrent Cuckoo Hashing

Multi-core processors

Computer Structure Multi-Threading

Interquery Parallelism

Evaluation of Relational Operations

Hyperthreading Technology

CHAPTER 5: PHYSICAL DATABASE DESIGN AND PERFORMANCE

Linchuan Chen, Peng Jiang and Gagan Agrawal

Selected Topics: External Sorting, Join Algorithms, …

Parallel Sorting Algorithms

Getting to the root of concurrent binary search tree performance

Sorting We may build an index on the relation, and then use the index to read the relation in sorted order. May lead to one disk block access for each.

Presentation transcript:

Exploiting Multithreaded Architectures to Improve Data Management Operations Layali Rashid The Advanced Computer Architecture U of C (ACAG) Department of Electrical and Computer Engineering University of Calgary

2 Outline The SMT and the CMP Architectures Join (Hash Join) Motivation Algorithm Results Sort (Radix and Quick Sorts) Motivation Algorithms Results Index (CSB+-Tree) Motivation Algorithm Results Conclusions

3 The SMT and the CMP Architectures Simultaneous Multithreading (SMT): multiple threads run simultaneously on a single processor. Chip Multiprocessor (CMP): more than one processor are integrated on a single chip.

4 Hash Join Motivation Hash join is one of the most important operations commonly used in current commercial DBMSs. The L2 cache load miss rate is a critical factor in main-memory hash join performance. Increase level of parallelism in hash join.

5 Architecture-Aware Hash Join (AA_HJ) Build Index Partition Phase Tuples divided equally between threads, each thread has its own set of L2-cache size clusters The Build and Probe Index Partition Phase One thread builds a hash table from each key-range, other threads index partition the probe relation similar to the previous phase. Probe Phase See figure.

6 AA_HJ Results We achieve speedups ranging from 2 to 4.6 compared to PT on Quad Intel Xeon Dual Core server. Speedups for the Pentium 4 with HT ranges between 2.1 to 2.9 compared to PT.

7 Memory-Analysis for Multithreaded AA_HJ A decrease in L2 load miss rate is due to the cache-sized index partitioning, constructive cache sharing and Group Prefetching. A minor increase in L1 data cache load miss rate from 1.5% to 4%.

8 The Sort Motivation Some researches find that the sort algorithms suffer from high level two cache miss rates. Whereas others pointed out that radix sort has high TLB miss rates. In addition, the fact that most sort algorithms are sequential has high impact on generating efficient parallel sort algorithms. In our work we target Radix Sort (distribution-based sort) and Quick Sort (comparison-based sort).

9 Our Parallel Sorts Radix Sort A hybrid radix sort between Partition Parallel Radix Sort and Cache-Conscious Radix Sort. Repartitioning large destination buckets only when they are significantly larger than the L2 cache size. Quick Sort Use Fast Parallel Quick Sort. Dynamically balancing the load across threads. Improve thread parallelism during the sequential cleaning up sorting. Stop the recursive partitioning process when the size of the subarray is almost equal to the largest cache size.

10 The Sort Timing for the Random Datasets on the SMT Arhcitecure Radix Sort and Quick Sort shows low L1 and L2 caches miss rates on our machines. Radix Sort has a DTLB Store miss rate up to 26%. Radix Sort accomplishes slight speedup on SMT architectures that doesn’t exceed 3%, due to its CPU-intensive nature. Enhancements in execution time for quick sort are about 25% to 30%. Quick SortRadix Sort

11 The Sort Timing for the Random Datasets on the CMP Architecture Radix SortQuick Sort Our speedups for the Radix sort range from 54% for two threads up to 300% for threads from 2 to 8. Our speedups for the Quick Sort range from 34% to 417%.

12 The Index Motivation Despite the fact that CSB+-tree proves to have significant speedup over B+-trees, experiments show that a large fraction of its execution time is still spent waiting for data. The L2 load miss rate for single-threaded CSB+-tree is as high as 42%.

13 Dual-threaded CSB+-Tree One CSB+-Tree. Single thread for the bulkloading. Two threads for probing. Unlike inserts and deletes, search needs no synchronization since it involves reads only.

14 Index Results Speedups for dual-threaded CSB+-tree range from 19% to 68% compared to single-threaded CSB+-tree. Two threads for memory-bound operations propose more chances to keep the functional units working. Sharing one CSB+-tree amongst both of our threads result in constructive behaviour and reduction of 6% -8% in the L2 miss rate.

15 Conclusions State-of-the-art parallel architectures (SMT and CMP) have opened opportunities for the improvement of software operations to better utilize the underlying hardware resources. It is essential to have efficient implementations of database operations. We propose architecture-aware multithreaded database algorithms of the most important database operations (joins, sorts and indexes). We characterize the timing and memory behaviour of these database operations.

16 The End

17 Backup Slides

18 Figure ‎1 ‑ 1: The SMT Architecture

19 Figure ‎1 ‑ 2: Comparison between the SMT and the Dual Core Architectures

20 Figure ‎1 ‑ 3: Combining the SMT and the CMP Architectures

21 Figure ‎2 ‑ 1: The L1 Data Cache Load Miss Rate for Hash Join

22 Figure ‎2 ‑ 2: The L2 Cache Load Miss Rate for Hash Join

23 Figure ‎2 ‑ 3: The Trace Cache Miss Rate for Hash Join

24 Figure ‎2 ‑ 4: Typical Relational Table in RDBMS

25 Figure ‎2 ‑ 5: Database Join

26 Figure ‎2 ‑ 6: Hash Equi-join Process

27 Figure ‎2 ‑ 7: Hash Table Structure

28 Figure ‎2 ‑ 8: Hash Join Base Algorithm partition R into R0, R1,…, Rn-1 partition S into S0, S1,…, Sn-1 for i = 0 until i = n-1 use Ri to build hash-tablei for i = 0 until i = n-1 probe Si using hash-tablei

29 Figure ‎2 ‑ 9: AA_HJ Build Phase Executed by one Thread

30 Figure ‎2 ‑ 10: AA_HJ Probe Index Partitioning Phase Executed by one Thread

31 Figure ‎2 ‑ 11: AA_HJ S-Relation Partitioning and Probing Phases

32 Figure ‎2 ‑ 12: AA_HJ Multithreaded Probing Algorithm

33 Table ‎2 ‑ 1: Machines Specifications

34 Table ‎2 ‑ 2: Number of Tuples for Machine 1

35 Table ‎2 ‑ 3: Number of Tuples for Machine 2

36 Figure ‎2 ‑ 13: Timing for three Hash Join Partitioning Techniques

37 Figure ‎2 ‑ 14: Memory Usage for three Hash Join Partitioning Techniques

38 Figure ‎2 ‑ 15: Timing for Dual-threaded Hash Join

39 Figure ‎2 ‑ 16: Memory Usage for Dual-threaded Hash Join

40 Figure ‎2 ‑ 17: Timing Comparison of all Hash Join Algorithms

41 Figure ‎2 ‑ 18: Memory Usage Comparison of all Hash Join Algorithms

42 Figure ‎2 ‑ 19: Speedups due to the AA_HJ+SMT and the AA_HJ+GP+SMT Algorithms

43 Figure ‎2 ‑ 20: Varying Number of Clusters for the AA_HJ+GP+SMT

44 Figure ‎2 ‑ 21: Varying the Selectivity for Tuple Size = 100Bytes

45 Figure ‎2 ‑ 22: Time Breakdown Comparison for the Hash Join Algorithms for tuple sizes 20Bytes and 100Bytes

46 Figure ‎2 ‑ 23: Timing for the Multi-threaded Architecture-Aware Hash Join

47 Figure ‎2 ‑ 24: Speedups for the Multi-Threaded Architecture-Aware Hash Join

48 Figure ‎2 ‑ 25: Memory Usage for the Multi- Threaded Architecture-Aware Hash Join

49 Figure ‎2 ‑ 26: Time Breakdown Comparison for Hash Join Algorithms

50 Figure ‎2 ‑ 27: The L1 Data Cache Load Miss Rate for NPT and AA_HJ

51 Figure ‎2 ‑ 28: Number of Loads for NPT and AA_HJ

52 Figure ‎2 ‑ 29: The L2 Cache Load Miss Rate for NPT and AA_HJ

53 Figure ‎2 ‑ 30: The Trace Cache Miss Rate for NPT and AA_HJ

54 Figure ‎2 ‑ 31: The DTLB Load Miss Rate for NPT and AA_HJ

55 Figure ‎3 ‑ 1: The LSD Radix Sort 1 for (i= 0; i < number_of_digits; i ++) 2sort source-array based on digiti;

56 Figure ‎3 ‑ 2: The Counting LSD Radix Sort Algorithm

57 Figure ‎3 ‑ 3: Parallel Radix Sort Algorithm

58 Table ‎3 ‑ 1: Memory Characterization for LSD Radix Sort with Different Datasets

59 Figure ‎3 ‑ 4: Radix Sort Timing for the Random Datasets on Machine 2

60 Figure ‎3 ‑ 5: Radix Sort Timing for the Gaussian Datasets on Machine 2

61 Figure ‎3 ‑ 6: Radix Sort Timing for Zero Datasets on Machine 2

62 Figure ‎3 ‑ 7: Radix Sort Timing for the Random Datasets on Machine 1

63 Figure ‎3 ‑ 8: Radix Sort Timing for the Gaussian Datasets on Machine 1

64 Figure ‎3 ‑ 9: Radix Sort Timing for the Zero Datasets on Machine 1

65 Figure ‎3 ‑ 10: The DTLB Stores Miss Rate for the Radix Sort on Machine 2 (Random Datasets)

66 Figure ‎3 ‑ 11: The L1 Data Cache Load Miss Rate for the Radix Sort on Machine 2 (Random Datasets)

67 Table ‎3 ‑ 2: Memory Characterization for Memory-Tuned Quick Sort with Different Datasets

68 Figure ‎3 ‑ 12: Quicksort Timing for the Random Datasets on Machine 2

69 Figure ‎3 ‑ 13: Quicksort Timing for the Random Dataset on Machine 1

70 Figure ‎3 ‑ 14: Quicksort Timing for the Gaussian Datasets on Machine 2

71 Figure ‎3 ‑ 15: Quicksort Timing for the Gaussian Dataset on Machine 1

72 Figure ‎3 ‑ 16: Quicksort Timing for the Zero Datasets on Machine 2

73 Figure ‎3 ‑ 17: Quicksort Timing for the Zero Dataset on Machine 1

74 Table ‎3 ‑ 3: The Sort Results for Machine 1

75 Table ‎3 ‑ 4: The Sort Results for Machine 2

76 Figure ‎4 ‑ 1: Search Operation on an Index Tree

77 Figure ‎4 ‑ 2: Differences between the B+-Tree and the CSB+-Tree

78 Figure ‎4 ‑ 3: Dual-Threaded CSB+-Tree for the SMT Architectures

79 Figure ‎4 ‑ 4: Timing for the Single and Dual- Threaded CSB+-Tree

80 Figure ‎4 ‑ 5: The L1 Data Cache Load Miss Rate for the Single and Dual-Threaded CSB+- Tree

81 Figure ‎4 ‑ 6: The Trace Cache Miss Rate for the Single and Dual-Threaded CSB+-Tree

82 Figure ‎4 ‑ 7: The L2 Load Miss Rate for the Single and Dual-Threaded CSB+-Tree

83 Figure ‎4 ‑ 8: The DTLB Load Miss Rate for the Single and Dual-Threaded CSB+-Tree

84 Figure ‎4 ‑ 9: The ITLB Load Miss Rate for the Single and Dual-Threaded CSB+-Tree