Sort in GPDB Feng Tian GreenPlum Inc.. WARNING: NON-TECH SLIDES Why (NOW)?  Real customers, real problems.  About to get the code in MAIN Make Joy/Brian's.

Slides:



Advertisements
Similar presentations
Introduction to Algorithms Quicksort
Advertisements

Garfield AP Computer Science
CSE332: Data Abstractions Lecture 14: Beyond Comparison Sorting Dan Grossman Spring 2010.
ADA: 5. Quicksort1 Objective o describe the quicksort algorithm, it's partition function, and analyse its running time under different data conditions.
ISOM MIS 215 Module 7 – Sorting. ISOM Where are we? 2 Intro to Java, Course Java lang. basics Arrays Introduction NewbieProgrammersDevelopersProfessionalsDesigners.
Database Management Systems, R. Ramakrishnan and J. Gehrke1 External Sorting Chapter 11.
Quicksort CS 3358 Data Structures. Sorting II/ Slide 2 Introduction Fastest known sorting algorithm in practice * Average case: O(N log N) * Worst case:
25 May Quick Sort (11.2) CSE 2011 Winter 2011.
Quicksort COMP171 Fall Sorting II/ Slide 2 Introduction * Fastest known sorting algorithm in practice * Average case: O(N log N) * Worst case: O(N.
Chapter 7: Sorting Algorithms
Data Structures Data Structures Topic #13. Today’s Agenda Sorting Algorithms: Recursive –mergesort –quicksort As we learn about each sorting algorithm,
Data Structures and Algorithms PLSD210 Sorting. Card players all know how to sort … First card is already sorted With all the rest, ¶Scan back from the.
Fundamentals of Algorithms MCS - 2 Lecture # 16. Quick Sort.
Sorting CS221 – 3/2/09. Recursion Recap Use recursion to improve code clarity Make sure the performance trade-off is worth it Every recursive method must.
Data Structures Advanced Sorts Part 2: Quicksort Phil Tayco Slide version 1.0 Mar. 22, 2015.
CS 280 Data Structures Professor John Peterson. Project Not a work day but I’ll answer questions as long as they keep coming! I’ll try to leave the last.
Lecture 25 Selection sort, reviewed Insertion sort, reviewed Merge sort Running time of merge sort, 2 ways to look at it Quicksort Course evaluations.
CS 280 Data Structures Professor John Peterson. Project Questions?
TDDB56 DALGOPT-D DALG-C Lecture 8 – Sorting (part I) Jan Maluszynski - HT Sorting: –Intro: aspects of sorting, different strategies –Insertion.
Quicksort
CS 280 Data Structures Professor John Peterson. Project Questions? /CIS280/f07/project5http://wiki.western.edu/mcis/index.php.
Chapter 7 (Part 2) Sorting Algorithms Merge Sort.
Sorting II/ Slide 1 Lecture 24 May 15, 2011 l merge-sorting l quick-sorting.
External Sorting Chapter 13.. Why Sort? A classic problem in computer science! Data requested in sorted order  e.g., find students in increasing gpa.
1 Data Structures and Algorithms Sorting. 2  Sorting is the process of arranging a list of items into a particular order  There must be some value on.
CIS 068 Welcome to CIS 068 ! Lesson 9: Sorting. CIS 068 Overview Algorithmic Description and Analysis of Selection Sort Bubble Sort Insertion Sort Merge.
CS 345: Topics in Data Warehousing Tuesday, October 19, 2004.
Sorting HKOI Training Team (Advanced)
CompSci 100e 11.1 Sorting: From Theory to Practice l Why do we study sorting?  Because we have to  Because sorting is beautiful  Example of algorithm.
1 Joe Meehean.  Problem arrange comparable items in list into sorted order  Most sorting algorithms involve comparing item values  We assume items.
Heapsort. Heapsort is a comparison-based sorting algorithm, and is part of the selection sort family. Although somewhat slower in practice on most machines.
CompSci 100e Program Design and Analysis II April 26, 2011 Prof. Rodger CompSci 100e, Spring20111.
Data Structures Using C++ 2E Chapter 10 Sorting Algorithms.
Database Management Systems, R. Ramakrishnan and J. Gehrke 1 External Sorting Chapter 13.
CSC 221: Recursion. Recursion: Definition Function that solves a problem by relying on itself to compute the correct solution for a smaller version of.
CompSci 100E 39.1 Memory Model  For this course: Assume Uniform Access Time  All elements in an array accessible with same time cost  Reality is somewhat.
Sorting: Implementation Fundamental Data Structures and Algorithms Klaus Sutner February 24, 2004.
CS4432: Database Systems II Query Processing- Part 2.
Adaptive Sorting “A Dynamically Tuned Sorting Library” “Optimizing Sorting with Genetic Algorithms” By Xiaoming Li, Maria Jesus Garzaran, and David Padua.
Chapter 9 Sorting 1. The efficiency of data handling can often be increased if the data are sorted according to some criteria of order. The first step.
Chapter 9 Sorting. The efficiency of data handling can often be increased if the data are sorted according to some criteria of order. The first step is.
Data Structures - CSCI 102 Selection Sort Keep the list separated into sorted and unsorted sections Start by finding the minimum & put it at the front.
Query Processing CS 405G Introduction to Database Systems.
Copyright © Curt Hill Sorting Ordering an array.
ICS201 Lecture 21 : Sorting King Fahd University of Petroleum & Minerals College of Computer Science & Engineering Information & Computer Science Department.
CSE 326: Data Structures Lecture 23 Spring Quarter 2001 Sorting, Part 1 David Kaplan
CPSC Why do we need Sorting? 2.Complexities of few sorting algorithms ? 3.2-Way Sort 1.2-way external merge sort 2.Cost associated with external.
Chapter 9: Sorting1 Sorting & Searching Ch. # 9. Chapter 9: Sorting2 Chapter Outline  What is sorting and complexity of sorting  Different types of.
Database Management Systems 3ed, R. Ramakrishnan and J. Gehrke1 External Sorting Chapter 13.
Quicksort This is probably the most popular sorting algorithm. It was invented by the English Scientist C.A.R. Hoare It is popular because it works well.
Sorting: Implementation Fundamental Data Structures and Algorithms Margaret Reid-Miller 24 February 2004.
Intro. to Data Structures Chapter 7 Sorting Veera Muangsin, Dept. of Computer Engineering, Chulalongkorn University 1 Chapter 7 Sorting Sort is.
CS4432: Database Systems II
CS 405G: Introduction to Database Systems Instructor: Jinze Liu Fall 2007.
CMPT 238 Data Structures More on Sorting: Merge Sort and Quicksort.
Diving into Query Execution Plans ED POLLACK AUTOTASK CORPORATION DATABASE OPTIMIZATION ENGINEER.
External Sorting Chapter 13
Chapter 7 Sorting Spring 14
Teach A level Computing: Algorithms and Data Structures
Data Structures and Algorithms
Quick Sort (11.2) CSE 2011 Winter November 2018.
Database Management Systems (CS 564)
Unit-2 Divide and Conquer
External Sorting Chapter 13
Sub-Quadratic Sorting Algorithms
EE 312 Software Design and Implementation I
Indexes and Performance
External Sorting Chapter 13
CSE 190D Database System Implementation
Presentation transcript:

Sort in GPDB Feng Tian GreenPlum Inc.

WARNING: NON-TECH SLIDES Why (NOW)?  Real customers, real problems.  About to get the code in MAIN Make Joy/Brian's code reading experience easier.

Outline (Doesn't this look familiar?)‏ Motivation Review of Current Status Improve Sort Performance Remaining Work

Sort in Database One of the most important operator  Order by  Group by  OLAP Rollup and Cube Window (partition by and order by)‏  Merge Join  Build index

Sort in GPDB One of the most mysterious operator  Sort is slow v.s. Sort is OK  Fix planner to avoid sort v.s. Fix sort

Sort is fun One of the most extensively studied algorithm In memory sorting algorithm  CK always got some interesting links  Jie challenged my interview question  Sedgewick: Quicksort is optimal  Bentley & McIlory, 93. External sort  TAOCP

GPDB Sort is funny Good  Honest TAOCP  Honest BM93. Bad  Equal keys  Lots of columns  Sort strings Ugly  Combination of the bads

Goal Get rid of the ugly part of GPDB Sort.

Outline Motivation Review of Current Status Improve Sort Performance Remaining Work

GPDB Sort Quicksort if entries fit in memory External sort  An honest implementation from TAOCP I/O pattern is pretty good Amount of I/O when sorting tuple is OK  No compression  Sorting datum is terrible, but not a concern at this moment Only used for distinct May eventually be replaced by hash  Use Heap to merge

GPDB Sort Details  Cost of comparison Non trivial overhead (Unicode) String compare is extremely slow  Strcoll v.s. Strxfrm + strcmp  Cost of memtuple_getattr It is way better than heap_getattr Postgres devs know this for a long time Cache first sort column  Sort (1, 'a'), (2, 'a'), (3, 'c')... is fast.  Sort (1, 'a'), (1, 'b'), (1, 'c')... is miserably slow.

Outline Motivation Review of Current Status Improve Sort Performance Remaining Work

Goal It should be “invisible”  No API change  Keep fast cases fast Slow cases? What slow cases?  Planner can honestly optimize a query, without worrying about “avoiding” sort  User can write a query, without trying to be creative  In the cases that a sort cannot be avoided, may save out neck.

Quicksort Is Optimal (Sedgewick)‏ Equal keys  Equal keys is good (Bentley & McIloy)‏ Do not special case small n  Why? Not sure. Cache oblivious? Multi column sort keys  Comparison get slower and slower

Quicksort As the old algorithm, cache first sort column Quicksort on first column For the range with equal first column, cache the second sort column, quick sort the range Until all sort columns are processed  May stop early. Sort (1, 'a'), (2, 'b'), (3, 'c') will not compare string at all. Sort (1, 'a'), (1, 'b'), (1, 'c') will only call memtuple_getattr when necessary.

Example (1, ?), (3, ?), (2, ?), (0, ?), (3, ?), (2, ?)‏ Choose Pivot (2, ?)‏ (2, ?), (1, ?), (1, ?) :: (3, ?), (3, ?), (2, ?)‏ Swap to middle (0, ?),(1, ?) :: (2, ?),(2, ?) :: (3, ?), (3, ?)‏

Recursive Down Quick sort each partition For left, right, just quick sort. For the middle part, expand to level k+1  (2, ?), (2, ?)... (2, ?) to (2, 'a'), (2, 'x'), (2, 'd')... (2, 'z')‏  Of course, only if middle has not expanded all level NO EXTRA LEVEL EXPANSION NO EXTRA COMPARISON

Heapsort Used in external sort (both produce runs and merge runs)‏ Cache first sort column when insert into heap Expand to (n+1)th sort column only when first n column equals those of heap top Remember the lv of expand  Maintain an array of datum d,  entry.sort_column[x] = d[x] if x < lv Siftup and Siftdown  Siftdown hole

HeapSort Continued NO EXTRA EXPANSION NO EXTRA COMPARISION However, code became more complicated.

Handling String When cache a sort column, cache strxfrm  Comparison use strcmp Equal String  Collapse equal strings Compare pointer value first Save memory Problems  Memory consumption

Minor improvements Fast path some basic types  Int, maybe float later Limit Sort: Use heapsort instead of insertion sort

Outline Motivation Review of Current Status Improve Sort Performance Remaining Work

“Honest” Implementation Cut corners in performance prototype is dangerous  Error handling  Special cases Relatively honest  Does not handle unique check etc. Pass make installcheck-good. Pass TPCH and opperf if turn off hashagg and hashjoin

TPCH 1G Q1 Hashagg ~5.7 sec Old sort ~15 sec New sort ~8 sec  Aggregate computing takes ~4 sec  Hashagg proper ~ 1.5 sec  New sort, generated 3 runs, motioned 6M tuples, and do one more comparison in Agg in less than 4 sec. The extra comparison takes more than 1 sec Sort proper is ~2 sec

Building index On ship_instruction, ship_mode, comment  Old: All take 24 to 26 sec  New: 4 sec, 6 sec, 11 sec On two columns  Old: 70+ sec  New: 16?

OLAP (Cube and Rollup)‏ For “Big” OLAP CUBE/ROLLUP queries, 10~15% faster  Not much on “smaller” ones, some may even see some small regression Unstable timing, regression comes and goes  Our olap plan have many sorts, on 1 or 2 integer column, so this is expected However, we can finish some “machine freezing” queries now

Yahoo Hashagg Slightly slower  Heapsort Overhead :-(  On par once I fastpath-ed int4cmp

Outline Motivation Review of Current Status Improve Sort Performance Remaining Work

More Improvements We know the level of key change  Important for sort agg  Important for OLAP  Important for merge join Take (more) advantages of unique, limit, aggregate.

Improve the code Heap code (maybe) is (more) complicated (than necessary), don't know how to improve yet. Memory management. Explain analyze accounting and reporting.

Code Review Code is at ftian_main_cr2 branch  tuplesort.c Should make it tuplesortnew.c, and probably GUC it. Uses memtuple and logtape as before. Uses new quick sort and heap sort.  mk_qsort.c Multi key quick sort. Straightforward.  mk_heap.c Multi key heap sort. 700 lines heap sort :-( About time to port into MAIN.

Feedback (Thanks!)‏ Welcome ideas, new improvements and critique of the approach.