Download presentation
Presentation is loading. Please wait.
Published byNathalie Pew Modified over 9 years ago
1
Improving Hash Join Performance By Exploiting Intrinsic Data Skew by Bryce Cutt supervised by Dr. Ramon Lawrence
2
Introduction Databases are part of our lives Hash Join is a core database algorithm o Very I/O intensive for large databases Queries may take hours o Any performance improvement is significant Real datasets contain skew o Skew is when some values occur more frequently o Skew can greatly reduce hash join performance Skew traditionally considered a bad thing for join algorithms o Try to mitigate negative effects of skew Adapt hash join o No longer just mitigate o Use foreknowledge of skew Improve performance
3
Relational Model Definitions
4
Example Relations Build Relation Probe Relation Part Purchase
5
DHJ Algorithm Build Phase Hash Function: modulo 5
6
DHJ Algorithm Build Phase, cont.
17
Probe Relation
18
DHJ Algorithm Probe Phase
19
DHJ Algorithm Probe Phase, cont.
26
DHJ Algorithm Cleanup Phase
27
DHJ Algorithm Cleanup Phase, cont.
30
Skewed Probe Relation
31
Statistics and Hash Joins Modern database systems maintain statistics such as histograms for query optimization What if hash join could use the statistics to choose the best build tuples to keep in memory? o Does not have to generate own statistics
32
Histojoin Algorithm General Idea Same basic form as DHJ Determines best build tuples from histogram o In this case the tuples with partid 2 and 3 Create partitions for the best build tuples o In addition to regular partitions o Freeze regular partitions first Perform a highly optimized multi-stage check o To determine the partition tuples belong in
33
Histojoin Algorithm Build Phase
34
Histojoin Algorithm Probe Phase
35
Implementation Details Avoided in algorithm description o General enough to fit any database system But ultimately important o Core of algorithm implementation specific Implemented in o Stand alone Java app Optimistic implementation o PostgreSQL HHJ Conservative implementation
36
Inaccurate Statistics Selections Multi-join plans o Sampling o SITs Handling dependent on implementation o PostgreSQL conservative memory usage
37
Experimental Results TPC-H o Database commonly used to test database system performance o Skewed versions o 1GB dataset used in Java tests o 10GB dataset used in PostgreSQL tests
38
Experimental Results, cont. Java, Lineitem/Part, skewed, 1GB Approx. 20% faster
39
Experimental Results, cont. Java, Lineitem/Part,high skew, 1GB Approx. 60% faster
40
Experimental Results, cont. Java, Various Joins, Percent Improvement, 1GB Approx. 20% for skewed and 60% for high skew
41
Experimental Results, cont. Java, Lineitem/Part, Inaccurate Histogram, 1GB
42
Experimental Results, cont. Java, Lineitem/Part/Supplier,high skew, 1GB Approx. 75% faster
43
Experimental Results, cont. PostgreSQL, Lineitem/Part,skewed, 10GB Approx. 10% faster
44
Experimental Results, cont. PostgreSQL, Lineitem/Part, high skew, 10GB Approx. 60% faster
45
Experimental Results, cont. PostgreSQL, Various Joins, Percent Improvement, 10GB 5-10% for skewed and 50-60% for high skew
46
Conclusion Histojoin o significantly outperforms standard hash joins in the presence of skew Smart implementation mitigates pitfalls Two papers have been published from this work PostgreSQL patch currently in review o Will be used by millions of users
47
Thank you Thank you Dr. Lawrence
Similar presentations
© 2025 SlidePlayer.com. Inc.
All rights reserved.