Presentation is loading. Please wait.

Presentation is loading. Please wait.

Data Partitioning in VLDB Tal Olier

Similar presentations


Presentation on theme: "Data Partitioning in VLDB Tal Olier"— Presentation transcript:

1 Data Partitioning in VLDB Tal Olier Tal.olier@hp.com

2 Why am I here? Tal Olier – tal.olier@hp.comtal.olier@hp.com ~15 years in various software development positions. All of them involved database practice. I work in HP Software, I love working there and I came to tell you this; the lecture is just an excuse getting me into the building :)

3 Agenda RDBMS in short (basic terms) SQL reminder A bit about (RDBMS) architecture Performance - access paths What is table join VLDB - the size factor VLDB - industry practice How joins are executed Summary

4 R elational D atabase M anagement S ystem

5 A little history Was invented in 1970 – By Edgar Frank "Ted" Codd – In IBM labs – Oracle emerged first to the market

6 Basics – a table Rows Columns Primary key Emp_idEmp_nameSalary 1Dany10,000 2Yosi20,000 3Moshe30,000 4Eli40,000

7 Basics – a relation A foreign key (constraint) A reference – Source table – Source column/s – Target table – Target column/s

8 People example People: name, height, smoking, father Books read: title, author Schedule details: from, to, activity Resume details: from, to, salary

9 People example

10 S tructured Q uery L anguage

11 Query language SQL – Structured Query Language Declarative (vs. procedural) Requires Internal optimization

12 SELECT query structure SELECT FROM… JOIN WHERE GROUP BY HAVING ORDER BY

13 SQL modules DML (+Select) – Data manipulation language DDL – Data definition language TC – Transaction controls (commit/rollback) DCL – Data control language (grant/revoke) PE – Procedural extensions

14 A bit about architecture

15 Database server Memory Process I/O System Client Process Data Files Log Files Server Process Buffer cache Log cache Other cache Everything is blocks

16 IO bound vs. CPU bound CPU – what is it consumed for? IO – what is it consumed for?

17 Performance?

18 FTS – full table scan Scan the whole table – from top to bottom 2007 2008 2009 2010 2007 2008 2009 2010 2007 2008 2009 2010

19 B Tree index B tree – allows great spanning that derives small tree height

20 B+ tree The leaves are organized in a doubly linked list B+ tree – allows searching through all values by searching the leaf level only

21 Database index Data is sorted according to the index columns The leaf contain pointers to rows in the table Search of 1 value in a tree - o (log n) Smaller index height in B+ trees Index (database) operations: – Add/remove values – Index seek – Index scan

22 Index seek/scan 2007 2008 2009 2010 2007 2008 2009 2010 2007 2008 2009 2010 …

23 Join (logical)

24 Inner join Use join predicate to match rows from 2 table: A and B Each row in table A is compared to each row in table B to find the pairs of rows that satisfy the join predicate Than column values for each matched pairs are combined into a result row

25 dept_idDept_nam e 1Sales 2Engineering 3Marketing department employee Emp_nameDept_id Rina1 Moshe2 Shira2 Yossinull emp_dept_ id emp_namedept_dep t_id dept_name 1Rina1Sales 1Rina2Engineering 1Rina3Marketing 2Moshe1Sales 2Moshe2Engineering 2Moshe3Marketing 2Shira1Sales 2Shira2Engineering 2Shira3Marketing nullYossi1Sales nullYossi2Engineering nullYossi3Marketing Cartesian product

26 Equi join A inner join that uses equality comparison in the join predicate Example: select * from employee emp join department dept on emp.dept_id = dept.dept_id

27 Equi join OK emp_dept_i d emp_namedept_dept _id dept_name 1Rina1Sales 1Rina2Engineering 1Rina3Marketing 2Moshe1Sales 2Moshe2Engineering 2Moshe3Marketing 2Shira1Sales 2Shira2Engineering 2Shira3Marketing nullYossi1Sales nullYossi2Engineering nullYossi3Marketing

28 RDBMS – summary in a nutshell Tables References Joins Indexes Blocks I/O

29 V ery L arge D ata B ase

30 RDBMS – summary in a nutshell Tables References Indexes Blocks I/O

31 VLDB – a table – size factor

32 Use case: Sales Information Table: – Customer name – Order number – Order date and time – List of items, amount and prices

33 Order details (2007-2010) 2007 2008 2009 2010 2007 2008 2009 2010 2007 2008 2009 2010

34 Remove 2007’s orders 2007 2008 2009 2010 2007 2008 2009 2010 2007 2008 2009 2010 …

35 Order details kept in 4 tables 2007 … 2008 … 2009 … 2010 …

36 … 4 tables – remove 2007’s data 2007 … 2008 … 2009 … 2010 …

37 Union view Select * from t2007 Union all Select * from t2008 Union all Select * from t2009 Union all Select * from t2010

38 Order details kept in 4 tables and a view 2007 … 2008 … 2009 … 2010 …

39 Partitioned table 2007 … 2008 … 2009 … 2010 …

40 Get back to: Remove 2007’s data? 2007 … 2008 … 2009 … 2010 …

41 Impact on index behavior 2007 2008 2009 2010 2007 2008 2009 2010 2007 2008 2009 2010 … 2007 … 2008 … 2009 … 2010 …

42 Partitioned index (local index) 2007 2008 2009 2010 2007 2008 2009 2010 2007 2008 2009 2010 … 2007 … 2008 … 2009 … 2010 …

43 Local indexes Index is bound to it’s partition Drop partition derives drop index Smaller index heights Index is always usable Harder to maintain uniqueness with it

44 Partitioned table - concepts Partition column is the key for dividing the data Performance – only relevant partitions used Add/drop partition – DDL Local index – index is bound to a partition

45 Star schema

46 Data tables block T a b l e - A T a b l e - B T a b l e - C

47 Let’s get back to our partitioned table 2007 … 2008 … 2009 … 2010 …

48 Dimension referencing YearCust_id… … 20071715 … Cust_idNameSerial signature… … 1715Bank of America Ltd/FFAA23472394- … … … …

49 Making fact tables thin 2007 … 2008 … 2009 … 2010 … Dimension

50 Join (physical)

51 Join To perform a join the optimizer need make the following decisions: Access path how to access each table Join order if more than 2 tables/views are joined, which join to do first Join method for each pair of row resource how to perform the join

52 Join methods– nested loop One input is the outer loop, the other input is the inner loop The inner loop is executed for each row in the outer loop Effective when – The outer loop is small – The inner loop is pre indexed

53 Join methods– hash The smaller of the 2 inputs is named the build input The second is probe input Hash table is build from build input Each row in the build input is put in the appropriate bucket The entire probe input is scanned

54 Join methods– hash cont’ For each row the hash value is calculated The corresponding hash bucket is scanned to find matched rows in the build input Good for joining large amount of data

55 Join methods– merge There is no concept of driving table Both input sources are sorted according the join key ( or use sorted source such as index) The sorted lists are merged together The merge itself is very fast, but it can be expensive to sort the sources

56 Summary It’s all about I/O Star schema – facts and dimensions Partitions + local indexes SQL joins (probably hash)

57 Q & A


Download ppt "Data Partitioning in VLDB Tal Olier"

Similar presentations


Ads by Google