Ján GENČI PDT 2009 Systém riadenia bázy dát (Database Management System)

Slides:

Advertisements

Similar presentations

Copyright © 2007 Ramez Elmasri and Shamkant B. Navathe Slide

Advertisements

Copyright © 2007 Ramez Elmasri and Shamkant B. Navathe Slide

Equality Join R X R.A=S.B S : : Relation R M PagesN Pages Relation S Pr records per page Ps records per page.

Copyright © 2011 Ramez Elmasri and Shamkant Navathe Algorithms for SELECT and JOIN Operations (8) Implementing the JOIN Operation: Join (EQUIJOIN, NATURAL.

Copyright © 2004 Ramez Elmasri and Shamkant Navathe Elmasri/Navathe, Fundamentals of Database Systems, Fourth Edition Chapter 15-1 Query Processing and.

Copyright © 2011 Pearson Education, Inc. Publishing as Pearson Addison-Wesley Chapter 19 Algorithms for Query Processing and Optimization.

Chapter 15 Algorithms for Query Processing and Optimization Copyright © 2004 Pearson Education, Inc.

Hashing and Indexing John Ortiz.

Copyright © 2007 Ramez Elmasri and Shamkant B. Navathe Slide

Chapter 14 Indexing Structures for Files Copyright © 2004 Ramez Elmasri and Shamkant Navathe.

Copyright © 2011 Pearson Education, Inc. Publishing as Pearson Addison-Wesley Chapter 18 Indexing Structures for Files.

Copyright © 2011 Pearson Education, Inc. Publishing as Pearson Addison-Wesley Chapter 19 Algorithms for Query Processing and Optimization.

©Silberschatz, Korth and Sudarshan13.1Database System Concepts Chapter 13: Query Processing Overview Measures of Query Cost Selection Operation Sorting.

Data Indexing Herbert A. Evans. Purposes of Data Indexing What is Data Indexing? Why is it important?

ACS-4902 Ron McFadyen Chapter 15 Algorithms for Query Processing and Optimization.

METU Department of Computer Eng Ceng 302 Introduction to DBMS Disk Storage, Basic File Structures, and Hashing by Pinar Senkul resources: mostly froom.

ACS-4902 Ron McFadyen Chapter 15 Algorithms for Query Processing and Optimization See Sections 15.1, 2, 3, 7.

Copyright © 2004 Pearson Education, Inc.. Chapter 13 Disk Storage, Basic File Structures, and Hashing.

Copyright © 2007 Ramez Elmasri and Shamkant B. Navathe Chapter 13 Disk Storage, Basic File Structures, and Hashing.

Database Systems Chapters ITM 354. The Database Design and Implementation Process Phase 1: Requirements Collection and Analysis Phase 2: Conceptual.

Copyright © 2011 Pearson Education, Inc. Publishing as Pearson Addison-Wesley Chapter 18 Indexing Structures for Files.

Chapter 19 Query Processing and Optimization

Evaluation of Relational Operations. Relational Operations v We will consider how to implement: – Selection ( ) Selects a subset of rows from relation.

1 CS 728 Advanced Database Systems Chapter 17 Database File Indexing Techniques, B- Trees, and B + -Trees.

Indexing dww-database System.

Chapter 61 Chapter 6 Index Structures for Files. Chapter 62 Indexes Indexes are additional auxiliary access structures with typically provide either faster.

Database Management Systems, R. Ramakrishnan and J. Gehrke1 File Organizations and Indexing Chapter 5, 6 of Elmasri “ How index-learning turns no student.

Chapter 14-1 Chapter Outline Types of Single-level Ordered Indexes –Primary Indexes –Clustering Indexes –Secondary Indexes Multilevel Indexes Dynamic Multilevel.

Copyright © 2007 Pearson Education, Inc. Publishing as Pearson Addison-Wesley Slide

Copyright © 2011 Pearson Education, Inc. Publishing as Pearson Addison-Wesley Chapter 17 Disk Storage, Basic File Structures, and Hashing.

Ján GENČI PDT 2009 Systém riadenia bázy dát (Database Management System)

©Silberschatz, Korth and Sudarshan13.1Database System Concepts Chapter 13: Query Processing Overview Measures of Query Cost Selection Operation Sorting.

Ján GENČI PDT 2009 Systém riadenia bázy dát (Database Management System)

Chapter 15 Algorithms for Query Processing and Optimization Copyright © 2004 Pearson Education, Inc.

1 Index Structures. 2 Chapter : Objectives Types of Single-level Ordered Indexes Primary Indexes Clustering Indexes Secondary Indexes Multilevel Indexes.

Query Processing and Optimization

METU Department of Computer Eng Ceng 302 Introduction to DBMS Indexing Structures for Files by Pinar Senkul resources: mostly froom Elmasri, Navathe and.

Chapter 9 Disk Storage and Indexing Structures for Files Copyright © 2004 Pearson Education, Inc.

Indexing Structures for Files

1 Chapter 2 Indexing Structures for Files Adapted from the slides of “Fundamentals of Database Systems” (Elmasri et al., 2003)

Nimesh Shah (nimesh.s) , Amit Bhawnani (amit.b)

1 Overview of Database Design Process. Data Storage, Indexing Structures for Files 2.

File Structures. 2 Chapter - Objectives Disk Storage Devices Files of Records Operations on Files Unordered Files Ordered Files Hashed Files Dynamic and.

Copyright © 2007 Ramez Elmasri and Shamkant B. Navathe Chapter 13 Disk Storage, Basic File Structures, and Hashing.

Advance Database Systems Query Optimization Ch 15 Department of Computer Science The University of Lahore.

1 B + -Trees: Search  If there are n search-key values in the file,  the path is no longer than  log  f/2  (n)  (worst case).

Chapter 6 Index Structures for Files 1 Indexes as Access Paths 2 Types of Single-level Indexes 2.1Primary Indexes 2.2Clustering Indexes 2.3Secondary Indexes.

Chapter 14 Indexing Structures for Files Copyright © 2004 Ramez Elmasri and Shamkant Navathe.

Copyright © 2011 Pearson Education, Inc. Publishing as Pearson Addison-Wesley Chapter 18 Indexing Structures for Files.

Alon Levy 1 Relational Operations v We will consider how to implement: – Selection ( ) Selects a subset of rows from relation. – Projection ( ) Deletes.

Indexing Structures Database System Implementation CSE 507 Some slides adapted from R. Elmasri and S. Navathe, Fundamentals of Database Systems, Sixth.

Chapter 5 Record Storage and Primary File Organizations

Database Management Systems 3ed, R. Ramakrishnan and J. Gehrke1 Evaluation of Relational Operations Chapter 14, Part A (Joins)

Query optimization Algorithms for execution query elements Execution strategy Query optimization when using indices.

10/3/2017 Chapter 6 Index Structures.

Indexing Structures for Files

Indexing Structures for Files

Data Indexing Herbert A. Evans.

Chapter Outline Indexes as additional auxiliary access structure

Indexing Structures for Files

CS 728 Advanced Database Systems Chapter 18

Database System Implementation CSE 507

Lecture 20: Indexing Structures

Systém riadenia bázy dát Database Management System

Advance Database Systems

Database Administration

Indexing Structures for Files

Lec 6 Indexing Structures for Files

Presentation transcript:

Ján GENČI PDT 2009 Systém riadenia bázy dát (Database Management System)

2 Obsah RAID 2-phase multiway sort-merge Fyzická organizácia dát Indexovanie Systémový katalóg Operácie relačnej algebry (krátko) Implementácia operácií relačnej algebry

3 Obsah (nestihneme) Transakčné spracovanie Paralelné spracovanie Zotavenie po chybách

4 Literatúra [1] Hector Garcia-Molina, Jeffrey D. Ullman, Jennifer D. Widom: Database System Implementation, Prentice Hall, ISBN-10: , pp.653 Database Systems: The Complete Book, 2001

5 Literatúra [2] Elmasri R., Navathe S. B. : Fundamentals of database systems. 4th ed., Pearson Education, th ed. – 2006, pp (ch ; 120 resp. 220 str.)

6 Literatúra [3] Ramakrishnan R., Gehrke J.: Database Management Systems. McGraw-Hill Science/Engineering/Math; 3rd ed., 2002, pp. 906 (ch. 7-14; 220 str.)

7 Literatúra [4] Abraham Silberschatz, Henry Korth, S. Sudarshan: Database System Concepts. McGraw-Hill Science/Engineering/Math; 5th ed., pp.~920 (ch ; 170 resp. 290 str.

RAID Obrázky (väčšina) z [2]

9 RAID Originally - Redundant Arrays of Inexpensive Disks. Currently - Redundant Array of Independent Disks Chen, Lee, Gibson, Katz, and Patterson (1994), ACM Computing Survey, Vol. 26, No.2 (June 1994). (pekne názorne spracované)

10 RAID 0

11 RAID 1, 2

12 RAID 3, 4, 5, 6

13 RAID – ďalšie kombinácie 10, 01 - Kombinácie základných RAIDov Performance: –Block-interleaved distributed-parity disk arrays (RAID 5) have the best small read, large read, and large write performance of any redundant disk array. –Small write requests are somewhat inefficient compared with redundancy schemes such as mirroring.

Two phase, multiway sort-merge Partially based on presentation of Simonas Šaltenis - Advanced Algorithm Design and Analysis

15 Purpose of Algorithm Sorting of very large collection of data (Data>Memory) Classic algorithm – With’s sort-merge algorithm (Wirth C.: Algoritmy a dátové štruktúry.)

16 Princíp – 1. fáza 1.Vytvoriť maximálne možné veľké „behy“ (utriedené postupnosti elementov) – najlepšie načítaním do dostupnej pamäte a zotriedením napr. quick-sortom 2.Spájanie behov (mergovanie)

17 Princíp – 2. fáza File Y: File X: Run 1 Run 2 Current page EOF Bf1 p1 Bf2 p2 Bfo po min(Bf1[p1], Bf2[p2], …, Bfk[pk]) Read, when pi = B Write, when Bfo full Run k=n/m Current page Bfk pk

18 Zhodnotenie Phase 1: O(n), Phase 2: O(n) Total: O(n) I/Os! Files only of “limited” size can be sorted –Phase 2 can merge a maximum of m-1 runs (m – number of buffers). –Which means: N/M (number of runs) < m-1

19 Triedenie veľmi veľkých súborov (m-1) 2 M (m-1) 3 M = N Phase 2 Phase 1 … MM (m-1)M MM M M … MM MM M M … MM MM M M … … …...

20 Otázky

SRBD – štruktúry a algoritmy

22

Primárne (fyzické) organizácie

24 O čom budeme hovoriť Podporované dátové typy Formovanie záznamov Organizácia (radenie) záznamov –fyzická –logická „Umiestnenie“ DBMS v rámci OS

25 Podporované dátové typy Tzv. built-in dátové typy Pre účely ukladania dát, je pre nás zaujímavá veľkosť dátového typu (sizeof(typ)) „Sémantika“ typu je podporená implementáciou (HW alebo SW) relevantných operácií (out of scope)

26 Storage Record Formats A fixed-length record A record with variable-length fields A variable-field record with separator characters.

27 Storage Record Formats [2]

28 Fixed length record Size of items is recorded in the system catalog

29 Variable length records Result of item(s) of variable length

30 NULL value representation Prakticky väčšina zdrojov o spôsobe implementácie „mlčí“ Pri záznamoch premenlivej dĺžky sa dá využiť null pointer na prvok záznamu ORACLE v dokumentácii pre ORA7 prezentoval ukladanie NULL hodnoty cez bitmapový prefix záznamu

31 Fyzická organizácia záznamov

32 Fyzická organizácia záznamov 2

33 Umiestňovanie záznamov do fyzických blokov Spanned Unspanned

34 Logické organizácie záznamov Sekvenčná Hašovaná Heap (hromada) Zhodnotenie z pohľadu operácií insert, find a delete

35 Sekvenčná organizácia

36 Zhodnotenie – sekvenčná org. Insert – drahá operácia (potreba posunúť priemerne N/2 záznamov) – oblasti pretečenia (overflow areas) Find – možnosť binárneho vyhľadávania podľa usporiadavajúceho atribútu - O(log 2 N), ináč O(N) = N/2 alebo N Delete – drahá operácia (potreba posunúť priemerne N/2 záznamov) – možnosť označovať záznamy ako zmazané  pack

37 Interné Hashovanie

38 Zhodnotenie – hashovanie Insert – O(1) ak neuvažujeme konflikty; ak uvažujeme = najhorší prípad O(N) Find – O(1) – hashovací atribút, O(N) ostatné atribúty Delete – O(1) Štruktúra musí byť dimenzovaná na maximálny počet záznamov

39 Externé hashovanie

40 Zhodnotenie - externé hashovanie Ako interné hashovanie Konflikty sa riešia blokmi pretečenia (viď ďalší slajd )

41 Ext. Hashovanie – overflow bloky

42 Extendible hashing

43 Zhodnotenie – ext. hashing Ako externé hashovanie Plusom je možnosť dynamického rozširovania „veľkosti hashovacieho poľa“

44 Heap (hromada) Záznamy sú neusporiadané – nie je usporiadavací atrubút Strácame možnosť - binárne vyhľadávanie; primárny index (ale iba pre usporiad. atr.) Veľmi efektívna operácia INSERT

45 Miesto DBMS v rámci OS Cooked files Raw devices NTFS DBMS Služby OS Filesystem Driver DBMS Služby OS - Driver

46 Otázky

Indexovanie Z podstatnej časti podľa [2] Všetky obrázky z [2]

48 Index Alternatívny spôsob prístupu k dátam Lokalizácia záznamu podľa obsahu

49 Kategorizácia indexov Podľa počtu úrovní: –Jedno-úrovňové –Viac-úrovňové Podľa indexovaného atribútu: –Primárne –Klastrovacie (clustering) –Sekundárne Podľa počtu indexovaných záznamov: –Hustý (dense) – všetky záznamy v indexe –Riedky (sparse) – len časť záznamov v indexe

50 Primárny index Indexuje „usporiadavajúci“ (ordering) atribút Riedky (sparse) index „Kotviaci“ záznam INSERT problém

51 Clustering index Aj nad „neusporia- davajúcim“ atribú- tom Primárna organizá- cia sa usporiada podľa daného atri- bútu – pri budovaní indexu

52 Clustering index Pri bežnej práci sa primárna organizácia nemodifikuje, ale používajú sa overflow bloky

53 Sekundárny index Index nad neusporiada- vajúcim atribútom (ale kľúčovým) Hustý (dense) index

54 Sekundárny index Nad nekľúčovým atribútom (opakujúce sa hodnoty)

55 Priebežné zhodnotenie Zatiaľ iba jednoúrovňové indexy Prínos ( N – počet záznamov, r – záznamov v bloku) –Vyhľadávanie nad „ordered“ kľúčom – log 2 N –Vyhľadávanie nad „non-ordered“ kľúčom – N/2 –Vyhľadávanie nad nekľúčovým atribútom – N –Primárny index log 2 (N/r) –Sekundárny index log 2 N (počet čítaných blokov – podstatne menší, kvôli vyššiemu blokovaciemu faktoru)

56 Príklad – sekvenčný súbor (ordering attribute) Ordered file with r = 30,000 records Block size B = 1024 bytes. Records are of fixed size and are unspanned Record length R = 100 bytes. The blocking factor bfr = floor(B/R) = floor(1024/100) = 10 records per block. The number of blocks b = (r/bfr) = r (30,000/1O)l = 3000 blocks. A binary search would need approximately –floor(log 2 b) = floor(log ) = 12 block accesses.

57 Primárny index Na osvieženie pamäti

58 Príklad – primárny index Key field of the file is V = 9 bytes long, a block pointer is P = 6 bytes size of index entry R = (9 + 6) = 15 bytes,  blocking factor bfr i = floor(B/R i ) = floor(1024/15) = 68 entries per block. The total number of index entries r i is equal to the number of blocks in the data file The number of index blocks is hence b i = ceiling(r/bfr i ) = ceiling(3000/68) = 45 blocks. To perform a binary search on the index file would need ceiling (log 2 b i )l = ceiling (log 2 45) = 6 (block accesses). To search for a record using the index, we need one additional block access to the data file - total of = 7 block accesses

59 Príklad – sekundárny index As example 1: r = 30,000,R = 100 bytes, B = 1024 bytes. To do a linear search, we would require b/2 = 3000/2 = 1500 block accesses (on the average, 3000 in the worst case) Supppose V = 9 and P = 6  bfr i = 68 –secondary index is dense  the total number of index entries r i is equal to the number of records = 30,000. –The number of blocks needed for the index is b i = ceiling(r/bfr) = 1(30,000/68) l = 442 blocks. –A binary search on this secondary index needs ceiling(log 2 b i ) = ceiling (log 2 442) = 9 block accesses.

60 Porovnanie (single-level) indexov

61 Multi-Level Indexes Because a single-level index is an ordered file, we can create a primary index to the index itself ; in this case, the original index file is called the first-level index and the index to the index is called the second-level index. We can repeat the process, creating a third, fourth,..., top level until all entries of the top level fit in one disk block A multi-level index can be created for any type of first- level index (primary, secondary, clustering) as long as the first-level index consists of more than one disk block

62 Multilevel indexy Prvá úroveň - dense alebo sparse Ďalšie úrovne už iba sparse Top level – iba jeden blok Vyhľadávanie vyžaduje pribl. (log bfri b i ) „block accesses“ INSERT problém !!!

63 Dynamic Multilevel Indexes Using B-Trees and B+-Trees Because of the insertion and deletion problem, most multi-level indexes use B-tree or B+-tree data structures, which leave space in each tree node (disk block) to allow for new index entries These data structures are variations of search trees that allow efficient insertion and deletion of new search values. In B-Tree and B+-Tree data structures, each node corresponds to a disk block Each node is kept between half-full and completely full

64 Dynamic Multilevel Indexes Using B-Trees and B+-Trees (contd.) An insertion into a node that is not full is quite efficient; if a node is full the insertion causes a split into two nodes Splitting may propagate to other tree levels A deletion is quite efficient if a node does not become less than half full If a deletion causes a node to become less than half full, it must be merged with neighboring nodes

65 Difference between B-tree and B+-tree In a B-tree, pointers to data records exist at all levels of the tree In a B+-tree, all pointers to data records exists at the leaf-level nodes A B+-tree can have less levels (or higher capacity of search values) than the corresponding B-tree

66 B-tree structure

67 B+-tree structure

68 B+-tree example

69 B-tree example - numbers

70 B+-tree example - numbers

71 B-tree – duplicate keys

72 Otázky

Systémový katalóg Na základe prezentácie Ľubomíra Miškoviča

74 Čo je systémový katalóg Systémový katalóg uchováva dáta ktoré popisujú každú databázu (metadata) Obsahuje popis: –Položiek, viet, súborov a vzťahov medzi nimi –Konceptuálnej schémy, externých schém a internú schému. Je tu popísané aj mapovanie medzi schémami na rôznych úrovniach

75 Zjednodušený model prostredia databázového systému

76 Obsah systémového katalógu Katalógy pre relačné SRBD obsahujú –Názvy relácií –Názvy atribútov –Domény atribútov –Primárne kľúče –Sekundárne kľúčové atribúty –Cudzie kľúče –Podmienky

77 Obsah systémového katalógu Ďalej obsahujú popisy –Externých pohľadov –Uloženie štruktúr a indexov pre internú úroveň –Informácie o bezpečnosti a autorizácií, ktoré definujú prístup používateľa k databázovým pohľadom –Prihlasovacie mená tvorcov alebo vlastníkov každej relácie

78 Obsah systémového katalógu Uchovávajú informácie ako –Veľkosť záznamu –Aktuálny počet záznamov –Počet indexov –Meno tvorcu každej relácie

79 Spôsoby implementácie systémového katalógu Systémový katalóg môže byť vytváraný pre každú databázu v systéme, alebo môže byť spoločný pre všetky databázy Systémový katalóg môže byť tvorený tabuľkami, ktorých štruktúra je totožná s tabuľkou databázy alebo špeciálnou štruktúrou

80 Príklad systémových katalógov pre Informix Systables – opisuje každú tabuľku v databáze. Obsahuje jeden riadok pre každú tabuľku v databáze, pohľad alebo synonymum definované v databáze. Zahŕňa všetky tabuľky v databáze aj tabuľku systémového katalógu Syscolumns – definuje každý stĺpec v databáze. Pre každý stĺpec definovaný v tabuľke alebo pohľade existuje jeden riadok Sysindex – popisuje indexy v databáze. Obsahuje jeden riadok pre každý index definovaný v databáze

81 Systables

82 syscolumns

83 Vzťah medzi tabuľkami

84 Oracle

85 Postgres

86 Otázky

Relačná algebra (RA) a implementácia operácií RA Podľa [2]

88 Relačná algebra Relácia - podmnožina karteziánskeho súčinu R  D 1 ...  D n Relačná algebra: –Formálny jazyk pre relačný model –Základný súbor operácií pre vyhľadávacie dotazy

89 Selekcia  Projekcia  Kartézsky súčin  Spojenie (join) (theta-, equi-, natural- ) Množinové (union kompatibilné): –Prienik (intersection)  –Zjednotenie (union)  –Rozdiel (difference) \ Operácie relačnej algebry

90 Elementary conditionEC and condition C Definition: Elementary (simple) condition EC is clause of the form: where operator is from the set of relational operators {=,, =,≠}. Definition: Condition C is clause of the form : [NOT] EC1 [{OR | AND } [ [NOT] EC2] …]

91 Examples (O1):  SSN=' ' (EMPLOYEE) (O2):  DNUMBER>5 (DEPARTMENT) (O3):  DNO=5 (EMPLOYEE) (O4):  DNO=5 AND SALARY>30000 AND SEX=' F' (EMPLOYEE) (O5):  ESSN=' ' AND PNO=10 (WORKS_ON)

92 SELECT operation Definition:  c = {  t i  R | c(t i )}(3-value logic) Implementation: –Linear search –Binary search –Using a primary index (or hash key) –Using a primary index to retrieve multiple records –Using a clustering index to retrieve multiple records –Using a secondary (B+-tree) index on an equality comparison –...

93 S1:Linear search (brute force) Retrieve every record in the file, and test whether its attribute values satisfy the selection condition. for every t i if (c(t i ) == TRUE) output(t i )

94 S2:Binary search If the selection condition involves an equality comparison on a key attribute on which the file is ordered.  SSN=' ' (EMPLOYEE)

95 S3: Using a primary index (or hash key) If the selection condition involves an equality comparison on a key attribute with a primary index (or hash key), use the primary index (or hash key) to retrieve the record. Note that this condition retrieves a single record (at most).  SSN=' ' (EMPLOYEE)

96 S4: Using a primary index to retrieve multiple records If the comparison condition is >, >=, <', or <= on a key field with a primary index, use the index to find the record satisfying the corresponding condition  DNUMBER>5 (DEPARTMENT) (selectivity, distribution)  DNO=5 AND SALARY>30000 AND SEX=' F' (EMPLOYEE)

97 S5: Using a clustering index to retrieve multiple records If the selection condition involves an equality comparison on a non (key attribute with a clustering index for example, DNO = 5 in S3) use the index to retrieve all the records satisfying the condition.  DNO=5 (EMPLOYEE) (if clusterred on DNO)

98 S6: Using a secondary (B+-tree) index on an equality comparison This search method can be used to retrieve a single record if the indexing field is a key (has unique values) or to retrieve multiple records if the indexing field is not a key. This can also be used for comparisons involving >, >=, <, or <=.

99 S7: Conjunctive selection using an individual index If an attribute involved in any single simple condition in the conjunctive condition has an access path that permits the use of one of the Methods S2 (binary search) to S6 (B- tree), use that condition to retrieve the records and then check whether each retrieved record satisfies the remaining simple conditions in the conjunctive condition.

100 S8:Conjunctive selection using a composite index If two or more attributes are involved in equality conditions in the conjunctive condition and a composite index (or hash structure) exists on the combined fields-for example, if an index has been created on the composite key (ESSN, PNO) of the WORKS_ON file for O5-we can use the index directly.

101 JOIN operation R ⋈ c S = {t i  R,t j  S| c(t i,t j ) == TRUE } Implementácia –Nested-loop join (brute force) –Single-loop join (using an access structure to retrieve the matching records) –Sort-merge join –Hash-join

102 J1. Nested-loop join (brute force) For each record t in R (outer loop), retrieve every record s from S (inner loop) and test whether the two records satisfy the join condition c (incl. theta-join). for each t i for each s j if( c(t i,s j ) == TRUE ) output(t i.s j ) Improvement - nested-block join

103 J2. Single-loop join (using an access structure to retrieve the matching records) If an index (or hash key) exists for one of the two join attributes-say, B of S, retrieve each record t in R, one at a time (single loop), and then use the access structure to retrieve directly all matching records s from S that satisfy t[B] =t[A] (equi-join).

104 J3. Sort-merge join If the records of R and S are physically sorted (ordered) by value of the join attributes A and B, respectively, we can implement the join in the most efficient way possible. Both files are scanned concurrently in order of the join attributes, matching the records that have the same values for A and B. If the files are not sorted, they may be sorted first by using external sorting.

105 J4. Hash-join The records of files R and S are both hashed to the same hash file, using the same hashing function on the join attributes A of R and B of S as hash keys. First, a single pass through the file with fewer records (say, R) hashes its records to the hash file buckets (partitioning phase - records of R are partitioned into the hash buckets). In the second phase (probing phase), a single pass through the other file (S) then hashes each of its records to probe the appropriate bucket, and that record is combined with all matching records from R in that bucket.

106 PROJECT operation  (R) Implementation: –straightforward to implement if includes a key of relation R – the same number of records. –If does not include a key of R, duplicate tuples must be eliminated (sorting, hashing). –Index can be used in some cases.

107 SET operation CARTESIAN PRODUCT operation R  S is quite expensive, because its result includes a record for each combination of records from R and S. Can be improved by processing at the block level UNION, INTERSECTION, and SET DIFFERENCE apply only to union-compatible relations (that have the same number of attributes and the same attribute domains). Implementation - sort-merge technique and hashing

108 Sort-merge technique (for the SET operation) The two relations are sorted on the same attributes. After sorting, a single scan through each relation is sufficient to produce the result. For example, we can implement the UNION operation, R  S, by scanning and merging both sorted files concurrently, and whenever the same tuple exists in both relations, only one is kept in the merged result. For the INTERSECTION operation, R  S, we keep in the merged result only those tuples that appear in both relations.

109 Hashing (for the SET operation) One table is partitioned and the other is used to probe the appropriate partition. For example, to implement R  S, first hash (partition) the records of R; then, hash (probe) the records of S, but do not insert duplicate records in the buckets. To implement R  S, first partition the records of R to the hash file. Then, while hashing each record of S, probe to check if an identical record from R is found in the bucket, and if so add the record to the result file. To implement R - S, first hash the records of R to the hash file buckets. While hashing (probing) each record of S, if an identical record is found in the bucket, remove that record from the bucket.

110 Implementing Aggregate Operations The aggregate operators (MIN, MAX, COUNT, AVERAGE, SUM), when applied to an entire table, can be computed by a table scan or by using an appropriate index, if available. For example, consider the following SQL query: SELECT MAX(SALARY) FROM EMPLOYEE; If an (ascending) index on SALARY exists for the EMPLOYEE relation, then the optimizer can decide on using the index to search for the largest value by following the rightmost pointer in each index node from the root to the rightmost leaf.

111 Implementing Aggregate Operations The dense index can be used for the COUNT, AVERAGE, and SUM aggregates. The associated computation would be applied to the values in the index.

112 GROUP BY clause When a GROUP BY clause is used in a query, the aggregate operator must be applied separately to each group of tuples. In this case, the computation is more complex - the table must first be partitioned into subsets of tuples, where each partition (group) has the same value for the grouping attributes. Sorting or hashing are used to partition the file into the appropriate groups If a clustering index exists on the grouping attributes, then the records are already partitioned (grouped) into the appropriate subsets.

113 Otázky