File Processing : Query Processing

Slides:



Advertisements
Similar presentations
Copyright © 2011 Ramez Elmasri and Shamkant Navathe Algorithms for SELECT and JOIN Operations (8) Implementing the JOIN Operation: Join (EQUIJOIN, NATURAL.
Advertisements

SPRING 2004CENG 3521 Query Evaluation Chapters 12, 14.
©Silberschatz, Korth and Sudarshan13.1Database System Concepts Chapter 13: Query Processing Overview Measures of Query Cost Selection Operation Sorting.
Query Processing (overview)
CSCI 5708: Query Processing I Pusheng Zhang University of Minnesota Feb 3, 2004.
CSCI 5708: Query Processing I Pusheng Zhang University of Minnesota Feb 3, 2004.
©Silberschatz, Korth and Sudarshan13.1Database System Concepts Chapter 13: Query Processing Overview Measures of Query Cost Selection Operation Sorting.
Dr. Kalpakis CMSC 461, Database Management Systems Query Processing.
©Silberschatz, Korth and Sudarshan13.1Database System Concepts Chapter 13: Query Processing Overview Measures of Query Cost Selection Operation Sorting.
Chapter 13 Query Processing Melissa Jamili CS 157B November 11, 2004.
Department of Computer Science and Engineering, HKUST Slide Query Processing and Optimization Query Processing and Optimization.
Query Processing. Steps in Query Processing Validate and translate the query –Good syntax. –All referenced relations exist. –Translate the SQL to relational.
12.1Database System Concepts - 6 th Edition Chapter 12: Query Processing Overview Measures of Query Cost Selection Operation Join Operation Sorting 、 Other.
Database System Concepts, 5th Ed. ©Silberschatz, Korth and Sudarshan Chapter 13: Query Processing.
Database System Concepts, 5th Ed. ©Silberschatz, Korth and Sudarshan See for conditions on re-usewww.db-book.com Chapter 13: Query Processing.
Chapter 13: Query Processing Chapter 13: Query Processing Overview Measures of Query Cost Selection Operation Sorting Join Operation Other Operations.
Computing & Information Sciences Kansas State University Tuesday, 03 Apr 2007CIS 560: Database System Concepts Lecture 29 of 42 Tuesday, 03 April 2007.
Database System Concepts, 6 th Ed. ©Silberschatz, Korth and Sudarshan See for conditions on re-usewww.db-book.com Chapter 12: Query Processing.
Lecture 1- Query Processing Advanced Databases Masood Niazi Torshiz Islamic Azad university- Mashhad Branch
Chapter 12 Query Processing. Query Processing n Selection Operation n Sorting n Join Operation n Other Operations n Evaluation of Expressions 2.
Chapter 12 Query Processing (1) Yonsei University 2 nd Semester, 2013 Sanghyun Park.
Chapter 13: Query Processing
Advance Database Systems Query Optimization Ch 15 Department of Computer Science The University of Lahore.
Computing & Information Sciences Kansas State University Monday, 03 Nov 2008CIS 560: Database System Concepts Lecture 27 of 42 Monday, 03 November 2008.
Computing & Information Sciences Kansas State University Wednesday, 08 Nov 2006CIS 560: Database System Concepts Lecture 32 of 42 Monday, 06 November 2006.
13.1 Chapter 13: Query Processing n Overview n Measures of Query Cost n Selection Operation n Sorting n Join Operation n Other Operations n Evaluation.
Chapter 12 Query Processing (2) Yonsei University 2 nd Semester, 2013 Sanghyun Park.
Chapter 13: Query Processing. Overview Measures of Query Cost Selection Operation Sorting Join Operation Other Operations Evaluation of Expressions.
File Processing : Query Processing 2008, Spring Pusan National University Ki-Joune Li.
Chapter 13: Query Processing
Chapter 4: Query Processing
CS 440 Database Management Systems
Database Management System
Chapter 13: Query Processing
Chapter 12: Query Processing
Chapter 12: Query Processing
Query Processing.
Chapter 13: Query Processing
Database Management Systems (CS 564)
Evaluation of Relational Operations: Other Operations
File Processing : Query Processing
Dynamic Hashing Good for database that grows and shrinks in size
Query Processing and Optimization
Yan Huang - CSCI5330 Database Implementation – Access Methods
Query Processing.
Chapter 13: Query Processing
CPT-S 415 Big Data Yinghui Wu EME B45 1.
Query processing and optimization
Chapter 13: Query Processing
Chapter 12: Query Processing
Chapter 13: Query Processing
Module 13: Query Processing
Chapter 13: Query Processing
Chapter 12: Query Processing
Lecture 2- Query Processing (continued)
Query Processing Overview Measures of Query Cost Selection Operation
Advance Database Systems
Chapter 13: Query Processing
Chapter 12 Query Processing (1)
Overview of Query Evaluation
Chapter 13: Query Processing
Evaluation of Relational Operations: Other Techniques
Sorting We may build an index on the relation, and then use the index to read the relation in sorted order. May lead to one disk block access for each.
Chapter 13: Query Processing
Chapter 13: Query Processing
External Sorting Sorting is used in implementing many relational operations Problem: Relations are typically large, do not fit in main memory So cannot.
Chapter 13: Query Processing
Chapter 13: Query Processing
Chapter 12: Query Processing
Evaluation of Relational Operations: Other Techniques
Presentation transcript:

File Processing : Query Processing 2018, Spring Pusan National University Ki-Joune Li

Basic Concepts of Query Retrieve records satisfying predicates Types of Query Operators Aggregate Query Sorting

Relational Operators : Select Selection (condition) Retrieve records satisfying predicates Example Find Student where Student.Score > 3.5 score>3.5(Student) Index or Hash Predicate Select

Relational Operators : Project Project (attributes) Extract interesting attributes Example Find Student.name where score > 3.5 name(acore>3.5(Student)) Full Scan Interesting attributes to get Extract

Cartesian Product Cartesian Product () Join ( ) Two Tables : R1  R2 Produce all cross products Join ( ) r11 … r21 r22 r2n r12 r1m r11 r12 … r1m R1 r21 r22 … r2n R2  =

Join Join ( ) Select combined records of cartesian product with same value of a common attribute (Natural Join) Example Student (StudentName, AdvisorProfessorID, Department, Score) Professor(ProfessorName, ProfessorID, Department) Student AdivsorProfessorID=ProfessorID Professor =  AdivsorProfessorID=ProfessorID(Student Professor) Double Scan : Expensive Operation

Relational Algebra Relational Algebra Operand : Table (Relation) Operator : Relational Operator (, , , etc) Example: SQL and relational algebra find Student.Name from Student, Professor where Student.Score > 3.5 and Student.AdvisorProfessorID=Professor.ID and Professor.Department=‘CSE’ student.name(score>3.5(Student) Department=‘CSE’ (Professor) ) Relational Algebra Specifies the sequence of operations   

Query Processing Mechanism Query Processing Steps 1. Parsing and translation 2. Optimization 3. Evaluation

Parsing and Translation Parsing Query Statement (e.g. in SQL) Translation into relational algebra Equivalent Expression For a same query statement several relation algebraic expressions are possible Example name(balance  2500(account )) name (balance  2500(name, balance (account ))) Different execution schedules Query Execution Plan (QEP) Determined by relational algebra Several QEPs may be produced by Parsing and Translation

Query Optimization Choose ONE QEP among QEPs based on Execution Cost of each QEP, where cost means execution time How to find cost of each QEP ? Real Execution Exact but Not Feasible Cost Estimation Types of Operations Number of Records Selectivity Distribution of data

Cost Model : Basic Concepts Cost Model : Number of Block Accesses Cost C = Cindex + Cdata where Cindex : Cost for Index Access Cdata : Cost for Data Block Retrieval Cindex vs. Cdata ? Cindex : depends on index Cdata depends on selectivity Random Access or Sequential Access Selectivity Number (or Ratio) of Objects Selected by Query

Cost Model : Type of Operations Cost model for each type of operations Select Project Join Aggregate Query Query Processing Method for each type of operations Index/Hash or Not

Cost Model : Number of Records Nrecord  Nblocks Number of Scans Single Scan O(N) : Linear Scan O(logN ) : Index Multiple Scans O(NM ) : Multiple Linear Scans O(N logM ) : Multiple Scans with Index

Selectivity Selectivity Affects on Cdata Random Access Scattered on several blocks Nblock  Nselected Sequential Access Contiguously stored on blocks Nblock = Nselected / Bf

Selectivity Estimation Depends on Data Distribution Example Q1 : Find students where 60 < weight < 70 Q2 : Find students where 80 < weight < 90 How to find the distribution Parametric Method e.g. Gaussian Distribution No a priori knowledge Non-Parametric Method e.g. Histogram Smoothing is necessary Wavelet, Discrete Cosine 30 40 50 60 70 80 90 100 Frequency

Select : Linear Search Algorithm : linear search Scan each file block and test all records to see whether they satisfy the selection condition. Cost estimate (number of disk blocks scanned) = br br denotes number of blocks containing records from relation r If selection is on a key attribute (sorted), cost = (br /2) stop on finding record Linear search can be applied regardless of selection condition or ordering of records in the file, or availability of indices

Select : Range Search Algorithm : primary index, comparison Relation is sorted on A For A  V (r) Step 1: use index to find first tuple  v and Step 2: scan relation sequentially For AV (r) just scan relation sequentially till first tuple > v; do not use index Algorithm : secondary index, comparison Step 1: use index to find first index entry  v and Step 2: scan index sequentially to find pointers to records. scan leaf nodes of index finding pointers to records, till first entry > v

Select : Range Search Comparison between Secondary Index Searching with Index and Linear Search Secondary Index retrieval of records that are pointed to requires an I/O for each record Linear file scan may be cheaper if records are scattered on many blocks clustering is important for this reason

Select : Complex Query Conjunction : 1  2 . . . n(r) Algorithm : selection using one index Step 1: Select a condition of i (i (r) ) Step 2: Test other conditions on tuple after fetching it into memory buffer. Algorithm : selection using multiple-key index Use appropriate multiple-attribute index if available. Algorithm : selection by intersection of identifiers Step 1: Requires indices with record pointers. Step 2: Intersection of all the obtained sets of record pointers. Step 3: Then fetch records from file Disjunction : 1 2 . . . n (r) Algorithm : Disjunctive selection by union of identifiers

Join Operation Several different algorithms to implement joins Nested-loop join Block nested-loop join Indexed nested-loop join Merge-join Hash-join Choice based on cost estimate Examples use the following information Number of records of (S)customer: 10,000 (R)depositor: 5000 Number of blocks of customer: 400 depositor: 100 Blocking factors of customer : 250 depositor: 50

Nested-Loop Join Algorithm NLJ the theta join r  s For each tuple tr in r do begin For each tuple ts in s do begin test pair (tr,ts) to see if they satisfy the join condition  if they do, add tr • ts to the result. End End r : outer relation, s : inner relation. No indices, any kind of join condition. Expensive

Example: Nested-Loop Join B1 s1 B1 r1 s2 r2 … … r50 s250 B2 r51 B2 s251 r52 … … s500 r100 … … B400 s9751 B100 r4951 s9752 r4952 … … s10000 r5000

Nested-Loop Join : Performance Worst case the estimated cost is nr  bs + br disk accesses, if not enough memory only to hold one block of each relation, Example 5000  400 + 100 = 2,000,100 disk accesses with depositor as outer relation, and 1000  100 + 400 = 1,000,400 disk accesses with customer as the outer relation. If the smaller relation fits entirely in memory, use that as the inner relation. Reduces cost to br + bs disk accesses. If smaller relation (depositor) fits entirely in memory, cost estimate will be 500 disk accesses.

Block Nested-Loop Join Algoritm BNLJ For each block Br of r do Get Block Br For each block Bs of s do Get Block Bs For each tuple tr in Br do For each tuple ts in Bs do Check if (tr, ts) satisfy the join condition if they do, add tr • ts to the result. End End End End No disk access required Disk access happens here No disk access required

Example: Block-Oriented Nested-Loop Join … … r50 s250 B2 r51 B2 s251 r52 … … s500 r100 … … B400 s9751 B100 r4951 s9752 r4952 … … s10000 r5000

Block Nested-Loop Join : Performance Worst case Estimate: br  bs + br block accesses. Each block in the inner relation s is read once for each block in the outer relation (instead of once for each tuple in the outer relation) Improvements : If M blocks can be buffered use (M-2) disk blocks as blocking unit for outer relations, use remaining two blocks to buffer inner relation and output Then the cost becomes br / (M-2)  bs + br

Indexed Nested-Loop Join Index lookups can replace file scans if join is an equi-join or natural join and an index is available on the inner relation’s join attribute Can construct an index just to compute a join. Algorithm INLJ For each block Br of r do Get Block Br For each tuple tr in Br do Search Index (IDXr , tr.key) if found, add tr • ts to the result. End End

Indexed Nested-Loop Join : Performance Worst case buffer has space for only one page of r, Cost of the join: br + nr  c Where c is the cost of traversing index and fetching matching tuple Number of matching tuples may be greater than one. If indices are available on join attributes of both r and s, use the relation with fewer tuples as the outer relation

Example of Nested-Loop Join Costs Assume depositor customer, with depositor as the outer relation. customer have a primary B+-tree index on the join attribute customer-name, which contains 20 entries in each index node. customer has 10,000 tuples, the height of the tree is 4, and one more access is needed to find the actual data Depositor has 5000 tuples Cost of block nested loops join 400*100 + 100 = 40,100 disk accesses assuming worst case memory Cost of indexed nested loops join 100 + 5000 * 5 = 25,100 disk accesses.

Hash-Join Applicable for equi-joins and natural joins. A hash function h is used to partition tuples of both relations h : A→ { 0, 1, ..., n } r0, r1, . . ., rn : partitions of r tuples s0, s1. . ., sn : partitions of s tuples r tuples in ri need only to be compared with s tuples in si .