Download presentation
Presentation is loading. Please wait.
Published byAlfred Whitehead Modified over 9 years ago
1
File Processing : Query Processing 2008, Spring Pusan National University Ki-Joune Li
2
STEMPNU Basic Concepts of Query Query Retrieve records satisfying predicates Types of Query Operators Aggregate Query Sorting
3
STEMPNU Relational Operators : Select Selection ( condition ) Retrieve records satisfying predicates Example Find Student where Student.Score > 3.5 score>3.5 (Student) Index or Hash Select Predicate
4
STEMPNU Relational Operators : Project Project ( attributes ) Extract interesting attributes Example Find Student.name where score > 3.5 name ( acore>3.5 (Student)) Full Scan Interesting attributes to get Extract
5
STEMPNU Cartisan Product Cartisan Product ( ) Two Tables : R 1 R 2 Produce all cross products Join ( ) r 11 r 12 … r1mr1m R1R1 r 21 r 22 … r2nr2n R2R2 = r 11 … r 21 r 22 … r2nr2n r 12 … r 21 r 22 … r2nr2n r1mr1m r 21 r 22 … r2nr2n … r1mr1m r1mr1m … …
6
STEMPNU Join Join ( ) Select combined records of cartisan product with same value of a common attribute (Natural Join) Example Student (StudentName, AdvisorProfessorID, Department, Score) Professor(ProfessorName, ProfessorID, Department) Student AdivsorProfessorID=ProfessorID Professor = AdivsorProfessorID=ProfessorID (Student Professor) Double Scan : Expensive Operation
7
STEMPNU Relational Algebra Operand : Table (Relation) Operator : Relational Operator ( , ,, etc) Example Find Student Name where Student Score > 3.5 and Advisor Professor belongs to CSE Department student.name ( acore>3.5 (Student) Department=‘CSE’ (Professor) ) Relational Algebra Specifies the sequence of operations
8
STEMPNU Query Processing Mechanism Query Processing Steps 1. Parsing and translation 2. Optimization 3. Evaluation
9
STEMPNU Parsing and Translation Parsing Query Statement (e.g. in SQL) Translation into relational algebra Equivalent Expression For a same query statement several relation algebraic expressions are possible Example balance 2500 ( name (account )) name ( balance 2500 (account )) Different execution schedules Query Execution Plan (QEP) Determined by relational algebra Several QEPs may be produced by Parsing and Translation
10
STEMPNU Query Optimization Choose ONE QEP among QEPs based on Execution Cost of each QEP, where cost means execution time How to find cost of each QEP ? Real Execution Exact but Not Feasible Cost Estimation Types of Operations Number of Records Selectivity Distribution of data
11
STEMPNU Cost Model : Basic Concepts Cost Model : Number of Block Accesses Cost C = C index + C data where C index : Cost for Index Access C data : Cost for Data Block Retrieval C index vs. C data ? C index : depends on index C data depends on selectivity Random Access or Sequential Access Selectivity Number (or Ratio) of Objects Selected by Query
12
STEMPNU Cost Model : Type of Operations Cost model for each type of operations Select Project Join Aggregate Query Query Processing Method for each type of operations Index/Hash or Not
13
STEMPNU Cost Model : Number of Records Number of Records N record N blocks Number of Scans Single Scan O(N) : Linear Scan O(logN ) : Index Multiple Scans O(NM ) : Multiple Linear Scans O(N logM ) : Multiple Scans with Index
14
STEMPNU Selectivity Affects on C data Random Access Scattered on several blocks N block N selected Sequential Access Contiguously stored on blocks N block = N selected / Bf
15
STEMPNU Selectivity Estimation Depends on Data Distribution Example Q1 : Find students where 60 < weight < 70 Q2 : Find students where 80 < weight < 90 How to find the distribution Parametric Method e.g. Gaussian Distribution No a priori knowledge Non-Parametric Method e.g. Histogram Smoothing is necessary Wavelet, Discrete Cosine 30 40 5060708090100 Frequency
16
STEMPNU Select : Linear Search Algorithm : linear search Scan each file block and test all records to see whether they satisfy the selection condition. Cost estimate (number of disk blocks scanned) = b r b r denotes number of blocks containing records from relation r If selection is on a key attribute (sorted), cost = (b r /2) stop on finding record Linear search can be applied regardless of selection condition or ordering of records in the file, or availability of indices
17
STEMPNU Select : Range Search Algorithm : primary index, comparison Relation is sorted on A For A V (r) Step 1: use index to find first tuple v and Step 2: scan relation sequentially For A V (r) just scan relation sequentially till first tuple > v; do not use index Algorithm : secondary index, comparison For A V (r) Step 1: use index to find first index entry v and Step 2: scan index sequentially to find pointers to records. For A V (r) scan leaf nodes of index finding pointers to records, till first entry > v
18
STEMPNU Select : Range Search Comparison between Searching with Index and Linear Search Secondary Index retrieval of records that are pointed to requires an I/O for each record Linear file scan may be cheaper if records are scattered on many blocks clustering is important for this reason
19
STEMPNU Select : Complex Query Conjunction : 1 2 ... n (r) Algorithm : selection using one index Step 1: Select a combination of i ( i (r) ) Step 2: Test other conditions on tuple after fetching it into memory buffer. Algorithm : selection using multiple-key index Use appropriate multiple-attribute index if available. Algorithm : selection by intersection of identifiers Step 1: Requires indices with record pointers. Step 2: Intersection of all the obtained sets of record pointers. Step 3: Then fetch records from file Disjunction : 1 2 ... n (r) Algorithm : Disjunctive selection by union of identifiers
20
STEMPNU Join Operation Several different algorithms to implement joins Nested-loop join Block nested-loop join Indexed nested-loop join Merge-join Hash-join Choice based on cost estimate Examples use the following information Number of records of customer: 10,000 depositor: 5000 Number of blocks of customer: 400 depositor: 100
21
STEMPNU Nested-Loop Join Algorithm NLJ the theta join r s For each tuple t r in r do begin For each tuple t s in s do begin test pair (t r,t s ) to see if they satisfy the join condition if they do, add t r t s to the result. End End r : outer relation, s : inner relation. No indices, any kind of join condition. Expensive
22
STEMPNU Nested-Loop Join : Performance Worst case the estimated cost is n r b s + b r disk accesses, if not enough memory only to hold one block of each relation, Example 5000 400 + 100 = 2,000,100 disk accesses with depositor as outer relation, and 1000 100 + 400 = 1,000,400 disk accesses with customer as the outer relation. If the smaller relation fits entirely in memory, use that as the inner relation. Reduces cost to b r + b s disk accesses. If smaller relation (depositor) fits entirely in memory, cost estimate will be 500 disk accesses.
23
STEMPNU Block Nested-Loop Join Algoritm BNLJ For each block B r of r do Get Block B r For each block B s of s do Get Block B s For each tuple t r in B r do For each tuple t s in B s do Check if (t r, t s ) satisfy the join condition if they do, add t r t s to the result. End End End End No disk access required Disk access happens here
24
STEMPNU Block Nested-Loop Join : Performance Worst case Estimate: b r b s + b r block accesses. Each block in the inner relation s is read once for each block in the outer relation (instead of once for each tuple in the outer relation) Improvements : If M blocks can be buffered use (M-2) disk blocks as blocking unit for outer relations, use remaining two blocks to buffer inner relation and output Then the cost becomes b r / (M-2) b s + b r
25
STEMPNU Indexed Nested-Loop Join Index lookups can replace file scans if join is an equi-join or natural join and an index is available on the inner relation’s join attribute Can construct an index just to compute a join. Algorithm INLJ For each block B r of r do Get Block B r For each tuple t r in B r do Search Index (IDX r, t r.key) if found, add t r t s to the result. End End
26
STEMPNU Indexed Nested-Loop Join : Performance Worst case buffer has space for only one page of r, Cost of the join: b r + n r c Where c is the cost of traversing index and fetching matching tuple Number of matching tuples may be greater than one. If indices are available on join attributes of both r and s, use the relation with fewer tuples as the outer relation
27
STEMPNU Example of Nested-Loop Join Costs Assume depositor customer, with depositor as the outer relation. customer have a primary B + -tree index on the join attribute customer-name, which contains 20 entries in each index node. customer has 10,000 tuples, the height of the tree is 4, and one more access is needed to find the actual data Depositor has 5000 tuples Cost of block nested loops join 400*100 + 100 = 40,100 disk accesses assuming worst case memory Cost of indexed nested loops join 100 + 5000 * 5 = 25,100 disk accesses.
28
STEMPNU Hash-Join Applicable for equi-joins and natural joins. A hash function h is used to partition tuples of both relations h : A→ { 0, 1,..., n } r 0, r 1,..., r n : partitions of r tuples s 0, s 1..., s n : partitions of s tuples r tuples in r i need only to be compared with s tuples in s i.
Similar presentations
© 2025 SlidePlayer.com. Inc.
All rights reserved.