Distributed Query Processing using different Semijoin operations.
Presentation Outline: 1.Overview. 2.Semijoin Operation. 3. Different semijoin operations. a. 2 way semijoin. b. Hash Semijoin.
1.1 What is distributed database system? A distributed database system is characterized by the distribution of the system components of hardware ,control and data. For this research, a distributed system is a collection of independent computers interconnected via point-to-point communication lines.
1.2 Node Characteristics: Each computer , known as a node in the network, has a processing capability, a data storage capability, and is capable of operating autonomously in the system. Each node contains a version of a distributed DBMS.
1.3 What is distributed query processing? The retrieval of data from different sites in a network is known as distributed query processing.
1.4 Phases of distributed query processing with a semijoin operator. 1. Initial Local processing (Selections and Projects are processed at each site.) 2. Semijoin processing ( A semijoin program) is derived from the remaining join operations and executed to reduce the size of the relations in a cost-effective way) 3. Final processing (all relations involved are transmitted to final site and all joins are performed there. qs: query site)
2.1 Semijoin: A semijoin from Ri to Rj on attribute A can be denoted as Rj⋉ Ri .It is used to reduce the data transmission cost. Computing steps: 1) Project Ri on attribute A (Ri[A] ) and ship this projection ( a semijoin projection) from the site of Ri to the site of Rj ; 2) Reduce Rj to Rj’ by eliminating tuples where attribute A are not matching any value in Ri[A] .
2.2 Example: Example (semijoin s: R1—AR2): Site 2 Site 1 qs 3 4 5 7 8 9 A C R2 B 1 2 6 R1 Site 1 Site 2 1 2 3 R1[A] projection Ship(3) 3 7 R2’ reduce qs Ship(2) Ship(6) Benefit (s) = 6 -2 = 4 Cost (s) = 3 Cost effectiveness D(s) = B(s)-C(s) >0
3.a.1 Definition of 2 way semijoin. 2-way Semijoin—an extended version of the semijoin Definition: A 2-way semijoin (t) of Ri and Rj on attribute A can be denoted as RiARj = {Ri—ARj, Rj—ARi } So t reduces Ri and Rj to Ri’ and Rj’ respectively.
3.a.2 Properties of 2 way semijoin. Computing steps: 1) Send Ri [A] from site i to site j ; 2) Reduce Rj to Rj’ by eliminating tuples whose attribute A are not matching any of Ri [A] and at the same time partition Ri [A] to Ri [A]m (match one of Rj [A]) and Ri [A]nm(Ri [A]- Ri [A]m) ; 3) Send min(Ri [A]m , Ri [A]nm) back to site i ; 4) Reduce Ri to Ri ’ using Ri [A]m (or Ri [A]nm) . Evaluation: Benefit: B(t) = [S(Ri ) - S(Ri ’)] + [S(Rj) - S(Rj’)] Cost: C(t) = S(Ri [A] ) + min[S(Ri [A]m ) , S( Ri [A]nm)] If the benefit exceeds the cost (D(t) >0) then it is called a cost-effective 2-way semioin
3.a.3 2-way semijoin example. 1 2 3 R1[A] projection Ship(3) 3 4 5 7 8 9 A C R2 B 1 2 6 R1 Site 1 Site 2 3 R1[A]m 1 2 R1[A]nm partition 7 R2’ reduce Ship(1) 3 6 R1’ reduce qs Ship(2)
3.a.4 Semijoin Vs 2-way semijoin. - It is an extended version of semijoin. - It has more reduction power than semijoin. - The propagation of reduction effects by the 2-way semijoin is further than by the semijoin.
3.b.1 Hash-semijoin operator. Main idea : use a search filter which represents the semijoin projection with a small bit array . Definition: The hash-semijoin of Ri and Rj is denoted Rj∝ Ri. It is computed as follow: The Semijoin projection of Ri is represented as a bit array; Shipping this bit array to the site of Rj ; finally, the tuples of Rj are screened by the search filter.
3.b.2 hash semijoin example. R2 R1 1 B H(x)=X Hij((Ri))Bij S#(R1) 1 3 4 8 projection S# Phone 2 222 3 333 4 444 5 555 6 666 Ship(Bij) Rj S# Name 1 Cindy 3 Jemal 4 Sunny 8 Maggie reduce 3 333 4 444 14
3.b.3 Semijoin Vs Hash Semijoin. Advantages: Hash-semijoin is more cost-effective than semijoin The search filter in the hash-semijoin achieves considerable savings in the cost of a semijoin operation Limitation: Only works on execution tree Tightly related with the hash functions