Download presentation
Presentation is loading. Please wait.
Published byMarshall Murphy Modified over 8 years ago
1
Optimizing Joins in a Map-Reduce Environment EDBT 2010 Presented by Foto Afrati, Jeffrey D. Ullman 2010-11-12 Summarized by Jaeseok Myung Intelligent Database Systems Lab School of Computer Science & Engineering Seoul National University, Seoul, Korea
2
Copyright 2010 by CEBT Outline Introduction 2-Way Join vs. Multi-Way Join Optimization of Multi-Way Joins Important Special Cases Experiments Conclusion Center for E-Business TechnologyIDS Lab. Seminar – 2/33
3
Copyright 2010 by CEBT A Model for Cluster Computing Files: A file is a set of tuples. It is stored in a file system such as GFS Many processes can read and write a file in parallel Assumption: infinite supply of processors Any process (job) can be assigned to any one processor Center for E-Business TechnologyIDS Lab. Seminar – 3/33
4
Copyright 2010 by CEBT The Cost Measure for MR Algorithms The communication cost of a process is the size of the input to the process This paper does not count the output size for a process – The output must be input to at least one other process – The final output is much smaller than its input The total communication cost is the sum of the communication costs of all processes that constitute an algorithm The elapsed communication cost is defined on the acyclic graph of processes Consider a path through this graph, and sum the communication costs of the processes along that path The maximum sum, over all paths is the elapsed communication cost Center for E-Business TechnologyIDS Lab. Seminar – 4/33
5
Copyright 2010 by CEBT In this paper, We begin an investigation into optimization issues for algorithms implemented in the MR environment In particular, we are interested in algorithms that minimize the total communication cost We begin the study of 2-way and multi-way joins We introduce the notion of a “share” for each attribute of the map- key. The product of the shares is a fixed constant k, which is the number of Reduce processes we shall use to implement the join The heart of the paper explores how to choose the map-key and shares to minimize the communication cost Center for E-Business TechnologyIDS Lab. Seminar – 5/33
6
Copyright 2010 by CEBT Outline Introduction 2-Way Join vs. Multi-Way Join Optimization of Multi-Way Joins Important Special Cases Experiments Conclusion Center for E-Business TechnologyIDS Lab. Seminar – 6/33
7
Copyright 2010 by CEBT 2-Way Join in MapReduce Center for E-Business Technology R(A,B) S(B,C) R S Input Reduce input Final output Map Reduce AB a0b0 a1b1 a2b2 …… BC b0c0 b0c1 b1c2 …… KV b0(a0, R) b0(c0, S) b0(c1, S) …… KV b1(a1, R) b1(c2, S) …… ABC a0b0c0 a0b0c1 a1b1c2 ……… IDS Lab. Seminar – 7/33
8
Copyright 2010 by CEBT 2-Way Join in MapReduce Center for E-Business Technology AB a0b0 a1b1 a2b2 …… KV b0(a0, R) b0(c0, S) b0(c1, S) …… Suppose we use k Reduce processes The output of any Map process with key b is sent to the Reduce process for hash value h(b) IDS Lab. Seminar – 8/33
9
Copyright 2010 by CEBT Joining Several Relations at Once Center for E-Business Technology R S Input Reduce input Final output Map Reduce R(A,B) S(B,C) T(C,D) T IDS Lab. Seminar – 9/33
10
Copyright 2010 by CEBT Joining Several Relations at Once Suppose we use k=m 2 Reduce processes for some m Values of B and C will each be hashed to m buckets Let h be a hash function with range 1, 2, …, m Each tuple S(b, c) is sent to the Reduce process (h(b), h(c)) Center for E-Business Technology R(A,B) S(B,C) T(C,D) IDS Lab. Seminar – 10/33
11
Copyright 2010 by CEBT Joining Several Relations at Once Let h be a hash function with range 1, 2, …, m S(b, c) -> (h(b), h(c)) R(a, b) -> (h(b), all) T(c, d) -> (all, h(c)) Each Reduce process computes the join of the tuples it receives Center for E-Business Technology (# of Reduce processes: 4 2 = 16) m=4, k=16 h(c) = 01 2 3 h(b) = 0 1 2 3 h(R.b) = 2 h(T.c) = 1 h(S.b) = 2 h(S.c) = 1 Reduce processes R(A,B) S(B,C) T(C,D) IDS Lab. Seminar – 11/33
12
Copyright 2010 by CEBT Joining Several Relations at Once h(b) = one of { 0, 1, 2, …, 9 }, h(c) = one of { a, b, c, …, z } Your map-key would be one of { 0a, 0b, …, 0z, 1a, …, 1z, …, 9z } For relation S Each tuple (b, c) can be a value, and a key is one of map-keys For relation R Each tuple (a, b) will be replicated, a key is one of h(b)a or h(b)b, … For relation T Each tuple (c, d) will be replicated, a key is one of 0h(c) or 1h(c), … Center for E-Business Technology R(A,B) S(B,C) T(C,D) IDS Lab. Seminar – 12/33
13
Copyright 2010 by CEBT Outline Introduction 2-Way Join vs. Multi-Way Join Optimization of Multi-Way Joins Formalize of Optimization Problem General algorithm for Optimization Important Special Cases Experiments Conclusion Center for E-Business TechnologyIDS Lab. Seminar – 13/33
14
Copyright 2010 by CEBT Formalize of Optimization Problem The communication cost: rc + sa + tb, where r, s, t: # of tuples in relations R, S, T a, b, c: # of buckets for the attributes (shares) Why? Consider a tuple (x, y) in relation R (x, y) must be replicated and sent to the c different reducers We must minimize the expression rc+sa+tb subject to the constraint that abc=k Each of a, b, and c must be a positive integer Center for E-Business Technology R(A,B) S(B,C) T(A,C) IDS Lab. Seminar – 14/33
15
제 11 장 비선형계획법 한밭대학교 산업경영공학과 강진규 교수 n 개의 결정변수 (x 1, x 2, …, x n ) 와 m 개의 등식제약하의 비선형모형 Max.( 또는 Min.) f(x 1, x 2, …, x n ) s. t. g 1 (x 1, x 2, …, x n ) = 0 g 2 (x 1, x 2, …, x n ) = 0 : g m (x 1, x 2, …, x n ) = 0 ▶ 등식제약하의 비선형계획모형
16
제 11 장 비선형계획법 한밭대학교 산업경영공학과 강진규 교수 라그랑지 승수법 (Lagrange multiplier method) 원래의 모형에 대해 라그랑지 승수를 도입하여 목적함수와 등식의 제 약식을 연결하는 라그랑지 함수 (Lagrange function) 를 만들어 제약이 없는 비선형계획모형으로 변환한 후 극치를 찾는다. i 번째 제약식에 대응하는 라그랑지 승수를 λ i 라 하면, 라그랑지 함수 L(x 1, x 2, …, x n, λ 1, λ 2, …, λ m ) = f(x 1, x 2, …, x n ) + λ 1 [g 1 (x 1, x 2, …, x n )] + λ 2 [g 2 (x 1, x 2, …, x n )] : + λ m [g m (x 1, x 2, …, x n )] ▶ 등식제약하의 비선형계획모형
17
제 11 장 비선형계획법 한밭대학교 산업경영공학과 강진규 교수 필요조건 (x 1, x 2, …, x n ) 가 원래 모형의 최적해가 되려면, 라그랑지 함수 L 에 대하여 다음의 조건을 만족하여야 한다. ∂L ── = 0, j = 1, 2, …, n ∂x j ∂L ── = 0, i = 1, 2, …, m ∂λ i 등식제약하에서 라그랑지승수법의 필요조건 ▶ 등식제약하의 비선형계획모형
18
제 11 장 비선형계획법 한밭대학교 산업경영공학과 강진규 교수 예제 모형 S 기계의 특수장비 생산계획문제 향후 2 년간 1,000 대의 특수장비를 제작ㆍ공급계획 생산비용은 각각 금년 100( 만원 ) 과 내년 80( 만원 ) 으로 추정 금년과 내년의 생산량이 다르면 생산량 차이의 제곱에 비례하는 추가 비용이 발생 금년의 생산량을 x 1, 내년의 생산량을 x 2 라 하면 추가비용 C(x 1, x 2 ) 는 (x 1 - x 2 ) 2 C(x 1, x 2 ) = ────── 100 ▶ 등식제약하의 비선형계획모형
19
제 11 장 비선형계획법 한밭대학교 산업경영공학과 강진규 교수 총비용 TC = 정상생산비용 + 추가비용이므로, 다음의 비선형계획모형이 된다. (x 1 - x 2 ) 2 Min. TC(x 1, x 2 ) = 100x 1 + 80x 2 + ────── 100 s. t. x 1 + x 2 = 1,000 라그랑지 승수를 λ 라 하면, 라그랑지 함수는 다음과 같다. (x 1 - x 2 ) 2 L(x 1, x 2, λ) = 100x 1 + 80x 2 + ────── + λ(x 1 + x 2 - 1,000) 100 이를 x 1, x 2, λ 에 대해 각각 편미분하여 이를 0 으로 놓으면, ▶ 등식제약하의 비선형계획모형
20
제 11 장 비선형계획법 한밭대학교 산업경영공학과 강진규 교수 ∂L (x 1 - x 2 ) ─── = 100 + ────── - λ = 0 ∂x 1 50 ∂L (x 1 - x 2 ) ─── = 80 - ────── - λ = 0 ∂x 2 50 ∂L ─── = x 1 + x 2 - 1,000 = 0 ∂λ 위 식을 풀면, x 1 = 250, x 2 = 750, λ = 90, TC = 87,500( 만원 ) (x 1, x 2 ) = (250, 750) 이 총비용을 최소로 하는 값인지를 확인하기 위하여는, 2 차 편미분 필요 라그랑지 승수 λ = 90 의 의미 : 최적 상태에서 특수장비를 한 대 더 생산하면 90 의 비용이 추가적으로 소요됨 (LP 의 쌍대변수값 ) ▶ 등식제약하의 비선형계획모형
21
Copyright 2010 by CEBT Problem Solving Problem solving using the method of Lagrange Multipliers Take derivatives with respect to the three variables a, b, c Multiply the three equations Center for E-Business TechnologyIDS Lab. Seminar – 21/33
22
Copyright 2010 by CEBT An Example for Understanding Center for E-Business Technology R(A,B) S(B,C) T(A,C) IDS Lab. Seminar – 22/33
23
Copyright 2010 by CEBT General Algorithm for Optimization Questions How can we select the map-key attributes? – Dominated Attributes What is the best # of buckets for each attribute? – Lagrange Multiplier Methods You can read section 3 of the paper http://infolab.stanford.edu/~ullman/pub/join-mr.pdf Center for E-Business TechnologyIDS Lab. Seminar – 23/33
24
Copyright 2010 by CEBT Outline Introduction 2-Way Join vs. Multi-Way Join Optimization of Multi-Way Joins Important Special Cases Star Join Chain Join Experiments Conclusion Center for E-Business TechnologyIDS Lab. Seminar – 24/33
25
Copyright 2010 by CEBT Special Cases Star Joins There is a fact table joined with several dimension tables – Fact table F: F(A1, A2, … An) – Dimension tables Di: Di(Ai, Bi) Chain Joins A chain join is a join of the form Center for E-Business TechnologyIDS Lab. Seminar – 25/33
26
Copyright 2010 by CEBT Star Joins Example k = abcd Center for E-Business TechnologyIDS Lab. Seminar – 26/33
27
Copyright 2010 by CEBT Star Joins Center for E-Business TechnologyIDS Lab. Seminar – 27/33
28
Copyright 2010 by CEBT Outline Introduction 2-Way Join vs. Multi-Way Join Optimization of Multi-Way Joins Important Special Cases Experiments Conclusion Center for E-Business TechnologyIDS Lab. Seminar – 28/33
29
Copyright 2010 by CEBT Experimental Settings Multi-node cluster composed of 4 PCs Debian GNU/Linux 3.0GHz dual-core CPU, 1GB RAM, 160GB HDD 1Gbps LAN Tuning Hadoop Parameters # of Reduce processes : 100 HDFS block size (max. size of each input split) : 128MB Center for E-Business TechnologyIDS Lab. Seminar – 29/33
30
Copyright 2010 by CEBT Test Data Sets Center for E-Business Technology Sizes of data sets, intermediate relations, and output (unit: 1 million tuples) IDS Lab. Seminar – 30/33
31
Copyright 2010 by CEBT Test Results Center for E-Business Technology Processing times for the two methods IDS Lab. Seminar – 31/33
32
Copyright 2010 by CEBT Outline Introduction 2-Way Join vs. Multi-Way Join Optimization of Multi-Way Joins Important Special Cases Experiments Conclusion Center for E-Business TechnologyIDS Lab. Seminar – 32/33
33
Copyright 2010 by CEBT Conclusion Proposed an algorithm for multi-way join that optimizes the communication cost How can we select the map-key attributes? – Dominated Attributes What is the best # of buckets for each attribute? – Lagrange Multiplier Methods Examined the algorithm with two common kinds of joins Star-join Chain-join Center for E-Business TechnologyIDS Lab. Seminar – 33/33
Similar presentations
© 2024 SlidePlayer.com. Inc.
All rights reserved.