Download presentation
Presentation is loading. Please wait.
Published byDaniel Oliver Modified over 9 years ago
1
Low Latency Geo-distributed Data Analytics Qifan Pu, Ganesh Ananthanarayanan, Peter Bodik, Srikanth Kandula, Aditya Akella, Paramvir Bahl, Ion Stoica
2
WAN Geo-distributed Data Analytics Seattle Berkeley Beijing London Slow & Wasteful 2 Perf. counters User activities … “Centralized” Data Analytics Paradigm
3
3 WAN Seattle Berkeley Beijing London A single logical analytics cluster across all sites.
4
44 WAN Seattle Berkeley Beijing London Incorporating WAN bandwidths is key to geo-distributed analytics performance. A single logical analytics system across all sites.
5
Incorporating WAN bandwidths Task placement – Decides the destinations of network transfers Data placement – Decides the sources of network transfers 5
6
Example Analytics Job SELECT time_window, percentile(latency, 99) GROUP BY time_window Seattle 40GB 20GB London 40GB 800 MB/s 200 MB/s WAN
7
Task Fractions Upload Time (s) Download Time (s) Input Data (GB) Calculating Transfer Time Seattle London 0.5 40GB 12.5s 50s 0.2 0.8 20s 2.5s 2.5x How to solve the general case, with more sites, BW heterogeneity and data skew? Seattle 40 20 London 40
8
Task Placement (TP Solver) Task 1 -> London Task 2 -> Beijing Task 5 -> London … Sites M Tasks N Data Matrix (MxN) Upload BWs Download BWs 8 TP Solver TP Solver Optimization Goal: Minimize the longest transfer of all links
9
Task Fractions Upload Time (s) Download Time (s) Input Data (GB) London 0.2 0.8 Seattle 100GB 50s 6.25s 40GB 160GB 0.07 0.93 24s 6s 2x 50s How to jointly optimize data and task placement? Seattle 100 50 London 100 Another example Query Lag
10
Iridium Jointly optimize data and task placement with greedy heuristic improve query response time bandwidth, query arrivals, etc Approach Goal Constraints 10
11
Iridium with Single Dataset Iterative heuristics for joint task-data placement. 1, Identify bottlenecks by solving task placement 2, assess:find amount of move data to alleviate current bottleneck 11 TP Solver TP Solver TP Solver TP Solver Until query arrivals, repeat.
12
Iridium with Multiple Datasets Prioritize high-value datasets: score = value x urgency / cost - value = sum(timeReduction) for all queries - urgency = 1/avg(query_lag) - cost = amount of data moved 12
13
13 Iridium: putting together Placement of data – Before query arrival – prioritize the move of high-value datasets Placement of tasks – During query execution: – constrained solver TP Solver TP Solver Not talked about: estimation of query arrivals, contention of move&query, etc
14
Evaluation Spark 1.1.0 and HDFS 2.4.1 – Override Spark’s task scheduler with ours – Data placement creates copies in cross-site HDFS Geo-distributed EC2 deployment across 8 regions – Tokyo, Singapore, Sydney, Frankfurt, Ireland, Sao Paulo, Virginia (US) and California (US). 14
15
Spark jobs, SQL queries and streaming queries – Conviva: video sessions paramters – Bing Edge: running dashboard, streaming – TPC-DS: decision support queries for retail – AMP BDB: mix of Hive and Spark queries Baseline: – “In-place”: Leave data unmoved + Spark’s scheduling – “Centralized”: aggregate all data onto one site How well does Iridium perform? 15
16
Iridium outperforms 4x-19x 3x-4x Conviva Bing-Edge TPC-DS Big-Data vs. In-place vs. Centralized 16 10x 19x 7x 4x Reduction (%) in Query Response Time 3x 4x 3x
17
Iridium subsumes both baselines! vs. Centralized: Data placement has higher contribution vs. In-place: Equal contributions from two techniques Median Reduction (%) Vs. CentralizedVs. In-place Task placement Data placement Iridium (both) 18% 38% 75% 24% 30% 63%
18
Reduction (%) in WAN Usage 1.5xBmin 1.3xBmin 1xBmin (64%, 19%) better MinBW: a scheme that minimizes bandwidth, to Bmin Iridium: budget the bandwidth usage to be m*Bmin Iridium can speed up queries while using near-optimal bandwidth cost Bandwidth Cost
19
Related work JetStream (NSDI’14) – Data aggregation and adaptive filtering – Does not support arbitrary queries, nor optimizes task and data placement WANalytics (CIDR’15), Geode (NSDI’15) – Optimize BW usage for SQL & general DAG jobs – Can lead to poor query performance time 19
20
20 Low Latency Geo-distributed Data Analytics Data is geographically distributed Services with global footprints Analyze logs across DCs “99 percentile movie rating” “Median Skype call setup latency” Abstraction: Single logical analytics cluster across all sites Incorporating WAN bandwidths Reduce response time over baselines by 3x – 19x WAN Seattle Berkeley Beijing London
Similar presentations
© 2025 SlidePlayer.com. Inc.
All rights reserved.