Download presentation
Presentation is loading. Please wait.
Published byTracey Powers Modified over 8 years ago
1
Scientific days, June 16 th & 17 th, 2014 This work has been partially supported by the LabEx PERSYVAL-Lab (ANR-11-LABX-0025-01) funded by the French program Investissement d’avenir A Control Approach for Performance of Big Data Systems Mihaly Berekmeri, Sara Bouchenak, Damian Serrano, Bogdan Robu, Nicolas Marchand GIPSA - LIG - Grenoble University, France Pervasive Computing Systems, Persyval -Lab
2
Scientific days, June 16 th & 17 th, 2014 Big Data - Big Problems Problem: Vast amounts of data generated daily – Facebook: 3.2 x 10 9 likes and comments/day – CERN’s LHC: Up to 1 PB/s during experiments How do we store it? How do we process it? 2
3
Scientific days, June 16 th & 17 th, 2014 Solution: use the cloud – Cloud computing Fast, on demand assigning of a group of shared hardware resources to applications – “Unlimited” storage and computing capacity 3
4
Scientific days, June 16 th & 17 th, 2014 Challenges in the cloud Challenges: – Current cloud solutions don’t assure performance – Difficult to provision with a changing workload – Interference, concurrency problems: IO, network skews, failures, node heterogeneity Assuring performance objectives poses considerable challenges 4
5
Scientific days, June 16 th & 17 th, 2014 Our approach Develop and apply control strategies to cloud software systems Control theory is everywhere: automotive, robotics, energy, microelectronics etc. “Except” in software systems 5
6
Scientific days, June 16 th & 17 th, 2014 Our approach But why would software need control theory? -> Dealing with the dynamics “The journey just as important as the destination” -> Mathematical tools face safely complexity, guarantee theoretically results, flexibility, robustness 6
7
Scientific days, June 16 th & 17 th, 2014 Challenges in building control theory for software systems No physics behind algorithms, applications difficult to use classical techniques to build models sensors can disappear with a system update Language difficulties: e.g. response time 7
8
Scientific days, June 16 th & 17 th, 2014 Objectives Develop a dynamical model for a distributed software framework dealing with BigData Build a test framework for control strategies Devise new control strategies that improve performance and reliability Consideration: – Implementations evolve rapidly remain agnostic to implementation 8
9
Scientific days, June 16 th & 17 th, 2014 MapReduce Programming model introduced by J. Dean and S. Ghemawat (Google) in 2004 Wide range of applications: log analysis, data mining, web search engines, scientific computing, business intelligence,… Used by the biggest companies: Amazon, eBay, Facebook, LinkedIn, Twitter, Yahoo, Microsoft... Automatic features: data partitioning and replication, task scheduling, fault tolerance 9
10
Scientific days, June 16 th & 17 th, 2014 MapReduce 10
11
Scientific days, June 16 th & 17 th, 2014 State of the Art Existing models – static models not suitable for control using control theory – assume that jobs are isolated don’t deal with concurrent job executions, unlikely in real life scenarios For modeling, we’ve essentially started from 0. 11
12
Scientific days, June 16 th & 17 th, 2014 State of the Art Existing controls – Focus on static, off-line configuration not robust enough – Dedicated cluster or job priorities bad performance for low priority jobs – Job level controllers: off-line profile, online adjustment based on job progress large profile database, modifying schedulers 12
13
Scientific days, June 16 th & 17 th, 2014 Sensors & Actuators Problem: most metrics are not available for measurement or control Solution: we built all the online sensors and actuators – Measure: average performance, availability, throughput in the last time window – Control: number of computing nodes 13
14
Scientific days, June 16 th & 17 th, 2014 The test framework we developed 14
15
Scientific days, June 16 th & 17 th, 2014 MapReduce Benchmark Suite Developed by Sangroya et al. (2012) performance and dependability benchmark Advantages: realistic multiuser workloads comprehensive test data fault injection 15
16
Scientific days, June 16 th & 17 th, 2014 Experimental setup 16 ClusterCPUMemoryStorageNetwork 60 nodes Grid5000 4 cores/CPU Intel 2.53GHz 15GB298GBInfiniband 20G business intelligence benchmark consists of a decision support system for a wholesale supplier requests are typical business queries over a large amount of data (10GB )
17
Scientific days, June 16 th & 17 th, 2014 Modeling challenges & Insights Capturing system dynamics – we define a sliding window over time – take average over window Handle complexity – linearize around an operating point defined by a baseline number of nodes and clients – the point of full utilization is the set-point 17
18
Scientific days, June 16 th & 17 th, 2014 Model structure grey-box modeling technique predicts MapReduce cluster performance based on the number of nodes and the number of clients 18
19
Scientific days, June 16 th & 17 th, 2014 Identification both of the models were identified using step response identification (prediction error estimation) 19
20
Scientific days, June 16 th & 17 th, 2014 Control architecture Control challenges: – Large deadtime – Many point of concurrency and interference 20
21
Scientific days, June 16 th & 17 th, 2014 Baseline experiment 21
22
Scientific days, June 16 th & 17 th, 2014 Relaxed performance– Minimal resource 22
23
Scientific days, June 16 th & 17 th, 2014 Strict performance control 23
24
Scientific days, June 16 th & 17 th, 2014 Conclusions Results: – design, implementation and evaluation of the first dynamic model for MapReduce systems – development and successful implementation of a control framework for assuring service time constraints The control architecture is implemented on a real Hadoop cluster Published at IFAC World Congress 2014 and ComPAS 2014, presented at several international workshops 24
25
Scientific days, June 16 th & 17 th, 2014 Future Work Add other metrics to our model: throughput, availability, reliability Online identification techniques Minimize the number of changes in the control input -> event based control Test with several on-line cloud frameworks and more complex workload scenarios 25
26
Scientific days, June 16 th & 17 th, 2014 Thank you for your attention! Questions? 26
Similar presentations
© 2024 SlidePlayer.com. Inc.
All rights reserved.