Download presentation
Presentation is loading. Please wait.
Published byEmma Newton Modified over 9 years ago
1
Riza Suminto, Agung Laksono *, Anang Satria *, Thanh Do †, Haryadi Gunawi †*
2
2 SPV @ HotCloud ’15
3
Users demand high dependability, reliability, and performance stability Amazon found that every 100ms of latency cost them 1% in sales Google found an extra 0.5 second in search page generation time dropped traffic by 20% 3 SPV @ HotCloud ’15
4
What Bugs Live in the Cloud? A Study of 3000+ Issues in Cloud Systems, SOCC’14 4 22% SPV @ HotCloud ’15
5
Performance Bug System Performance Verifier SPV @ HotCloud ’15 5
6
6 Jobs take multiple times than usual to finish Improper speculative execution JCH 1 & TPL 1 & FPL 2 & FTY 1 Unnecessary repeated recovery TPL 1 & TPL 4 & FTY 4 & TOP 1 SPV @ HotCloud ’15
7
7 Map read locally Mappers and reducers in different nodes All-to-All Fault at map node Slow NIC DLC A TPL A FPL A FTY A JCH A M1 M2 M3 Mappers Reducers All reducers slow! DLC A & TPL A & JCH A & FPL A & FTY A No straggler = No SpecExec SPV @ HotCloud ’15 slow!
8
DLC A & TPL A & JCH A & FPL A & FTY A 8 M1 M2 M3 Mapper s DN DLC B = read remote Straggler! SPV @ HotCloud ’15 DLC A & TPL A & JCH A & FPL A & FTY A M1 M2 M3 Mappers Reducers
9
DLC A & TPL A & JCH A & FPL A & FTY A 9 M1 M2 M3 Mappers Reducers FPL B slow reducer = Straggler! SPV @ HotCloud ’15 DLC A & TPL A & JCH A & FPL A & FTY A M1 M2 M3 Mappers Reducers
10
10 Mappers and Reducers in different nodes Mappers and Reducers in different racks Large number of nodes per rack Slow inter-rack switch M M M M R Rack 1Rack 2 M TPL A TPL B TOP A FTY B TPL A & TPL B & TOP A & FTY B SPV @ HotCloud ’15 slow!
11
Untriggered Speculative Execution MR-70001 = JCH 1 & TPL 1 & FPL 2 & FTY 1 MR-70002 = DSR 1 & DLC 1 & FPL 1 & FTY 1 MR-5533 = FTY 2 & FPL 3 & TPL 3 …… O(n) Recovery MR-5251 = FTY 3 & FPL 3 & FTM 1 MR-5060 = TPL 1 & TPL 3 & FTY 1 & FPL 2 MR-1800 = TPL 1 & TPL 4 & FTY 4 & TOP 1 …… Long lock contention MR-9191 = FTY 3 & FPL 3 & FTM 1 MR-9292 = TPL 1 & TPL 3 & FTY 1 & FPL 2 MR-9393 = TPL 1 & TPL 4 & FTY 4 & TOP 1 …… 11 Scenario TypePossible Condition DLC: Data Locality(1) Read from remote disk, (2) read from local disk,... DSR: Data Source(1) Some tasks read from same datanode, (2) all tasks read from different datanodes, … JCH: Job Characteristic Map-reduce is (1) many-to-all, (2) all-to-many, (3) large fan-in, (4) large fan-out,... JSZ: Job Size(1)1GBjarfile,(2)1MBjarfile,... LSZ: Load Size(1) Thousands of tasks, (2) small number of tasks, … FTY: Fault Type(1) Slow node/NIC, (2) Node disconnect/packet drop, (3) Disk error/out of space, (4) Rack switch, … FPL: Fault PlacementSlowdown fault injection at the (1) source datanode, (2) mapper, (3) reducer, … FGR: Fault Ganularity(1) Single disk/NIC, (2) single node (deadnode), (3) en- tire rack (network switch), … FTM: Fault Timing(1) During shuffling, (2) during 95% of task completion, … TOP: Topology(1) 30 nodes per rack, (2) 3 nodes per rack, … TPL: Task Placement(1) Mappers and reducers are in different nodes, (2) AM and reducers in different nodes, (3) Mappers are in the same node, (4) Most of reducers placed in the same rack,... SPV @ HotCloud ’15
12
Performance Bug System Performance Verifier SPV @ HotCloud ’15 12
13
SPV @ HotCloud ’15 13 Benchmarking Hundreds benchmark for every scenario Injecting slowdowns and failures Take days to weeks!!
14
14 Four goals in performance verification Fast Covers many deployment scenario Runs in pre-deployment Directly checks implementation code SPV @ HotCloud ’15 Formal modeling tools!
15
15 SPV @ HotCloud ’15 @Data public class JobInProgress { JobID jobId; TaskInProgress maps[];... } @IO public HeartbeatResponse heartbeat (HeartbeatData hd){... } Target system (e.g., Hadoop code) SPV Compiler Auto-generated model (in Colored Petri Net) Performance Verification 20X larger than hand model Hand model
16
16 Tasks Node Task to Run (“T1”,map) A @0(A,“T1”,map) @10 input(node,task); output(assignment); action let val (id,type) = task in (node,id,type) end; @+10 node assignment task Schedule Task SPV @ HotCloud ’15
17
17 CPNJava SPV @ HotCloud ’15
18
Java SysJava Data flattening Code modularization Annotation tagging SysJava Model compiler 18 SPV @ HotCloud ’15
19
Java system states = ArrayList, Map, Tree,… CPN states= multisets 19 List runningJobs; public class JobInProgress { JobID jobId; TaskInProgress maps[];... } class TaskInProgress { TaskID id; double progress;... } Job In Progres s Task In Progres s Job Task Mappin g [(1)] [(1,a),(1,b)] [(a,10%),(b,15%)] SPV @ HotCloud ’15
20
20 private boolean processHeartbeat( TaskTrackerStatus trackerStats) { synchronized (taskTrackers) {... } for (TaskStatus ts: trackerStats) { tasks.get(ts.id).updateStatus(ts); }... } @ProcessState private void initCheck() { synchronized (taskTrackers) {... } @ForEach private void updateStatuses( TaskTrackerStatus trackerStats) { for (TaskStatus ts: trackerStats) {... } @GetState private TaskInProgress getTask(TaskID id) { tasks.get(ts.id); } @UpdateState private void tipUpdate(TaskInProgress tip, TaskStatus ts) { tip.updateStatus(ts); } Modular function Control Flow logic CRUD Logic SPV @ HotCloud ’15
21
Assist compiler Annotation Category: Data Structure I/O CRUD & Process Miscellaneous 21 public HeartbeatResponse heartbeat (HeartbeatData hd) {... } public class JobInProgress { JobID jobId; TaskInProgress maps[];... } SPV @ HotCloud ’15 @Data @IO
22
SPV Compiler Executable XML Define configurations, assertions, and specifications Explore every non-deterministic choices Task to node mapping 22 Tasks Node Task to Run (“T1”,map) B (A,“T1”,map) Tasks Node Task to Run (“T1”,map) B T1 on AT1 on B A Schedule Task A (B,“T1”,map) Schedule Task SPV @ HotCloud ’15
23
5305 lines of code on top of WALA & Access/CPN Hadoop MapReduce 1.2.1, with 1067 lines code change 20x larger than hand-made model 34 scenario, 30 assertion violation, 4 performance bug 1.5 hour model checking 23 ConfigurationValue Worker NodeNode A, B Data NodeNode A, B, C Tasks2 Task Fault TypeSlow Data Node Fault PlacementNode C SPV @ HotCloud ’15
24
24 http://ucare.cs.uchicago.edu SPV @ HotCloud ’15
25
25 Is it time for pre-deployment detection of performance bugs? Bridging system code and formal methods Future of data-centric languages Beyond Hadoop Root cause anatomy of performance bugs Beyond performance bugs SPV @ HotCloud ’15
Similar presentations
© 2025 SlidePlayer.com. Inc.
All rights reserved.