Cesar A. Stuardo, Tanakorn Leesatapornwongsa , Riza O

Slides:

Advertisements

Similar presentations

A Process Splitting Transformation for Kahn Process Networks Sjoerd Meijer.

Advertisements

Dr. Kalpakis CMSC 621, Advanced Operating Systems. Fall 2003 URL: Distributed System Architectures.

The Case for Drill-Ready Cloud Computing Vision Paper Tanakorn Leesatapornwongsa and Haryadi S. Gunawi 1.

Race Conditions. Isolated & Non-Isolated Processes Isolated: Do not share state with other processes –The output of process is unaffected by run of other.

SEDA: An Architecture for Well-Conditioned, Scalable Internet Services

Next Generation of Apache Hadoop MapReduce Arun C. Murthy - Hortonworks Founder and Architect Formerly Architect, MapReduce.

What Bugs Live in the Cloud? A Study of Issues in Cloud Systems Jeffry Adityatama, Kurnia J. Eliazar, Agung Laksono, Jeffrey F. Lukman, Vincentius.

Take An Internal Look at Hadoop Hairong Kuang Grid Team, Yahoo! Inc

1 The Google File System Reporter: You-Wei Zhang.

THE HOG LANGUAGE A scripting MapReduce language. Jason Halpern Testing/Validation Samuel Messing Project Manager Benjamin Rapaport System Architect Kurry.

Institute of Computer and Communication Network Engineering OFC/NFOEC, 6-10 March 2011, Los Angeles, CA Lessons Learned From Implementing a Path Computation.

Parallel Programming Models Jihad El-Sana These slides are based on the book: Introduction to Parallel Computing, Blaise Barney, Lawrence Livermore National.

Presented By HaeJoon Lee Yanyan Shen, Beng Chin Ooi, Bogdan Marius Tudor National University of Singapore Wei Lu Renmin University Cang Chen Zhejiang University.

Introduction to Hadoop and HDFS

Computer Network Lab. Korea University Computer Networks Labs Se-Hee Whang.

DynamicMR: A Dynamic Slot Allocation Optimization Framework for MapReduce Clusters Nanyang Technological University Shanjiang Tang, Bu-Sung Lee, Bingsheng.

GPFS: A Shared-Disk File System for Large Computing Clusters Frank Schmuck & Roger Haskin IBM Almaden Research Center.

Cofax Scalability Document Version Scaling Cofax in General The scalability of Cofax is directly related to the system software, hardware and network.

Tanakorn Leesatapornwongsa, Jeffrey F. Lukman, Shan Lu, Haryadi S. Gunawi.

Practical Hadoop: do’s and don’ts by example Kacper Surdy, Zbigniew Baranowski.

Re-Architecting Apache Spark for Performance Understandability Kay Ousterhout Joint work with Christopher Canel, Max Wolffe, Sylvia Ratnasamy, Scott Shenker.

SPIDAL Java Optimized February 2017 Software: MIDAS HPC-ABDS

Optimizing Distributed Actor Systems for Dynamic Interactive Services

Node.Js Server Side Javascript

Cassandra - A Decentralized Structured Storage System

Machine Learning Library for Apache Ignite

SEDA: An Architecture for Scalable, Well-Conditioned Internet Services

Introduction to Distributed Platforms

Threads vs. Events SEDA – An Event Model 5204 – Operating Systems.

Software Systems Development

Hydra: Leveraging Functional Slicing for Efficient Distributed SDN Controllers Yiyang Chang, Ashkan Rezaei, Balajee Vamanan, Jahangir Hasan, Sanjay Rao.

CSS534: Parallel Programming in Grid and Cloud

Chapter 10 Data Analytics for IoT

Distributed Network Traffic Feature Extraction for a Real-time IDS

Self Healing and Dynamic Construction Framework:

CSE 775 – Distributed Objects Submitted by: Arpit Kothari

Applying Control Theory to Stream Processing Systems

Parallel Programming By J. H. Wang May 2, 2017.

PREGEL Data Management in the Cloud

Yongle Zhang, Serguei Makarov, Xiang Ren, David Lion, Ding Yuan

Understanding Real World Data Corruptions in Cloud Systems

Hadoop Clusters Tess Fulkerson.

Software Engineering Introduction to Apache Hadoop Map Reduce

APACHE HAWQ 2.X A Hadoop Native SQL Engine

Storage Virtualization

Central Florida Business Intelligence User Group

Meng Cao, Xiangqing Sun, Ziyue Chen May 28th, 2014

Node.Js Server Side Javascript

The Basics of Apache Hadoop

Parallel and Multiprocessor Architectures – Shared Memory

Reference-Driven Performance Anomaly Identification

Hadoop Technopoints.

Introduction to Apache

Overview of big data tools

CS 501: Software Engineering Fall 1999

Erlang Multicore support

Lecture 16 (Intro to MapReduce and Hadoop)

Prof. Leonardo Mostarda University of Camerino

Operating System Introduction.

Hardware-less Testing for RAS Software

Why Threads Are A Bad Idea (for most purposes)

Control Theory in Log Processing Systems

Database System Architectures

FlyMC: Highly Scalable Testing of Complex Interleavings in Distributed Systems Jeffrey F. Lukman, Huan Ke, Cesar Stuardo, Riza Suminto, Daniar Kurniawan,

Parallel Programming in C with MPI and OpenMP

Why Threads Are A Bad Idea (for most purposes)

Why Threads Are A Bad Idea (for most purposes)

MapReduce: Simplified Data Processing on Large Clusters

A tutorial on building large-scale services

Threads CSE 2431: Introduction to Operating Systems

Presentation transcript:

ScaleCheck A Single-Machine Approach for Discovering Scalability Bugs in Large Distributed Systems Cesar A. Stuardo, Tanakorn Leesatapornwongsa , Riza O. Suminto, Huan Ke, Jeffrey F. Lukman, Wei-Chiu Chuang , Shan Lu and Haryadi S. Gunawi * + + *

New Class of Bugs? “Classic” critical bugs Scalability bugs? ScaleCheck @ FAST ’19 02/28/19 New Class of Bugs? “Classic” critical bugs Concurrency Configuration Performance Security Scalability bugs?

Scalability Bugs? Cassandra Hadoop/HDFS HBase: Riak: ScaleCheck @ FAST ’19 02/28/19 Scalability Bugs? Cassandra Bug #3831: "start exploding when start reading 100-300 nodes" Bug #6127: "obvious symptom is with 1000 nodes” Bug #6409: "With >500 nodes, … have trouble" Hadoop/HDFS Bug #1073: "1600 data nodes. randomWriter fails to run" Bug #3990: "multi-thousand node cluster often results in [problem]” HBase: Bug #10209: "with 200+ nodes, [cluster] takes 50mins to start" Bug #12139: "with >200 nodes, StochasticLoadBalancer doesn't work" Riak: Bug #926: "1000 node cluster is a larger question”

ScaleCheck @ FAST ’19 02/28/19 Scalability Bugs Latent bugs whose symptoms surface in large-scale deployments, but not necessarily in small/medium-scale deployments Symptom #Nodes

An Example: Cassandra Bug #3831 ScaleCheck @ FAST ’19 02/28/19 An Example: Cassandra Bug #3831 Decommissioning Update Ring N=6

An Example: Cassandra Bug #3831 ScaleCheck @ FAST ’19 02/28/19 An Example: Cassandra Bug #3831 In every node for(i=0 to N) for(j=0 to N) for(k=0 to N) 𝑶( 𝑵 𝟑 ) Update Ring CPU intensive 0.1 – 4 seconds A: “ B is dead” In all nodes! (unstable cluster) A: “ B is alive” Late Gossip Flapping B N > 200

ScaleCheck @ FAST ’19 02/28/19 The “Flapping” Bug(s) The bug symptom

What are the root causes of scalability bugs? ScaleCheck @ FAST ’19 02/28/19 “For Apache Hadoop, testing at thousand-node scale has been one of the most effective ways of finding bugs, but it’s both difficult and expensive. It takes considerable expertise to deploy and operate a large-scale cluster, much less debug the issues. Running such a cluster also costs thousands of dollars an hour, making scale testing impossible for the solo contributor. As it stands, we are heavily reliant on test clusters operated by large companies to do scale testing. A way of finding scalability bugs without requiring running a large-scale cluster would be extremely useful” — Andrew Wang (Cloudera and Apache Hadoop PMC Member and Committer) What are the root causes of scalability bugs? Q1 Can we do large-scale testing easily, e.g. in one machine? Q2

ScaleCheck Real-scale testing Discovering scalability bugs ScaleCheck @ FAST ’19 02/28/19 ScaleCheck Real-scale testing Discovering scalability bugs Democratizing large scale testing

ScaleCheck SFind What are the root causes of scalability bugs? ScaleCheck @ FAST ’19 02/28/19 What are the root causes of scalability bugs? Q1 ScaleCheck SFind Scale Dependent Loops Automatically find scale dependent data structures and loops by checking their in-memory growth tendency over small scale experiments Loops iterating scale dependent data structures

Can we do large-scale testing easily, e.g. in one machine? ScaleCheck @ FAST ’19 02/28/19 ScaleCheck Can we do large-scale testing easily, e.g. in one machine? Q2 STest Many Challenges!!! Memory Overhead Context switching SPC Single Process Cluster CPU contention GEDA Global Event-Driven Architecture Network Overhead PIL Processing Illusion Bad modularity Large scale testing in a single machine with little source code modification

Our Results Integrated to 4 systems Scale tested 18 protocols ScaleCheck @ FAST ’19 02/28/19 Our Results Integrated to 4 systems Scale tested 18 protocols Accurately reproduced 10 known scalability bugs Found 4 new bugs …..in just a single machine

Outline Introduction SFind STest Evaluations and Conclusions ScaleCheck @ FAST ’19 02/28/19 Outline Introduction SFind STest Evaluations and Conclusions

Our Frequent Root Cause… ScaleCheck @ FAST ’19 02/28/19 Our Frequent Root Cause… is not easy to find in… Any distributed system these days has… > 100K LOC > 100x source files > 1000x maps, lists, queues, etc… We can find them manually… /* … holds one entry per peer … */ /* … maps partition to node… */ /* … maps blocks to datanodes… */ How can we find them automatically? Scale Dependent Loops iterating scale dependent data structures

List “peers” growth tendency ScaleCheck @ FAST ’19 02/28/19 JVMTI and Reflection support List peers = {A} peers = {A, B} peers = {…} A A A … B … Step 1 Step 2 Step N List “peers” growth tendency Size Step N … 2 2 1 1 1 1 1 2 1 2 N

A’s on heap data structures growth tendency after N steps ScaleCheck @ FAST ’19 02/28/19 Data structures that tend to grow when one or more dimensions grow are scale dependent Size Step List X Size Step List peers Size Step Set Y Size Step Map Z Size Step Map nodes Size Step Set T Size Step Queue Q Size Step Map R A’s on heap data structures growth tendency after N steps

Finding many scale-dep data structures ScaleCheck @ FAST ’19 02/28/19 Finding many scale-dep data structures 26 data structures (Cassandra, HDFS) Try out different axes of scale Adding files Adding partitions tables, columns, files, directories… Try out different workloads Snapshot directories Removing nodes migration, rebalance… inodes tokens 18 data structures (Cassandra, HDFS) snapshots deadNodes

Prioritizing Scale-Testing ScaleCheck @ FAST ’19 02/28/19 Prioritizing Scale-Testing Map our data structures to loops using data-flow Priority Method ? sendToEndpoints applyStateLocally Complexity 2 #IO 𝑵 𝟐 #Lock 𝑵 𝟐 Create inter-procedural call graphs for all methods applyStateLocally sendToEndpoints Prioritize harmful scalability bottlenecks Priority Method 1 applyStateLocally 2 sendToEndpoints …

Outline Introduction SFind STest Evaluations and Conclusions ScaleCheck @ FAST ’19 02/28/19 Outline Introduction SFind STest Evaluations and Conclusions

Democratizing Large Scale Testing with STest ScaleCheck @ FAST ’19 02/28/19 Democratizing Large Scale Testing with STest Many Challenges!!! Priority Method 1 applyStateLocally … Memory Overhead … Context switching CPU intensity Network Overhead TestApplyStateLocally.java Bad modularity public void testApplyStateLocally(…){ //… }

Try #1: Naïve Packing (NP) ScaleCheck @ FAST ’19 02/28/19 Try #1: Naïve Packing (NP) Deploy nodes as processes in a single machine One process = one node = one system’s instance Main challenges Per node runtime overhead Process context switching Event Lateness (e.g send message) Max Colocation factor: 50 nodes Expected: Msg every second … Observed: Msg every 2 seconds Node 1 Node 2 Node… Prohibits large scale-testing!

Try #2: Single Process Cluster (SPC) ScaleCheck @ FAST ’19 02/28/19 Try #2: Single Process Cluster (SPC) Deploy nodes as processes threads in a single process Challenges Lack of modularity (e.g. bad design patterns) Automatic per-node isolation (e.g. ClassLoader isolation) Large amount of threads Simplify inter-node communication Max Colocation factor: 120 nodes Node 2’s msg queue recv send … … … Node 1 Node 2 Node…

Per-Node Services Frequent Design pattern ScaleCheck @ FAST ’19 02/28/19 Per-Node Services Frequent Design pattern Service = worker pool + event queue But N nodes mean 𝑵∗ 𝑺 𝟏 + …+ 𝑺 𝒔 threads!!! … Node service

Try #3: Global Event Driven Architecture (GEDA) ScaleCheck @ FAST ’19 02/28/19 Try #3: Global Event Driven Architecture (GEDA) One global event handler per service total # of threads = # of cores Identify event constrains Multi-threaded versus single-threaded Max Colocation factor: 512 nodes Global event handler … Node 1 … enforce serial processing Node 2

Using our Framework for HDFS-9188 ScaleCheck @ FAST ’19 02/28/19 Using our Framework for HDFS-9188 RPC Queue full = no other request can be served Accurate! Namenode’s RPC Queue Datanodes IBR processed in serial N = 32 N = 64 N = 128 Max size = 1000 N = 256 FULL!

But Meanwhile in CASS-3831… ScaleCheck @ FAST ’19 02/28/19 But Meanwhile in CASS-3831… Inaccurate!

CPU Contention… Replace CPU intensive computation with sleep ScaleCheck @ FAST ’19 02/28/19 CPU Contention… Try #4: Processing Illusion (PIL) Replace CPU intensive computation with sleep “The key to computation is the execution time and eventual output” What code blocks can be safely replaced by sleep? In every node 𝑶( 𝑵 𝟑 ) In every node 𝑶(𝟏)

(Easy) CPU intensive blocks ScaleCheck @ FAST ’19 02/28/19 (Easy) CPU intensive blocks Can be PILed… function intense() if(!ScaleCheck){ for(…){ for(…){ for(…){ write(“hello!”); } } else { t = getTime(…) sleep(t) Non pertinent operations that do not affect processing semantics function intense() for(…){ for(…){ for(…){ write(“hello!”); } 𝑶( 𝑵 𝟑 ) Offline profiling 𝑶(𝟏)

(Hard) CPU intensive blocks ScaleCheck @ FAST ’19 02/28/19 (Hard) CPU intensive blocks Can be PILed… function relevantAndRelevant(input) if(!ScaleCheck){ for(…){ for(…){ for(…){ modifyClusterState(…) } else{ t, output = getNextState(input) sleep(t) clusterState = output Pertinent operations that affect processing semantics function relevantAndRelevant(input) for(…){ for(…){ for(…){ modifyClusterState(…) } 𝑶( 𝑵 𝟑 ) Offline/Online memoization

ScaleCheck @ FAST ’19 02/28/19 How about now?

Outline Introduction SFind STest Evaluations and Conclusions ScaleCheck @ FAST ’19 02/28/19 Outline Introduction SFind STest Evaluations and Conclusions

Reproducing 8 known bugs accurately ScaleCheck @ FAST ’19 2/28/19 Reproducing 8 known bugs accurately [see paper for details]

Finding New Bugs Meanwhile in latest Cassandra… 𝑶( 𝑵 𝟑 ) 𝑶 𝑵 IO ScaleCheck @ FAST ’19 02/28/19 Finding New Bugs Meanwhile in latest Cassandra… 𝑶( 𝑵 𝟑 ) applyStateLocally 𝑶 𝑵 IO 𝑶(𝑵) Lock

ScaleCheck @ FAST ’19 02/28/19 Finding New Bugs

Need all techniques for a high colocation factor ScaleCheck @ FAST ’19 02/28/19 Colocation Factor How many nodes can we colocate in a single machine maintaining accuracy? 512 +PIL Need all techniques for a high colocation factor +GEDA +NetStub +SPC 256 Naive 128 64 Cassandra HDFS Voldemort decommission IBR rebalance

Limitations and Future Work ScaleCheck @ FAST ’19 02/28/19 Limitations and Future Work Focus on scale dependent CPU/Processing time E.g. cluster size, number of files, etc… Scale of load, Scale of failure Colocation factor limited by a single machine Akin to unit testing Multiple machines PIL might require real scale memoization Offline memoization takes too long (days) Explore other CPU contention reduction techniques New code = New maintenance costs Low cost (about 1% of code base) Going for zero-effort emulation

Conclusions ScaleCheck: ScaleCheck @ FAST ’19 02/28/19 Conclusions ScaleCheck: Real-scale testing Discovering scalability bugs Democratizing large scale testing

Thank you! Questions? http://ucare.cs.uchicago.edu ScaleCheck @ Fast ’19 02/28/19 Thank you! Questions? http://ucare.cs.uchicago.edu http://cs.uchicago.edu https://ceres.uchicago.edu