Control Theory in Log Processing Systems

Slides:

Advertisements

Similar presentations

Analysis of : Operator Scheduling in a Data Stream Manager CS561 – Advanced Database Systems By Eric Bloom.

Advertisements

Evaluation of a Scalable P2P Lookup Protocol for Internet Applications

EHarmony in Cloud Subtitle Brian Ko. eHarmony Online subscription-based matchmaking service Available in United States, Canada, Australia and United Kingdom.

Inpainting Assigment – Tips and Hints Outline how to design a good test plan selection of dimensions to test along selection of values for each dimension.

Cloud Computing Resource provisioning Keke Chen. Outline  For Web applications statistical Learning and automatic control for datacenters  For data.

CS 501: Software Engineering Fall 2000 Lecture 16 System Architecture III Distributed Objects.

The new The new MONARC Simulation Framework Iosif Legrand  California Institute of Technology.

Applying Control Theory to Stream Processing Systems Wei Xu Bill Kramer Joe Hellerstein.

1 Functional Testing Motivation Example Basic Methods Timing: 30 minutes.

Naixue GSU Slide 1 ICVCI’09 Oct. 22, 2009 A Multi-Cloud Computing Scheme for Sharing Computing Resources to Satisfy Local Cloud User Requirements.

Database Management 9. course. Execution of queries.

Budget-based Control for Interactive Services with Partial Execution 1 Yuxiong He, Zihao Ye, Qiang Fu, Sameh Elnikety Microsoft Research.

© ETH Zürich Eric Lo ETH Zurich a joint work with Carsten Binnig (U of Heidelberg), Donald Kossmann (ETH Zurich), Tamer Ozsu (U of Waterloo) and Peter.

Tool Integration with Data and Computation Grid GWE - “Grid Wizard Enterprise”

Fast Crash Recovery in RAMCloud. Motivation The role of DRAM has been increasing – Facebook used 150TB of DRAM For 200TB of disk storage However, there.

ApproxHadoop Bringing Approximations to MapReduce Frameworks

Tool Integration with Data and Computation Grid “Grid Wizard 2”

Presentation subtitle: 20pt Arial Regular, teal R045 | G182 | B179 Recommended maximum length: 2 lines Confidentiality/date line: 13pt Arial Regular, white.

1 Student Date Time Wei Li Nov 30, 2015 Monday 9:00-9:25am Shubbhi Taneja Nov 30, 2015 Monday9:25-9:50am Rodrigo Sanandan Dec 2, 2015 Wednesday9:00-9:25am.

Leverage Big Data With Hadoop Analytics Presentation by Ravi Namboori Visit

SketchVisor: Robust Network Measurement for Software Packet Processing

Real-time Software Design

Kai Li, Allen D. Malony, Sameer Shende, Robert Bell

Auburn University

Chapter 19: Network Management

Digital Control CSE 421.

OPERATING SYSTEMS CS 3502 Fall 2017

Efficient Evaluation of XQuery over Streaming Data

How Alluxio (formerly Tachyon) brings a 300x performance improvement to Qunar’s streaming processing Xueyan Li (Qunar) & Chunming Li (Garena)

WP18, High-speed data recording Krzysztof Wrona, European XFEL

Auburn University COMP7330/7336 Advanced Parallel and Distributed Computing Exploratory Decomposition Dr. Xiao Qin Auburn.

Advanced Operating Systems CIS 720

Introduction to Load Balancing:

Software Architecture in Practice

Action Breakout Session

Large-scale file systems and Map-Reduce

Self Healing and Dynamic Construction Framework:

GWE Core Grid Wizard Enterprise (

Applying Control Theory to Stream Processing Systems

Behavioral Design Patterns

Wayne Wolf Dept. of EE Princeton University

Kay Ousterhout, Christopher Canel, Sylvia Ratnasamy, Scott Shenker

Regulating Data Flow in J2EE Application Server

Software Architecture in Practice

Real-time Software Design

Database Performance Tuning and Query Optimization

A Framework for Automatic Resource and Accuracy Management in A Cloud Environment Smita Vijayakumar.

DHCP, DNS, Client Connection, Assignment 1 1.3

April 30th – Scheduling / parallel

Clustering DNS Problems

CSCI1600: Embedded and Real Time Software

Rui Wu, Jose Painumkal, Sergiu M. Dascalu, Frederick C. Harris, Jr

Load Shedding in Stream Databases – A Control-Based Approach

Matlab as a Development Environment for FPGA Design

Clustering DNS Problems

World-Views of Simulation

Smita Vijayakumar Qian Zhu Gagan Agrawal

Overview of big data tools

A Simulator to Study Virtual Memory Manager Behavior

Chapter 2: Operating-System Structures

Chapter 11 Database Performance Tuning and Query Optimization

CrawlBuddy The web’s best friend.

Memory System Performance Chapter 3

Adaptive Query Processing (Background)

Chapter 2: Operating-System Structures

Lab 8: GUI testing Software Testing LTAT

CSCI1600: Embedded and Real Time Software

MapReduce: Simplified Data Processing on Large Clusters

Presentation transcript:

Control Theory in Log Processing Systems Wei Xu (xuw@cs.berkeley.edu) UC Berkeley Joseph L. Hellerstein IBM T.J. Watson Research Center

Outline Data streams and log processing Applying control theory Controlling queue length Load balancing Lessons learned

Introduction Goal of our project A tool A testbed Problem: data rate up to 1 TB a day Distributed Infrastructure How to make itself reliable? main goal of our project is to analyze log data of online services such as Amazon or eBay. these systems are very complex and they often fail. ... however, due to very high data rate and complexity of the logs, we had problems processing the data

Example of system log data request data Apache log, etc performance data CPU, mem etc. failure data Detected problems /error messages reports from operators 450 attributes, 11,000 requests a second

?   The big picture Production System raw log data Data Collection Automatic analysis preprocessing  ? Repository Sanitized Data  Failure Detection add “AOL” box in front of the orange arrows. add a feedback loop back into AOL. how would be this used in real life. it’s in critical path of failure recovery. speed of “data analysis” is critical for recovery. also, speed of preprocessing is critical ... also, how do we evaluate this framework? how much delay do we introduce? what happens if a node in the preprocessing step fails? can we handle that? put data sanitizing functions into TCQ!!

Preprocessing Sanitize the data Put logs into common format Merge information from various sources Sampling Needs to be fast all the required preprocessing should be done outside the algorithm ... As new data streams -> as new input

Stream processing Telegraph Continuous Query (TCQ) Log data are data streams Preprocessing tasks are continuous queries Telegraph Continuous Query (TCQ) SQL queries adaptive: execution optimized on-the-fly performance doesn’t depend on #queries SLT query Q We think that stream processing is a good data model for system log data.

Data preprocessing architecture load splitter combiner 4 1 TCQ query Q 4 1 5 2 6 3 SLT 1 6 5 4 3 2 1 6 5 4 3 2 1 5 2 6 5 4 3 2 1 6+5+4 3+2+1 TCQ query R SLT 2 “one machine running TCQ can’t handle 1 TB of data a day, so we need to distribute the processing. at the same time, we also want to extract temporal information from the data and thus we need to process the data in sequence. these contradicting goals ...” can be easily distributed over a cluster of machines linear scaling performance of a TCQ node depends on the data rate, not on the number of queries running => can generate many streams can be extended/reorganized in any way why (at least) two tiers? sampling should be the first thing to do in the pipeline to reduce the data rate (that’s why we need parallelism) how to support off-line algorithms? 6 3 Intra-Event Processing Inter-Event Processing

Problem: performance disturbance CPU contention Maintenance Tasks Packets drop Other failures SELECTIVITY changes

The result of disturbance End to End Response time (ms) Time (second)

Solution – Control Theory Treat this as a failure? Not necessary and too expensive Feedback control theory as first tier defense mechanism Try to make it stable at least for sometime If doesn’t work out, try recovery

Outline Data streams and log processing Applying control theory Controlling queue length Load balancing Lessons learned

The problem Source Buffer TCQ Result Q

Why does this happen? TCQ Complex internal structure Controlled Data Source Input Buffer TCQ drops tuples silently if result queue is full Back pressure not possible

Control Problems Goal? What to control? The Knob? No dropping tuples The result queue length The Knob? Input data rate to the TCQ node

Control block diagram Target system (System identification) u(k)=u(k-1)+(Kp+KI)e(k)-Kpe(k-1) Error Data rate in next interval Last Error Data rate in last interval

Result – Under CPU Contention Source Buffer TCQ Result Q

Why useful? Original system New system Input data rate =>tuple drop v.s. not drop New system Input data rate => Response time Make it ready for load balancing

Outline System log as data streams Applying control theory Controlling queue length Load balancing Lessons learned

The problem Barrier in system Different response times End to end response time matches the slower node

The control problem Goal? What to control? The knob? What to monitor? Make the response time equal What to control? Response time on each node The knob? Tuples assigned to each node What to monitor? Queue length v.s. response time

System with control Response time

Control block diagram

Result End to End Response time (ms) Time (second)

Outline System log as data streams Applying control theory Controlling queue length Load balancing Lessons learned

Advantages of control theory Performance can be analyzed Stability Accuracy Settling time Overshoot

Other advantages Simple implementation Encourage good system design Modeling the system Treat system as black box First defense mechanism against disturbances in system

Limitations Not all software systems are designed to be controlled Finite input produces unbounded output E.g. Join in TCQ Useful state not measurable Queuing theory helps, but lacks other good theory Many binary variables Failed v.s working correctly

Other Limitations Can not find the cause of problem The model for target system is complex Lack of a reliable knob E.g. change result queue length of TCQ – sometime it crash What is the range you can turn? How often you can turn? How long will the system respond? Can not find the cause of problem

Solution? More advanced modeling and controller? Adaptive control Design controller-friendly systems? A simple model User configurable parameter -> knobs?

Future Work As a tool, real users? Scheduling multiple streams Dynamically scale up/down Other control theory applications

Backup Slides

Future Work Load balancer Load control across multiple tiers Scheduling of multiple streams

System with control Controlled Output Rate Data Source Controller Queue Length Monitor

Result Source Buffer TCQ Result Q

Conclusion Advantages of feedback control Make system more robust under disturbance Allows more time for failure detection Treat complex systems as black boxes Cope with the system characteristics instead of having to change it Theoretical analysis Implementation is easy System statistics can also be used for SLT

Output Thread (Code Reuse) What is going on? Controlled Output Thread (Code Reuse) Desired Queue length Queue Length Controller Data Rate to TCQ Actual Queue Length

Theory meets reality Output Y from simulation Queue length Time

Tricky part of parameter estimation Model evaluation – Making the system operate in desired range Data rate vs free space Free Space Non-Linear range Easy for data source, but queue length ..

Why do we need control? Data source does not provide accurate data rate

Control Problems Not accurate for various reasons Scheduling Time spent on I/O Etc. Providing an accurate data source using feedback control By controlling the input of “desired rate”

The Control Architecture 1500 1900 1600 P Controller (with precompensation) u(k)=Kp*e(k) PI Controller U(k)=u(k-1)+(Kp+KI)e(k)-Kpe(k-1)

Result – An accurate data source P Controller with Pre-compensation PI Controller

Zoom In A lot of small disturbance in a Java program Incremental garbage collection P Controller PI Controller