A Software-Defined Networking based Approach for Performance Management of Analytical Queries on Distributed Data Stores Pengcheng Xiong (NEC Labs America)

Slides:



Advertisements
Similar presentations
DISTRIBUTED MULTIMEDIA SYSTEMS
Advertisements

Network Resource Broker for IPTV in Cloud Computing Lei Liang, Dan He University of Surrey, UK OGF 27, G2C Workshop 15 Oct 2009 Banff,
Big Data + SDN SDN Abstractions. The Story Thus Far Different types of traffic in clusters Background Traffic – Bulk transfers – Control messages Active.
Analysis of : Operator Scheduling in a Data Stream Manager CS561 – Advanced Database Systems By Eric Bloom.
Alex Cheung and Hans-Arno Jacobsen August, 14 th 2009 MIDDLEWARE SYSTEMS RESEARCH GROUP.
1 Enhanced EDF Scheduling Algorithms for Orchestrating Network-wide Active Measurements Prasad Calyam, Chang-Gun Lee Phani Kumar Arava, Dima Krymskiy OARnet,
Active and Accelerated Learning of Cost Models for Optimizing Scientific Applications Piyush Shivam, Shivnath Babu, Jeffrey Chase Duke University.
Distributed databases
1. Aim High with Oracle Real World Performance Andrew Holdsworth Director Real World Performance Group Server Technologies.
Scalable and Crash-Tolerant Load Balancing based on Switch Migration
1 Failure Recovery for Priority Progress Multicast Jung-Rung Han Supervisor: Charles Krasic.
Small-world Overlay P2P Network
The Network Weather Service A Distributed Resource Performance Forecasting Service for Metacomputing Rich Wolski, Neil T. Spring and Jim Hayes Presented.
Toolbox Mirror -Overview Effective Distributed Learning.
CStream: Neighborhood Bandwidth Aggregation For Better Video Streaming Thangam Vedagiri Seenivasan Advisor: Mark Claypool Reader: Robert Kinicki 1 M.S.
©NEC Laboratories America 1 Hui Zhang Samrat Ganguly Sudeept Bhatnagar Rauf Izmailov NEC Labs America Abhishek Sharma University of Southern California.
Chapter 10: Stream-based Data Management Title: Design, Implementation, and Evaluation of the Linear Road Benchmark on the Stream Processing Core Authors:
1 © Prentice Hall, 2002 Chapter 13: Distributed Databases Modern Database Management 6 th Edition Jeffrey A. Hoffer, Mary B. Prescott, Fred R. McFadden.
1© Copyright 2015 EMC Corporation. All rights reserved. SDN INTELLIGENT NETWORKING IMPLICATIONS FOR END-TO-END INTERNETWORKING Simone Mangiante Senior.
Research Gísli Hjálmtýsson - AT&T Research - 1 Programmable Networks of Tomorrow (Pronto): The Programmable Interface of Pronto.
SyNAR: Systems Networking and Architecture Group Symbiotic Jobscheduling for a Simultaneous Multithreading Processor Presenter: Alexandra Fedorova Simon.
VTS INNOVATOR SERIES Real Problems, Real solutions.
EstiNet Network Simulator & Emulator 2014/06/ 尉遲仲涵.
Jani Pousi Supervisor: Jukka Manner Espoo,
Word Wide Cache Distributed Caching for the Distributed Enterprise.
Research on cloud computing application in the peer-to-peer based video-on-demand systems Speaker : 吳靖緯 MA0G rd International Workshop.
Tutorial of Course Project: Distributed Query Engine Jun Wang( 王军 ) East Main Building
Sidewinder A Predictive Data Forwarding Protocol for Mobile Wireless Sensor Networks Matt Keally 1, Gang Zhou 1, Guoliang Xing 2 1 College of William and.
Database Replication Policies for Dynamic Content Applications Gokul Soundararajan, Cristiana Amza, Ashvin Goel University of Toronto EuroSys 2006: Leuven,
VeriFlow: Verifying Network-Wide Invariants in Real Time
1 Adaptive QoS Framework for Wireless Sensor Networks Lucy He Honeywell Technology & Solutions Lab No. 430 Guo Li Bin Road, Pudong New Area, Shanghai,
HadoopDB Presenters: Serva rashidyan Somaie shahrokhi Aida parbale Spring 2012 azad university of sanandaj 1.
Module 4: Planning, Optimizing, and Troubleshooting DHCP
INTERNATIONAL NETWORKS At Indiana University Hans Addleman TransPAC Engineer, International Networks University Information Technology Services Indiana.
Sujayyendhiren RS, Kaiqi Xiong and Minseok Kwon Rochester Institute of Technology Motivation Experimental Setup in ProtoGENI Conclusions and Future Work.
Frontiers in Massive Data Analysis Chapter 3.  Difficult to include data from multiple sources  Each organization develops a unique way of representing.
ArcGIS Server for Administrators
Students: Anurag Anjaria, Charles Hansen, Jin Bai, Mai Kanchanabal Professors: Dr. Edward J. Delp, Dr. Yung-Hsiang Lu CAM 2 Continuous Analysis of Many.
1 Network Emulation Mihai Ivanovici Dr. Razvan Beuran Dr. Neil Davies.
Server to Server Communication Redis as an enabler Orion Free
Architectures and Algorithms for Future Wireless Local Area Networks  1 Chapter Architectures and Algorithms for Future Wireless Local Area.
DISTRIBUTED DATABASES JORGE POMBAR. Overview Most businesses need to support databases at multiple sites. Most businesses need to support databases at.
INTRODUCTION TO DBS Database: a collection of data describing the activities of one or more related organizations DBMS: software designed to assist in.
A Utility-based Approach to Scheduling Multimedia Streams in P2P Systems Fang Chen Computer Science Dept. University of California, Riverside
Accommodating Bursts in Distributed Stream Processing Systems Yannis Drougas, ESRI Vana Kalogeraki, AUEB
Department of Computer Science MapReduce for the Cell B. E. Architecture Marc de Kruijf University of Wisconsin−Madison Advised by Professor Sankaralingam.
Efficient Live Checkpointing Mechanisms for computation and memory-intensive VMs in a data center Kasidit Chanchio Vasabilab Dept of Computer Science,
Copyright 2007, Information Builders. Slide 1 Machine Sizing and Scalability Mark Nesson, Vashti Ragoonath June 2008.
Super Computing 2000 DOE SCIENCE ON THE GRID Storage Resource Management For the Earth Science Grid Scientific Data Management Research Group NERSC, LBNL.
Author Utility-Based Scheduling for Bulk Data Transfers between Distributed Computing Facilities Xin Wang, Wei Tang, Raj Kettimuthu,
ADAPTIVE CACHE-LINE SIZE MANAGEMENT ON 3D INTEGRATED MICROPROCESSORS Takatsugu Ono, Koji Inoue and Kazuaki Murakami Kyushu University, Japan ISOCC 2009.
Rate-Based Query Optimization for Streaming Information Sources Stratis D. Viglas Jeffrey F. Naughton.
Decomposing the Dashboard Example for a Distributed Implementation Jason Shamberger EE249 Fall 1999 Mentor: Dr. Alberto Ferrari.
Web Cache. What is Cache? Cache is the storing of data temporarily to improve performance. Cache exist in a variety of areas such as your CPU, Hard Disk.
Internet of Things. Creating Our Future Together.
1 Scalability and Accuracy in a Large-Scale Network Emulator Nov. 12, 2003 Byung-Gon Chun.
1 Network Measurement Challenges LHC E2E Network Research Meeting October 25 th 2006 Joe Metzger Version 1.1.
© Airspan Networks Inc. Automatic QoS Testing over IEEE Standard.
Virtual laboratories in cloud infrastructure of educational institutions Evgeniy Pluzhnik, Evgeniy Nikulchev, Moscow Technological Institute
Applying Control Theory to Stream Processing Systems
Author: Ragalatha P, Manoj Challa, Sundeep Kumar. K
Supporting Fault-Tolerance in Streaming Grid Applications
Class project by Piyush Ranjan Satapathy & Van Lepham
Group 6-SDN Based Prioritized Information Dissemination
Paraskevi Raftopoulou, Euripides G.M. Petrakis
Congestion Control in SDN-Enabled Networks
Congestion Control in SDN-Enabled Networks
Presentation transcript:

A Software-Defined Networking based Approach for Performance Management of Analytical Queries on Distributed Data Stores Pengcheng Xiong (NEC Labs America) Hakan Hacigumus (NEC Labs America) Jeffrey F. Naughton (Univ. of Wisconsin)

Agenda Why?  Motivation and background How?  System architecture and implementation So what?  Real system and benchmark query evaluation Conclusion 2

Motivation Data analytics applications or data scientists query the data from distributed stores.  A huge amount of data traffic on the network. Join  Many applications want to share a cluster Data backup, video streaming, etc  Response time is critical Deadline-driven reports  Query service differentiation Batch queries, interactive queries 3

An example query (TPC-H Q14) 4 We assume that tables are distributed at relational data stores. Relational data stores are connected by networking

Network change implies plan perf. change 5 Phase 1 Phase 2 Phase 3 (1) Huge gap (2) The best plan can become the worst one Network status changes

What if? 6 Phase 1 Phase 2 Phase 3 What if query optimizer can dynamically monitor the network bandwidth and adaptively choose plan? Adaptive plan is chosen and query execution time is kept short.

Network busy implies no good plan 7 Run query right now and right away. I need that ASAP to catch my deadline! User Distributed DBMS Well… I am sorry. None of the candidate plans can meet your deadline due to current busy network status.

What if? 8 Run query right now and right away. I need that ASAP to catch my deadline! User Distributed DBMS OK. Although current network is busy, I can control it to prioritize the bandwidth for the query. What if query optimizer can control the network?

Distributed query optimizer monitors and controls the network? 9

Sounds like a mission impossible Database always treats the underneath networking as a black box  unable to monitor  let alone to control With software-defined networking  inquire about the current status of the network, or  control the network with directives 10 Networking With SDN Unable to monitor, let alone to control Able to inquire and control

Sounds interesting, but how? 11 Ethernet Switch/Router

12 Data Path (Hardware) Control Path (Software)

13 Data Path (Hardware) Control Path OpenFlow OpenFlow Controller OpenFlow Protocol (SSL/TCP) Dist. Query Optimizer API Our contribution

14 System architecture

15 System implementation NEC PFS5240

Plan generation 16 Stores lineitem table Stores part table

Cost estimation 17 Cost model for network operator  Amount of data transferred  Real-time transfer speed (Monitor)  Take any bandwidth left (Control)  Assign the highest priority  Make a bandwidth reservation

Evaluation Setup  TPC-H, scaling factor 100, Q14  Small tables (supplier, nation, region) are replicated.  Other tables are placed at a single data store site  Neighbor traffic generator-iperf  Summary of case studies 18

Case 1: single user, single-thread, iperf 19 Phase 1 Phase 2 Phase 3 Bottleneck Based on SDN, query optimizer can dynamically monitor the network bandwidth and adaptively choose the best plan

Case 3: multiple users, multiple-thread, no contention traffic, priority queue 20 Based on SDN, premium queries run faster than regular ones. Based on SDN, all queries run faster.

Case study 5: single user, multi-thread, iperf, weighted-fair queue 21 Based on SDN, more reservation makes queries run faster.

Conclusion SDN can be effectively exploited for performance management of analytical queries on distributed data stores  Directly monitor the network and adaptively pick the best plan.  Control the priority of network traffic or make network bandwidth reservations to differentiate the query service. Lots of opportunities 22

23 Thanks!