Macro-level Scheduling of ETL Workflows Anastasios Karagiannis 1, Panos Vassiliadis 1, Alkis Simitsis 2 1 Univ. of Ioannina, Greece 2 HP Labs, USA

Slides:



Advertisements
Similar presentations
Analysis of : Operator Scheduling in a Data Stream Manager CS561 – Advanced Database Systems By Eric Bloom.
Advertisements

MapReduce.
Scheduling Criteria CPU utilization – keep the CPU as busy as possible (from 0% to 100%) Throughput – # of processes that complete their execution per.
Topic : Process Management Lecture By: Rupinder Kaur Lecturer IT, SRS Govt. Polytechnic College for Girls,Ludhiana.
Architecture-dependent optimizations Functional units, delay slots and dependency analysis.
Efficient Keyword Search for Smallest LCAs in XML Database Yu Xu Department of Computer Science & Engineering University of California, San Diego Yannis.
Parallel and Distributed Simulation Global Virtual Time - Part 2.
TI: An Efficient Indexing Mechanism for Real-Time Search on Tweets Chun Chen 1, Feng Li 2, Beng Chin Ooi 2, and Sai Wu 2 1 Zhejiang University, 2 National.
Operating Systems: Introduction n 1. Historical Development n 2. The OS as a Resource Manager n 3. Definitions n 4. The Process.
Meeting Service Level Objectives of Pig Programs Zhuoyao Zhang, Ludmila Cherkasova, Abhishek Verma, Boon Thau Loo University of Pennsylvania Hewlett-Packard.
Chapter 5 CPU Scheduling. CPU Scheduling Topics: Basic Concepts Scheduling Criteria Scheduling Algorithms Multiple-Processor Scheduling Real-Time Scheduling.
G. Papastefanatos 1, P. Vassiliadis 2, A. Simitsis 3, T. Sellis 1,4, Y. Vassiliou 1 (1) National Technical University of Athens, Athens, Hellas (Greece)
Manish Bhide, Manoj K Agarwal IBM India Research Lab India {abmanish, Amir Bar-Or, Sriram Padmanabhan IBM Software Group, USA
©NEC Laboratories America 1 Hui Zhang Samrat Ganguly Sudeept Bhatnagar Rauf Izmailov NEC Labs America Abhishek Sharma University of Southern California.
Scheduling in Batch Systems
VLDB Revisiting Pipelined Parallelism in Multi-Join Query Processing Bin Liu and Elke A. Rundensteiner Worcester Polytechnic Institute
George Papastefanatos 1, Panos Vassiliadis 2, Alkis Simitsis 3,Yannis Vassiliou 1 (1) National Technical University of Athens
An Adaptive Multi-Objective Scheduling Selection Framework For Continuous Query Processing Timothy M. Sutherland Bradford Pielech Yali Zhu Luping Ding.
UC Berkeley Improving MapReduce Performance in Heterogeneous Environments Matei Zaharia, Andy Konwinski, Anthony Joseph, Randy Katz, Ion Stoica University.
Hadoop & Cheetah. Key words Cluster  data center – Lots of machines thousands Node  a server in a data center – Commodity device fails very easily Slot.
Efficient Scheduling of Heterogeneous Continuous Queries Mohamed A. Sharaf Panos K. Chrysanthis Alexandros Labrinidis Kirk Pruhs Advanced Data Management.
Chapter 6: CPU Scheduling
Computer Architecture and Operating Systems CS 3230: Operating System Section Lecture OS-3 CPU Scheduling Department of Computer Science and Software Engineering.
1 B Trees - Motivation Recall our discussion on AVL-trees –The maximum height of an AVL-tree with n-nodes is log 2 (n) since the branching factor (degree,
Freshness-Aware Scheduling of Continuous Queries in the Dynamic Web Mohamed A. Sharaf Alexandros Labrinidis Panos K. Chrysanthis Kirk Pruhs Advanced Data.
Software Pipelining for Stream Programs on Resource Constrained Multi-core Architectures IEEE TRANSACTIONS ON PARALLEL AND DISTRIBUTED SYSTEM 2012 Authors:
Silberschatz, Galvin and Gagne  Operating System Concepts Chapter 6: CPU Scheduling Basic Concepts Scheduling Criteria Scheduling Algorithms.
© Copyright 2012 Hewlett-Packard Development Company, L.P. The information contained herein is subject to change without notice. LogKV: Exploiting Key-Value.
Combining the strengths of UMIST and The Victoria University of Manchester Utility Driven Adaptive Workflow Execution Kevin Lee School of Computer Science,
Scheduling. Alternating Sequence of CPU And I/O Bursts.
Querying Large Databases Rukmini Kaushik. Purpose Research for efficient algorithms and software architectures of query engines.
Computer Architecture and Operating Systems CS 3230: Operating System Section Lecture OS-1 Process Concepts Department of Computer Science and Software.
Autonomic scheduling of tasks from data parallel patterns to CPU/GPU core mixes Published in: High Performance Computing and Simulation (HPCS), 2013 International.
A Taxonomy of ETL Activities Panos Vassiliadis 1, Alkis Simitsis 2, Eftychia Baikousi 1 (1) University of Ioannina (2) HP Labs.
SIMPLE: Stable Increased Throughput Multi-hop Link Efficient Protocol For WBANs Qaisar Nadeem Department of Electrical Engineering Comsats Institute of.
INNOV-10 Progress® Event Engine™ Technical Overview Prashant Thumma Principal Software Engineer.
1 11/29/2015 Chapter 6: CPU Scheduling l Basic Concepts l Scheduling Criteria l Scheduling Algorithms l Multiple-Processor Scheduling l Real-Time Scheduling.
MROrder: Flexible Job Ordering Optimization for Online MapReduce Workloads School of Computer Engineering Nanyang Technological University 30 th Aug 2013.
Department of Computer Science MapReduce for the Cell B. E. Architecture Marc de Kruijf University of Wisconsin−Madison Advised by Professor Sankaralingam.
Parallel and Distributed Simulation Time Parallel Simulation.
1 CS.217 Operating System By Ajarn..Sutapart Sappajak,METC,MSIT Chapter 5 CPU Scheduling Slide 1 Chapter 5 CPU Scheduling.
6.1 CPU Scheduling Basic Concepts Scheduling Criteria Scheduling Algorithms Multiple-Processor Scheduling Real-Time Scheduling Algorithm Evaluation.
ApproxHadoop Bringing Approximations to MapReduce Frameworks
An Efficient Gigabit Ethernet Switch Model for Large-Scale Simulation Dong (Kevin) Jin.
Chapter 9: Web Services and Databases Title: NiagaraCQ: A Scalable Continuous Query System for Internet Databases Authors: Jianjun Chen, David J. DeWitt,
Sunpyo Hong, Hyesoon Kim
Chapter 4 CPU Scheduling. 2 Basic Concepts Scheduling Criteria Scheduling Algorithms Multiple-Processor Scheduling Real-Time Scheduling Algorithm Evaluation.
PACMan: Coordinated Memory Caching for Parallel Jobs Ganesh Ananthanarayanan, Ali Ghodsi, Andrew Wang, Dhruba Borthakur, Srikanth Kandula, Scott Shenker,
CPU Scheduling G.Anuradha Reference : Galvin. CPU Scheduling Basic Concepts Scheduling Criteria Scheduling Algorithms Multiple-Processor Scheduling Real-Time.
1 Lecture 5: CPU Scheduling Operating System Fall 2006.
IncApprox The marriage of incremental and approximate computing Pramod Bhatotia Dhanya Krishnan, Do Le Quoc, Christof Fetzer, Rodrigo Rodrigues* (TU Dresden.
Chapter 6: CPU Scheduling
Process management Information maintained by OS for process management
On Spatial Joins in MapReduce
CPU Scheduling Basic Concepts Scheduling Criteria
CPU Scheduling G.Anuradha
Chapter 6: CPU Scheduling
Module 5: CPU Scheduling
Chapter 5: CPU Scheduling
Operating System Concepts
Chapter5: CPU Scheduling
Chapter 6: CPU Scheduling
CPU SCHEDULING SIMULATION
Chapter 6: CPU Scheduling
Module 5: CPU Scheduling
Chapter 6: CPU Scheduling
Donghui Zhang, Tian Xia Northeastern University
CPU Scheduling: Basic Concepts
Module 5: CPU Scheduling
EdgeWise: A Better Stream Processing Engine for the Edge
Presentation transcript:

Macro-level Scheduling of ETL Workflows Anastasios Karagiannis 1, Panos Vassiliadis 1, Alkis Simitsis 2 1 Univ. of Ioannina, Greece 2 HP Labs, USA

Outline Motivation Our Solution – modeling – algorithms – system architecture Evaluation Conclusions 2A. Karagiannis, P. Vassiliadis, A. Simitsis – QDB’11

Example Flow 3A. Karagiannis, P. Vassiliadis, A. Simitsis – QDB’11 information sources e.g., database tables, files, XML, sensors, twitter, facebook, web portals target results e.g., warehouse tables, OLAP/mining tools, data marts, reports, dashboards Streaming Data Flow & Text Analytic Operators Filters Sensor data, external event Streams Complex Event Detector Event Stream Realtime Correlation Root Cause Discovery Primitive Event Detector Primitive Event Detector Multivariate TS Predictor Streaming Data Flow & Event Analytic Operators Data Cleaning & Schema Modification Operators s © 2011 Hewlett-Packard Development Company, L.P. The information contained herein is subject to change without notice

Background Scheduling policies – mostly in stream technology e.g., Aurora, Chain, Pipeline scheduling – undisclosed policies used in commercial ETL tools round robin, OS takes over – research on ETL has not dealt with scheduling efforts on efficient loading in real-time ETL workflows 4A. Karagiannis, P. Vassiliadis, A. Simitsis – QDB’11

Contribution Study of scheduling processes for ETL workflows – implementation of a simple, yet generic and extensible, ETL engine – enforce scheduling policies in ETL execution – use of template ETL workflows for experimentation System characteristics – pipelining – zero data loss – no deadlocks 5A. Karagiannis, P. Vassiliadis, A. Simitsis – QDB’11

6 Our solution

Modeling 7A. Karagiannis, P. Vassiliadis, A. Simitsis – QDB’11 v … … v inout

Modeling An ETL workflow is a DAG G(V,E) An activity node v has – consumption rate, selectivity, in-queues w/ total size queue(v) A queue q has – size(q) at time t, MaxMem(q) 8A. Karagiannis, P. Vassiliadis, A. Simitsis – QDB’11 v … … v inout Scheduler – P policy and T = T 1  …  T LAST – which operator to activate and for how long – when an operator should stop – when an operator finishes – when flow execution ends T1T1 T LAST TiTi T i+1 T i.f T i.l T i+1.f T i+1.l

Modeling 9A. Karagiannis, P. Vassiliadis, A. Simitsis – QDB’11 Problem statement find a policy P for a workflow G(V,E), s.t. – P creates a division of T into intervals T 1  T 2  …  T LAST –  t  T, v  V,  q  Q(v) size(q)  MaxMem(q) – minimize OF 1 and/or OF 2 – OF 1 : minimize T LAST – OF 2 : minimize max(Σ queue t (v)) for t  T and v  V

Scheduling Algorithms 10A. Karagiannis, P. Vassiliadis, A. Simitsis – QDB’11 pick next operator based onwhen Round Robin (RR) operator id input queue is exhausted Minimum Cost (MC) max size of input queue input queue is exhausted Minimum Memory (MM) max memory benefit * time slot * MemB(v) = (In(v)-Out(v)) / ExecTime(v) x Queue(v)

Software Architecture 11A. Karagiannis, P. Vassiliadis, A. Simitsis – QDB’11

12A. Karagiannis, P. Vassiliadis, A. Simitsis – QDB’11 Evaluation

Template Workflows 13A. Karagiannis, P. Vassiliadis, A. Simitsis – QDB’11 wishbone tree fork primary flow

Template Workflows 14A. Karagiannis, P. Vassiliadis, A. Simitsis – QDB’11 wishbone tree fork primary flow

Experiments Parameters – workflow size, complexity, selectivity – data size Tuning – stall time – time slot – data queue size – row pack size Dataset – TPC-H data 15A. Karagiannis, P. Vassiliadis, A. Simitsis – QDB’11

Experiments data size and execution time 16A. Karagiannis, P. Vassiliadis, A. Simitsis – QDB’11

Experiments data size and memory 17A. Karagiannis, P. Vassiliadis, A. Simitsis – QDB’11

18A. Karagiannis, P. Vassiliadis, A. Simitsis – QDB’11 Conclusions & On-going Work

Lessons Learned RR is quite efficient in performance, but lags in memory consumption effectiveness We can devise a scheduling policy (MC) with slightly better performance than RR and observable earnings in average memory consumption A slower policy (MM) shows significant earnings in average memory consumption that range between 1/2 to 1/10 of the memory used by the other policies 19A. Karagiannis, P. Vassiliadis, A. Simitsis – QDB’11

Mixed Policy – sketch Key idea – split a workflow into subflows s.t. simple subflows can use a faster policy as MC complex subflows (w/ memory consuming tasks and blocking operators) can use MM for gaining in memory – use the extra memory for boosting faster workflows with parallelization – workflow segmentation (examples) parallelize subflows w/o dependencies on each other place pipeline activities into the same subflow blocking activities split the workflow into two parts that should be synchronized (allocate resources for the 2 nd part only when the 1 st finishes) 20A. Karagiannis, P. Vassiliadis, A. Simitsis – QDB’11

Mixed Policy – first results Complex workflows based on tree, butterfly, and fork archetypes 21A. Karagiannis, P. Vassiliadis, A. Simitsis – QDB’11 treebutterflyfork

Conclusions Summary – Schedule ETL workflows for improving execution time memory consumption w/o data losses – Home-grown implementation of an ETL engine – Minimum Memory improves average memory consumption – Minimum Cost improves execution time (RR is close) Future work – other prioritization schemes due to different SLAs – scheduling for (near-)real-time ETL 22A. Karagiannis, P. Vassiliadis, A. Simitsis – QDB’11 Thank You!

23A. Karagiannis, P. Vassiliadis, A. Simitsis – QDB’11 Back-up slides

Example big query 24A. Karagiannis, P. Vassiliadis, A. Simitsis – QDB’11 © 2011 Hewlett-Packard Development Company, L.P. The information contained herein is subject to change without notice

Example big query (cont.) 25A. Karagiannis, P. Vassiliadis, A. Simitsis – QDB’11 © 2011 Hewlett-Packard Development Company, L.P. The information contained herein is subject to change without notice

Scheduling in RW (1) 26A. Karagiannis, P. Vassiliadis, A. Simitsis – QDB’11 NameSourceWho Is Next For How Long CriterionDecision FIFO [BBDM03], [UrFr01] next token until idle / time slot FairnessLocal Round Robin [BBDM03], [UrFr01] next ready token until idle / time slot FairnessLocal Equal Time [UrFr01] least executed time until idle / time slot FairnessGlobal Cheapest First [UrFr01] least processing cost until idle response time Local Greedy Scheduling [BBDM03] least selectivity time slot memory consumption Local

NameSourceWho Is Next For How Long CriterionDecision Min Latency [CCR+03] largest output size until idle response time Global Rate Based [UrFr01] largest output size until idle response time Global Min Cost[CCR+03]largest input sizeuntil idlethroughputLocal Min Memory [CCR+03] largest data consumption until idle memory consumption Local Chain Scheduling [BBDM03] largest data consumption time slot memory consumption Global Scheduling in RW (2) 27A. Karagiannis, P. Vassiliadis, A. Simitsis – QDB’11