ETL Queues for Active Data Warehousing Alexis Karakasidis Panos Vassiliadis Evaggelia Pitoura Dept. of Computer Science University of Ioannina.

Slides:



Advertisements
Similar presentations
Introduction to Embedded Systems Resource Management - III Lecture 19.
Advertisements

VSMC MIMO: A Spectral Efficient Scheme for Cooperative Relay in Cognitive Radio Networks 1.
A Full Bandwidth ATM Firewall Olivier Paul, Maryline Laurent, Sylvain Gombault ENST de Bretagne in collaboration with France Telecom R&D DRET.
CSIT560 Internet Infrastructure: Switches and Routers Active Queue Management Presented By: Gary Po, Henry Hui and Kenny Chong.
Improving TCP Performance over Mobile Ad Hoc Networks by Exploiting Cross- Layer Information Awareness Xin Yu Department Of Computer Science New York University,
Multiaccess Problem How to let distributed users (efficiently) share a single broadcast channel? ⇒ How to form a queue for distributed users? The protocols.
Insider Access Behavior Team May 06 Brandon Reher Jake Gionet Steven Bromley Jon McKee Advisor Client Dr. Tom DanielsThe Boeing Company Contact Dr. Nick.
"Distance Learning and Networking Technologies" Assistant Prof. Dr.-Stelios Savaidis Department of Electronics, TEI Piraeus, Greece
Towards a Benchmark for ETL Workflows Panos Vassiliadis Anastasios Karagiannis Vasiliki Tziovara Alkis Simitsis Univ. of Ioannina Almaden Research Center.
The War Between Mice and Elephants Presented By Eric Wang Liang Guo and Ibrahim Matta Boston University ICNP
Queuing Analysis Based on noted from Appendix A of Stallings Operating System text 6/10/20151.
1 Minseok Kwon and Sonia Fahmy Department of Computer Sciences Purdue University {kwonm, All our slides and papers.
AQM for Congestion Control1 A Study of Active Queue Management for Congestion Control Victor Firoiu Marty Borden.
1 Token Bucket Based CAC and Packet Scheduling for IEEE Broadband Wireless Access Networks Chi-Hung Chiang
1 Performance Evaluation of Computer Networks Objectives  Introduction to Queuing Theory  Little’s Theorem  Standard Notation of Queuing Systems  Poisson.
Graph-Based Modeling of ETL Activities with Multi-Level Transformations and Updates Alkis Simitsis 1, Panos Vassiliadis 2, Manolis Terrovitis 1, Spiros.
Supporting Streaming Updates in an Active Data Warehouse Neoklis Polyzotis, Spiros Skiadopoulos, Panos Vassiliadis, Alkis Simitsis, Nils-Erik Frantzell.
Fluid-based Analysis of a Network of AQM Routers Supporting TCP Flows with an Application to RED Vishal Misra Wei-Bo Gong Don Towsley University of Massachusetts,
1 Emulating AQM from End Hosts Presenters: Syed Zaidi Ivor Rodrigues.
Little’s Theorem Examples Courtesy of: Dr. Abdul Waheed (previous instructor at COE)
Reduced TCP Window Size for VoIP in Legacy LAN Environments Nikolaus Färber, Bernd Girod, Balaji Prabhakar.
Low-Rate TCP Denial of Service Defense Johnny Tsao Petros Efstathopoulos Tutor: Guang Yang UCLA 2003.
UCB Improvements in Core-Stateless Fair Queueing (CSFQ) Ling Huang U.C. Berkeley cml.me.berkeley.edu/~hlion.
Ns Simulation Final presentation Stella Pantofel Igor Berman Michael Halperin
Data Provenance in ETL Scenarios Panos Vassiliadis University of Ioannina (joint work with Alkis Simitsis, IBM Almaden Research Center, Timos Sellis and.
TCP Behavior across Multihop Wireless Networks and the Wired Internet Kaixin Xu, Sang Bae, Mario Gerla, Sungwook Lee Computer Science Department University.
Application of Methods of Queuing Theory to Scheduling in GRID A Queuing Theory-based mathematical model is presented, and an explicit form of the optimal.
Bell Labs Advanced Technologies EMEAAT Proprietary Information © 2004 Lucent Technologies1 Overview contributions for D27 Lucent Netherlands Richa Malhotra.
Research on cloud computing application in the peer-to-peer based video-on-demand systems Speaker : 吳靖緯 MA0G rd International Workshop.
SEDA: An Architecture for Well-Conditioned, Scalable Internet Services
Queueing Theory I. Summary Little’s Law Queueing System Notation Stationary Analysis of Elementary Queueing Systems  M/M/1  M/M/m  M/M/1/K  …
Network Analysis A brief introduction on queues, delays, and tokens Lin Gu, Computer Networking: A Top Down Approach 6 th edition. Jim Kurose.
Introduction to Queuing Theory
DELAYED CHAINING: A PRACTICAL P2P SOLUTION FOR VIDEO-ON-DEMAND Speaker : 童耀民 MA1G Authors: Paris, J.-F.Paris, J.-F. ; Amer, A. Computer.
CS433 Modeling and Simulation Lecture 13 Queueing Theory Dr. Anis Koubâa 03 May 2009 Al-Imam Mohammad Ibn Saud University.
Software Performance Testing Based on Workload Characterization Elaine Weyuker Alberto Avritzer Joe Kondek Danielle Liu AT&T Labs.
Fluid-based Analysis of a Network of AQM Routers Supporting TCP Flows with an Application to RED Vishal Misra Wei-Bo Gong Don Towsley University of Massachusetts,
TexPoint fonts used in EMF.
ONLINE GAME NETWORK TRAFFIC OPTIMIZATION Jaewoo kim Youngho yi Minsik cho.
11 Experimental and Analytical Evaluation of Available Bandwidth Estimation Tools Cesar D. Guerrero and Miguel A. Labrador Department of Computer Science.
CS433 Modeling and Simulation Lecture 12 Queueing Theory Dr. Anis Koubâa 03 May 2008 Al-Imam Mohammad Ibn Saud University.
1 Chapters 8 Overview of Queuing Analysis. Chapter 8 Overview of Queuing Analysis 2 Projected vs. Actual Response Time.
High-speed TCP  FAST TCP: motivation, architecture, algorithms, performance (by Cheng Jin, David X. Wei and Steven H. Low)  Modifying TCP's Congestion.
1 Network Emulation Mihai Ivanovici Dr. Razvan Beuran Dr. Neil Davies.
TCP Trunking: Design, Implementation and Performance H.T. Kung and S. Y. Wang.
Network Design and Analysis-----Wang Wenjie Queuing Theory III: 1 © Graduate University, Chinese academy of Sciences. Network Design and Performance Analysis.
1 Evaluation of Cooperative Web Caching with Web Polygraph Ping Du and Jaspal Subhlok Department of Computer Science University of Houston presented at.
Conceptual Modeling for ETL processes Panos Vassiliadis, Alkis Simitsis, Spiros Skiadopoulos National Technical.
CSCI1600: Embedded and Real Time Software Lecture 19: Queuing Theory Steven Reiss, Fall 2015.
1 Part VII Component-level Performance Models for the Web © 1998 Menascé & Almeida. All Rights Reserved.
Measuring the Capacity of a Web Server USENIX Sympo. on Internet Tech. and Sys. ‘ Koo-Min Ahn.
OPERATING SYSTEMS CS 3530 Summer 2014 Systems and Models Chapter 03.
Ó 1998 Menascé & Almeida. All Rights Reserved.1 Part VI System-level Performance Models for the Web (Book, Chapter 8)
Internet Applications: Performance Metrics and performance-related concepts E0397 – Lecture 2 10/8/2010.
802.11e EDCA WLN 2005 Sydney, Nov Paal E. Engelstad (presenter) UniK / Telenor R&D Olav N. Østerbø Telenor R&D
Analysis and Design of an Adaptive Virtual Queue (AVQ) Algorithm for AQM By Srisankar Kunniyur & R. Srikant Presented by Hareesh Pattipati.
1 Sheer volume and dynamic nature of video stresses network resources PIE: A lightweight latency control to address the buffer problem issue Rong Pan,
1 Evaluation of Cooperative Web Caching with Web Polygraph Ping Du and Jaspal Subhlok Department of Computer Science University of Houston presented at.
Access Link Capacity Monitoring with TFRC Probe Ling-Jyh Chen, Tony Sun, Dan Xu, M. Y. Sanadidi, Mario Gerla Computer Science Department, University of.
A Classification for Access Control List To Speed Up Packet-Filtering Firewall CHEN FAN, LONG TAN, RAWAD FELIMBAN and ABDELSHAKOUR ABUZNEID Department.
OPERATING SYSTEMS CS 3502 Fall 2017
The Impact of Replacement Granularity on Video Caching
Al-Imam Mohammad Ibn Saud University
Dynamic Graph Partitioning Algorithm
Analyzing Security and Energy Tradeoffs in Autonomic Capacity Management Wei Wu.
Columbia University in the city of New York
Xiaoyang Zhang1, Yuchong Hu1, Patrick P. C. Lee2, Pan Zhou1
Javad Ghaderi, Tianxiong Ji and R. Srikant
CSE 550 Computer Network Design
Chapter-5 Traffic Engineering.
Presentation transcript:

ETL Queues for Active Data Warehousing Alexis Karakasidis Panos Vassiliadis Evaggelia Pitoura Dept. of Computer Science University of Ioannina

IQIS'05 17 June 2005, Baltimore MD, USA2 Forecast We demonstrate that we can employ queue theory to predict the behavior of an Active ETL process We discuss implementation issues in order to achieve several nice properties concerning minimal system overhead and high freshness of data

IQIS'05 17 June 2005, Baltimore MD, USA3 Contents Problem description System Architecture & Theoretical Analysis Experiments Conclusions and Future Work

IQIS'05 17 June 2005, Baltimore MD, USA4 Contents Problem description System Architecture & Theoretical Analysis Experiments Conclusions and Future Work

IQIS'05 17 June 2005, Baltimore MD, USA5 Active Data Warehousing Traditionally, data warehouse refreshment has been performed off-line, through Extractction- Transformation-Loading (ETL) software. Active Data Warehousing refers to a new trend where data warehouses are updated as frequently as possible, to accommodate the high demands of users for fresh data. Issues that come up: –How to design an Active DW? –How can we implement an Active DW?

IQIS'05 17 June 2005, Baltimore MD, USA6 Issues and Goals of this paper Smooth upgrade of the software at the source –The modification of the software configuration at the source side is minimal. Minimal overhead of the source system No data losses are allowed Maximum freshness of data –The response time for the transport, cleaning transformation and loading of a new source record to the DW should be small and predictable Stable interface at the warehouse side –The architecture should scale up with respect to the number of sources and data consumers at the DW

IQIS'05 17 June 2005, Baltimore MD, USA7 Contributions We set up the architectural framework and the issues that arise for the case of active data warehousing. We develop the theoretical framework for the problem, by employing queue theory for the prediction of the performance of the system. –We provide a taxonomy for ETL tasks that allows treating them as black-box tasks. –Then, standard queue theory techniques can be applied for the design of an ETL workflow. We provide technical solutions for the implementation of our reference architecture, achieving the aforementioned goals We prove our results through extensive experimentation.

IQIS'05 17 June 2005, Baltimore MD, USA8 Related work Obviously, work in the field of ETL is related –must be customized for active DW Streams, due to the nature of the data –still, all R.W. is on continuous queries, no updates Huge amount of work in materialized view refreshment –orthogonal to our problem Web services –due to the fact that in our architecture, the DW exports W.S.’s to the sources

IQIS'05 17 June 2005, Baltimore MD, USA9 Contents Problem description System Architecture & Theoretical Analysis Experiments Conclusions and Future Work

IQIS'05 17 June 2005, Baltimore MD, USA10 Add_SPK 1 SUPPKEY=1 SK 1 DS.PS 1.PKEY, LOOKUP_PS.SKEY, SUPPKEY $ 2€ COSTDATE DS.PS 2 Add_SPK 2 SUPPKEY=2 SK 2 DS.PS 2.PKEY, LOOKUP_PS.SKEY, SUPPKEY COSTDATE=SYSDATE AddDate CheckQTY QTY>0 U DS.PS 1 Log rejected Log rejected A2EDate NotNULL Log rejected Log rejected Log rejected DIFF 1 DS.PS_NEW 1.PKEY, DS.PS_OLD 1.PKEY DS.PS_NEW 1 DS.PS_OLD 1 DW.PARTSU PP Aggregate 1 PKEY, DAY MIN(COST) Aggregate 2 PKEY, MONTH AVG(COST) V2 V1 TIME  DW.PARTSUPP.DATE, DAY FTP 1 S 1 _PARTSU PP S 2 _PARTSU PP FTP 2 DS.PS_NEW 2 DIFF 2 DS.PS_OLD 2 DS.PS_NEW 2.PKEY, DS.PS_OLD 2.PKEY SourcesDW DSA ETL workflows

IQIS'05 17 June 2005, Baltimore MD, USA11 Queue Theory for ETL We can model various kinds of ETL transformations as queues, which we call ETL queues Each queue has an incoming arrival rate λ and a mean service time 1/μ Little’s Law: N= λ*T M/M/1 queue (Poisson arrivals) –Mean response time W=1/(μ-λ) –Mean queue length L=ρ/(1 - ρ), ρ=λ/μ Server

IQIS'05 17 June 2005, Baltimore MD, USA12 Queue Theory for ETL Queues can be combined to form queue networks Jackson networks: networks were each queue can be solved independently (under reasonable constraints) We can use queue theory to predict the behavior of the Active Data Warehouse

IQIS'05 17 June 2005, Baltimore MD, USA13 How to predict the behavior of the Active Data Warehouse 1.Compose ETL queues in a Jackson network to simulate the implementation of the Active Data Staging Area (ADSA) 2.Then, solve the Jackson network and relate the parameters of ADSA, specifically: –Source arrival rate (i.e., rate or record production at the source) –Overall service time (i.e., time that a record spends in the ADSA) –Mean queue length (i.e., no. of records in the network)

IQIS'05 17 June 2005, Baltimore MD, USA14 Taxonomy of ETL transformations Filters Transformers Binary Operators Generic model

IQIS'05 17 June 2005, Baltimore MD, USA15 System Architecture

IQIS'05 17 June 2005, Baltimore MD, USA16 Contents Problem description System Architecture & Theoretical Analysis Experiments Conclusions and Future Work

IQIS'05 17 June 2005, Baltimore MD, USA17 Experimentation environment Source: an application in C that uses an ISAM library ADSA implemented in Sun JDK 1.4 Web Services platform: –Apache Axis 1.1 [AXIS04] –Xerces XML parser –Apache Tomcat DW implemented over MySQL 4.1 Configuration: –Source: PIII 700MHz with 256MB memory, SuSE Linux 8.1 –DW: Pentium 4 2.8GHz with 1GB memory, Mandrake Linux, ADSA included –Department’s LAN for the network Source operates at full capacity

IQIS'05 17 June 2005, Baltimore MD, USA18 First set of experiments A first set of experiments over a simple configuration, to determine fundamental architectural choices Issues –Smooth upgrade of the source software –UDP vs TCP –Source Overhead –Data delay –Topology

IQIS'05 17 June 2005, Baltimore MD, USA19 Experimentation results Smooth upgrade: not more than 100 lines of code modified UDP resulted in 35% data loss, due to ADSA overflow => TCP a clear choice Source overhead is highly dependent on row blocking: –Source overhead is 1.7% with a source flow regulator, vs 34% without –WS mode (blocking vs non-blocking) has no effect –Medium size packets seem to work better

IQIS'05 17 June 2005, Baltimore MD, USA20 Data Freshness We count the time to carry all records from source to DW We empty the ADSA with 3 policies: –Immediate transport –We simulate a slower ADSA by removing 50, 100, 150, 200, 250 and 300 records from the queue every 0.1 sec –We remove 500, 1000, 1500, 2000, 2500 and 3000 records every 1 sec –Source max rate is about 1250 records / sec Findings: –Small package sizes result in small delays –There is a threshold (the source rate) underneath which the queue explodes –We can achieve data freshness time equal to data insertion time when we continuously empty a small size queue

IQIS'05 17 June 2005, Baltimore MD, USA21 Data Freshness

IQIS'05 17 June 2005, Baltimore MD, USA22 Data Fresh- ness

IQIS'05 17 June 2005, Baltimore MD, USA23 Data Freshness

IQIS'05 17 June 2005, Baltimore MD, USA24 Experiments including transformation scenarios We enrich the previous configuration with several ETL activities in the ADSA Based on the previous, we have fixed: –2-tier architecture, ADSA at the DW –Source Flow Regulation with medium size packages –TCP for network connection –Non-blocking calling of DW WS’s

IQIS'05 17 June 2005, Baltimore MD, USA25 Scenarios to measure data freshness (a)(c) (b)(d)

IQIS'05 17 June 2005, Baltimore MD, USA26 Goals of the experiments Steadiness of the system –System is steady whenever service rate is higher than arrival rate; transient effects disappear Source overhead –Medium size blocking is still a winner Throughput for ADSA –The ADSA is only one packet behind the source –Avg. delay per row ~0.9 msec for all scenarios Success of theoretical prediction –Half a packet underestimation

IQIS'05 17 June 2005, Baltimore MD, USA27 Contents Problem description System Architecture & Theoretical Analysis Experiments Conclusions and Future Work

IQIS'05 17 June 2005, Baltimore MD, USA28 Conclusions We can employ queue theory to predict the behavior of an Active ETL process We have proposed an architectural configuration with –Minimal source overhead –No effect on the source due to the operation of an ADSA –No packet losses, due to the usage of TCP –Small delay in the ADSA, especially if row blocking in medium size blocks is used

IQIS'05 17 June 2005, Baltimore MD, USA29 Future Work Combine our configuration with results in the optimization of ETL processes (ICDE’05) Fault tolerance Experiment with higher client loads at the warehouse side Scale-up the number of sources involved

IQIS'05 17 June 2005, Baltimore MD, USA30 Thank you!

IQIS'05 17 June 2005, Baltimore MD, USA31 Backup Slides

IQIS'05 17 June 2005, Baltimore MD, USA32 Grand View

IQIS'05 17 June 2005, Baltimore MD, USA33 Jackson’s Theorem and ETL queues Jackson’s Theorem. If in an open network the condition λi < µi · mi holds for every i  {1,..,N} (with mi standing for the number of servers at node i) then the steady state probability of the network can be expressed as the product of the state probabilities of the individual nodes: π (k 1,…, k N ) = π 1 (k 1 )π 2 (k 2 )... π Ν (k Ν ) Therefore, we can solve this class of networks in four steps: Solve the traffic equations to find λi for each queuing node i Determine separately for each queuing system i its steady-state probabilities π i (k i ) Determine the global steady-state probabilities π (k 1,…, k N ). Derive the desired global performance measures. From step 1, we can derive the mean delay and queue length for each node.

IQIS'05 17 June 2005, Baltimore MD, USA34 Source Code Alterations Original RoutineAltered Routine Open_isam_File(){ … opening_isam_file_commands … } Open_isam_File(){ … opening_isam_file_commands … if(open==success) DWFlowR_socket_open() } Write_record_to_File(){ … insert_record_commands … } Write_record_to_File(){ … insert_record_commands … if(write==success) write_to_SFlowR() } Close_isam_File(){ … closing_isam_file_commands … } Close_isam_File(){ … closing_isam_file_commands … if(close==success) DWFlowR_socket_close() }

IQIS'05 17 June 2005, Baltimore MD, USA35 First set of experiments

IQIS'05 17 June 2005, Baltimore MD, USA36 Data Freshness We count the time to carry all records from source to DW We empty the ADSA with 3 policies: –Immediate transport –We simulate a slower ADSA by removing 50, 100, 150, 200, 250 and 300 records from the queue every 0.1 sec –We remove 500, 1000, 1500, 2000, 2500 and 3000 records every 1 sec Source max rate is about 1250 records / sec Findings: –Small package sizes result in small delays –There is a threshold (the source rate) underneath which the queue explodes –We can achieve data freshness time equal to data insertion time when we continuously empty a small size queue

IQIS'05 17 June 2005, Baltimore MD, USA37 Source overhead

IQIS'05 17 June 2005, Baltimore MD, USA38 Topology and source overhead

IQIS'05 17 June 2005, Baltimore MD, USA39 Second set of experiments

IQIS'05 17 June 2005, Baltimore MD, USA40 Source overhead

IQIS'05 17 June 2005, Baltimore MD, USA41 Throughput for ETL operations

IQIS'05 17 June 2005, Baltimore MD, USA42 Scenarios to measure data freshness

IQIS'05 17 June 2005, Baltimore MD, USA43 Data Delay

IQIS'05 17 June 2005, Baltimore MD, USA44 Theoretical prediction vs. actual measurements of average queue length for scenario (c) in packets Measured Theoretical Prediction Difference FILTER_10_ FILTER_02_ SK_ GB_SUM_ WS_GB WS_GB_UPD

IQIS'05 17 June 2005, Baltimore MD, USA45 Theoretical Predictions and Actual Measurements In most cases, we underestimate the actual queue size by half a packet (i.e., 25 records) We overestimate the actual queue size when we simulate slow servers, esp. in the combination of large timeouts and large packets Reasons for the discrepancies: –Simulation of slower rates through timeouts –Due to the row-blocking approach, the granule of transport is a single packet