29/1/2014 Efficient Updates for a Shared Nothing Analytics Platform Katerina Doka, Dimitrios Tsoumakos, Nectarios Koziris {katerina, dtsouma,

Slides:



Advertisements
Similar presentations
2010 Workshop on Massive Data Analytics on the Cloud (MDAC 2010) April 26, 2010 Raleigh, NC, USA In association with the 19th Annual World Wide Web Conference.
Advertisements

National Technical University of Athens Department of Electrical and Computer Engineering Image, Video and Multimedia Systems Laboratory
Prof. Panias Dimitrios Krestou Athina, MSc 11/9/2014 SLM 2008, Belgrade National Technical University of Athens.
Load Balancing Hybrid Programming Models for SMP Clusters and Fully Permutable Loops Nikolaos Drosinos and Nectarios Koziris National Technical University.
Great Theoretical Ideas in Computer Science.
Delivering High Performance to Parallel Applications Using Advanced Scheduling Nikolaos Drosinos, Georgios Goumas Maria Athanasaki and Nectarios Koziris.
Evaluation of Message Passing Synchronization Algorithms in Embedded Systems 1 Evaluation of Message Passing Synchronization Algorithms in Embedded Systems.
Mini-Project Status and Report
A Fully Distributed, Fault-Tolerant Data Warehousing System Katerina Doka, Dimitrios Tsoumakos, Nectarios Koziris Computing Systems Laboratory National.
Studying thermal equilibrium with temperature sensors and a film canister Sarantos Oikonomidis Dimitrios Sotiropoulos George Kalkanis University of Athens.
Automatic Scaling of Selective SPARQL Joins Using the TIRAMOLA System E. Angelou, N. Papailiou, I. Konstantinou, D. Tsoumakos, N. Koziris Computing Systems.
A Framework for Dynamic Volume Rendering Ptolemy II EECS 290N – Final Project T. Crawford 12/04.
Technical Specification / Schedule Department of Computer Science and Engineering Michigan State University Spring 2007 Team : CSE 498, Collaborative Design.
LabVis: Interactive Visualization of Medical Laboratory Results Adam Bodnar and Dmitry Nekrasovski CPSC 533 Project Update March 14, 2004.
A Pipelined Execution of Tiled Nested Loops on SMPs with Computation and Communication Overlapping Maria Athanasaki, Aristidis Sotiropoulos, Georgios Tsoukalas,
6. & 7. Teams: Technical Specification / Schedule Project Title Team Member 1 Team Member 2 Team Member 3 Team Member 4 Department of Computer Science.
Scheduling of Tiled Nested Loops onto a Cluster with a Fixed Number of SMP Nodes Maria Athanasaki, Evangelos Koukis, Nectarios Koziris National Technical.
The Four Causes of Aristotle. 1.The formal cause. 2.The material cause. 3.The efficient cause. 4.The final cause. Think of oak trees ………
Performance Comparison of Pure MPI vs Hybrid MPI-OpenMP Parallelization Models on SMP Clusters Nikolaos Drosinos and Nectarios Koziris National Technical.
Advanced Hybrid MPI/OpenMP Parallelization Paradigms for Nested Loop Algorithms onto Clusters of SMPs Nikolaos Drosinos and Nectarios Koziris National.
12-1 Arithmetic Sequences and Series. Sequence- A function whose domain is a set of natural numbers Arithmetic sequences: a sequences in which the terms.
© The Aerospace Corporation 2009 Addressing The Needs of Real-Time Embedded Software A Case for Software Systems Engineering Rob Pettit Flight Software.
8/9/2015 Distributed Indexing of Web Scale Datasets for the Cloud {ikons, eangelou, dtsouma, Computing Systems Laboratory.
A Distributed Architecture for Multi-dimensional Indexing and Data Retrieval in Grid Environments Athanasia Asiki, Katerina Doka, Ioannis Konstantinou,
Panagiotis Antonopoulos Microsoft Corp Ioannis Konstantinou National Technical University of Athens Dimitrios Tsoumakos.
Distributed Indexing of Web Scale Datasets for the Cloud {ikons, eangelou, Computing Systems Laboratory School of Electrical.
computer
1 Introduction One of the largest research laboratories in the Computer Science Department of the School of Electrical and Computer Engineering, National.
What is Computer Science? “Computer Science is no more about computers than astronomy is about telescopes.” - Edsger Dijkstra “Computer Science is no more.
WebTV for Schools: TV programme for students by students on the Web CP GR - MINERVA - MPP SOCRATES - MINERVA.
NATIONAL TECHNICAL UNIVERSITY OF ATHENS School of Electrical & Computer Engineering Service Oriented Architecture - SOA for Intelligent Management of Power.
©2012 Microsoft Corporation. All rights reserved. Content based on SharePoint 15 Technical Preview and published July 2012.
Incentive Mechanism Design and Implementation for Mobile Sensing Systems Zhibo Wang Dept. of EECS University of Tennessee, Knoxville Project for ECE 692.
Point in Polygon Analysis for Moving Objects Farid Karimipour Mahmoud R. Delavar Andrew U. Frank Hani Rezayan University of Tehran, Iran Technical University.
13.1 Sequences. Definition of a Sequence 2, 5, 8, 11, 14, …, 3n-1, … A sequence is a list. A sequence is a function whose domain is the set of natural.
The most efficient STR loci in forensic genetics in population of central Poland R. Jacewicz, M. Jedrzejczyk, J. Berent Forensic Science International:
IHE Lab – USA IHE Laboratory Domain F2F – September 2011 Sondra Renly, Anna Orlova.
Partial Differential Equations and Applied Mathematics Seminar
Combining HTM and RCU to Implement Highly Efficient Balanced Binary Search Trees Dimitrios Siakavaras, Konstantinos Nikas, Georgios Goumas and Nectarios.
Analytics Social Data Hub
Mixed methods research.
Christos Markou Institute of Nuclear Physics NCSR ‘Demokritos’
Warm-Up Ch3 #4 Gr 6.
Motivation and Background:
COMPUTER REPAIR SERVICE. When your computer isn't working correctly, you want nothing more than fast and flawless computer repair service to make your.
CIS 328 Education for Service-- snaptutorial.com
مقایسه تطبیقی ارائه رشته علوم كامپیوتر در دانشگاه آكسفورد
Activity level in the hierarchy of propositions in case of the People of the State of California v. Orenthal James Simpson analyzed using Bayesian network 
פחת ורווח הון סוגיות מיוחדות תהילה ששון עו"ד (רו"ח) ספטמבר 2015
: القسم 21 إجراء المقابلة التحفيزية.
الجزء 22:المقابلة التحفيزية
Providing QoS through Active Domain Management
علم النفس التحليلي كارل غوستاف يونغ
Pedro A. Barrio, Pablo Martin, Antonio Alonso 
Topics discussed in this section:
Graphing and Evaluating The Piecewise Function A Series of Examples
CSA 1100 Historical and Scientific Perspectives on Computer Science and AI Technical Writing Dr. Matthew Montebello.
CSE 498, Collaborative Design
Establishment of Italian national DNA database and the central laboratory: Some aspects  R. Biondo, F. De Stefano  Forensic Science International: Genetics.
The Most In-Demand Skills for Cloud Computing.
Statistical evaluation of pre-laboratory and laboratory factors that influence DNA recovery from archaeological material  Cristina Gamba, Eva Fernández,
Standard Analyses and Displays for Common Clinical Trial Data:
Congratulations! Now Get to Work
Service Platform for Green European Transportation
Five tips for Painless Data Sharing.
Short tandem repeat sequencing on the 454 platform
Extreme-Scale Distribution-Based Data Analysis
Data Analytics course.
Science is fun. Science is fun. Science is fun. Science is fun. Science is fun. Science is fun. Science is fun. Science is fun. Science is fun. Science.
Presentation transcript:

29/1/2014 Efficient Updates for a Shared Nothing Analytics Platform Katerina Doka, Dimitrios Tsoumakos, Nectarios Koziris {katerina, dtsouma, Computing Systems Laboratory National Technical University of Athens

29/1/2014 Motivation Large volumes of data Everyday life, science and business domain Time-series data Temporally ordered, organized in hierarchies (Day<Month<Year) E.g., date of a credit card purchase, time of a phone call Important for monitoring a process of interest On-line processing Fast retrieval – Point, range, aggregate queries Detection of real time changes in trends Intrusion or DoS detection, effects of products promotion Online, cost-efficient updates 2

29/1/2014 Up till now Data Warehouses Centralized, off-line approaches Distributed warehousing systems Functionality remains centralized Distributed Warehouse-like initiative: Brown Dwarf Distribution of centralized Dwarf Deployed on shared-nothing, commodity hardware Scalability, fault tolerance, performance No special consideration for time-series data Update procedure costly unfit for frequent updates 3

29/1/2014 Our Goals Cloud based DataWarehousing-like system Targeted to time-series data Arriving at high rate Store, update, query data at various granularity levels Multidimensional, hierarchical Shared nothing architecture Commodity nodes Without use of any proprietary tool Java libraries, socket APIs 4

29/1/2014 Our Contribution 5 Complete system for multidimensional time-series data Store with one pass Update online Query efficiently Point, aggregate Various levels of granularity Adaptive materialization According to data recency Accelerate cube creation/update Minimize storage consumption

29/1/2014 Dwarf Dwarf computes, stores, indexes and updates materialized cubes Eliminates prefix and suffix redundancies Any query (point or aggregate) is answered through traversal of structure 6

29/1/2014 Brown Dwarf Dwarf nodes mapped to overlay nodes UID for each node Hint tables of the form (currAttr, child) Insertion One-pass over the fact table Gradual structure of hint tables Queries Overlay path of d hops Incremental Updates Elasticity through adaptive mirroring 7

29/1/2014 Advantages and Drawbacks Store even larger amounts of data! Dwarf reduces but may also blow-up data High dimensional, sparse >1,000 times Handle many more requests Query the system online Accelerate creation (up to 5 times ) and querying (up to 60 times) Parallelization Update remains costly 8

29/1/2014 Time Series Dwarf (TSD) A concept hierarchy characterizes time and any other dimension Updates are applied in temporal order Temporal granularity of queries relative to the time of querying More detailed queries for recent events More coarse grained queries for past events 9

29/1/2014 TSD Operations - Insertion Time first in order Lack of ALL cell in Time Aggregate created after completion of a level 10

29/1/2014 TSD Operations - Querying Follow path along the structure Roll-up query for aggregate already created Within d hops (e.g., ) Roll-up query for recent records Initial query substituted by multiple lower level queries (e.g., ) 11

29/1/2014 TSD Operations - Updating Insertion of a new tuple Longest common prefix with existing structure Underlying nodes recursively updated Lack of ALL cell for Time + temporal ordering = fewer existing cells affected Example: 3 TSD nodes vs. 12 Dwarf nodes affected 12

29/1/2014 Adaptive Materialization A daemon process asynchronously creates roll-up views deletes corresponding drill-down ones The period of this process depends on application Tradeoff: cube size vs. response accuracy 13

29/1/2014 Experimental Evaluation 25 LAN commodity nodes (dual core, 2.0 GHz, 4GB main memory) Synthetic and real datasets APB-1 Benchmark generator 4-d, 3 levels for Time, various densities DARPA Intrusion Detection audit data 1M tuples, 7-d, 3 levels for Time TSD: static mode TSD ad : adaptive mode 14

29/1/2014 Cube Construction Noticeable reduction of cube size for TSD, impressive for TSD ad (up to 85% for the APB dataset) Lack of the ALL cell in the first dimension Acceleration of cube creation up to 89% compared to Dwarf Better use of resources through parallelization (BD) Further reduction due to lack of ALL and selective materialization 15 Size (MB)Time (sec) Dataset#TuplesDwarfBDTSDTSD ad DwarfBDTSDTSD ad APB-A 1.2M APB-B 2.5M APB-C 3.7M DARPA 1.1M

29/1/2014 Updates 10k updates TSD up to 3 times faster than Dwarf and 30% faster than BD Ordered updates – do not affect already created views No recursive updates for ALL cell of first dimension smaller communication overhead (3-fold reduction) TSD ad does not include roll-up view creation (asynchronous) further acceleration ~20% 16 Time(sec)Msgs/update DatasetDwarfBDTSDTSD ad BDTSDTSD ad APB-A APB-B APB-C DARPA

29/1/2014 Queries DARPA 10k datasets – 3 kinds of querysets, 50% aggregates Q1: Ideal Q2: Recent records are queried upon in more detail (Zipfian) Q3: Random As queryset approximates uniform distribution Message cost increases Accuracy decreases 17 Time(sec)Msgs/query%Inaccurate queries %Resp. Deviation QuerysetBDTSDTSD ad BDTSDTSD ad Q1Q Q2Q Q3Q

29/1/2014 Questions 18