Download presentation
Presentation is loading. Please wait.
Published byElaine Lancaster Modified over 9 years ago
1
A Fully Distributed, Fault-Tolerant Data Warehousing System Katerina Doka, Dimitrios Tsoumakos, Nectarios Koziris Computing Systems Laboratory National Technical University of Athens
2
Motivation Large volumes of data Everyday life (Web 2.0) Science (LHC, NASA) Business domain (automation, digitization, globalization) New regulations – log/digitize/store everything Sensors Immense production rates Distributed by nature D. Tsoumakos, HDMS 20104/29/2015
3
Motivation (contd.) Demand for on always-on analytics Store huge datasets Both structured and semi-structured bulk data Detection of real time changes in trends Fast retrieval – Point, range, aggregate queries –Intrusion or DoS detection, effects of product’s promotion Online, near real-time updates From various locations, at big rates D. Tsoumakos, HDMS 20104/29/2015
4
(Up till) now Traditional Data Warehouses Vast amounts of historical data – data cubes Centralized, off-line approaches Querying vs. Updating Distributed warehousing systems Functionality remains centralized Cloud Infrastructures Resource as a service Elasticity, commodity hardware Pay-as-you-go pricing model D. Tsoumakos, HDMS 20104/29/2015
5
Our Goal Distributed DataWarehousing-like system Store, query, update Multi-d, hierarchical Scalable, always-on Shared-nothing architecture Commodity nodes No proprietary tool needed Java libraries, socket APIs D. Tsoumakos, HDMS 20104/29/2015
6
Brown Dwarf in a nutshell Complete system for datacubes Distributed storage Online updates Efficient query resolution Point, aggregate Various levels of granularity Elastic resources according to Workload skew Node churn D. Tsoumakos, HDMS 20104/29/2015
7
Dwarf Dwarf computes, stores, indexes and updates materialized cubes Eliminates prefix and suffix redundancies Centralized structure with d levels Root contains all distinct values of first dimension Each cell points to node of the next level D. Tsoumakos, HDMS 20104/29/2015
8
Why distribute it? Store larger amounts of data Dwarf may reduce but may also blow-up data High dimensional, sparse >1,000 times Update and query the system online Accelerate creation, query and update speed Parallelization What about… Failures, load-balancing, comm. costs? Performance D. Tsoumakos, HDMS 20104/29/2015
9
Brown Dwarf (BD) Overview Dwarf nodes mapped to overlay nodes UID for each node Hint tables of the form (currAttr, child) Resolve/update along network path Mirrors on per-node basis D. Tsoumakos, HDMS 20104/29/2015
10
BD Operations – Insert+Query One-pass over the fact table Gradual structure of hint tables Creation of cell → insertion of currAttr Creation of dearf node → registration of child Follow path (d hops)along the structure D. Tsoumakos, HDMS 20104/29/2015
11
BD Operations - Update Longest common prefix with existing structure Underlying nodes recursively updated Nodes expanded with new cells New nodes created ALL cells affected D. Tsoumakos, HDMS 20104/29/2015
12
Elasticity of Brown Dwarf Static and adaptive replication vs: Load (min/max load) Churn (require≥k replicas) Local only interactions Ping/exchange hint Tables for consistency Query forwarding to balance load D. Tsoumakos, HDMS 20104/29/2015
13
Experimental Evaluation 16 LAN commodity nodes (dual core, 2.0 GHz, 4GB main memory) Synthetic and real datasets 5d-25d, various levels of skew (Zipf θ=0.95) APB-1 Benchmark generator Forest and Weather datasets Simulation results with 1000s nodes D. Tsoumakos, HDMS 20104/29/2015
14
Cube Construction Acceleration of cube creation up to 3.5 times compared to Dwarf Better use of resources through parallelization More noticeable effect for high dimensional, skewed datasets Storage overhead Mainly attributed to mapping between dwarf node and network IDs Shared among network nodes UniformZipf dSize (MB)Time (sec)Size (MB)Time (sec) DwarfBDDwarfBDDwarfBDDwarfBD 511441134 10453113675421 15796329222722674 201317122505469543204 251823198881521951206535 D. Tsoumakos, HDMS 20104/29/2015
15
Updates 1% updates Up to 2.3 times faster for skewed dataset Dimensionality increases the cost UniformZipf dTime (sec)Msg /upd Time (sec)Msg /upd DwarfBDDwarfBD 577158614 10181451211450 1531221114331120 20482819310466200 258939301172104306 D. Tsoumakos, HDMS 20104/29/2015
16
Queries 1K querysets, 50% aggregate Impressive acceleration of up to 60 times Message cost bound by d+1 UniformZipf dTime (sec)Msg /quer Time (sec)Msg /quer DwarfBDDwarfBD 5546226 103031129111 156531655116 2010232188220 251821326172926 D. Tsoumakos, HDMS 20104/29/2015
17
Elasticity Dimitrios Tsoumakos, UoI Talk 17 23/02/2010 10-d 100k datasets, 5k query-sets λ=10qu/sec → 100qu/sec BD adapts according to demand → elasticity k=3, N fail failing nodes every T fail sec 5k queries, 10-d uniform dataset No loss for N fail < k+1 Query time increases due to redirections
18
What have we achieved so far? BD optimizations – work in progress Replication units (chunks, …), Hierarchies – faster updates (MDAC 2010), … Brown Dwarf focuses on +Efficient answering of aggregate queries +Cloud - friendly - Preprocessing - Costly updates HiPPIS project +Explicit support for Hierarchical data +No preprocessing +Ease of insertion and updates - Processing for aggregate queries D. Tsoumakos, HDMS 20104/29/2015
19
Questions D. Tsoumakos, HDMS 20104/29/2015
Similar presentations
© 2025 SlidePlayer.com. Inc.
All rights reserved.