Download presentation
Presentation is loading. Please wait.
Published byCordelia Gibson Modified over 9 years ago
1
Fateme Shirazi Spring 2010 1 Statistical structures for Internet-scale data management Authors: Nikos Ntarmos, Peter Triantafillou, G. Weikum
2
Outline 2 Introduction Background : Hash sketches Compute aggregates and building histograms Implementation Results Conclusion
3
Peer-to-Peer (P2P) 3 File sharing in overlay networks Millions of users (peers) provide storage and bandwidth for searching and fetching files
4
Motivation 4 In P2P file-sharing often the total number of (unique) documents shared by their users is needed Distributed P2P search engines need to evaluate the significance of keywords the ratio of indexed documents containing each keyword to the total number of indexed documents
5
Motivation 5 Internet-scale information retrieval systems need a method to deduce the rank/score of data items. Sensor networks need methods to compute aggregates Traditionally query optimizers rely on histograms over stored data, to estimate the size of intermediate results
6
Overview Sketch 6 A large number of nodes, form the system’s infrastructure Contribute and/or store data items,involved in operations such as computing synopses and building histograms In general, queries do not affect all nodes Compute aggregation functions over data sets dynamically by a filter predicate of the query
7
Problem Formulation 7 Relevant data items stored in unpredictable ways in a subset of all nodes A large number of different data sets expected to exist, stored at (perhaps overlapping) subsets of the network And, relevant queries and synopses may be built and used over any of these data sets
8
Computational Model 8 Data stored in P2P network is structured in relations Each R consists of (k+l) attr. or columns R(a1,…,ak,b1,…,bl) Either one of the attributes of the tuple, or calculated otherwise (e.g. a combination of its attributes) attr1attr2attr3
9
Outline 9 Introduction Background : Hash sketches Compute aggregates and building histograms Experimental setup Results Conclusion
10
Distributed Hash Tables 10 A family of structured P2P network overlays exposing a hash- table-like interface(l ookup service ) Examples of DHTs include Chord, Kademlia, Pastry, CAN… Any node can efficiently retrieve a value with given key
11
Chord 11 Nodes are assigned identifiers from a circular ID space, computed as the hash of IP address Node-ID space among nodes partitioned, so that each node is responsible for a well-defined set (arc) of identifiers Each item is also assigned a unique identifier from the same ID space Stored at the node whose ID is closest to the item’s ID
12
Hash Sketches 12 Estimating the number of distinct items in D of data in a database For application domains which need counting distinct elements: Approximate query answering in very large databases, Data mining on the Internet graph Stream processing
13
Hash Sketches 13 A hash sketch consists of a bit vector B[·] of length L In order to estimate the number n of distinct elements in D, ρ (h(d)) is applied to all d ∈ D and record the results in the bitmap vector B[0... L−1] 000011 d1 d2 d3 d4 0 0 0 0 1 1 LSB MSB Partially copied from slides of the author
14
0 0 0 0 0 0 LSB MSB Hash sketches: Insertions h() () PRN n PRN n-1...... PRN 4 PRN 3 PRN 2 PRN 1 L-bit Pseudo-Random Numbers dndn d n-1...... d4d4 d3d3 d2d2 d1d1 Data Items n Hash sketch (Bit vector B) b L-1 bLbL...... b1b1 b0b0 L+1 h() 10111 () “my item 1 key” “my item 2 key” “my item 3 key” “my item 4 key” 10010 01101 10011 1 1 14 Copied from slides of the author
15
Hash Sketches 15 Since h() distributes values uniformly over [0, 2 L ) P( ρ (h(d)) = k) = 2 −k−1 R =position of the least-significant 0-bit in B, then 2 R ~ n d1 d2 d3 d4 000011 |D| ~ 2 2 = 4 Partially copied from slides of the author
16
Distributing Data Synopses 16 (1) the “conservative” but popular rendezvous based approach (2) the decentralized way of DHS, in which no node has some sort of special functionality Partially copied from slides of the author
17
Mapping DHS bits to DHT Nodes N1 N8 N14 N21 N32 N56 N51 N48 N42 N38 Bit 0 Bit 1 Bit 2 Bit 3 Bit … Copied from slides of the author 17
18
DHS : Counting N1 N8 N14 N21 N32 N56 N51 N48 N42 N38 Counting node Bits >3 not set Bit 2 not set. Retrying… Bit 2 not set Bit 1 not set. Retrying… Bit 1 set! Copied from slides of the author 18
19
Outline 19 Introduction Background : Hash sketches Compute aggregates and building histograms Experimental setup Results Conclusion
20
Computing Aggregates 20 COUNT-DISTINCT: Estimation of the number of (distinct) items in a multi-set COUNT: adding the tuple IDs to the corresponding synopsis, instead of the values of the column in question SUM : each node locally computes the sum of values of the column tuples it stores, populates a local hash sketch AVG: Consists of estimating the SUM and COUNT of the column and then taking their ratio
21
COUNT-DISTINCT 21 Both rendezvous-based hash sketches and DHS applicable to estimation of the number of (distinct) items in a multiset Assume the estimation of the number of distinct values in a column C of a relation R stored in our Internet-scale data management system is wanted
22
Counting with the Rendezvous Approach 22 Nodes first compute a rendezvous ID. (attr1 h() 47 ) Then compute locally the synopsis and send it to the node whose ID is closest to the above ID (“rendezvous node”) The rendezvous node responsible for combining the individual synopses (by bitwise OR) into the global synopsis Interested nodes can then acquire the global synopsis by querying the rendezvous node
23
Step 1 23
24
Step 2 24
25
Step 3 25
26
Counting with DHS 26 In the DHS-based case, nodes storing tuples of R insert them into the DHS, by: (1)Nodes hash their tuples and compute ρ (hash) for each tuple (2) For each tuple,nodes send a “set-to-1” to a random ID in the corresponding arc (3) Counting consists of probing random nodes in arcs corresponding to increasing bit positions until 0-bit is found
27
Step 1 27
28
Step 2 28
29
Step 3 29
30
Histograms 30 The most common technique used by commercial databases as a statistical summary An approximation of the distribution of values in base relations. For a given attribute/column, a histogram is a grouping of attribute values into “buckets” Salary Age
31
Constructing histogram types 31 Equi-Width histograms The most basic histogram variant Partitions the attribute value domain into cells (buckets) of equal spread Assigns to each the number of tuples with an attribute value.
32
Other histogram types 32 Average shifted Equi-Width histograms,ASH Consist of several EWH with different starting positions in value space Frequency of each value in a bucket computed as the average of estimations given by histogram Equi-Depth histograms In an Equi-Depth histogram all buckets have equal frequencies but not (necessarily) equal spreads
33
Outline 33 Introduction Background : Hash sketches Compute aggregates and building histograms Implementation Results Conclusion
34
Implementation 34 1.Generating the workload 2. Populating the network with peers 3. Randomly assigning data tuples from the base data to nodes in the overlay 4. Then inserting all nodes into the P2P 5. Selecting random nodes,reconstructing histograms and computing aggregates
35
Measures of Interest 35 To consider (1) The fairness of the load distribution across nodes in the network (2)The accuracy of the estimation itself (3)The number of hops are considered to do the estimation To show the trade-off of scalability vs. performance/load distribution between the DHS and rendezvous-based approaches
36
Fairness 36 To compute the fairness, the load on any given node as the insertion /query/probe “hits” on the node is measured Number of times this node is target of insertion/query/probe opera A multitude of metrics are used. More specifically : The Gini Coefficient The Fairness Index Maximum and total loads for DHS- and rendezvous based approaches
37
The Gini Coefficient 37 Mean of the absolute difference of every possible pair. Takes values in the interval [0, 1), where a GC value of 0.0 is the best possible state, with 1.0 being the worst The Gini Coefficient roughly represents the amount of imbalance in the system Gini = A/(A+B) A B
38
Estimation error 38 Mean error of the estimation is reported Computed as percentage By the distributed estimation differed to the estimated aggregate computed in a centralized manner (i.e. as if all data was stored on a single host)
39
Hop-count Costs 39 The per-node average hop count for inserting all tuples to the distributed synopsis is measured and shown The per-node hop count costs are higher for the DHS-based approach
40
Outline 40 Introduction Background Compute aggregates and building histograms Implementation Results Conclusion
41
Results 41 The hop-count efficiency and the accuracy of rendezvous- based hash sketches and of the DHS is measured Initially single-attribute relations is created, with integer values in the intervals [0, 1000) following either a uniform distribution (depicted as a Zipf with θ equal to 0.0) or a shuffled Zipf distribution with θ equal to 0.7, 1.0, and 1.2
42
Total query load (node hits) over time 42
43
Load distribution 43 The extra hop-count cost of the DHS-based approach pays back when it comes to load distribution fairness The load on a node, the number of times it is visited (a.k.a. node hits) during data insertion and/or query processing.
44
Gini Coefficient 44 Rendezvous approach DHS approach
45
Evolution of the Gini coefficient 45 In the rendezvous based approach a single node has all the query load The DHS-based approaches,≈0.5, which equal the GC values of the distribution of the distances between consecutive nodes in the ID space Thus the best respective values by any algorithm using randomized assignment of items to nodes
46
Evolution of the Gini coefficient 46
47
Error for Computing COUNT Aggregate 47 Rendezvous approach DHS approach In both cases, error due to use of hash sketches Both approaches exhibit the same average error As expected, the higher the number of bitmaps in the synopsis, the better the accuracy
48
Insertion hop count 48 Rendezvous approach DHS approach The insertion hop-count cost for all aggregates Hop count costs are higher for the DHS-based approach by appr.8× for both the insertion and query cases
49
Outline 49 Introduction Background : Hash sketches Compute aggregates and building histograms Experimental setup Results Conclusion
50
50 A framework for distributed statistical synopses for Internet- scale networks such as P2P systems Extending centralized settings techniques towards distributed settings Developing DHT based higher-level synopses like Equi-Width, ASH, and Equi-Depth histograms
51
Conclusion 51 Fully distributed cardinality estimator, providing scalability, efficiency, accuracy Constructed efficiently and scaling well with growing network size, while having high accuracy Providing trade-off between accuracy and construction /maintenance costs Totally balanced (access and maintenance) load on nodes
52
Future research 52 Examining auto-tuning capabilities for the histogram inference engine Integrating it with Internet- scale query processing systems To look into implementing for other types of synopses, aggregates, and histogram variants Finally, using this tools for approximate query answ ering
53
Thank you 53
Similar presentations
© 2025 SlidePlayer.com. Inc.
All rights reserved.