Download presentation
Presentation is loading. Please wait.
Published byAubrey Johns Modified over 9 years ago
1
1 Lightweight Indexing of Observational Data in Log-Structured Storage National University of Singapore (Sheng Wang, Beng Chin Ooi) Portland State University(David Maier) VLDB 2014
2
2 Outline Background Challenges Contributions CR-Index Index Optimization Experiments
3
3 background Huge amounts of data generated by sensors every day Data are expanding in precision and quantity High write throughput Efficient query
4
4 challenges State-of-the-art storage doesn't take high-throughput into account (CMOP : RDBMS + netCDF) Record-level Index incur significant index maintenance cost –B + -Tree Random IO due to update –LSM-Tree large number of index entries
5
5 contributions A schema for storing observational data in logBase [1] to facilitate indexing A novel, lightweight index called CR-Index structure for range queries which take full advantage of observational-data traits Experiments on two real-word observational datasets [1] H. T. Vo, S. Wang, D. Agrawal, G. Chen, B. C. Ooi. Int'l Conference on Very Large Data Bases (VLDB), PVLDB 5(10):1004-1015, 2012.
6
6 Storage High write throughput Traits of observational data –no update –continuous change –potential discontinuities Log-structured storage append new data to the end of a file avoid random I/O
7
7 LogBase An unordered column-oriented distributed log-store Each node is responsible for one or more partitions of a table Version controll and transaction semantics Relational data model –each record has a primary key and several attributes –each record is decomposed as a set of cells (KEY, ATTRIBUTE, VALUE, TIMESTAMP)
8
8 Architecture
9
9 Logic View & Physical View
10
10 Basic query formats Time range Value range
11
11 Obervational data locality In general, append only strategy hurts read performance, but log-store provide considerable data locality –time-ordered property time range query : sequencial scan –value-correlated property due to continuity trait, once a record is inside a value range, surrounding records will likely lie in the range
12
12 Continuous Range Index(CR- Index) the value-correlated property implies a seek in the log can potentially yield many results we needn't locate qualifying records individualy, as long as identify regions containing results Group successive records into blocks as an atomic unit Each block is summarized by a value range using a boundary pair Range query can be transformed to an intersection- checking problem
13
13 CR-Index structure
14
14 CR-Index structure
15
15 Index structures Interval tree –O(n) –stabbing query:O(logn)
16
16 Solution partition result set into two disjoint group –Group A : CR-Record that have at least one endpoint inside the query range [a,b] B + -tree –Group B : CR-Records that completely contain the query range Interval-tree
17
17 Solution For each CR-record, two entries are inserted into the B + -tree, one for each endpoint. the endpoint as a key and CR- recrod's refrence is the value For each CR-record, its' value range is inserted into the Interval tree
18
18 Example
19
19 Example
20
20 Example
21
21 Query example query1 : [3,11] result1 : 2,3,4,5,3,2,4 2,3,4,5 query2 : [16,18] result2 :
22
22 Query example query : [16,18] result : 6
23
23 Optimization Index with delta intervals –boundary pair in consective blocks may overlap, if a query intersect a block, it will probably intersect the following blocks
24
24 Example query : [3,5] result : 2, 3
25
25 Length-k delta interval
26
26 Evaluate range query value condition can be used at both interval-index level and CR-log level time condition can be used at CR-Log via checkpoint-list hole information can be updated
27
27 Query steps 1.Access the interval index to get CR-Record ids: Group A from B+-tree and Group B from interval tree 2.Locate each identified record in the CR-Log. Scan the log for additional CR-records if using delta intervals 3.Filter CR-record using checkpoint list and hole information 4.Fetch and scan the data blocks for remaining CR- records. Exract and return all qualifying results 5.For any detected false-positive blocks, track the holes and update the hole information in CR-records.
28
28 Experiments Compare with B + -tree and LSM-tree Data sets –CMOP Costal Margin Data (13million) salinity, temperature, oxygen –Real-time Soccer Game Date (25million) sensor ID, position, speed, velocity, accelaeration
29
29 Experiment Environment –A cluster where each machine has a quad- core processor, 8GB, 500GB –Java –block length : 64 –delta-interval : 1 –range query
30
30 System load time Write time in load data CR-Index : 8% LSM : 45%-77% B+ : 78%-124%
31
31 Index update time 15% LSM-tree 9% B+-tree
32
32 Index space consumption disk space –10%-12% LSM-tree –4%-6% B+-tree
33
33 Query response time
34
34 Index look-up cost
35
35 Data access cost
36
36 Over QQQ
Similar presentations
© 2025 SlidePlayer.com. Inc.
All rights reserved.