1 Lightweight Indexing of Observational Data in Log-Structured Storage National University of Singapore (Sheng Wang, Beng Chin Ooi) Portland State University(David.

1 Lightweight Indexing of Observational Data in Log-Structured Storage National University of Singapore (Sheng Wang, Beng Chin Ooi) Portland State University(David Maier) VLDB 2014

2 Outline  Background  Challenges  Contributions  CR-Index  Index Optimization  Experiments

3 background Huge amounts of data generated by sensors every day Data are expanding in precision and quantity High write throughput Efficient query

4 challenges State-of-the-art storage doesn't take high-throughput into account (CMOP : RDBMS + netCDF) Record-level Index incur significant index maintenance cost –B + -Tree Random IO due to update –LSM-Tree large number of index entries

5 contributions A schema for storing observational data in logBase [1] to facilitate indexing A novel, lightweight index called CR-Index structure for range queries which take full advantage of observational-data traits Experiments on two real-word observational datasets [1] H. T. Vo, S. Wang, D. Agrawal, G. Chen, B. C. Ooi. Int'l Conference on Very Large Data Bases (VLDB), PVLDB 5(10):1004-1015, 2012.

6 Storage High write throughput Traits of observational data –no update –continuous change –potential discontinuities Log-structured storage append new data to the end of a file avoid random I/O

7 LogBase An unordered column-oriented distributed log-store Each node is responsible for one or more partitions of a table Version controll and transaction semantics Relational data model –each record has a primary key and several attributes –each record is decomposed as a set of cells (KEY, ATTRIBUTE, VALUE, TIMESTAMP)

8 Architecture

9 Logic View & Physical View

10 Basic query formats Time range Value range

11 Obervational data locality In general, append only strategy hurts read performance, but log-store provide considerable data locality –time-ordered property time range query : sequencial scan –value-correlated property due to continuity trait, once a record is inside a value range, surrounding records will likely lie in the range

12 Continuous Range Index(CR- Index) the value-correlated property implies a seek in the log can potentially yield many results we needn't locate qualifying records individualy, as long as identify regions containing results Group successive records into blocks as an atomic unit Each block is summarized by a value range using a boundary pair Range query can be transformed to an intersection- checking problem

13 CR-Index structure

14 CR-Index structure

15 Index structures Interval tree –O(n) –stabbing query:O(logn)

16 Solution partition result set into two disjoint group –Group A : CR-Record that have at least one endpoint inside the query range [a,b] B + -tree –Group B : CR-Records that completely contain the query range Interval-tree

17 Solution For each CR-record, two entries are inserted into the B + -tree, one for each endpoint. the endpoint as a key and CR- recrod's refrence is the value For each CR-record, its' value range is inserted into the Interval tree

18 Example

19 Example

20 Example

21 Query example query1 : [3,11] result1 : 2,3,4,5,3,2,4 2,3,4,5 query2 : [16,18] result2 :

22 Query example query : [16,18] result : 6

23 Optimization Index with delta intervals –boundary pair in consective blocks may overlap, if a query intersect a block, it will probably intersect the following blocks

24 Example query : [3,5] result : 2, 3

25 Length-k delta interval

26 Evaluate range query value condition can be used at both interval-index level and CR-log level time condition can be used at CR-Log via checkpoint-list hole information can be updated

27 Query steps 1.Access the interval index to get CR-Record ids: Group A from B+-tree and Group B from interval tree 2.Locate each identified record in the CR-Log. Scan the log for additional CR-records if using delta intervals 3.Filter CR-record using checkpoint list and hole information 4.Fetch and scan the data blocks for remaining CR- records. Exract and return all qualifying results 5.For any detected false-positive blocks, track the holes and update the hole information in CR-records.

28 Experiments Compare with B + -tree and LSM-tree Data sets –CMOP Costal Margin Data (13million) salinity, temperature, oxygen –Real-time Soccer Game Date (25million) sensor ID, position, speed, velocity, accelaeration

29 Experiment Environment –A cluster where each machine has a quad- core processor, 8GB, 500GB –Java –block length : 64 –delta-interval : 1 –range query

30 System load time Write time in load data CR-Index : 8% LSM : 45%-77% B+ : 78%-124%

31 Index update time 15% LSM-tree 9% B+-tree

32 Index space consumption disk space –10%-12% LSM-tree –4%-6% B+-tree

33 Query response time

34 Index look-up cost

35 Data access cost

36 Over QQQ

1 Lightweight Indexing of Observational Data in Log-Structured Storage National University of Singapore (Sheng Wang, Beng Chin Ooi) Portland State University(David.

Similar presentations

Presentation on theme: "1 Lightweight Indexing of Observational Data in Log-Structured Storage National University of Singapore (Sheng Wang, Beng Chin Ooi) Portland State University(David."— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

1 Lightweight Indexing of Observational Data in Log-Structured Storage National University of Singapore (Sheng Wang, Beng Chin Ooi) Portland State University(David.

Similar presentations

Presentation on theme: "1 Lightweight Indexing of Observational Data in Log-Structured Storage National University of Singapore (Sheng Wang, Beng Chin Ooi) Portland State University(David."— Presentation transcript:

Similar presentations

About project

Feedback