1 Lightweight Indexing of Observational Data in Log-Structured Storage National University of Singapore (Sheng Wang, Beng Chin Ooi) Portland State University(David.

Slides:



Advertisements
Similar presentations
TI: An Efficient Indexing Mechanism for Real-Time Search on Tweets Chun Chen 1, Feng Li 2, Beng Chin Ooi 2, and Sai Wu 2 1 Zhejiang University, 2 National.
Advertisements

©Silberschatz, Korth and Sudarshan12.1Database System Concepts Chapter 12: Part C Part A:  Index Definition in SQL  Ordered Indices  Index Sequential.
Hashing and Indexing John Ortiz.
Lecture 13: Query Execution. Where are we? File organizations: sorted, hashed, heaps. Indexes: hash index, B+-tree Indexes can be clustered or not. Data.
The Zebra Striped Network File System Presentation by Joseph Thompson.
Effectively Indexing Uncertain Moving Objects for Predictive Queries School of Computing National University of Singapore Department of Computer Science.
1 Lecture 8: Data structures for databases II Jose M. Peña
Query Evaluation. An SQL query and its RA equiv. Employees (sin INT, ename VARCHAR(20), rating INT, age REAL) Maintenances (sin INT, planeId INT, day.
Indexes. Primary Indexes Dense Indexes Pointer to every record of a sequential file, (ordered by search key). Can make sense because records may be much.
Indexes. Primary Indexes Dense Indexes Pointer to every record of a sequential file, (ordered by search key). Can make sense because records may be much.
COMP 451/651 Indexes Chapter 1.
Multiple-key indexes Index on one attribute provides pointer to an index on the other. If V is a value of the first attribute, then the index we reach.
1 Overview of Storage and Indexing Chapter 8 (part 1)
1 Overview of Storage and Indexing Yanlei Diao UMass Amherst Feb 13, 2007 Slides Courtesy of R. Ramakrishnan and J. Gehrke.
Efficient Storage and Retrieval of Data
1 Indexing Structures for Files. 2 Basic Concepts  Indexing mechanisms used to speed up access to desired data without having to scan entire.
Primary Indexes Dense Indexes
1.1 CAS CS 460/660 Introduction to Database Systems File Organization Slides from UC Berkeley.
Physical Storage Organization. Advanced DatabasesPhysical Storage Organization2 Outline Where and How data are stored? –physical level –logical level.
Homework #3 Due Thursday, April 17 Problems: –Chapter 11: 11.6, –Chapter 12: 12.1, 12.2, 12.3, 12.4, 12.5, 12.7.
Bigtable: A Distributed Storage System for Structured Data F. Chang, J. Dean, S. Ghemawat, W.C. Hsieh, D.A. Wallach M. Burrows, T. Chandra, A. Fikes, R.E.
Indexing. Goals: Store large files Support multiple search keys Support efficient insert, delete, and range queries.
Cloud Computing Lecture Column Store – alternative organization for big relational data.
Oracle Data Block Oracle Concepts Manual. Oracle Rows Oracle Concepts Manual.
1 Physical Data Organization and Indexing Lecture 14.
1 © Prentice Hall, 2002 Physical Database Design Dr. Bijoy Bordoloi.
CS 345: Topics in Data Warehousing Tuesday, October 19, 2004.
Ohio State University Department of Computer Science and Engineering Automatic Data Virtualization - Supporting XML based abstractions on HDF5 Datasets.
HBase A column-centered database 1. Overview An Apache project Influenced by Google’s BigTable Built on Hadoop ▫A distributed file system ▫Supports Map-Reduce.
Presented By HaeJoon Lee Yanyan Shen, Beng Chin Ooi, Bogdan Marius Tudor National University of Singapore Wei Lu Renmin University Cang Chen Zhejiang University.
VLDB2012 Hoang Tam Vo #1, Sheng Wang #2, Divyakant Agrawal †3, Gang Chen §4, Beng Chin Ooi #5 #National University of Singapore, †University of California,
1 CPS216: Advanced Database Systems Notes 04: Operators for Data Access Shivnath Babu.
Hypertable Doug Judd Zvents, Inc.. hypertable.org Background.
ICPP 2012 Indexing and Parallel Query Processing Support for Visualizing Climate Datasets Yu Su*, Gagan Agrawal*, Jonathan Woodring † *The Ohio State University.
Efficiently Processing Queries on Interval-and-Value Tuples in Relational Databases Jost Enderle, Nicole Schneider, Thomas Seidl RWTH Aachen University,
Object Persistence (Data Base) Design Chapter 13.
Indexing.
12.1 Chapter 12: Indexing and Hashing Spring 2009 Sections , , Problems , 12.7, 12.8, 12.13, 12.15,
Key/Value Stores CMSC 491 Hadoop-Based Distributed Computing Spring 2015 Adam Shook.
1 Overview of Storage and Indexing Chapter 8 (part 1)
Indexing HDFS Data in PDW: Splitting the data from the index VLDB2014 WSIC、Microsoft Calvin
Physical Storage Organization. Advanced DatabasesPhysical Storage Organization2 Outline Where and How data are stored? –physical level –logical level.
Database Management COP4540, SCS, FIU Physical Database Design (ch. 16 & ch. 3)
1 Tree Indexing (1) Linear index is poor for insertion/deletion. Tree index can efficiently support all desired operations: –Insert/delete –Multiple search.
Reporter : Yu Shing Li 1.  Introduction  Querying and update in the cloud  Multi-dimensional index R-Tree and KD-tree Basic Structure Pruning Irrelevant.
Indexing and hashing Azita Keshmiri CS 157B. Basic concept An index for a file in a database system works the same way as the index in text book. For.
HBase Elke A. Rundensteiner Fall 2013
Methodology – Physical Database Design for Relational Databases.
Database Indexing 1 After this lecture, you should be able to:  Understand why we need database indexing.  Define indexes for your tables in MySQL. 
Physical Database Design Purpose- translate the logical description of data into the technical specifications for storing and retrieving data Goal - create.
Partition Architecture Yeon JongHeum
Relational Operator Evaluation. Overview Application Programmer (e.g., business analyst, Data architect) Sophisticated Application Programmer (e.g.,
Nov 2006 Google released the paper on BigTable.
March, 2002 Efficient Bitmap Indexing Techniques for Very Large Datasets Kesheng John Wu Ekow Otoo Arie Shoshani.
Chap 5. Disk IO Distribution Chap 6. Index Architecture Written by Yong-soon Kwon Summerized By Sungchan IDS Lab
CS 440 Database Management Systems Lecture 6: Data storage & access methods 1.
CS 540 Database Management Systems
Bigtable: A Distributed Storage System for Structured Data
Decibel: The Relational Dataset Branching System
Apache Accumulo CMSC 491 Hadoop-Based Distributed Computing Spring 2016 Adam Shook.
Big Data Yuan Xue CS 292 Special topics on.
Presenters : Virag Kothari,Vandana Ayyalasomayajula Date: 04/21/2010.
CPS216: Data-intensive Computing Systems
Indexing Structures for Files and Physical Database Design
CS522 Advanced database Systems
Azita Keshmiri CS 157B Ch 12 indexing and hashing
CSE-291 (Cloud Computing) Fall 2016
HashKV: Enabling Efficient Updates in KV Storage via Hashing
File organization and Indexing
RUM Conjecture of Database Access Method
Presentation transcript:

1 Lightweight Indexing of Observational Data in Log-Structured Storage National University of Singapore (Sheng Wang, Beng Chin Ooi) Portland State University(David Maier) VLDB 2014

2 Outline  Background  Challenges  Contributions  CR-Index  Index Optimization  Experiments

3 background Huge amounts of data generated by sensors every day Data are expanding in precision and quantity High write throughput Efficient query

4 challenges State-of-the-art storage doesn't take high-throughput into account (CMOP : RDBMS + netCDF) Record-level Index incur significant index maintenance cost –B + -Tree Random IO due to update –LSM-Tree large number of index entries

5 contributions A schema for storing observational data in logBase [1] to facilitate indexing A novel, lightweight index called CR-Index structure for range queries which take full advantage of observational-data traits Experiments on two real-word observational datasets [1] H. T. Vo, S. Wang, D. Agrawal, G. Chen, B. C. Ooi. Int'l Conference on Very Large Data Bases (VLDB), PVLDB 5(10): , 2012.

6 Storage High write throughput Traits of observational data –no update –continuous change –potential discontinuities Log-structured storage append new data to the end of a file avoid random I/O

7 LogBase An unordered column-oriented distributed log-store Each node is responsible for one or more partitions of a table Version controll and transaction semantics Relational data model –each record has a primary key and several attributes –each record is decomposed as a set of cells (KEY, ATTRIBUTE, VALUE, TIMESTAMP)

8 Architecture

9 Logic View & Physical View

10 Basic query formats Time range Value range

11 Obervational data locality In general, append only strategy hurts read performance, but log-store provide considerable data locality –time-ordered property time range query : sequencial scan –value-correlated property due to continuity trait, once a record is inside a value range, surrounding records will likely lie in the range

12 Continuous Range Index(CR- Index) the value-correlated property implies a seek in the log can potentially yield many results we needn't locate qualifying records individualy, as long as identify regions containing results Group successive records into blocks as an atomic unit Each block is summarized by a value range using a boundary pair Range query can be transformed to an intersection- checking problem

13 CR-Index structure

14 CR-Index structure

15 Index structures Interval tree –O(n) –stabbing query:O(logn)

16 Solution partition result set into two disjoint group –Group A : CR-Record that have at least one endpoint inside the query range [a,b] B + -tree –Group B : CR-Records that completely contain the query range Interval-tree

17 Solution For each CR-record, two entries are inserted into the B + -tree, one for each endpoint. the endpoint as a key and CR- recrod's refrence is the value For each CR-record, its' value range is inserted into the Interval tree

18 Example

19 Example

20 Example

21 Query example query1 : [3,11] result1 : 2,3,4,5,3,2,4 2,3,4,5 query2 : [16,18] result2 :

22 Query example query : [16,18] result : 6

23 Optimization Index with delta intervals –boundary pair in consective blocks may overlap, if a query intersect a block, it will probably intersect the following blocks

24 Example query : [3,5] result : 2, 3

25 Length-k delta interval

26 Evaluate range query value condition can be used at both interval-index level and CR-log level time condition can be used at CR-Log via checkpoint-list hole information can be updated

27 Query steps 1.Access the interval index to get CR-Record ids: Group A from B+-tree and Group B from interval tree 2.Locate each identified record in the CR-Log. Scan the log for additional CR-records if using delta intervals 3.Filter CR-record using checkpoint list and hole information 4.Fetch and scan the data blocks for remaining CR- records. Exract and return all qualifying results 5.For any detected false-positive blocks, track the holes and update the hole information in CR-records.

28 Experiments Compare with B + -tree and LSM-tree Data sets –CMOP Costal Margin Data (13million) salinity, temperature, oxygen –Real-time Soccer Game Date (25million) sensor ID, position, speed, velocity, accelaeration

29 Experiment Environment –A cluster where each machine has a quad- core processor, 8GB, 500GB –Java –block length : 64 –delta-interval : 1 –range query

30 System load time Write time in load data CR-Index : 8% LSM : 45%-77% B+ : 78%-124%

31 Index update time 15% LSM-tree 9% B+-tree

32 Index space consumption disk space –10%-12% LSM-tree –4%-6% B+-tree

33 Query response time

34 Index look-up cost

35 Data access cost

36 Over QQQ