Optimized Rewriter Rules for Efficient Querying of JSON Data

Slides:



Advertisements
Similar presentations
LIBRA: Lightweight Data Skew Mitigation in MapReduce
Advertisements

An Introductory Tutorial. Background and Purpose.
Spark: Cluster Computing with Working Sets
Big Data Management and Analytics Introduction Spring 2015 Dr. Latifur Khan 1.
Manish Bhide, Manoj K Agarwal IBM India Research Lab India {abmanish, Amir Bar-Or, Sriram Padmanabhan IBM Software Group, USA
CIS607, Fall 2005 Semantic Information Integration Article Name: Clio Grows Up: From Research Prototype to Industrial Tool Name: DH(Dong Hwi) kwak Date:
Hadoop & Cheetah. Key words Cluster  data center – Lots of machines thousands Node  a server in a data center – Commodity device fails very easily Slot.
Efficient Evaluation of XQuery over Streaming Data Xiaogang Li Gagan Agrawal The Ohio State University.
A Metadata Based Approach For Supporting Subsetting Queries Over Parallel HDF5 Datasets Vignesh Santhanagopalan Graduate Student Department Of CSE.
Linked-data and the Internet of Things Payam Barnaghi Centre for Communication Systems Research University of Surrey March 2012.
Querying Large Databases Rukmini Kaushik. Purpose Research for efficient algorithms and software architectures of query engines.
Big Data Analytics Large-Scale Data Management Big Data Analytics Data Science and Analytics How to manage very large amounts of data and extract value.
XML Access Control Koukis Dimitris Padeleris Pashalis.
Building a Distributed Full-Text Index for the Web by Sergey Melnik, Sriram Raghavan, Beverly Yang and Hector Garcia-Molina from Stanford University Presented.
Chapter 3 - VLANs. VLANs Logical grouping of devices or users Configuration done at switch via software Not standardized – proprietary software from vendor.
By Jeff Dean & Sanjay Ghemawat Google Inc. OSDI 2004 Presented by : Mohit Deopujari.
Internet of Things. IoT Novel paradigm – Rapidly gaining ground in the wireless scenario Basic idea – Pervasive presence around us a variety of things.
Big Data Analytics Platforms. Our Team NameApplication Viborov MichaelApache Spark Bordeynik YanivApache Storm Abu Jabal FerasHPCC Oun JosephGoogle BigQuery.
Semantic Data Extraction for B2B Integration Syntactic-to-Semantic Middleware Bruno Silva 1, Jorge Cardoso 2 1 2
Lecture 15: Query Optimization. Very Big Picture Usually, there are many possible query execution plans. The optimizer is trying to chose a good one.
Smart Grid Big Data: Automating Analysis of Distribution Systems Steve Pascoe Manager Business Development E&O - NISC.
Copyright © 2016 Pearson Education, Inc. Modern Database Management 12 th Edition Jeff Hoffer, Ramesh Venkataraman, Heikki Topi CHAPTER 11: BIG DATA AND.
Introduction to Performance Tuning Chia-heng Tu PAS Lab Summer Workshop 2009 June 30,
COMP7330/7336 Advanced Parallel and Distributed Computing MapReduce - Introduction Dr. Xiao Qin Auburn University
Computer Science and Engineering Parallelizing Feature Mining Using FREERIDE Leonid Glimcher P. 1 ipdps’04 Scaling and Parallelizing a Scientific Feature.
| presented by Vasileios Zois CS at USC 09/20/2013 Introducing Scalability into Smart Grid 1.
BAHIR DAR UNIVERSITY Institute of technology Faculty of Computing Department of information technology Msc program Distributed Database Article Review.
Semantic Graph Mining for Biomedical Network Analysis: A Case Study in Traditional Chinese Medicine Tong Yu HCLS
Image taken from: slideshare
Presented by: Omar Alqahtani Fall 2016
Efficient Evaluation of XQuery over Streaming Data
University of Maryland College Park
WP18, High-speed data recording Krzysztof Wrona, European XFEL
Hadoop.
International Conference on Data Engineering (ICDE 2016)
Curator: Self-Managing Storage for Enterprise Clusters
An Open Source Project Commonly Used for Processing Big Data Sets
Pagerank and Betweenness centrality on Big Taxi Trajectory Graph
Tutorial: Big Data Algorithms and Applications Under Hadoop
Massive Spatial Query on the Kepler Architecture
Parallel Databases.
MongoDB Er. Shiva K. Shrestha ME Computer, NCIT
Spark Presentation.
GF and RS, Dept. of CS, Mangalore University
ABSTRACT   Recent work has shown that sink mobility along a constrained path can improve the energy efficiency in wireless sensor networks. Due to the.
Grid Computing.
Data Platform and Analytics Foundational Training
NOSQL databases and Big Data Storage Systems
Database Performance Tuning and Query Optimization
SpatialHadoop: A MapReduce Framework for Spatial Data
Relational Algebra Chapter 4, Part A
Chapter 15 QUERY EXECUTION.
Big Data - in Performance Engineering
湖南大学-信息科学与工程学院-计算机与科学系
Hyunchul Park, Kevin Fan, Manjunath Kudlur,Scott Mahlke
Objective of This Course
Akshay Tomar Prateek Singh Lohchubh
CS110: Discussion about Spark
Relational Algebra Chapter 4, Sections 4.1 – 4.2
Pregelix: Think Like a Vertex, Scale Like Spandex
Towards an Internet-Scale XML Dissemination Service
Overview of big data tools
Internet of Things.
병렬처리시스템 2005년도 2학기 채 수 환
Chapter 11 Database Performance Tuning and Query Optimization
Query Optimization.
MapReduce: Simplified Data Processing on Large Clusters
Lecture 29: Distributed Systems
Map Reduce, Types, Formats and Features
L. Glimcher, R. Jin, G. Agrawal Presented by: Leo Glimcher
Presentation transcript:

Optimized Rewriter Rules for Efficient Querying of JSON Data Christina Pavlopoulou, Vasileios Zois

Introduction Embedded devices paired with physical objects Quality services necessitate constant information exchange Data collection can improve on the provided services Summary: Modern physical devices are embedded with electronics, sensors and network connectivity. Augmenting the quality of the provided service is based on constant information exchange. Collecting data related to normal operation can help predict human behavior and provide real time support for improving the provided services. Today’s embedded devices are architecturally diverse and for this reason data interoperability is becoming cumbersome. Fortunately, existing data interchange formats are adapted to this new paradigm in an attempt to overcome these communication barriers. Architectural diversity limits data interoperability Exchanging data can be cumbersome Data interchange formats were designed to overcome the existing communication limitations

Collecting Data Predicting human behavior to ensure continuous service Feedback for improving and retraining decision models Why is it important to collect data from the real world? Real time decision support for ensuring smooth operation of services (i.e. power grid monitoring) Feedback for improving and retraining decision models (i.e. tesla autopilot improves by learning driver behavior). Responding fast in an emergency

Data Processing Challenges Large volume Number of interconnected devices keep increasing High velocity Readings are generated continuously Huge variety Smart phones, meters, traffic lights, locks e.t.c Summary: Processing the aggregated data that are generated from various embedded devices is challenging. If we look closely this data processing problem resembles a big data processing problem. The number of interconnected devices is expected to increase exponentially in the near future. Data for most applications are continuous and change throughout the day. Additionally, there is a huge variety of devices that will be transmitting different types of data (i.e. energy consumption, video playback from traffic lights, temperature, wind speed for wind energy). So it is imperative to be able to efficiently query these aggregated information to support real time decision, model verification and training as well as supporting operations related to data mining and machine learning.

Popular Data Interchange Formats Comma Separated Values (CSV) Files eXtensible Markup Language (XML) JavaScript Object Notation (JSON) Some of the most popular data formats include Comma Separated Files (CSV), XML and JSON Data, and YAML Data (Needs figure to visually compare this dataset). All of these formats are structured and were designed to be easily interpretable and readable by humans and machines. CSV format is simplistic but not very flexible during parsing. Data types need to be homogeneous and have the same number of instances else space is being wasted. The XML format is more expressive at the expense of space overhead to encode the required information inside the tags. JSON exhibits the same expressiveness with less space overhead to encode the semantics of the data. YAML was designed to be easily interpretable based and mapped to data types common to most high-level languages (i.e. arrays, lists, maps) YAML

Large Scale XML and JSON Data Processing Serialized Query Processing on XML BaseX Parallelized Query Processing on XML PAXQuery using MapReduce VXQuery using Hyracks and Directed Acyclic Graph (DAG) processing model Parallelized Query Processing on JSON Our Work using VXQuery To process XML or JSON data, there have been several system implementations… For XML data there exist both serial and parallel implementations. Stratosphere and BaseX implement a serial Xquery processor. PAXQuery and VXQuery are the only two parallel solutions currently available. PAXQuery is based on the MapReduce programming model, while VXQuery uses Hyracks to achieve parallelism. Hyracks uses a directed acyclic graph model to schedule job execution in parallel. We will be focusing on VXQuery because it was recently updated with support for JSON Data and we would like to study the possibilities for query optimization.

VXQuery Details Apache VXQuery Algebricks Translates XQuery to the corresponding Algebricks parallel algebra Algebricks Enumerates query operators (i.e. join, group-by aggregate, projection) Hyracks data parallel platform Produces data parallel execution plan

doc(“books.xml”)/bookstore/book Rewriter Rules on XML doc(“books.xml”)/bookstore/book Path Expression Rules Parallel Rules Sort operators removal Subplan operators removal Enable unnesting Datascan operator Join operator Aggregate operator

Rewriter Rules on JSON PARALLEL REWRITER RULES Enable unnesting. Instead of giving all the results as a huge tuple on the unnest operator, we pipeline one result at the time. The iterate expression is not called on child expression but on value. Datascan operator. The query is addressed to a collection of files instead of only one. Further improvement: insert the value expression as data source make tuples even smaller jn:json-doc(“books.json”)(“bookstore”)(“book”) collection(“books”)(“bookstore”)(“book”) So based on our understanding of the rewriter rules for XML data, we identified those that can be beneficial to JSON data.

Experimental Setup System Configuration Testing & Evaluation A cluster of 4 nodes Disk-resident data are equally partitioned among nodes Hyracks responsible for coordinating work Testing & Evaluation Evaluate scalability of rewriter rules for JSON format Compare performance of JSON to equivalent XML representation Calculate speedup and possible throughput

Dataset Analysis Weather Data1 Queries GHCN daily dataset Fields: date, data type, station id, value, attributes Station dataset Fields: name, latitude, longitude, and date of first and last reading 3 basic types Selection Join Aggregation 1. http://www.noaa.gov/

Thank you! Questions?!

Project Current Progress Completed work Data gathered and transformed to JSON System setup with VXQuery Implemented part of the first rule Future work Complete rule 1 & 2 Perform experiments Compare and evaluate results