Optimized Rewriter Rules for Efficient Querying of JSON Data

Optimized Rewriter Rules for Efficient Querying of JSON Data
Christina Pavlopoulou, Vasileios Zois

Introduction Embedded devices paired with physical objects
Quality services necessitate constant information exchange Data collection can improve on the provided services Summary: Modern physical devices are embedded with electronics, sensors and network connectivity. Augmenting the quality of the provided service is based on constant information exchange. Collecting data related to normal operation can help predict human behavior and provide real time support for improving the provided services. Today’s embedded devices are architecturally diverse and for this reason data interoperability is becoming cumbersome. Fortunately, existing data interchange formats are adapted to this new paradigm in an attempt to overcome these communication barriers. Architectural diversity limits data interoperability Exchanging data can be cumbersome Data interchange formats were designed to overcome the existing communication limitations

Collecting Data Predicting human behavior to ensure continuous service
Feedback for improving and retraining decision models Why is it important to collect data from the real world? Real time decision support for ensuring smooth operation of services (i.e. power grid monitoring) Feedback for improving and retraining decision models (i.e. tesla autopilot improves by learning driver behavior). Responding fast in an emergency

Data Processing Challenges
Large volume Number of interconnected devices keep increasing High velocity Readings are generated continuously Huge variety Smart phones, meters, traffic lights, locks e.t.c Summary: Processing the aggregated data that are generated from various embedded devices is challenging. If we look closely this data processing problem resembles a big data processing problem. The number of interconnected devices is expected to increase exponentially in the near future. Data for most applications are continuous and change throughout the day. Additionally, there is a huge variety of devices that will be transmitting different types of data (i.e. energy consumption, video playback from traffic lights, temperature, wind speed for wind energy). So it is imperative to be able to efficiently query these aggregated information to support real time decision, model verification and training as well as supporting operations related to data mining and machine learning.

Popular Data Interchange Formats
Comma Separated Values (CSV) Files eXtensible Markup Language (XML) JavaScript Object Notation (JSON) Some of the most popular data formats include Comma Separated Files (CSV), XML and JSON Data, and YAML Data (Needs figure to visually compare this dataset). All of these formats are structured and were designed to be easily interpretable and readable by humans and machines. CSV format is simplistic but not very flexible during parsing. Data types need to be homogeneous and have the same number of instances else space is being wasted. The XML format is more expressive at the expense of space overhead to encode the required information inside the tags. JSON exhibits the same expressiveness with less space overhead to encode the semantics of the data. YAML was designed to be easily interpretable based and mapped to data types common to most high-level languages (i.e. arrays, lists, maps) YAML

Large Scale XML and JSON Data Processing
Serialized Query Processing on XML BaseX Parallelized Query Processing on XML PAXQuery using MapReduce VXQuery using Hyracks and Directed Acyclic Graph (DAG) processing model Parallelized Query Processing on JSON Our Work using VXQuery To process XML or JSON data, there have been several system implementations… For XML data there exist both serial and parallel implementations. Stratosphere and BaseX implement a serial Xquery processor. PAXQuery and VXQuery are the only two parallel solutions currently available. PAXQuery is based on the MapReduce programming model, while VXQuery uses Hyracks to achieve parallelism. Hyracks uses a directed acyclic graph model to schedule job execution in parallel. We will be focusing on VXQuery because it was recently updated with support for JSON Data and we would like to study the possibilities for query optimization.

VXQuery Details Apache VXQuery Algebricks
Translates XQuery to the corresponding Algebricks parallel algebra Algebricks Enumerates query operators (i.e. join, group-by aggregate, projection) Hyracks data parallel platform Produces data parallel execution plan

doc(“books.xml”)/bookstore/book
Rewriter Rules on XML doc(“books.xml”)/bookstore/book Path Expression Rules Parallel Rules Sort operators removal Subplan operators removal Enable unnesting Datascan operator Join operator Aggregate operator

Rewriter Rules on JSON PARALLEL REWRITER RULES Enable unnesting.
Instead of giving all the results as a huge tuple on the unnest operator, we pipeline one result at the time. The iterate expression is not called on child expression but on value. Datascan operator. The query is addressed to a collection of files instead of only one. Further improvement: insert the value expression as data source make tuples even smaller jn:json-doc(“books.json”)(“bookstore”)(“book”) collection(“books”)(“bookstore”)(“book”) So based on our understanding of the rewriter rules for XML data, we identified those that can be beneficial to JSON data.

Experimental Setup System Configuration Testing & Evaluation
A cluster of 4 nodes Disk-resident data are equally partitioned among nodes Hyracks responsible for coordinating work Testing & Evaluation Evaluate scalability of rewriter rules for JSON format Compare performance of JSON to equivalent XML representation Calculate speedup and possible throughput

Dataset Analysis Weather Data1 Queries GHCN daily dataset
Fields: date, data type, station id, value, attributes Station dataset Fields: name, latitude, longitude, and date of first and last reading 3 basic types Selection Join Aggregation 1.

Thank you! Questions?!

Project Current Progress
Completed work Data gathered and transformed to JSON System setup with VXQuery Implemented part of the first rule Future work Complete rule 1 & 2 Perform experiments Compare and evaluate results

Optimized Rewriter Rules for Efficient Querying of JSON Data

Similar presentations

Presentation on theme: "Optimized Rewriter Rules for Efficient Querying of JSON Data"— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

Optimized Rewriter Rules for Efficient Querying of JSON Data

Similar presentations

Presentation on theme: "Optimized Rewriter Rules for Efficient Querying of JSON Data"— Presentation transcript:

Similar presentations

About project

Feedback