Alejandro Llaves Javier D. Fernández Oscar Corcho Ontology Engineering Group Universidad Politécnica de Madrid Madrid, Spain OrdRing.

Slides:



Advertisements
Similar presentations
Workshop on Knowledge Technologies within the 6th Framework MayA. Gómez-Pérez Certifying Knowledge and Knowledge Technologies ( based on Sig3 activities.
Advertisements

DELOS Highlights COSTANTINO THANOS ITALIAN NATIONAL RESEARCH COUNCIL.
CoMPI: Enhancing MPI based applications performance and scalability using run-time compression. Rosa Filgueira, David E.Singh, Alejandro Calderón and Jesús.
LIBRA: Lightweight Data Skew Mitigation in MapReduce
SLA-Oriented Resource Provisioning for Cloud Computing
Building Cloud-ready Video Transcoding System for Content Delivery Networks(CDNs) Zhenyun Zhuang and Chun Guo Speaker: 饒展榕.
TI: An Efficient Indexing Mechanism for Real-Time Search on Tweets Chun Chen 1, Feng Li 2, Beng Chin Ooi 2, and Sai Wu 2 1 Zhejiang University, 2 National.
Pete Bohman Adam Kunk.  Introduction  Related Work  System Overview  Indexing Scheme  Ranking  Evaluation  Conclusion.
JSI Sensor Middleware. Slide 2 of x Embedded vs. Midleware based Architecture for Sensor Metadata Management Embedded approach assign an IP address to.
Designing a DTC Verification System Jennifer Mahoney NOAA/ESRL 21 Feb 2007.
Data Intensive Techniques to Boost the Real-time Performance of Global Agricultural Data Infrastructures SEMAGROW U SING A POWDER T RIPLE S TORE FOR BOOSTING.
Search Engines and Information Retrieval
A Stratified Approach for Supporting High Throughput Event Processing Applications July 2009 Geetika T. LakshmananYuri G. RabinovichOpher Etzion IBM T.
Zero-programming Sensor Network Deployment 學生:張中禹 指導教授:溫志煜老師 日期: 5/7.
Decentralized resource management for a distributed continuous media server Cyrus Shahabi and Farnoush Banaei-Kashani IEEE Transactions on Parallel and.
Page 1Prepared by Sapient for MITVersion 0.1 – August – September 2004 This document represents a snapshot of an evolving set of documents. For information.
TECHNIQUES FOR OPTIMIZING THE QUERY PERFORMANCE OF DISTRIBUTED XML DATABASE - NAHID NEGAR.
Sensor Networks Storage Sanket Totala Sudarshan Jagannathan.
The SAM-Grid Fabric Services Gabriele Garzoglio (for the SAM-Grid team) Computing Division Fermilab.
Data Mining on the Web via Cloud Computing COMS E6125 Web Enhanced Information Management Presented By Hemanth Murthy.
Construction of efficient PDP scheme for Distributed Cloud Storage. By Manognya Reddy Kondam.
What Can Do for You! Fabian Christ
Research on cloud computing application in the peer-to-peer based video-on-demand systems Speaker : 吳靖緯 MA0G rd International Workshop.
Search Engines and Information Retrieval Chapter 1.
1 Distributed Monitoring of Peer-to-Peer Systems By Serge Abiteboul, Bogdan Marinoiu Docflow meeting, Bordeaux.
IoTCloud Platform – Connecting Sensors to Cloud Services Supun Kamburugamuve, Geoffrey C. Fox {skamburu, School of Informatics and Computing.
A Metadata Based Approach For Supporting Subsetting Queries Over Parallel HDF5 Datasets Vignesh Santhanagopalan Graduate Student Department Of CSE.
Distributed Indexing of Web Scale Datasets for the Cloud {ikons, eangelou, Computing Systems Laboratory School of Electrical.
Introduction to Hadoop and HDFS
Database Management 9. course. Execution of queries.
An Autonomic Framework in Cloud Environment Jiedan Zhu Advisor: Prof. Gagan Agrawal.
Master Thesis Defense Jan Fiedler 04/17/98
Trisolda Jakub Yaghob Charles University in Prague, Czech Rep.
DANIEL J. ABADI, ADAM MARCUS, SAMUEL R. MADDEN, AND KATE HOLLENBACH THE VLDB JOURNAL. SW-Store: a vertically partitioned DBMS for Semantic Web data.
Towards Low Overhead Provenance Tracking in Near Real-Time Stream Filtering Nithya N. Vijayakumar, Beth Plale DDE Lab, Indiana University {nvijayak,
An Architecture for Distributed High Performance Video Processing in the Cloud Speaker : 吳靖緯 MA0G IEEE 3rd International Conference.
Daniel J. Abadi · Adam Marcus · Samuel R. Madden ·Kate Hollenbach Presenter: Vishnu Prathish Date: Oct 1 st 2013 CS 848 – Information Integration on the.
B3AS Joseph Lewthwaite 1 Dec, 2005 ARL Knowledge Fusion COE Program.
Adaptive Web Caching CS411 Dynamic Web-Based Systems Flying Pig Fei Teng/Long Zhao/Pallavi Shinde Computer Science Department.
Grid Computing at Yahoo! Sameer Paranjpye Mahadev Konar Yahoo!
Integrating Scale Out and Fault Tolerance in Stream Processing using Operator State Management Author: Raul Castro Fernandez, Matteo Migliavacca, et al.
GUIDED BY DR. A. J. AGRAWAL Search Engine By Chetan R. Rathod.
The Forest and the Trees Julia Stoyanovich Candidacy Exam in Database Systems Fall 2005.
Zibin Zheng DR 2 : Dynamic Request Routing for Tolerating Latency Variability in Cloud Applications CLOUD 2013 Jieming Zhu, Zibin.
Efficient RDF Storage and Retrieval in Jena2 Written by: Kevin Wilkinson, Craig Sayers, Harumi Kuno, Dave Reynolds Presented by: Umer Fareed 파리드.
CCGrid 2014 Improving I/O Throughput of Scientific Applications using Transparent Parallel Compression Tekin Bicer, Jian Yin and Gagan Agrawal Ohio State.
INNOV-10 Progress® Event Engine™ Technical Overview Prashant Thumma Principal Software Engineer.
O PTIMAL SERVICE TASK PARTITION AND DISTRIBUTION IN GRID SYSTEM WITH STAR TOPOLOGY G REGORY L EVITIN, Y UAN -S HUN D AI Adviser: Frank, Yeong-Sung Lin.
CCGrid, 2012 Supporting User Defined Subsetting and Aggregation over Parallel NetCDF Datasets Yu Su and Gagan Agrawal Department of Computer Science and.
DynamicMR: A Dynamic Slot Allocation Optimization Framework for MapReduce Clusters Nanyang Technological University Shanjiang Tang, Bu-Sung Lee, Bingsheng.
ICDCS 2014 Madrid, Spain 30 June-3 July 2014
From Theory to Practice: Efficient Join Query Processing in a Parallel Database System Shumo Chu, Magdalena Balazinska and Dan Suciu Database Group, CSE,
Dzmitry Kliazovich University of Luxembourg, Luxembourg
Massive Semantic Web data compression with MapReduce Jacopo Urbani, Jason Maassen, Henri Bal Vrije Universiteit, Amsterdam HPDC ( High Performance Distributed.
Data Consolidation: A Task Scheduling and Data Migration Technique for Grid Networks Author: P. Kokkinos, K. Christodoulopoulos, A. Kretsis, and E. Varvarigos.
FP CumuloNimbo: A Highly Scalable Transactional Cloud PaaS Bettina Kemme, McGill University CumuloNimbo Canada EU Future Internet Workshop, 2011.
Rate-Based Query Optimization for Streaming Information Sources Stratis D. Viglas Jeffrey F. Naughton.
Collection and storage of provenance data Jakub Wach Master of Science Thesis Faculty of Electrical Engineering, Automatics, Computer Science and Electronics.
WP3: Data Provenance and Access Control Irini Fundulaki, FORTH December 11-12, 2012, Luxembourg.
Fault – Tolerant Distributed Multimedia Streaming Web Application By Nirvan Sagar – Srishti Ganjoo – Syed Shahbaaz Safir
PROTECT | OPTIMIZE | TRANSFORM
Introduction to Load Balancing:
International Conference on Data Engineering (ICDE 2016)
Distributed Network Traffic Feature Extraction for a Real-time IDS
Applying Control Theory to Stream Processing Systems
Work plan revisited Activity 3 Impact Activity 4 Management
Sub-millisecond Stateful Stream Querying over
Ministry of Higher Education
Building a Database on S3
Discretized Streams: A Fault-Tolerant Model for Scalable Stream Processing Zaharia, et al (2012)
Presentation transcript:

Alejandro Llaves Javier D. Fernández Oscar Corcho Ontology Engineering Group Universidad Politécnica de Madrid Madrid, Spain OrdRing workshop - ISWC 2014 Riva del Garda October 20 th, 2014 Towards efficient processing of RDF data streams

Efficient?? Scalable?

Outline Introduction Background: Storm and Lambda Architecture Efficient processing of queries over RDF streams Architecture overview Storm-based operators for querying RDF streams Adaptive query processing for data streams RDF stream compression Conclusions & future work Open questions

Introduction Origins of Linked Stream Data Extracting information from data streams is complex: heterogeneity, rate of generation, volume, provenance,… Challenges C1. Efficient processing of user queries over RDF streams C2. Continuous transmission of data increases latency C3. Integration of historical and real-time data with background knowledge Source:

Background Storm - Distributed system for real-time processing of streams Why Storm? Simple processing model (parallelization) Open source community backing the project Used by relevant companies, e.g. Twitter. Lambda Architecture (Marz 2013) Batch layer: stores ALL the incoming data in an immutable master dataset and pre-computes batch views on historic data. Serving layer: indexes views on the master dataset. Real-time processing layer: requests data views depending on incoming queries.

Efficient processing of RDF streams Goal: to develop a stream processing engine capable of adapting to variable conditions, such as changing rates of input data, failure of processing nodes, or distribution of workload, while serving complex continuous queries. Methodology State of the art of (RDF) stream processing Evaluate how to parallelize SPARQLStream queries Implementation of RDF query operators Optimize parallelization for common queries Design self-adaptive strategies that allow the engine to react in front of changes

Architecture overview

Storm-based operators for querying RDF streams Triple2Graph operator Time Window operator Simple Join operator Projection operator Simple Join Stream > (join attribute) Stream, Set > Windowing Stream > (window size, emission time) Stream Project Stream > (input, output) Stream > Triple2Graph Stream (graph starter) Stream

Project Simple Join Windowing Windowing Windowing Windowing Storm-based operators for querying RDF streams Storm topology example (4 nodes) SELECT ?obs.value ?sensors.location FROM NAMED STREAM [60 SEC TO NOW] WHERE obs.sensorId = sensors.id ; SPOUT Triple2Graph Output SPOUT Triple2Graph t0 t1 t2 t3

RDF stream compression (1/2) Efficient RDF Interchange (ERI) format Based on Efficient XML Interchange (EXI) Main assumption: RDF streams have regular structure and are redundant Information encoded at 2 levels Structural dictionary Presets (values) Example: SSN observations

Where can we apply RDF stream compression?

Conclusions and future work Conclusions We have addressed challenges C1 (scalability) and C2 (transmission) Catalogue of Storm-based operators to parallelize query processing over RDF streams. New format for RDF stream compression called ERI. Challenge C3 (integration) involves storage of historical data and the deployment of batch and serving layers OR the migration to a more general system, e.g. Apache Spark. Future work Finish the implementation of RDF query operators Test the parallelization of a set of common queries → SRBench Adaptive strategies based on Adaptive Query Processing Evaluation → Benchmarking: comparison to CQELS Cloud Integration of ERI into our engine

Open questions How does the order of tuple arrival affect the parallelization of join processing tasks? Are the spatial (or spatio-temporal) properties of a tuple a dimension to have into account for ordering? In such case, what influence does it have on reasoning tasks? And on parallelization tasks? How does the out-of-order tuples affect the processing of streams? In case of discarding out-of- order tuples, how to communicate this in the results?

The research leading to this results has received funding from the EU's Seventh Framework Programme (FP7/ ) under grant agreement no , PlanetData network of excellence, from Ministerio de Economía y Competitividad (Spain) under the project “4V: Volumen, Velocidad, Variedad y Validez en la Gestión Innovadora de Datos” (TIN C4-2-R), and has been supported by an AWS in Education Research Grant award. Alejandro Llaves Thanks!

Adaptive Query Processing (AQP) for data streams Traditional databases include a query optimizer that designs the execution plan based on the data statistics. AQP (Deshpande 2007) techniques allow adjusting the query execution plan to varying conditions of the data input, the incoming queries, and the system.

RDF stream compression (2/2) Evaluation Datasets: streaming, statistical, and general static. Compression ratio, compression time, and parsing throughput (transmission + decompression) Comparison to other formats, such as N-Triples, Turtle, RDSZ, HDT, with different configurations of ERI w.r.t. transmitted data block (1K – 4K) and the presence of dictionary. Conclusion: ERI produces state-of-the-art compression for RDF streams and excels for regularly-structured static RDF datasets. ERI compression ratios remain competitive in general datasets and the time overheads for ERI processing are relatively low.