Range-Aggregate Query on Distributed Uncertain Database

Slides:

Advertisements

Similar presentations

Semantics and Evaluation Techniques for Window Aggregates in Data Streams Jin Li, David Maier, Kristin Tufte, Vassilis Papadimos, Peter A. Tucker SIGMOD.

Advertisements

LIBRA: Lightweight Data Skew Mitigation in MapReduce

BOAT - Optimistic Decision Tree Construction Gehrke, J. Ganti V., Ramakrishnan R., Loh, W.

University of Minnesota CG_Hadoop: Computational Geometry in MapReduce Ahmed Eldawy* Yuan Li* Mohamed F. Mokbel*$ Ravi Janardan* * Department of Computer.

Large-Scale Machine Learning Program For Energy Prediction CEI Smart Grid Wei Yin.

Data Integration Aggregate Query Answering under Uncertain Schema Mappings Avigdor Gal, Maria Vanina Martinez, Gerardo I. Simari, VS Subrahmanian Presented.

Design Patterns for Efficient Graph Algorithms in MapReduce Jimmy Lin and Michael Schatz University of Maryland Tuesday, June 29, 2010 This work is licensed.

Parallel K-Means Clustering Based on MapReduce The Key Laboratory of Intelligent Information Processing, Chinese Academy of Sciences Weizhong Zhao, Huifang.

Scalable Approximate Query Processing through Scalable Error Estimation Kai Zeng UCLA Advisor: Carlo Zaniolo 1.

Building Efficient Time Series Similarity Search Operator Mijung Kim Summer Internship 2013 at HP Labs.

Interpreting the data: Parallel analysis with Sawzall LIN Wenbin 25 Mar 2014.

© David Kirk/NVIDIA and Wen-mei W. Hwu ECE408/CS483/ECE498al, University of Illinois, ECE408 Applied Parallel Programming Lecture 11 Parallel.

CS525: Special Topics in DBs Large-Scale Data Management Hadoop/MapReduce Computing Paradigm Spring 2013 WPI, Mohamed Eltabakh 1.

HBase A column-centered database 1. Overview An Apache project Influenced by Google’s BigTable Built on Hadoop ▫A distributed file system ▫Supports Map-Reduce.

Ex-MATE: Data-Intensive Computing with Large Reduction Objects and Its Application to Graph Mining Wei Jiang and Gagan Agrawal.

A Metadata Based Approach For Supporting Subsetting Queries Over Parallel HDF5 Datasets Vignesh Santhanagopalan Graduate Student Department Of CSE.

Ohio State University Department of Computer Science and Engineering 1 Cyberinfrastructure for Coastal Forecasting and Change Analysis Gagan Agrawal Hakan.

Scalable and Efficient Data Streaming Algorithms for Detecting Common Content in Internet Traffic Minho Sung Networking & Telecommunications Group College.

Ranking Queries on Uncertain Data: A Probabilistic Threshold Approach Wenjie Zhang, Xuemin Lin The University of New South Wales & NICTA Ming Hua,

Map-Reduce-Merge: Simplified Relational Data Processing on Large Clusters Hung-chih Yang(Yahoo!), Ali Dasdan(Yahoo!), Ruey-Lung Hsiao(UCLA), D. Stott Parker(UCLA)

An Introduction to HDInsight June 27 th,

CERN – European Organization for Nuclear Research Administrative Support - Internet Development Services CET and the quest for optimal implementation and.

Supporting Large-scale Social Media Data Analyses with Customizable Indexing Techniques on NoSQL Databases.

Advanced Analytics on Hadoop Spring 2014 WPI, Mohamed Eltabakh 1.

Mining Document Collections to Facilitate Accurate Approximate Entity Matching Presented By Harshda Vabale.

ISchool, Cloud Computing Class Talk, Oct 6 th Computing Pairwise Document Similarity in Large Collections: A MapReduce Perspective Tamer Elsayed,

Chapter 5 Ranking with Indexes 1. 2 More Indexing Techniques n Indexing techniques:  Inverted files - best choice for most applications  Suffix trees.

CS525: Big Data Analytics MapReduce Computing Paradigm & Apache Hadoop Open Source Fall 2013 Elke A. Rundensteiner 1.

 Frequent Word Combinations Mining and Indexing on HBase Hemanth Gokavarapu Santhosh Kumar Saminathan.

Efficient Computation of Combinatorial Skyline Queries Author: Yu-Chi Chung, I-Fang Su, and Chiang Lee Source: Information Systems, 38(2013), pp

IBM Research ® © 2007 IBM Corporation Introduction to Map-Reduce and Join Processing.

Map-Reduce examples 1. So, what is it? A two phase process geared toward optimizing broad, widely distributed parallel computing platforms Apache Hadoop.

Supporting Ranking and Clustering as Generalized Order-By and Group-By Chengkai Li (UIUC) joint work with Min Wang Lipyeow Lim Haixun Wang (IBM) Kevin.

Big traffic data processing framework for intelligent monitoring and recording systems 學生 : 賴弘偉教授 : 許毅然作者 : Yingjie Xia a, JinlongChen a,b,n, XindaiLu.

ApproxHadoop Bringing Approximations to MapReduce Frameworks

CISC 849 : Applications in Fintech Namami Shukla Dept of Computer & Information Sciences University of Delaware iCARE : A Framework for Big Data Based.

HEMANTH GOKAVARAPU SANTHOSH KUMAR SAMINATHAN Frequent Word Combinations Mining and Indexing on HBase.

Presented by: Dardan Xhymshiti Fall  Type: Research paper  Authors:  International conference on Very Large Data Bases. Yoonjar Park Seoul National.

System Support for High Performance Scientific Data Mining Gagan Agrawal Ruoming Jin Raghu Machiraju S. Parthasarathy Department of Computer and Information.

Computer Science and Engineering Parallelizing Feature Mining Using FREERIDE Leonid Glimcher P. 1 ipdps’04 Scaling and Parallelizing a Scientific Feature.

More SQL: Complex Queries, Triggers, Views, and Schema Modification

”Map-Reduce-Merge: Simplified Relational Data Processing on Large Clusters” Published In SIGMOD '07 By Yahoo! Senthil Nathan N IIT Bombay.

MapReduce MapReduce is one of the most popular distributed programming models Model has two phases: Map Phase: Distributed processing based on key, value.

Supporting Ranking and Clustering as Generalized Order-By and Group-By

MapReduce Compiler RHadoop

Optimizing Parallel Algorithms for All Pairs Similarity Search

ANOMALY DETECTION FRAMEWORK FOR BIG DATA

Pagerank and Betweenness centrality on Big Taxi Trajectory Graph

Parallel Databases.

Hadoop MapReduce Framework

Created By: Dan Robert and Ronald Richardson II

Enabling Scalable and HA Ingestion and Real-Time Big Data Insights for the Enterprise OCJUG, 2014.

SpatialHadoop: A MapReduce Framework for Spatial Data

Selectivity Estimation of Big Spatial Data

MapReduce Computing Paradigm Basics Fall 2013 Elke A. Rundensteiner

Ministry of Higher Education

Big Data - in Performance Engineering

A Unifying View on Instance Selection

Tools for Processing Big Data Jinan Al Aridhee and Christian Bach

Chapter 2 Lin and Dyer & MapReduce Basics Chapter 2 Lin and Dyer &

Parallel Computation Patterns (Reduction)

Distributed Probabilistic Range-Aggregate Query on Uncertain Data

Xu Zhou Kenli Li Yantao Zhou Keqin Li

AGENDA Buzz word. AGENDA Buzz word What is BIG DATA ? Big Data refers to massive, often unstructured data that is beyond the processing capabilities.

Apache Hadoop and Spark

Range Likelihood Tree: A Compact and Effective Representation for Visual Exploration of Uncertain Data Sets Ohio State University (Shen) Problem: Uncertainty.

MapReduce: Simplified Data Processing on Large Clusters

Supporting Online Analytics with User-Defined Estimation and Early Termination in a MapReduce-Like Framework Yi Wang, Linchuan Chen, Gagan Agrawal The.

Presentation transcript:

Range-Aggregate Query on Distributed Uncertain Database Graduate Team #1 Niranjan Rai, nrai@kent.edu Pujit Koirala, pkoirala@kent.edu Sudeep Tuladhar, stuladha@kent.edu

Uncertain/Probabilistic Data Database containing noise or the probability associated with it being accurate. Can be uncertain due to unreliable source of data. As the data is growing in its in Volume, Velocity and Variety, it also has great concern in Veracity(accuracy). Uncertain data can be mainly found in sensor data, trend predictions, social media data, etc.

Range Query Range query retrieves all the list of tuples that lies on the given range. a query range Q2 a query range Q1 Data objects

Aggregate Query Aggregate Query performs some aggregation function[COUNT, AVG, SUM, MIN, MAX ] and returns the summarized informations from the group of tuples. <B , 19> <H , 25> <A , 26> SUM = 231 AVG = 28.875 MIN = 12 MAX = 61 <G , 12> <E , 61> <C , 36> <Name, Age> <F , 32> <D , 20> Data objects

Range-Aggregate Query Given a Range and an aggregate function, Range-Aggregate Query returns the result of aggregate function for the given range . Basically, it is the combination of Range and Aggregate query. a query range Q1 Apply Aggregate function

Range-Aggregate Query on Uncertain Database In uncertain database we need to consider the probability associated with each of the instances contained in the uncertain object. Time and Memory consuming as we have to go through each tuple to check the probability associated within in the given range. There are very few researches that are focused on Range-Aggregate Query on Uncertain Database

Threshold Probability = 0.6 Range c1 a2 0.5 a1 0.8 b2 c2 b3 0.2 0.5 b1

Motivation and Problem Statement RAQ plays vital role in extracting valuable information and has many applications. Range-Aggregate Queries available only in either centralized uncertain data or distributed certain data. Need for distributed approach for faster processing of Uncertain Data Very few researches on Range-Aggregate Query on Uncertain Database.

Sample Dataset

Proposed Solution Perform Range-Aggregate Query on Uncertain Database in Distributed Environment. Use Map Reduce framework, a core component of Apache Hadoop Framework. It processes huge data in parallel and on clusters.

Proposed Solution <Range, Threshold> Partition 1 1 <objectID, Value> Partition 2 2 Aggregate value reducer Partition N N Uncertain Database mapper Figure 1: Data Flow Diagram of Proposed Solution

MapReduce Pseudo-Code func map(Range, threshold): for each tuple in db: if tuple.attribute >Range.min and tuple.attribute < Range.max: if tuple.confidence > thresholdProbability: emit(tupleId, tuple.value) end if end for end func func reduce(keys, values): return Array.average(values) //if aggregate function is average

Expected Output A system which can process Range-Aggregate query on Uncertain Database Query processing is parallel and efficient. A bug free parallel processing system which is much faster than the available centralized system.

Future Enhancement Implement indexing techniques to increase query speed and efficiency. Introduce effective pruning techniques.

Conclusion We proposed a Range-Aggregate query on Uncertain Database for Distributed System. Map-Reduce framework helps to perform queries faster even with less memory.

Questions??