Top-k Queries on Uncertain Data: On score Distribution and Typical Answers Presented by Qian Wan, HKUST Based on [1][2]

Slides:



Advertisements
Similar presentations
Answering Approximate Queries over Autonomous Web Databases Xiangfu Meng, Z. M. Ma, and Li Yan College of Information Science and Engineering, Northeastern.
Advertisements

Pseudo-Relevance Feedback For Multimedia Retrieval By Rong Yan, Alexander G. and Rong Jin Mwangi S. Kariuki
Efficient Processing of Top- k Queries in Uncertain Databases Ke Yi, AT&T Labs Feifei Li, Boston University Divesh Srivastava, AT&T Labs George Kollios,
指導教授:陳良弼 老師 報告者:鄧雅文  Introduction  Related Work  Problem Formulation  Future Work.
Introduction to Information Retrieval
Probabilistic Skyline Operator over Sliding Windows Wenjie Zhang University of New South Wales & NICTA, Australia Joint work: Xuemin Lin, Ying Zhang, Wei.
A Framework for Clustering Evolving Data Streams Charu C. Aggarwal, Jiawei Han, Jianyong Wang, Philip S. Yu Presented by: Di Yang Charudatta Wad.
Efficient Processing of Top- k Queries in Uncertain Databases Ke Yi, AT&T Labs Feifei Li, Boston University Divesh Srivastava, AT&T Labs George Kollios,
PAPER BY : CHRISTOPHER R’E NILESH DALVI DAN SUCIU International Conference on Data Engineering (ICDE), 2007 PRESENTED BY : JITENDRA GUPTA.
PREFER: A System for the Efficient Execution of Multi-parametric Ranked Queries Vagelis Hristidis University of California, San Diego Nick Koudas AT&T.
Machine learning continued Image source:
Top-K Query Evaluation on Probabilistic Data Christopher Ré, Nilesh Dalvi and Dan Suciu University of Washington.
LUDWIG- MAXIMILIANS- UNIVERSITY MUNICH DATABASE SYSTEMS GROUP DEPARTMENT INSTITUTE FOR INFORMATICS Probabilistic Similarity Queries in Uncertain Databases.
Ming Hua, Jian Pei Simon Fraser UniversityPresented By: Mahashweta Das Wenjie Zhang, Xuemin LinUniversity of Texas at Arlington The University of New South.
COMP53311 Data Stream Prepared by Raymond Wong Presented by Raymond Wong
Supervised learning Given training examples of inputs and corresponding outputs, produce the “correct” outputs for new inputs Two main scenarios: –Classification:
Spatial and Temporal Data Mining V. Megalooikonomou Introduction to Decision Trees ( based on notes by Jiawei Han and Micheline Kamber and on notes by.
Analyzing System Logs: A New View of What's Important Sivan Sabato Elad Yom-Tov Aviad Tsherniak Saharon Rosset IBM Research SysML07 (Second Workshop on.
Probabilistic Similarity Search for Uncertain Time Series Presented by CAO Chen 21 st Feb, 2011.
Rank Aggregation. Rank Aggregation: Settings Multiple items – Web-pages, cars, apartments,…. Multiple scores for each item – By different reviewers, users,
Continuous Data Stream Processing MAKE Lab Date: 2006/03/07 Post-Excellence Project Subproject 6.
KNN, LVQ, SOM. Instance Based Learning K-Nearest Neighbor Algorithm (LVQ) Learning Vector Quantization (SOM) Self Organizing Maps.
On the Semantics and Evaluation of Top-k Queries in Probabilistic Databases Presented by Xi Zhang Feburary 8 th, 2008.
Review Rong Jin. Comparison of Different Classification Models  The goal of all classifiers Predicating class label y for an input x Estimate p(y|x)
Part I: Classification and Bayesian Learning
Time series analysis and Sequence Segmentation
Da Yan and Wilfred Ng The Hong Kong University of Science and Technology.
Bayesian Sets Zoubin Ghahramani and Kathertine A. Heller NIPS 2005 Presented by Qi An Mar. 17 th, 2006.
Ranking Queries on Uncertain Data: A Probabilistic Threshold Approach Wenjie Zhang, Xuemin Lin The University of New South Wales & NICTA Ming Hua,
DANIEL J. ABADI, ADAM MARCUS, SAMUEL R. MADDEN, AND KATE HOLLENBACH THE VLDB JOURNAL. SW-Store: a vertically partitioned DBMS for Semantic Web data.
Introduction to machine learning and data mining 1 iCSC2014, Juan López González, University of Oviedo Introduction to machine learning Juan López González.
A Survey Based Seminar: Data Cleaning & Uncertain Data Management Speaker: Shawn Yang Supervisor: Dr. Reynold Cheng Prof. David Cheung
Top-k Similarity Join over Multi- valued Objects Wenjie Zhang Jing Xu, Xin Liang, Ying Zhang, Xuemin Lin The University of New South Wales, Australia.
Approximate XML Joins Huang-Chun Yu Li Xu. Introduction XML is widely used to integrate data from different sources. Perform join operation for XML documents:
Algorithm Paradigms High Level Approach To solving a Class of Problems.
CLUSTERING. Overview Definition of Clustering Existing clustering methods Clustering examples.
K-Hit Query: Top-k Query Processing with Probabilistic Utility Function SIGMOD2015 Peng Peng, Raymond C.-W. Wong CSE, HKUST 1.
1 Discovering Robust Knowledge from Databases that Change Chun-Nan HsuCraig A. Knoblock Arizona State UniversityUniversity of Southern California Journal.
This paper was presented at KDD ‘06 Discovering Interesting Patterns Through User’s Interactive Feedback Dong Xin Xuehua Shen Qiaozhu Mei Jiawei Han Presented.
Multi-Relational Data Mining: An Introduction Joe Paulowskey.
Answering Similar Region Search Queries Chang Sheng, Yu Zheng.
Query Processing over Incomplete Autonomous Databases Presented By Garrett Wolf, Hemal Khatri, Bhaumik Chokshi, Jianchun Fan, Yi Chen, Subbarao Kambhampati.
DATA MINING WITH CLUSTERING AND CLASSIFICATION Spring 2007, SJSU Benjamin Lam.
Lineage Tracing for General Data Warehouse Transformations Yingwei Cui and Jennifer Widom Computer Science Department, Stanford University Presentation.
Space-Efficient Online Computation of Quantile Summaries SIGMOD 01 Michael Greenwald & Sanjeev Khanna Presented by ellery.
CS4432: Database Systems II Query Processing- Part 2.
Mining Top-K Large Structural Patterns in a Massive Network Feida Zhu 1, Qiang Qu 2, David Lo 1, Xifeng Yan 3, Jiawei Han 4, and Philip S. Yu 5 1 Singapore.
CHAPTER 1: Introduction. 2 Why “Learn”? Machine learning is programming computers to optimize a performance criterion using example data or past experience.
KNN & Naïve Bayes Hongning Wang Today’s lecture Instance-based classifiers – k nearest neighbors – Non-parametric learning algorithm Model-based.
CSE 6392 – Data Exploration and Analysis in Relational Databases April 20, 2006.
Trajectory Data Mining Dr. Yu Zheng Lead Researcher, Microsoft Research Chair Professor at Shanghai Jiao Tong University Editor-in-Chief of ACM Trans.
Ranking of Database Query Results Nitesh Maan, Arujn Saraswat, Nishant Kapoor.
Optimal Aggregation Algorithms for Middleware By Ronald Fagin, Amnon Lotem, and Moni Naor.
26/01/20161Gianluca Demartini Ranking Categories for Faceted Search Gianluca Demartini L3S Research Seminars Hannover, 09 June 2006.
A Unified Approach to Ranking in Probabilistic Databases Jian Li, Barna Saha, Amol Deshpande University of Maryland, College Park, USA VLDB
SUPERVISED AND UNSUPERVISED LEARNING Presentation by Ege Saygıner CENG 784.
KNN & Naïve Bayes Hongning Wang
A presentation to El Paso del Norte Software Association
Database Applications (15-415) DBMS Internals- Part VII Lecture 16, October 25, 2016 Mohammad Hammoud.
Probabilistic Data Management
Probabilistic Data Management
Mining Frequent Itemsets over Uncertain Databases
Rank Aggregation.
Active learning The learning algorithm must have some control over the data from which it learns It must be able to query an oracle, requesting for labels.
DBMS with probabilistic model
Finding Functionally Significant Structural Motifs in Proteins
Selected Topics: External Sorting, Join Algorithms, …
DATA MINING Introductory and Advanced Topics Part II - Clustering
Implementation of Relational Operations
Probabilistic Databases
Presentation transcript:

Top-k Queries on Uncertain Data: On score Distribution and Typical Answers Presented by Qian Wan, HKUST Based on [1][2]

Introduction: Uncertain Data Management Modeling Uncertain Data Possible Worlds Model Uncertain data management Top-k, Join, kNN, Skyline, Indexing, etc. Uncertain Data Mining Clustering, Classification, Frequent Pattern, Outlier Detection

Introduction: Data Representation A simple way to representing probabilistic data Each tuple has a confidence Pr(instance)= ∏ Pr(attendance) x ∏ Pr(absence) Mutual Exclusion Constraints for each tuple* Scoring function*

Introduction: Other Works K tuples that co-exist in a possible world U-Topk Returning tuples according to marginal distribution of top-k results U-kRanks and PT-k

Introduction: Other Works (Example)

Introduction: Other Works (drawback) The top-k result may be atypical The distribution of scores is not used

Introduction: c-Typical-Top k 3-Typical-Top 2 scores of this example is {118, 183, 235} Expected distance is 6.6 The vectors are {(t2, t6), (T7,T6), (T7,T3)}

Algorithm Distribution of top-2 tuples’ scores

Algorithm – Naïve approach INPUT: tuples with membership probabilities OUTPUT: Top-k scores distribution IDEA: recursively go through all possible worlds to calculate all probabilities, until reaching a threshold

Algorithm – a DP approach D(i,j): score distribution of top-j starting at Ti. The main problem is D(1,k) (?)

Algorithm – a DP approach Transformation: D(i,j) = TF[D(i+1,j),D(i+1,j-1)] D(i+1,j): For each (v,p) add (v, p(1-pi)) D(i+1,j-1): For each (v,p) add (v+si, p*pi) Merge duplicate items Bottom up DP Approximation

Handling More Real Scenarios Handling Mutually Exclusive Rules Compress the ME group Refine by lead tuple region Handling Ties When two tuples have the same score, rank them according to probability

Algorithm 3-Typical-Top 2 scores

c-Typical-Top k 3-Typical-Top 2 scores of this example is {118, 183, 235} Expected distance is 6.6 The vectors are {(t2, t6), (T7,T6), (T7,T3)}

Computing c-Typical-Top k Define F^a(j) to be the optimal objective over {sj, …, sn} where a is the number of typical scores. G^a(j) means the same

Computing c-Typical-Top k Just solve the two function optimization problem, using DP Boundary conditions

Empirical Study 3 -Typical VS U-Topk

Empirical Study

Q&A

Reference [1] Charu C. Aggarwal, Philip S. Yu “A Survey of Uncertain Data Algorithms and Applications”, IEEE Transactions on Knowledge and Data Engineering, 2009 [2] Tingjian Ge, Stan Zdonik, Samuel Madden. Top-k Queries on Uncertain Data: On Score Distribution and Typical Answers. SIGMOD, 2009