Reynold Cheng†, Eric Lo‡, Xuan S

Slides:



Advertisements
Similar presentations
Lazy Paired Hyper-Parameter Tuning
Advertisements

Probabilistic Skyline Operator over Sliding Windows Wenjie Zhang University of New South Wales & NICTA, Australia Joint work: Xuemin Lin, Ying Zhang, Wei.
Cleaning Uncertain Data with Quality Guarantees Reynold Cheng, Jinchuan Chen, Xike Xie 2008 VLDB Presented by SHAO Yufeng.
Cleaning Uncertain Data with Quality Guarantees Dr. Reynold Cheng Department of Computer Science The University of Hong Kong
PAPER BY : CHRISTOPHER R’E NILESH DALVI DAN SUCIU International Conference on Data Engineering (ICDE), 2007 PRESENTED BY : JITENDRA GUPTA.
Tuning bandit algorithms in stochastic environments The 18th International Conference on Algorithmic Learning Theory October 3, 2007, Sendai International.
DBLA: D ISTRIBUTED B LOCK L EARNING A LGORITHM F OR C HANNEL S ELECTION I N C OGNITIVE R ADIO N ETWORKS Chowdhury Sayeed Hyder Department of Computer Science.
Extraction and Transfer of Knowledge in Reinforcement Learning A.LAZARIC Inria “30 minutes de Science” Seminars SequeL Inria Lille – Nord Europe December.
Cleaning Uncertain Data for Top-k Queries Luyi Mo, Reynold Cheng, Xiang Li, David Cheung, Xuan Yang The University of Hong Kong {lymo, ckcheng, xli, dcheung,
Learning on Probabilistic Labels Peng Peng, Raymond Chi-wing Wong, Philip S. Yu CSE, HKUST 1.
David Chu--UC Berkeley Amol Deshpande--University of Maryland Joseph M. Hellerstein--UC Berkeley Intel Research Berkeley Wei Hong--Arched Rock Corp. Approximate.
1 Efficient Subgraph Search over Large Uncertain Graphs Ye Yuan 1, Guoren Wang 1, Haixun Wang 2, Lei Chen 3 1. Northeastern University, China 2. Microsoft.
Autocorrelation and Linkage Cause Bias in Evaluation of Relational Learners David Jensen and Jennifer Neville.
Probabilistic Threshold Range Aggregate Query Processing over Uncertain Data Wenjie Zhang University of New South Wales & NICTA, Australia Joint work:
A Generic Framework for Handling Uncertain Data with Local Correlations Xiang Lian and Lei Chen Department of Computer Science and Engineering The Hong.
EE 290A: Generalized Principal Component Analysis Lecture 6: Iterative Methods for Mixture-Model Segmentation Sastry & Yang © Spring, 2011EE 290A, University.
Quantile-Based KNN over Multi- Valued Objects Wenjie Zhang Xuemin Lin, Muhammad Aamir Cheema, Ying Zhang, Wei Wang The University of New South Wales, Australia.
Approximating Sensor Network Queries Using In-Network Summaries Alexandra Meliou Carlos Guestrin Joseph Hellerstein.
Mortal Multi-Armed Bandits Deepayan Chakrabarti,Yahoo! Research Ravi Kumar,Yahoo! Research Filip Radlinski, Microsoft Research Eli Upfal,Brown University.
Cheng, Xie, Yiu, Chen, Sun UV-diagram: a Voronoi Diagram for uncertain data 26th IEEE International Conference on Data Engineering Reynold Cheng (University.
1 An Asymptotically Optimal Algorithm for the Max k-Armed Bandit Problem Matthew Streeter & Stephen Smith Carnegie Mellon University NESCAI, April
1 Learning Entity Specific Models Stefan Niculescu Carnegie Mellon University November, 2003.
Exploration and Exploitation Strategies for the K-armed Bandit Problem by Alexander L. Strehl.
Exploiting Correlated Attributes in Acquisitional Query Processing Amol Deshpande University of Maryland Joint work with Carlos Sam
Flow Algorithms for Two Pipelined Filtering Problems Anne Condon, University of British Columbia Amol Deshpande, University of Maryland Lisa Hellerstein,
Efficient Join Processing over Uncertain Data - By Reynold Cheng, et all. Presented By Lydia & Usha.
Model-Driven Data Acquisition in Sensor Networks - Amol Deshpande et al., VLDB ‘04 Jisu Oh March 20, 2006 CS 580S Paper Presentation.
Presented by: Duong, Huu Kinh Luan March 14 th, 2011.
Model-driven Data Acquisition in Sensor Networks Amol Deshpande 1,4 Carlos Guestrin 4,2 Sam Madden 4,3 Joe Hellerstein 1,4 Wei Hong 4 1 UC Berkeley 2 Carnegie.
Handling Advertisements of Unknown Quality in Search Advertising Sandeep Pandey Christopher Olston (CMU and Yahoo! Research)
1 Assessment of Imprecise Reliability Using Efficient Probabilistic Reanalysis Farizal Efstratios Nikolaidis SAE 2007 World Congress.
1 Data Mining over the Deep Web Tantan Liu, Gagan Agrawal Ohio State University April 12, 2011.
Myopic Policies for Budgeted Optimization with Constrained Experiments Javad Azimi, Xiaoli Fern, Alan Fern Oregon State University AAAI, July
Reinforcement Learning Evaluative Feedback and Bandit Problems Subramanian Ramamoorthy School of Informatics 20 January 2012.
1 On Querying Historical Evolving Graph Sequences Chenghui Ren $, Eric Lo *, Ben Kao $, Xinjie Zhu $, Reynold Cheng $ $ The University of Hong Kong $ {chren,
Optimizing Plurality for Human Intelligence Tasks Luyi Mo University of Hong Kong Joint work with Reynold Cheng, Ben Kao, Xuan Yang, Chenghui Ren, Siyu.
ON INCENTIVE-BASED TAGGING Xuan S. Yang, Reynold Cheng, Luyi Mo, Ben Kao, David W. Cheung {xyang2, ckcheng, lymo, kao, The University.
Department of Computer Science City University of Hong Kong Department of Computer Science City University of Hong Kong 1 A Statistics-Based Sensor Selection.
Influence Maximization in Dynamic Social Networks Honglei Zhuang, Yihan Sun, Jie Tang, Jialin Zhang, Xiaoming Sun.
A Survey Based Seminar: Data Cleaning & Uncertain Data Management Speaker: Shawn Yang Supervisor: Dr. Reynold Cheng Prof. David Cheung
Towards Robust Indexing for Ranked Queries Dong Xin, Chen Chen, Jiawei Han Department of Computer Science University of Illinois at Urbana-Champaign VLDB.
DAQ: A New Paradigm for Approximate Query Processing Navneet Potti Jignesh Patel VLDB 2015.
ICS280 Presentation by Suraj Nagasrinivasa (1) Evaluating Probabilistic Queries over Imprecise Data (SIGMOD 2003) by R Cheng, D Kalashnikov, S Prabhakar.
Hypothesis Testing.  Select 50% users to see headline A ◦ Titanic Sinks  Select 50% users to see headline B ◦ Ship Sinks Killing Thousands  Do people.
Answering Top-k Queries Using Views Gautam Das (Univ. of Texas), Dimitrios Gunopulos (Univ. of California Riverside), Nick Koudas (Univ. of Toronto), Dimitris.
COMP 2208 Dr. Long Tran-Thanh University of Southampton Bandits.
Minimizing Delay in Shared Pipelines Ori Rottenstreich (Technion, Israel) Joint work with Isaac Keslassy (Technion, Israel) Yoram Revah, Aviran Kadosh.
Histograms for Selectivity Estimation, Part II Speaker: Ho Wai Shing Global Optimization of Histograms.
1 Chapter 17 2 nd Part Making Complex Decisions --- Decision-theoretic Agent Design Xin Lu 11/04/2002.
A Unified Approach to Ranking in Probabilistic Databases Jian Li, Barna Saha, Amol Deshpande University of Maryland, College Park, USA VLDB
Predicting Consensus Ranking in Crowdsourced Setting Xi Chen Mentors: Paul Bennett and Eric Horvitz Collaborator: Kevyn Collins-Thompson Machine Learning.
Scrubbing Query Results from Probabilistic Databases Jianwen Chen, Ling Feng, Wenwei Xue.
Distributed Learning for Multi-Channel Selection in Wireless Network Monitoring — Yuan Xue, Pan Zhou, Tao Jiang, Shiwen Mao and Xiaolei Huang.
Jian Li Institute for Interdisciplinary Information Sciences Tsinghua University Multi-armed Bandit Problems WAIM 2014.
By: Kenny Raharjo 1. Agenda Problem scope and goals Game development trend Multi-armed bandit (MAB) introduction Integrating MAB into game development.
Figure 5: Change in Blackjack Posterior Distributions over Time.
Stochastic Skyline Operator
Feedback-Aware Social Event-Participant Arrangement
CSCI 5822 Probabilistic Models of Human and Machine Learning
Tuning bandit algorithms in stochastic environments
Conflict-Aware Event-Participant Arrangement
Instructors: Fei Fang (This Lecture) and Dave Touretzky
Sequential Data Cleaning: A Statistical Approach
Chapter 2: Evaluative Feedback
Shunan Zhang, Michael D. Lee, Miles Munro
Uncertain Data Mobile Group 报告人:郝兴.
Adaptive Choice of Information Sources
Efficient Processing of Top-k Spatial Preference Queries
Chapter 2: Evaluative Feedback
Presentation transcript:

Explore or Exploit? Effective Strategies for Disambiguating Large Databases Reynold Cheng†, Eric Lo‡, Xuan S. Yang†, Ming-Hay Luk‡, Xiang Li†, and Xike Xie† †: University of Hong Kong {ckcheng, xyang2, xli, xkxie}@cs.hku.hk ‡: Hong Kong Polytechnic University {ericlo, csmhluk}@comp.polyu.edu.hk

Outline Introduction Solutions Experiments Conclusion & Future Work

Outline Introduction Solutions Experiments Conclusion & Future Work

? Data Ambiguity Attribute Uncertainty [N. Dalvi, VLDB’04] From AddAll.com Item Price Effective C++ in AMAZON 27.49 30.68 30.99 33.68 n-1 false values Entity Val1, Val2, …, Valn Each entity has a set of possible values Only one value out of the set is true Attribute Uncertainty [N. Dalvi, VLDB’04] Set Valued Attribute [J. Pei, VLDB’07] 1 Is the book’s price lower than $30 2 Who is the offender? 3 Where is the person? ? …

Data Cleaning Cleaning Information Availability Item Price Effective C++ in AMAZON 27.49 30.68 30.99 33.68 Cleaning Information Availability Cost One cleaning operation may not be able to remove all false values Cleaning probabilistic database [R. Cheng, VLDB’08] Cleaning may fail …

Cleaning the entities by the decreasing order of their sc-prob Data Cleaning Model Entity # of false values T1 5 T2 3 T3 6 T4 4 T5 1 cost 1 sc-prob 0.1 0.4 0.7 1 UNKNOWN sc-prob KNOWN sc-pdf Cleaning the entities by the decreasing order of their sc-prob # of false values remove 1 Cleaning Operation clean(Ti) Cost Successful Cleaning Probability (sc-prob) Incompleteness Objective Remove as many false values as possible; Under a given # of cleaning operations. But we may know some information from the pervious cleaning procedure, which I will introduce later.

Heuristic-Based Algorithms Random Algorithm Randomly choose 1 item to clean Greedy Algorithm pi’ = successes/ trials to estimate pi Choose the entity with the highest pi’ ε-Greedy Algorithm With probability ε, randomly choose 1 entity; Otherwise, same as Greedy Algorithm 1 Random: We prefer to clean the entities with a high sc-probability; 2 Greedy Algorithm: The estimation is not precise, can not be trusted with 100% confidence; 3 \epsilon-Greedy Algorithm: hybrid set some confidence of estimated sc-probability; user settled parameter \epsilon.

Outline Introduction Solutions Experiments Conclusion & Future Work

Multi Armed Bandit Problem p1, p2, …, pk K Slot Machines Hidden Probabilities Rewards Cost & Budget Objective

Comparison between Cleaning and MAB Infinite # of Coins p1, p2, …, pk Entity # of false values sc-prob T1 5 0.1 T2 3 0.4 T3 6 T4 4 0.7 T5 1 Cost & Budget Objective Remove as many false values as possible Under a given # of cleaning operations Classic MAB Problem [D. Berry, 1985] MAB Problem with limited life time [D. Chakrabarti, NIPS’08]

sc-pdf Don’t know the sc-prob of each individual entity Known sc-pdf: The distribution of sc-prob Entity # of false values sc-prob T1 5 0.1 T2 3 0.4 T3 6 T4 4 0.7 T5 1 0.1 0.4 0.7 1 sc-prob freq 1/5 2/5

# of false values removed by an algorithm Important Notations Notation Meaning Ti Ambiguous Entity ri # of false values in Ti pi sc-probability clean(Ti) cleaning Ti C total cleaning budget R # of false values removed by an algorithm ξ(A) Effectiveness R/C f sc-pdf

The EE-Algorithm Fail Success t = 3 q = 2/3 3 1 1 2 1 1/3 >= 2/3? Entity # of false values sc-prob T1 5 0.1 T2 3 0.4 T3 6 T4 4 0.7 T5 1 T2 Trial m Fail 3 1 1 2 1 Success 1/3 >= 2/3?

The EE-Algorithm Fail Success t = 3 q = 2/3 3 2 1 2 2/3 >= 2/3? T4 Entity # of false values sc-prob T1 5 0.1 T2 3 0.4 T3 6 T4 4 0.7 T5 1 T4 Trial m 3 2 2/3 >= 2/3? # of remaining false value 1 2 Fail Success

Setting Parameters for EE Estimation of Cleaning Effectiveness # of cleaning operations used: χi # of false values removed: γi Pne(p): an entity with sc-probability p is explored but not exploited Et(p): the expected number of false values removed from an entity with sc-probability p after exploration and before exploitation

Setting Parameters for EE Finding the Best Parameters Bound Explore Frequent with E[ri]/E[pi] Discretize region [0, 1] with an interval δ Find the (t, q) pair which can maximize the estimated cleaning effectiveness

Optimization Stopping the Exploration Early During the explore procedure, if we find m/t must be lower than q then stop exploring. d: # of trials in explore phase d-m < (1-q)*t

Outline Introduction Solutions Experiments Conclusion & Future Work

Experiments … Dataset Movie Dataset Synthetic Dataset Statistics # of entities Avg # of false values sc-pdf Default Budget Movie 4,999 1 Uniform 5,000 Synthetic 50,000 9.5 Normal 10,000

Effectiveness vs. Budget Synthetic Data: When the sc-pdf is known to the algorithm -> highest effectiveness Except for that one, EE gave the highest effectiveness & quite stable \epsilon-Greedy is the third highest one, but it is not that stable, as the budget goes bigger, the effectiveness drops significantly. Random algorithm and Greedy algorithm perform worst.

Summary of Other Results Different SC-pdf Uniform Gaussian(0.5, 0.13), (0.5, 0.1667), (0.5, 0.3) Different average number of false values 2, 4.5, 7, 9.5 Effectiveness of t and q Time Efficiency

Outline Introduction Solutions Experiments Conclusion & Future Work

Conclusions We identify a realistic problem of removing data ambiguity under a tight cleaning budget, We borrow the idea of the Multi-Armed-Bandit (MAB) problem, and develop the Explore-Exploit (EE) algorithm Detailed experiments show that the EE perform better than simple variants of Greedy heuristics We are studying the problem in a more complex setting, e.g., the cost of removing ambiguity varies across different entities

References [N. Dalvi, VLDB’04]: N. Dalvi and D. Suciu. Efficient query evaluation on probabilistic databases. In VLDB, 2004. [J. Pei, VLDB’07]: J. Pei, B. Jiang, X. Lin, and Y. Yuan. Probabilistic skylines on uncertain data. In VLDB, 2007. [A. Deshpande, VLDB’04]: A. Deshpande, C. Guestrin, S. Madden, J. Hellerstein, and W. Hong. Model-driven data acquisition in sensor networks. In VLDB, 2004. [R. Cheng, VLDB’08]: R. Cheng, J. Chen, and X. Xie. Cleaning uncertain data with quality guarantees. VLDB, 2008. [D. Berry, 1985]: D. Berry and B. Fristedt. Bandit Problems: Sequential Allocation of Experiments. Chapman and Hall, 1985. [D. Chakrabarti, NIPS’08]: D. Chakrabarti, R. Kumar, F. Radlinski, and E. Upfal. Mortal Multi-Armed Bandits. In NIPS, 2008.

Thank you!  Shawn Yang xyang2@cs.hku.hk

Effectiveness vs. Dataset Characteristics

Effect of Parameters

Time Efficiency

Conclusions Build the ambiguity and cleaning model to describe the disambiguating procedure An algorithm framework of exploring and exploit, and the estimation of cleaning effectiveness with proof A concrete solution based on the framework

Future work Unknown sc-pdf; Different Cost; Multiple Removal of the false values; Calculation of the parameters (tmax, qmax);