Xin Luna Dong (AT&T Labs  Google Inc.) Barna Saha, Divesh Srivastava (AT&T Labs-Research) VLDB’2013.

Slides:



Advertisements
Similar presentations
LTV and RFM for Non Profits DMA Non Profit Forum Friday February : :45 The Capitol Hilton Washington, DC Arthur Middleton Hughes Vice President.
Advertisements

3.6 Support Vector Machines
Unit-iv.
McGraw-Hill/Irwin ©2008 The McGraw-Hill Companies, All Rights Reserved Chapter 8 Markups and Markdowns: Perishables and Breakeven Analysis.
Cognitive Radio Communications and Networks: Principles and Practice By A. M. Wyglinski, M. Nekovee, Y. T. Hou (Elsevier, December 2009) 1 Chapter 12 Cross-Layer.
L3S Research Center University of Hanover Germany
Managerial Economics & Business Strategy
CLASSIFY SIDES PYTHAGOREAN THEOREM CLASSIFY ANGLES SIMPLIFY RADICALS MISC
Behind The Supply Curve: Production Function I
© 2002 Prentice-Hall, Inc.Chap 17-1 Basic Business Statistics (8 th Edition) Chapter 17 Decision Making.
Online and Offline Selling in Limit Order Markets Aaron Johnson Yale University Kevin Chang Yahoo! Inc. Workshop on Internet and Network Economics December,
Week 2 The Object-Oriented Approach to Requirements

Pole Placement.
QoS-based Management of Multiple Shared Resources in Dynamic Real-Time Systems Klaus Ecker, Frank Drews School of EECS, Ohio University, Athens, OH {ecker,
The basics for simulations
Chapter 4: Informed Heuristic Search
Operations Management For Competitive Advantage © The McGraw-Hill Companies, Inc., 2001 C HASE A QUILANO J ACOBS ninth edition 1 Strategic Capacity Management.
Cost-effective Outbreak Detection in Networks Jure Leskovec, Andreas Krause, Carlos Guestrin, Christos Faloutsos, Jeanne VanBriesen, Natalie Glance.
Introduction to Cost Behavior and Cost-Volume Relationships
Chapter 12 Capturing Surplus.
An Application of Linear Programming Lesson 12 The Transportation Model.
Simple Interest Lesson
Process Analysis If you cannot describe what you are doing as a process, you do not know what you are doing. W.E. Deming.
McGraw-Hill/Irwin Copyright © 2007 by The McGraw-Hill Companies, Inc. All rights reserved. Chapter Eleven Cost Behavior, Operating Leverage, and CVP Analysis.
Lecture 7 Paradigm #5 Greedy Algorithms
MCA 301: Design and Analysis of Algorithms
Production and Cost Analysis: Part I
Strategy Review Meeting Strategy Review Meeting
Xian Li  Cisco) Xin Luna Dong (AT&T  Google) Kenneth Lyons (AT&T Labs-Research) Weiyi Meng Divesh Srivastava (AT&T.
12/10/14 Exam Wedn., 12/17/13, 2pm-4:30pm, Baker Laboratory 200 ( Material: Cumulative. Covers all material.
Xin Luna Dong (Google Inc.) Divesh Srivastava (AT&T 5/2013.
Group Recommendation: Semantics and Efficiency
MCS 312: NP Completeness and Approximation algorithms Instructor Neelima Gupta
Xin Luna Dong AT&T Labs-Research Joint work w. Laure Berti-Equille, Yifan Hu, Divesh
Laure Berti (Universite de Rennes 1), Anish Das Sarma (Stanford), Xin Luna Dong (AT&T), Amelie Marian (Rutgers), Divesh Srivastava (AT&T)
Knapsack Problem Section 7.6. Problem Suppose we have n items U={u 1,..u n }, that we would like to insert into a knapsack of size C. Each item u i has.
Comp 122, Spring 2004 Greedy Algorithms. greedy - 2 Lin / Devi Comp 122, Fall 2003 Overview  Like dynamic programming, used to solve optimization problems.
Chapter 5 Fundamental Algorithm Design Techniques.
Online Data Fusion School of Computing National University of Singapore AT&T Shannon Research Labs Xuan Liu, Xin Luna Dong, Beng Chin Ooi, Divesh Srivastava.
Ariel D. Procaccia (Microsoft)  Best advisor award goes to...  Thesis is about computational social choice Approximation Learning Manipulation BEST.
Northwestern University Winter 2007 Machine Learning EECS Machine Learning Lecture 13: Computational Learning Theory.
Reduced Support Vector Machine
1 Combinatorial Dominance Analysis The Knapsack Problem Keywords: Combinatorial Dominance (CD) Domination number/ratio (domn, domr) Knapsack (KP) Incremental.
Bioinformatics Challenge  Learning in very high dimensions with very few samples  Acute leukemia dataset: 7129 # of gene vs. 72 samples  Colon cancer.
KNAPSACK PROBLEM A dynamic approach. Knapsack Problem  Given a sack, able to hold K kg  Given a list of objects  Each has a weight and a value  Try.
Week 2: Greedy Algorithms
Mariam Salloum (YP.com) Xin Luna Dong (Google) Divesh Srivastava (AT&T Research) Vassilis J. Tsotras (UC Riverside) 1 Online Ordering of Overlapping Data.
UNIVERSITY OF MARYLAND GOOGLE INC. AT&T LABS-RESEARCH THEODOROS REKATSINAS XIN LUNA DONG DIVESH SRIVASTAVA CHARACTERIZING AND SELECTING FRESH DATA SOURCES.
Economics Chapter 4-2 MINI PROJECT – Due November 13 Create a cartoon or comic strip to illustrate an economic concept from the chapter. For example, demonstrating.
Modeling and simulation of systems Simulation optimization and example of its usage in flexible production system control.
AGEC 407 Economic Decision Making Marginal analysis –changes at the “margin” –examining the results of an additional unit.
Optimizing Plurality for Human Intelligence Tasks Luyi Mo University of Hong Kong Joint work with Reynold Cheng, Ben Kao, Xuan Yang, Chenghui Ren, Siyu.
Online Data Fusion School of Computing National University of Singapore AT&T Shannon Research Labs Xuan Liu, Xin Luna Dong, Beng Chin Ooi, Divesh Srivastava.
The Greedy Method. The Greedy Method Technique The greedy method is a general algorithm design paradigm, built on the following elements: configurations:
De novo discovery of mutated driver pathways in cancer Discussion leader: Matthew Bernstein Scribe: Kun-Chieh Wang Computational Network Biology BMI 826/Computer.
The Theory of Production  Relationship between factors of production and the output of goods and services  How output changes when inputs change  Based.
The Production Function Chapter 13. Firm Behavior Firm’s have an economic goal to maximize profits Profits = Total Revenue – Total Costs.
CS 3343: Analysis of Algorithms Lecture 19: Introduction to Greedy Algorithms.
Chapter 6 Production.
The Production Function
Distributed Submodular Maximization in Massive Datasets
Data Integration with Dependent Sources
Coverage Approximation Algorithms
The Production Function
Polynomial time approximation scheme
Microeconomics Part 2 Copyright © Texas Education Agency, All rights reserved.
The results for Challenging Problem 1.
Viral Marketing over Social Networks
Algorithms Lecture # 26 Dr. Sohail Aslam.
Presentation transcript:

Xin Luna Dong (AT&T Labs  Google Inc.) Barna Saha, Divesh Srivastava (AT&T Labs-Research) VLDB’2013

* Lots of money

* Lots of machines

* Lots of people

1096 books from the largest source 1213 books from the 2 largest sources 1250 books from the 10 largest sources 1260 books from the first 35 sources All 1265 books from the first 537 sources In total 894 sources, 1265 CS books CS books from AbeBooks.com

90 > 80 books w. correct authors after 579 sources (Accu) 93 > 80 books w. correct authors after 583 sources (Vote) All 100 books (gold standard) from the first 548 sources 78 books w. correct authors for Vote 80 books w. correct authors for Accu CS books from AbeBooks.com

* Questions * Is it best to integrate all data? * How to spend the computing resources in a wise way? * How to wisely select sources before real integration to balance the gain and the cost? * Prelude for data integration and outside traditional integration tasks (schema mapping, entity resolution, data fusion)

14 books (17.6% fewer) w. correct authors from the first 200 (33% less resources) sources 17 books w. correct authors from 300 sources (budget) CS books from AbeBooks.com

65 books w. correct authors (quality requirement) from the first 520 sources 81 books (25% more) w. correct authors from 526 sources (1% more) CS books from AbeBooks.com

Marginal gain II Marginal cost Marginal gain II Marginal cost The law of Diminishing Returns Largest profit

Marginal point with the largest profit in this ordering: 548 sources CS books from AbeBooks.com Challenge 1. The Law of Diminishing Returns does not necessarily hold, so multiple marginal points Challenge 2. Each source is different in quality, so different ordering leads to different marginal points: best solution integrates 26 sources Challenge 3. Estimating gain and cost w/o real integration

* Input * S: a set of available sources * F: integration model * Output: subset Ŝ to maximize profit G F (Ŝ)-C F (Ŝ) * G F (Ŝ): Gain of integrating Ŝ using model F * C F (Ŝ): Cost of integrating Ŝ using model F * Gain and cost need to be in the same unit to be comparable; e.g., $

* Theorem I (NP-Completeness). Under the arbitrary cost model (i.e., different sources have different costs), Marginalism is NP- complete. * Theorem II (A greedy solution can obtain arbitrarily bad results): Let d opt be the optimal profit and d be the profit by a greedy solution. For any θ, there exists an input set of sources and a gain model s.t. d/d opt < θ.

Improvement I. Randomly select from Top-k solutions Improvement II. Hill climbing to improve the initial solution Improvement III. Repeat r times and choose the best solution

* Side contributions on data fusion * The PopAccu model: monotonicity—adding a source should never decrease fusion quality * Algorithms to estimate fusion quality: dynamic programming

* Book data set: CS books at Abebooks.com in 2007 * 894 sources * 1265 books * records * Flight data set: Deep-Web sources for “flight status” in 2011 * 38 sources * 1200 flights * records

228 sources provide books in gold standard Marginalism selects 165 sources; reaching the highest quality PopAccu outperforms Vote and Accu, and is nearly monotonic for “good” sources

Marginalism has higher profit than MaxGLimitC and MinCLimitG most of the time

Greedy solution often cannot find the optimal solution GRASP (top-10, repeating 320 times) obtains nearly optimal results

* Full-fledged source selection for data integration * Other quality measures: e.g., freshness, consistency, redundancy; correlations, copying relationships between sources * Complex cost and gain models * Selecting subsets of data from each source * Other components of data integration: schema mapping, entity resolution

The More the Better? OR Less is More?