AnHai Doan University of Illinois Joint work with Pedro DeRose, Robert McCann, Yoonkyong Lee, Mayssam Sayyadian, Warren Shen, Wensheng Wu, Quoc Le, Hoa.

Slides:



Advertisements
Similar presentations
Mayssam Sayyadian, Yoonkyong Lee, AnHai Doan University of Illinois, USA Arnon Rosenthal MITRE Corp., USA eTuner: Tuning Schema Matching Software using.
Advertisements

Lukas Blunschi Claudio Jossen Donald Kossmann Magdalini Mori Kurt Stockinger.
Efficient IR-Style Keyword Search over Relational Databases Vagelis Hristidis University of California, San Diego Luis Gravano Columbia University Yannis.
Data Mining Methodology 1. Why have a Methodology  Don’t want to learn things that aren’t true May not represent any underlying reality ○ Spurious correlation.
Amit Shvarchenberg and Rafi Sayag. Based on a paper by: Robin Dhamankar, Yoonkyong Lee, AnHai Doan Department of Computer Science University of Illinois,
By: Mr Hashem Alaidaros MIS 211 Lecture 4 Title: Data Base Management System.
Visibility Information Exchange Web System. Source Data Import Source Data Validation Database Rules Program Logic Storage RetrievalPresentation AnalysisInterpretation.
Building Enterprise Applications Using Visual Studio ®.NET Enterprise Architect.
Sunita Sarawagi.  Enables richer forms of queries  Facilitates source integration and queries spanning sources “Information Extraction refers to the.
Search Engines and Information Retrieval
Reconciling Schemas of Disparate Data Sources: A Machine-Learning Approach AnHai Doan Pedro Domingos Alon Halevy.
CS155b: E-Commerce Lecture 10: Feb. 13, 2003 XML and its relationship to B2B commerce Acknowledgements: R. Glushko, A. Gregory, and V. Ramachandran.
Chapter 14 The Second Component: The Database.
8 Systems Analysis and Design in a Changing World, Fifth Edition.
Methodology Conceptual Database Design
Copyright © 2006 by The McGraw-Hill Companies, Inc. All rights reserved. McGraw-Hill Technology Education Copyright © 2006 by The McGraw-Hill Companies,
Learning to Map between Structured Representations of Data
A Platform for Personal Information Management and Integration Xin (Luna) Dong and Alon Halevy University of Washington.
Pedro Domingos Joint work with AnHai Doan & Alon Levy Department of Computer Science & Engineering University of Washington Data Integration: A “Killer.
Robert McCann University of Illinois Joint work with Bedoor AlShebli, Quoc Le, Hoa Nguyen, Long Vu, & AnHai Doan VLDB 2005 Mapping Maintenance for Data.
Project Proposal: Academic Job Market and Application Tracker Website Project designed by: Cengiz Gunay Client: Cengiz Gunay Audience: PhD candidates and.
Semantic Interoperability Jérôme Euzenat INRIA & LIG France Natasha Noy Stanford University USA.
Public Conversations Architecture Clustering Results Conversation Map Conclusion CEES: Intelligent Access to Public Conversations William Lee,
Challenges in Information Retrieval and Language Modeling Michael Shepherd Dalhousie University Halifax, NS Canada.
2008 © ChengXiang Zhai Dragon Star Lecture at Beijing University, June 21-30, Pick a Good IR Research Problem ChengXiang Zhai Department of Computer.
Enterprise & Intranet Search How Enterprise is different from Web search What to think about when evaluating Enterprise Search How Intranet use is different.
1 Data Mining Books: 1.Data Mining, 1996 Pieter Adriaans and Dolf Zantinge Addison-Wesley 2.Discovering Data Mining, 1997 From Concept to Implementation.
1 The BT Digital Library A case study in intelligent content management Paul Warren
AnHai Doan, Pedro Domingos, Alon Halevy University of Washington Reconciling Schemas of Disparate Data Sources: A Machine Learning Approach The LSD Project.
Chapter 6: Foundations of Business Intelligence - Databases and Information Management Dr. Andrew P. Ciganek, Ph.D.
Error reports as a source for SPI Tor Stålhane Jingyue Li, Jan M.N. Kristiansen IDI / NTNU.
AnHai Doan Pedro Domingos Alon Levy Department of Computer Science & Engineering University of Washington Learning Source Descriptions for Data Integration.
Learning Source Mappings Zachary G. Ives University of Pennsylvania CIS 650 – Database & Information Systems October 27, 2008 LSD Slides courtesy AnHai.
Michael Cafarella Alon HalevyNodira Khoussainova University of Washington Google, incUniversity of Washington Data Integration for Relational Web.
UOS 1 Ontology Based Personalized Search Zhang Tao The University of Seoul.
Using error reports in SPI Tor Stålhane IDI / NTNU.
Minor Thesis A scalable schema matching framework for relational databases Student: Ahmed Saimon Adam ID: Award: MSc (Computer & Information.
Introduction to SQL Server Data Mining Nick Ward SQL Server & BI Product Specialist Microsoft Australia Nick Ward SQL Server & BI Product Specialist Microsoft.
IMAP: Discovering Complex Semantic Matches between Database Schemas Robin Dhamankar, Yoonkyong Lee, AnHai Doan University of Illinois, Urbana-Champaign.
Cimple: Building Community Portal Sites through Crawling & Extraction Zachary G. Ives University of Pennsylvania CIS 650 – Implementing Data Management.
Database Design and Management CPTG /23/2015Chapter 12 of 38 Functions of a Database Store data Store data School: student records, class schedules,
Data Mining By Dave Maung.
AnHai Doan University of Wisconsin-Madison Data Quality Challenges in Community Systems Joint work with Pedro DeRose, Warren Shen, Xiaoyong Chai, Byron.
AnHai Doan Dept. of Computer Science Univ. of Illinois at Urbana-Champaign Spring 2004 Evolving and Self-Managing Data Integration Systems.
Greedy is not Enough: An Efficient Batch Mode Active Learning Algorithm Chen, Yi-wen( 陳憶文 ) Graduate Institute of Computer Science & Information Engineering.
Individualized Knowledge Access David Karger Lynn Andrea Stein Mark Ackerman Ralph Swick.
WebIQ: Learning from the Web to Match Deep-Web Query Interfaces Wensheng Wu Database & Information Systems Group University of Illinois, Urbana Joint work.
Chapter 6 CASE Tools Software Engineering Chapter 6-- CASE TOOLS
Software Prototyping Rapid software development to validate requirements.
Mayssam Sayyadian, AnHai Doan University of Wisconsin - Madison Hieu LeKhac University of Illinois - Urbana Luis Gravano Columbia University Efficient.
Data Integration Hanna Zhong Department of Computer Science University of Illinois, Urbana-Champaign 11/12/2009.
Semantic Mappings for Data Mediation
Identifying “Best Bet” Web Search Results by Mining Past User Behavior Author: Eugene Agichtein, Zijian Zheng (Microsoft Research) Source: KDD2006 Reporter:
1 Copyright © 2009, Oracle. All rights reserved. Oracle Business Intelligence Enterprise Edition: Overview.
Presented By – Yogesh A. Vaidya. Introduction What are Structured Web Community Portals? Advantages of SWCP powerful capabilities for searching, querying.
Toward Entity Retrieval over Structured and Text Data Mayssam Sayyadian, Azadeh Shakery, AnHai Doan, ChengXiang Zhai Department of Computer Science University.
Tuning using Synthetic Workload Summary & Future Work Experimental Results Schema Matching Systems Tuning Schema Matching Systems Formalization of Tuning.
Warren Shen, Xin Li, AnHai Doan Database & AI Groups University of Illinois, Urbana Constraint-Based Entity Matching.
Pedro DeRose University of Wisconsin-Madison The DBLife Prototype System in The Cimple Project on Community Information Management.
Systems Analysis and Design in a Changing World, Fifth Edition
Building Enterprise Applications Using Visual Studio®
AnHai Doan, Pedro Domingos, Alon Halevy University of Washington
eTuner: Tuning Schema Matching Software using Synthetic Scenarios
Datamining : Refers to extracting or mining knowledge from large amounts of data Applications : Market Analysis Fraud Detection Customer Retention Production.
Information Retrieval
Teaching slides Chapter 8.
Introduction to Information Retrieval
McGraw-Hill Technology Education
eTuner: Tuning Schema Matching Software using Synthetic Scenarios
Presentation transcript:

AnHai Doan University of Illinois Joint work with Pedro DeRose, Robert McCann, Yoonkyong Lee, Mayssam Sayyadian, Warren Shen, Wensheng Wu, Quoc Le, Hoa Nguyen, Long Vu, Robin Dhamankar, Alex Kramnik, Luis Gravano, Weiyi Meng, Raghu Ramakrishnan, Dan Roth, Arnon Rosenthal, Clemen Yu From Data Integration to Community Information Management

2 New researcher Find houses with 4 bedrooms priced under 300K homes.comrealestate.comhomeseekers.com Data Integration Challenge

3 Actually Bought a House in 2004 Buying period –queried 7-8 data sources over 3 weeks –some of the sources are local, not “indexed” by national sources –3 hours / night  60+ hours –huge amount of time on querying, post processing Buyer-remorse period –repeated the above for another 3 weeks! We really need to automate data integration...

4 Architecture of Data Integration Systems mediated schema houses.comhomes.com source schema 2 realestate.com source schema 3source schema 1 Find houses with 4 bedrooms priced under 300K wrapper

5 Current State of Affairs Vibrant research & industrial landscape Research since the 70s, accelerated in past decade –database, AI, Web, KDD, Semantic Web communities –14+ workshops in past 3 years: ISWC-03, IJCAI-03, VLDB-04, SIGMOD-04, DILS-04, IQIS-04, ISWC-04, WebDB-05, ICDE-05, DILS-05, IQIS-05, IIWeb-06, etc. –main database focuses: –modeling, architecture, query processing, schema/tuple matching –building specialized systems: life sciences, Deep Web, etc. Industry –53 startups in 2002 [Wiederhold-02] –many new ones in 2005 Despite much R&D activities, however …

6 DI Systems are Still Very Difficult to Build and Maintain Builder must execute multiple tasks Most tasks are extremely labor intensive Total cost often at 35% of IT budget [Knoblock et. al. 02] –systems often take months or years to develop High cost severely limits deployment of DI systems select data sources create wrappers create mediated schemas match schemas eliminate duplicate tuples monitor changes etc.

7 Data Illinois Directions: –automate tasks to minimize human labor –leverage users to spread out the cost –simplify tasks so that they can be done quickly

8 price agent-name address Sample Research on Automating Integration Tasks: Schema Matching 1-1 matchcomplex match homes.com listed-price contact-name city state Mediated-schema 320K Jane Brown Seattle WA 240K Mike Smith Miami FL

9 Schema Matching is Ubiquitous! Fundamental problem in numerous applications Databases –data integration, –model management –data translation, collaborative data sharing –keyword querying, schema/view integration –data warehousing, peer data management, … AI –knowledge bases, ontology merging, information gathering agents,... Web –e-commerce, Deep Web, Semantic Web, Google Base, next version of My Web 2.0? eGovernment, bio-informatics, e-sciences

10 Why Schema Matching is Difficult Schema & data never fully capture semantics! –not adequately documented Must rely on clues in schema & data –using names, structures, types, data values, etc. Such clues can be unreliable –same names  different entities: area  location or square-feet –different names  same entity: area & address  location Intended semantics can be subjective –house-style = house-description? Cannot be fully automated, needs user feedback

11 Current State of Affairs Schema matching is now a key bottleneck! –largely done by hand, labor intensive & error prone –data integration at GTE [Li&Clifton, 2000] –40 databases, elements, estimated time: 12 years Numerous matching techniques have been developed –Databases: IBM Almaden, Wisconsin, Microsoft Research, Purdue, BYU, George Mason, Leipzig, NCSU, Illinois, Washington,... –AI: Stanford, Toronto, Rutgers, Karlsruhe University, NEC, USC, … "everyone and his brother is doing ontology mapping" Techniques are often synergistic, leading to multi-component matching architectures –each component employs a particular technique –final predictions combine those of the components

12 Example: LSD [Doan et al. SIGMOD-01] Mediated schema Urbana, IL James Smith Seattle, WA Mike Doan address agent-name area contact-agent Peoria, IL (206) Kent, WA (617) homes.com Name Matcher Naive Bayes Matcher Combiner 0.3 agent name contact agent area => (address, 0.7), (description, 0.3) contact-agent => (agent-phone, 0.7), (agent-name, 0.3) comments => (address, 0.6), (desc, 0.4) Match Selector Constraint Enforcer Only one attribute of source schema matches address area = address contact-agent = agent-phone... comments = desc

13 Multi-Component Matching Solutions Such systems are very powerful... –maximize accuracy; highly customizable... but place a serious tuning burden on domain users Constraint enforcer Match selector Matcher Combiner … Matcher 1 Matcher n Constraint enforcer Match selector Combiner Matcher 1Matcher n … Constraint enforcer Match selector Combiner Matcher 1Matcher n … Match selector Combiner LSDCOMASF LSD-SF Introduced in [Doan et. al., WebDB-00, SIGMOD-01, Do&Rahm, VLDB-02, Embley et. al. 02] Now commonly adopted, with industrial-strength systems –e.g., Protoplasm [MSR], COMA++ [Univ of Lepzig]

14 Tuning Schema Matching Systems Library of matching components Constraint enforcer Match selector Combiner Matcher 1Matcher n … Execution graph Knobs of decision tree matcher Threshold selector Bipartite graph selector A* search enforcer Relax. labeler ILP Average combiner Min combiner Max combiner Weighted sum combiner q-gram name matcher Decision tree matcher Naïve Bays matcher TF/IDF name matcher SVM matcher Characteristics of attr. Post-prune? Size of validation set Split measure Given a particular matching situation –how to select the right components? –how to adjust the multitude of knobs? Untuned versions produce inferior accuracy

15 Large number of knobs –e.g., 8-29 in our experiments Wide variety of techniques –database, machine learning, IR, information theory, etc. Complex interaction among components Not clear how to compare quality of knob configs Long-standing problem since the 80s, getting much worse with multiple-component systems But Tuning is Extremely Difficult  Developing efficient tuning techniques is now crucial

16 The eTuner Solution [VLDB-05a] Given schema S & matching system M –tunes M to maximize average accuracy of matching S with future schemas –commonly occur in data integration, warehousing, supply chain Challenge 1: Evaluation –score each knob config K of matching system M –return K*, the one with highest score –but how to score knob config K? –if we know a representative workload W = {(S,T1),..., (S,Tn)}, and correct matches between S and T1, …, Tn  can use W to score K Challenge 2: Huge or infinite search space

17 Solving Challenge 1: Generate Synthetic Input/Output Need workload W = {(S,T1), (S,T2), …, (S,Tn)} To generate W –start with S –perturb S to generate T1 –perturb S to generate T2 –etc. Know the perturbation  know matches between S & Ti

18 emp-last id wage Laup Brown Emps Perturb # of columns Perturb table and column names Perturb data tuples id = id first = NONE last = emp-last salary = wage id first last salary ($) 1 Bill Laup40,000 $ 2 Mike Brown60,000 $ Employees last id salary ($) Laup 140,000 $ Brown260,000 $ Employees emp-last id wage Laup 140,000$ Brown260,000$ Emps Generate Synthetic Input/Output Schema S Make sure tables do not share tuples Rules are applied probabilistically

19 The eTuner Architecture Staged Searcher Tuning Procedures Workload Generator Perturbation Rules Matching Tool M Synthetic Workload Tuned Matching Tool M S Ω 1 T 1 S Ω 2 T 2 S Ω n T n Schema S More details / experiments in –Sayyadian et. al., VLDB-05

20 eTuner: Current Status Only the first step –but now we have a line of attack for a long-standing problem Current directions –find optimal synthetic workload –develop faster search methods –extend for other matching scenarios –adapt ideas to scenarios beyond schema matching –wrapper maintenance [VLDB-05b] –domain-specific search engine?

21 Automate Integration Tasks: Summary Schema matching –architecture: WebDB-00, SIGMOD-01, WWW-02 –long-standing problems: SIGMOD-04a, eTuner [VLDB-05a] –learning/other techniques: CIDR-03, VLDBJ-03, MLJ-03, WebDB-03, SIGMOD-04b, ICDE-05a, ICDE-05b –novel problem: debug schemas for interoperability [ongoing] –industry transfer: involving 2 startups –promote research area: workshop at ISWC-03, special issues in SIGMOD Record-04 & AI Magazine-05, survey Query reformulation: ICDE-02 Mediated schema construction: SIGMOD-04b, ICDM-05, ICDE-06 Duplicate tuple removal: AAAI-05, Tech report 06a, 06b Wrapper maintenance: VLDB-05b

22 Research Directions Automate integration tasks –to minimize human labor Leverage users –to spread the cost Simplify integration tasks –so that they can be done quickly

23 The MOBS Project Learn from multitude of users to improve tool accuracy, thus significantly reducing builder workload MOBS = Mass Collaboration to Build Systems Questions Answers

24 Mass Collaboration Build software artifacts –Linux, Apache server, other open-source software Knowledge bases, encyclopedia –wikipedia.com Review & technical support websites –amazon.com, epinions.com, quiq.com, Detect software bugs –[Liblit et al. PLDI 03 & 05] Label images/pages on the Web –ESPgame, flickr, del.ici.ous, My Web 2.0 Improve search engines, recommender systems Why not data integration systems?

25 Example: Duplicate Data Matching Hard for machine, but easy for human Mouse for Dell laptop 200 series... Dell X200; mouse at reduced price... Dell laptop X200 with mouse... Serious problem in many settings (e.g., e.com)

26 Key Challenges How to modify tools to learn from users? How to combine noisy user answers How to obtain user participation? –data experts, often willing to help (e.g., Illinois Fire Service) –may be asked to help (e.g., e.com) –volunteer (e.g., online communities), "payment" schemes Multiple noisy oracles –build user models, learn them via interaction with users –novel form of active learning

27 Current Status Develop first-cut solutions –built prototype, experimented with users, for source discovery and schema matching –improve accuracy by 9-60%, reduced workload by 29-88% Built two simple DI systems on Web –almost exclusively with users Building a real-world application –DBlife (more later) See [McCann et al., WebDB-03, ICDE-05, AAAI Spring Symposium-05, Tech Report-06]

28 Research Directions Automate integration tasks –to minimize human labor Leverage users –to spread the cost Simplify integration tasks –so that they can be done quickly

29 Simplify Mediated Schema  Keyword Search over Multiple Databases Novel problem Very useful for urgent / one-time DI needs –also when users are SQL-illiterate (e.g., Electronic Medical Records) –also on the Web (e.g., when data is tagged with some structure) Solution [Kite, Tech Report 06a] –combines IR, schema matching, data matching, and AI planning

30 Simplify Wrappers  Structured Queries over Text/Web Data Novel problem –attracts attention from database / AI / Web researchers at Columbia, IBM TJ Watson/Almaden, UCLA, IIT-Bombay [SQOUT, Tech Report 06b], [SLIC, Tech Report 06c] SELECT... FROM... WHERE... s, text, Web data, news, etc.

31 Research Directions Automate integration tasks –to minimize human labor Leverage users –to spread the cost Simplify integration tasks –so that they can be done quickly Integration is difficult Do best-effort integration Integrate with text Should leverage human Build on this to promote Community Information Management

32 Community Information Management Numerous communities on the Web –database researchers, movie fans, legal professionals, bioinformatics, etc. –enterprise intranets, tech support groups Each community = many disparate data sources + people Members often want to query, monitor, discover info. –any interesting connection between researchers X and Y? –list all courses that cite this paper –find all citations of this paper in the past one week on the Web –what is new in the past 24 hours in the database community? –which faculty candidates are interviewing this year, where? Current integration solutions fall short of addressing such needs

33 Cimple Illinois/Wisconsin Software platform that can be rapidly deployed and customized to manage data-rich online communities Web pages Text documents * * * * * * * * * SIGMOD-04 * * * * give-talk Jim Gray Keyword search SQL querying Question answering Browse Mining Alert/Monitor News summary Jim Gray SIGMOD-04 * * Share / aggregation Researcher Homepages Conference Pages Group Pages DBworld mailing list DBLP Import & personalize data Tag entities/relationship / create new contents Context-dependent services

34 Prototype System: DBlife 1164 data sources, crawled daily, pages / day  160+ MB, people mentions  persons

35 Structure Related Challenges Extraction –better blackboxes, compose blackboxes, exploit domain knowledge Maintenance –critical, but very little has been done Exploitation –keyword search over extracted structure? SQL queries? –detect interesting events? Web pages Text documents * * * * * * * * * SIGMOD-04 * * * * give-talk Jim Gray Keyword search SQL querying Question answering Browse Mining Alert/Monitor News summary Jim Gray SIGMOD-04 * * Researcher Homepages Conference Pages Group Pages DBworld mailing list DBLP

36 User Related Challenges Users should be able to –import whatever they want –correct/add to the imported data –extend the ER schema –create new contents for share/exchange –ask for context-dependent services Examples –user imports a paper, system provides bib item –user imports a movie, add desc, tags it for exchange Challenges –provide incentives, payment –handle malicious/spam users –share / aggregate user activities/actions/content give-talk Jim Gray SIGMOD-04

37 Comparison to Current My Web 2.0 Cimple focuses on domain-specific communities –not the entire Web Besides page level –also considers finer granularities of entities / relations / attributes –leverages automatic “best-effort” data integration techniques Leverages user feedback to further improve accuracy –thus combines both automatic techniques and human efforts Considers the entire range of search + structured queries –and how to seamlessly move between them Allows personalization and sharing –consider context-dependent services beyond keyword search (e.g., selling, exchange)

38 Applying Cimple to My Web 2.0: An Example Going beyond just sharing Web pages Leveraging My Web 2.0 for other actions –e.g., selling, exchanging goods (turning it to a classified ads platform?) E.g., want to sell my house –create a page describing the house –save it to my account on My Web 2.0 –tag it with “sell:house, sell, house, champaign, IL” –took me less than 5 minutes (not including creating the page) –now if someone searches for any of these keywords …

39

40

41 Here a button can be added to facilitate the “sell” action  provide context-dependent services

42 The Big Picture [Speculative Mode] Structured data (relational, XML) Unstructured data (text, Web, ) Multitude of users Database: SQL IR/Web/AI/Mining: keyword, QA Semantic Web Industry/Real World Many apps will involve all three Exact integration will be difficult - best-effort is promising - should leverage human Apps will want broad range of services - keyword search, SQL queries - buy, sell, exchange, etc.

43 Summary Data integration: crucial problem –at intersection of database, AI, Web, IR Illinois in my group: –automate tasks to minimize human labor –leverage users to spread out the cost –simplify tasks so that they can be done quickly Best-effort integration, should leverage human The Cimple Illinois/Wisconsin –builds on current work to study Community Information Management A step toward managing structured + text + users synergistically! See “anhai” on Yahoo for more details