AnHai Doan Dept. of Computer Science Univ. of Illinois at Urbana-Champaign Spring 2004 Evolving and Self-Managing Data Integration Systems.

AnHai Doan Dept. of Computer Science Univ. of Illinois at Urbana-Champaign Spring 2004 Evolving and Self-Managing Data Integration Systems

2 Data Integration at UIUC Two main players –Kevin Chang and AnHai Doan 10 students, 30001 cups of coffees, 3 SIGMOD-04 papers Four supporting players –Chengxiang Zhai: IR, bioinformatics, text/data integration –Dan Roth: AI, question answering, text/data integration –Jiawei Han: data mining –Marianne Winslett: security/privacy issues in data sharing Many supporting departments and local organizations –NCSA, Information Science, Genome Institute, Fire Service Institute

3 New faculty member Find houses with 3 bedrooms priced under 300K homes.comrealestate.comhomeseekers.com Data Integration Challenge

4 Architecture of Data Integration System mediated schema homes.comrealestate.com source schema 2 homeseekers.com source schema 3source schema 1 Find houses with 3 bedrooms priced under 300K wrapper list-price, bdrms, address price, num-beds, location, agent-name Think “comparison shopping systems on steroid”...

5 The Need for Data Integration is Ubiquitous! In virtually all domains –data are distributed & stored in heterogeneous formats WWW –hundreds/thousands of sources in bioinformatics, real estate, book,etc. Enterprises –avg. organization has 49 databases [Ives-01] –organizations frequently merge, exchange data Government: e.g., digital government initiatives Military, cultural & international exchange, Semantic Web, information agents, etc. Long-standing challenge in the database community –recent explosion of distributed data adds urgency

6 Current State of Affairs Vibrant research & industrial landscape Research –dated back to the 70-80s, accelerated in the 90s –Stanford, UPenn, AT&T Labs, Maryland, UWashington, Wisconsin, IBM Almaden, ISI, Arizona State U, Ireland, CMU, etc. –many workshops in AI and DB communities: e.g., SIGMOD/VLDB-04 –focused on –conceptual & algorithmic aspects –building systems in specific domains (bio, geo-spatial, rapid emergency response, virtual organization, etc.) Industry –more than 50 startups in 2001, new startups in 2004 Despite much R&D activities, however …

7 Current State of Affairs (cont.) … Most DI systems are still built & maintained manually Manual deployment is extremely labor-intensive... –construct mediated- & source schemas, –find semantic mappings between schemas, –constantly monitor & adjust to changes at hundreds or thousands of data sources,...... and has become a key bottleneck Emerging technologies –XML, Web services, Semantic Web,... will further fuel DI applications & exacerbate the problem Slashing the astronomical cost of ownership is now crucial!

8 The AIDA Project Recently started at Univ of Illinois –AIDA = Automatic Integration of Data Goal: evolving and self-managing data integration systems Easy to start –takes hours instead of weeks or months –perhaps with just a few sources Learn to continuously improve –expand to cover new sources –add novel query capabilities, better query performance Adjust automatically to changes –detect and fix broken wrappers, semantic matches, etc. Require minimal efforts from system admin –some efforts at the start –far less as system has been learning more and more

9 The AIDA Project (cont.) In line with trends in broader computing landscape –autonomic systems (IBM initiative) –recovery-oriented computing (Berkeley) –cognitive computer systems (DARPA) –from cycles to RASS (Stanford) –self-tuning databases (MSR, IBM Almaden, Oracle) Key differences –applied to distributed data management systems –must attack difficult semantics/meta-data issues –heavy involvement of human –must handle large scale Need techniques from multiple fields –databases, machine learning, AI, IR, data mining

10 Project Overview Thrust 1: automate current labor-intensive tasks –schema matching –mediated schema construction –entity matching Thrust 2: develop new capabilities –entity integration Thrust 3: monitor & adjust to changes Thrust 4: reduce cost of system admin –by leveraging the mass of users Thrust 5: design sources for interoperability

11 price agent-name address Schema Matching 1-1 matchcomplex match homes.com listed-price contact-name city state Mediated-schema 320K Jane Brown Seattle WA 240K Mike Smith Miami FL

12 Why Schema Matching is Difficult Schema & data never fully capture semantics! –not adequately documented Must rely on clues in schema & data –using names, structures, types, data values, etc. Such clues can be unreliable –same names => different entities: area => location or square-feet –different names => same entity: area & address => location Intended semantics can be subjective –house-style = house-description? –military apps require committees! Cannot be fully automated, needs work from system admin!

13 Current State of Affairs Largely done by hand –labor intensive & error prone –data integration at GTE [Li&Clifton, 2000] –40 databases, 27000 elements, estimated time: 12 years Need semi-automatic approaches to scale up! Numerous prior & current research projects –Databases: SemInt (Northwestern), DELTA (MITRE), IBM Almaden, Microsoft Research, Wisconsin, Toronto, UC-Irvine, BYU, George Mason, U of Leipzig,... –AI: Stanford, Karlsruhe University, NEC Japan, ISI,... Many startups

14 Our Prior & Ongoing Work [2000-date] Joint work with –Robin Dhamanka, Yoonkyong Lee, Wensheng Wu, Rob McCann, Warren Shen, Alex Kramnik, Olu Sobulo, Vanitha Varadarajan (Illinois), Pedro Domingos, Alon Halevy (U Washington) Learning 1-1 matches for relational & XML schemas –LSD (Learning Source Description) system [WebDB-00, SIGMOD-01, Machine Learning Journal-03] Learning 1-1 & complex matches for ontologies –GLUE [WWW-02, VLDB Journal-03, Ontology Handbook-03] Learning 1-1 matches by mass collaboration –MOBS [WebDB-03, IJCAI-03 Workshop] Learning complex matches for relational schemas: iMAP [SIGMOD-04] Large-scale matching via clustering: IceQ [SIGMOD-04] Corpus-based schema matching [submitted] Further resources –brief survey talk at http://anhai.cs.uiuc.edu/home/talks/isi-matching.ppt –"Learning to Match Structured Representations of Data" [book by Springer-Velag, to appear]

15 Mediated Schema Construction Joint work with –Wensheng Wu (UIUC), Clement Yu (UIC), Weiyi Meng (SUNY Binghamton) ICeQ project –given a set of source query interfaces –construct a mediated schema Step 1: find matches among source query interfaces –use clustering [SIGMOD-04] Step 2: use the found matches to construct mediated schema (ongoing work) Future work –given lot of text in the domain, construct a mediated schema

17 Entity Matching (400K, Queen Ann – Seattle, 206-616-1842, Mike Brown)... (250K, Decatur, 317-727-2459, P. Robertson) (400K, Seattle, 616-1842, Mike Brown)... (400K, Queen Ann – Seattle, 206-616-1842, M. Brown) (320K, S. W. Champaign, 217-727-1999, Jane Smith)... PRICE LOCATION PHONE NAME

18 Prior Work Very active area of research –databases: [Hernandez&Stolfo,SIGMOD-95], [Cohen,SIGMOD-98], [Elfeky&Verykios&Elmagarmid,ICDE-02],... –AI: [Cohen&Richman,KDD-02],[Bilenko&Mooney,02], Dan Roth group, [Tejada et. al., 01],[Tejada et. al. KDD-02], [Michalowski et. al. 03],... Much progress –very effective techniques for many applications –covered a broad range of scenarios Key commonality –assume entities from disparate sources have same set of attributes –e.g., (price,location,phone,name) vs. (price,location,phone,name) –match entities based on similarity of corresponding attributes

19 Our PROM Approach Key observation 1: Entities often have disjoint attributes –source V1: (age, name) –source V2: (name, salary) –source S1: (location,description,phone,name) –source S2: (description,phone,name, price,sq-feet) Key observation 2: Correlations among disjoint attributes can be exploited to maximize matching accuracy! –e.g., (9, “Mike Brown”) vs. (“M. Brown”, $200K) a 9-year-old is unlikely to make $200K!

20 A Profile-Based Solution Consider again matching persons –source V1: (age, name) –source V2: (name, salary) –(9, “Mike Brown”) vs. (“M. Brown”, $200K) Step 1: build a person profile –what does a “typical” person “look” like? –build from data & user input Step 2: match person names –“Mike Brown” vs. “M. Brown” => 0.7 –discard if confidence score is low, otherwise... Step 3: feed both tuples into profile –(9, “Mike Brown”, “M. Brown”, $200K) => 0.3

21 Advantages of Profile-Based Solution Can exploit disjoint attributes to improve accuracy Profiles capture task- independent knowledge –created from task data, domain experts, external data –created once, used anywhere –an example of “knowledge construction and reuse” Yields an extensible, modular architecture –plug and play with new profiles Training data Expert knowledge Domain data Previous matching tasks Similarity Estimator Soft Profile Combiner Tuple t1 Matching pairs Hard Profile Combiner Tuple t2 Table T2 Table T 1 Hard profilers User specified constraints Soft profilers

22 Eg: (birth-year < 1900) implies (#ODI-matches = 0) Association Rule Profiler PROFILERS encode information about domain concepts and can be constructed in many ways Encodes interesting association rules having high confidence Employs Association Rule Mining Techniques Manual Profiler Manually encoded rules Domain Expert Specified Eg: debut-year  b-year Classifier Learn from training data Encodes high confidence rules relating disjoint attributes Eg: Decision tree Instance Profiler Characteristics of a few frequent entities Eg: Profilers for 10 most productive director Histogram Profiler All possible value combinations for a set of attributes Categorical rules based on complete data Learn from external data that is complete in some aspect Completeness Profiler Eg: (studio,movie-genre) Eg: Color US movies are produced only after 1917 External data Learn from external data that is complete in some aspect Profilers

23 Improve accuracy significantly across six real-world domains More profilers result in better performance Entity Matching: Empirical Evaluation

24 Desired result: papers (1)-(2) DBLP-Lite data source (1) Christopher Zhai, A. Kramnik, “Data Warehousing”, SIGMOD, 1998 (2) C. C. Zhai, H. Fang, “Data Mining”, VLDB, 1999 (3) C. Zhai, D. Salesin, “Motion Capturing”, SIGGRAPH, 1998 (4) C. Zhai, “Search Optimization”, SIGIR, 1999 (5) Cheng Zhai, Bruce Croft, Jiawei Han “Text Clustering”, SIGIR, 1999 (6) Cheng Zhai, Bruce Croft, “Language Models”, SIGIR, 2001 (7) A. Doan, H. Fang, “Semantic Integration”, SIGMOD, 2000 Entity Integration Problem: find all tuples related to a real world entity. –given a seed paper Chris C. Zhai, A. Kramnik, Hui Fang, “ Query Processing ”, SIGMOD, 1998 find all papers by Chris C. Zhai from DBLP-Lite

25 Seed paper: Chris C. Zhai, A. Kramnik, Hui Fang, “ Query Processing ”, SIGMOD, 1998 If match papers based only on author names => retrieve (1)-(6) If consider also co-authors and confs => retrieve (1)-(2), (4)-(6) Baseline Solutions: Pairwise Matching (1) Christopher Zhai, A. Kramnik, “Data Warehousing”, SIGMOD, 1998 (2) C. C. Zhai, H. Fang, “Data Mining”, VLDB, 1999 (3) C. Zhai, D. Salesin, “Motion Capturing”, SIGGRAPH, 1998 (4) C. Zhai, “Search Optimization”, SIGIR, 1999 (5) Cheng Zhai, Bruce Croft, Jiawei Han “Text Clustering”, SIGIR, 1999 (6) Cheng Zhai, Bruce Croft, “Language Models”, SIGIR, 2001 (7) A. Doan, H. Fang, “Semantic Integration”, SIGMOD, 2000

26 Seed paper: Chris C. Zhai, A. Kramnik, Hui Fang, “ Query Processing ”, SIGMOD, 1998 If match papers based only on author names => retrieve (1)-(6) If consider also co-authors and confs => retrieve (1)-(2), (4)-(6) Better Solution: Apply Profilers to Pairwise Matching (1) Christopher Zhai, A. Kramnik, “Data Warehousing”, SIGMOD, 1998 (2) C. C. Zhai, H. Fang, “Data Mining”, VLDB, 1999 (3) C. Zhai, D. Salesin, “Motion Capturing”, SIGGRAPH, 1998 (4) C. Zhai, “Search Optimization”, SIGIR, 1999 (5) Cheng Zhai, Bruce Croft, Jiawei Han “Text Clustering”, SIGIR, 1999 (6) Cheng Zhai, Bruce Croft, “Language Models”, SIGIR, 2001 (7) A. Doan, H. Fang, “Semantic Integration”, SIGMOD, 2000 Aggregate Property: very active in both DB and IR, with 3 SIGMOD/VLDB papers and 3 SIGIR papers in 3 years Doesn't fit profile of a typical researcher!

27 Even Better Solution: Global Matching (1) Christopher Zhai, A. Kramnik, “Data Warehousing”, SIGMOD, 1998 (2) C. C. Zhai, H. Fang, “Data Mining”, VLDB, 1999 (3) C. Zhai, D. Salesin, “Motion Capturing”, SIGGRAPH, 1998 (4) C. Zhai, “Search Optimization”, SIGIR, 1999 (5) Cheng Zhai, Bruce Croft, Jiawei Han “Text Clustering”, SIGIR, 1999 (6) Cheng Zhai, Bruce Croft, “Language Models”, SIGIR, 2001 (7) A. Doan, H. Fang, “Semantic Integration”, SIGMOD, 2000 (5), (6) Cheng Zhai in IR (4) seed paper C. Zhai, “ Search Optimization ”, SIGIR, 1999 Chris C. Zhai, A. Kramnik, Hui Fang, “ Query Processing ”, SIGMOD, 1998

28 Clustering improves performance over pair-wise matching Adding profilers improves performance over both clustering and pair-wise matching. Empirical Evaluation

29 More Information on Entity Matching and Integration Context-based entity matching and integration –Tech. Report UIUC-03-2004 Profile-based object matching for information integration –A. Doan, Y. Lu, Y. Lee, and J. Han –IEEE Intelligent Systems, special issue on information integration, 2003 Object matching for data integration: a profile-based approach –A. Doan, Y. Lu, Y. Lee, and J. Han –Proc. of the IJCAI-03 workshop on information integration on the Web, 2003

31 The Problem Numerous automatic tools have been developed for –schema matching, wrapper construction, source discovery, etc. No matter how good these tools are, system admin still needs to –verify predictions of tools –correct wrong ones These tasks are still extremely labor intensive –even worse when considering system maintenance System complexity overwhelms capacity of human admin Reduce the labor cost of system admin is critical! –perhaps most important issue, to enable practical systems!

32 Solution: Shift Some Labor to Users Take some task or part of some task –e.g., schema matching, wrapper construction, source discovery Convert it into a series of very simple questions –such that knowing the answers = solving the task Ask users to answer questions –such that each user has to do very little work  Spread the task labor thinly over a mass of users !

33 Example: Mass Collaboration for Schema Matching ? ?? PriceAuthor $21.99Joseph Heller $14.99John Steinbeck AmountWriter $6.99Upton Sinclair $12.99Aldous Huxley $7.99George Orwell CostAuthor PriceAuthor $21.99Joseph Heller $14.99John Steinbeck AmountWriter $6.99Upton Sinclair $12.99Aldous Huxley $7.99George Orwell CostAuthor

34 Mass Collaboration is not New Successfully applied to –open source software construction –knowledge base construction –collaborative software bug detection –collaborative filtering –annotating online pictures [CMU] Leverage both implicit and explicit feedback from users But has not been applied to data integration settings Can use both implicit and explicit feedback –focus here on explicit one

35 MOBS Project: Mass Collaboration to Build DI Systems Joint work with –Rob McCann, Alex Kramnik, Warren Shen, Vanitha Varadarajan, Olu Sobulo If succeeds –can dramatically reduce cost & time –launch numerous DI systems on Web & enterprises Key challenges –how to break a task into a series of questions –how to entice users to answer questions –how to combine user answers (e.g., what to do with malicious users?) Illustrate baseline solutions via schema matching

36 Example: Book Domain title | author | year | price | category Mediated Schema a 1 | a 2 | a 3 | a 4 | a 5 Schema S 1 b 1 | b 2 | b 3 | b 4 | b 5 Schema S 2 d 1 | d 2 | d 3 | d 4 | d 5 Schema S 4 c 1 | c 2 | c 3 | c 4 | c 5 Schema S 3

37 Build Partial Correct System title | author | year | price | category Mediated Schema a 1 | a 2 | a 3 | a 4 | a 5 Schema S 1 b 1 | b 2 | b 3 | b 4 | b 5 Schema S 2 d 1 | d 2 | d 3 | d 4 | d 5 Schema S 4 c 1 | c 2 | c 3 | c 4 | c 5 Schema S 3

38 Solicit User Answers 3 1 2 0

39 Detect & Remove Bad Users

40 Combine User Answers title | author | year | price | category Mediated Schema a 1 | a 2 | a 3 | a 4 | a 5 Schema S 1 b 1 | b 2 | b 3 | b 4 | b 5 Schema S 2 d 1 | d 2 | d 3 | d 4 | d 5 Schema S 4 c 1 | c 2 | c 3 | c 4 | c 5 Schema S 3

41 Combine User Answers

42 MOBS Challenges Revisited How to decompose a task into a series of questions? –task dependent, currently works for source discovery, 1-1 matching –if can’t solve the whole task, ok for part of the task (e.g., wrapper) How to entice users to answer questions? –incentive models: monopoly or better-service applications use helper applications use volunteers How to evaluate users and combine their answers? –use machine learning –build a dynamic Bayesian network model –solicit user answers to questions with known answers –use these as training data to learn network parameters More detail in [McCann et. al. Tech Report 04, WebDB-03]

43 MOBS Applicability Applied MOBS in many settings... –scale: small-community intranet to high-traffic website –users: unpredictable novice users to cooperative experts... and to several DI tasks –Deep Web: form recognition, query interface matching –Surface Web: hub discovery, data extraction, mini-Citeseer

44 NameHelper Application Duration & Status Current ProgressPrecisionRecall Avg User Workload Form Recognition DB course website, 132 undergrad students 5 days, Completed Completed 24/24 interfaces, Found 17 bookstore interfaces 1.0 (0.7 ML) 0.89 (0.89 ML) 7.4 answers Interface Matching DB course website, 132 undergrad students 7 days, Stopped Completed 10/17 interfaces, Matched 65 total attributes 0.97 (0.63 ML) 0.97 (0.63 ML) 12.5 answers Hub Discovery IR course website, 28 undergrad students 21 days, Stopped Completed 15/30 sites Found 15 hubs 0.87 (0.27 ML) 0.87 (0.27 ML) 16.1 answers Mini-Citeseer Google search engine, 21 researchers, friends, family 19 days, Completed Completed 17/17 pages (94 lists) Found 19 pubs 1.00.868.7 answers P1 – Uniform [0,1]P2 – Uniform [0.3,0.7]P3 – Uniform [0.5,0.9] P4 – Bell [0,1]P5 – Bell [0.3,0.7]P6 – Bell [0.5,0.9] P7 – Bimodal {0.2,0.8} P8 – 90% Uniform [0,0.4], 10% {0.8} P9 – 10% {0.1}, 50% Uniform [0.5,0.7], 40% Uniform [0.8,1] P10 – 10% {0.3}, 90% Uniform [0.7,1] Simulation and Real-World Results

46 Summary The need for data integration is pervasive Manual data integration is a key bottleneck Our solution: AIDA project on autonomic DI systems Discussed problems –schema matching [SIGMOD-04] –mediated schema construction [SIGMOD-04] –entity matching & integration [Tech report 04] –mass collaboration [Tech report 04] Machine learning is the underlying technique Many implications beyond data integration context More information: “anhai” on Google

AnHai Doan Dept. of Computer Science Univ. of Illinois at Urbana-Champaign Spring 2004 Evolving and Self-Managing Data Integration Systems.

Similar presentations

Presentation on theme: "AnHai Doan Dept. of Computer Science Univ. of Illinois at Urbana-Champaign Spring 2004 Evolving and Self-Managing Data Integration Systems."— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

AnHai Doan Dept. of Computer Science Univ. of Illinois at Urbana-Champaign Spring 2004 Evolving and Self-Managing Data Integration Systems.

Similar presentations

Presentation on theme: "AnHai Doan Dept. of Computer Science Univ. of Illinois at Urbana-Champaign Spring 2004 Evolving and Self-Managing Data Integration Systems."— Presentation transcript:

Similar presentations

About project

Feedback