Data Integration Hanna Zhong Department of Computer Science University of Illinois, Urbana-Champaign 11/12/2009.

Slides:



Advertisements
Similar presentations
Università di Modena e Reggio Emilia ;-)WINK Maurizio Vincini UniMORE Researcher Università di Modena e Reggio Emilia WINK System: Intelligent Integration.
Advertisements

Dr. Leo Obrst MITRE Information Semantics Information Discovery & Understanding Command & Control Center February 6, 2014February 6, 2014February 6, 2014.
Retrieval of Information from Distributed Databases By Ananth Anandhakrishnan.
A Prototype Implementation of a Framework for Organising Virtual Exhibitions over the Web Ali Elbekai, Nick Rossiter School of Computing, Engineering and.
CSE 636 Data Integration Data Integration Approaches.
Information Integration Using Logical Views Jeffrey D. Ullman.
1. The Digital Library Challenge The Hybrid Library Today’s information resources collections are “hybrid” Combinations of - paper and digital format.
Prentice Hall, Database Systems Week 1 Introduction By Zekrullah Popal.
0 General information Rate of acceptance 37% Papers from 15 Countries and 5 Geographical Areas –North America 5 –South America 2 –Europe 20 –Asia 2 –Australia.
1 Introduction to XML. XML eXtensible implies that users define tag content Markup implies it is a coded document Language implies it is a metalanguage.
Information Retrieval in Practice
Search Engines and Information Retrieval
BYU 2003BYU Data Extraction Group Combining the Best of Global-as-View and Local-as-View for Data Integration Li Xu Brigham Young University Funded by.
Information Retrieval in Practice
1 Introduction The Database Environment. 2 Web Links Google General Database Search Database News Access Forums Google Database Books O’Reilly Books Oracle.
2005Integration-intro1 Data Integration Systems overview The architecture of a data integration system:  Components and their interaction  Tasks  Concepts.
ReQuest (Validating Semantic Searches) Norman Piedade de Noronha 16 th July, 2004.
Bieber et al., NJIT © Slide 1 Digital Library Integration Masters Project and Masters Thesis Summer and Fall 2005 CIS 786 / CIS Fall.
CSE 636 Data Integration Overview. 2 Data Warehouse Architecture Data Source Data Source Relational Database (Warehouse) Data Source Users   Applications.
Microsoft ® Official Course Interacting with the Search Service Microsoft SharePoint 2013 SharePoint Practice.
Automatic Data Ramon Lawrence University of Manitoba
University of Kansas Data Discovery on the Information Highway Susan Gauch University of Kansas.
Overview of Search Engines
Quete: Ontology-Based Query System for Distributed Sources Haridimos Kondylakis, Anastasia Analyti, Dimitris Plexousakis Kondylak, analyti,
Search Engine Optimization
Databases & Data Warehouses Chapter 3 Database Processing.
MDC Open Information Model West Virginia University CS486 Presentation Feb 18, 2000 Lijian Liu (OIM:
Integration of Biological Sources: Current Systems and Challenges Ahead ( Sigmod Record, Vol. 33. No. 3, September 2004 ) Thomas Hernandez & Sybbarao Kambhampati.
1 Overview of Database Federation and IBM Garlic Project Presented by Xiaofen He.
CS598CXZ Course Summary ChengXiang Zhai Department of Computer Science University of Illinois, Urbana-Champaign.
Search Engines and Information Retrieval Chapter 1.
Research Topics in Computing Data Modelling for Data Schema Integration 1 March 2005 David George.
CS621 : Seminar-2008 DEEP WEB Shubhangi Agrawal ( )‏ Jayalekshmy S. Nair ( )‏
Building Search Portals With SP2013 Search. 2 SharePoint 2013 Search  Introduction  Changes in the Architecture  Result Sources  Query Rules/Result.
Peer-to-Peer Data Integration Using Distributed Bridges Neal Arthorne B. Eng. Computer Systems (2002) Supervisor: Babak Esfandiari April 12, 2005 Candidate.
1 Data Integration. 2 Motivating Examples An organization has on average 49 databases –can talk about the same topic, but use different vocabularies,
CSE 636 Data Integration Overview Fall What is Data Integration? The problem of providing uniform (sources transparent to user) access to (query,
Navigational Plans For Data Integration Marc Friedman Alon Levy Todd Millistein Presented By Avinash Ponnala Avinash Ponnala.
XML & Mediators Thitima Sirikangwalkul Wai Sum Mong April 10, 2003.
Audio and Video Chris McConnell Department of Radio-TV-Film November 30, 2006.
1 Lessons from the TSIMMIS Project Yannis Papakonstantinou Department of Computer Science & Engineering University of California, San Diego.
Aquenergy Portal Elisabetta Zuanelli, University of Rome “Tor Vergata”, Italy E-Age 2014 Muscat december.
1 Schema Registries Steven Hughes, Lou Reich, Dan Crichton NASA 21 October 2015.
XP New Perspectives on The Internet, Sixth Edition— Comprehensive Tutorial 3 1 Searching the Web Using Search Engines and Directories Effectively Tutorial.
ICDL 2004 Improving Federated Service for Non-cooperating Digital Libraries R. Shi, K. Maly, M. Zubair Department of Computer Science Old Dominion University.
2007. Software Engineering Laboratory, School of Computer Science S E Web-Harvest Web-Harvest: Open Source Web Data Extraction tool 이재정 Software Engineering.
Scaling Heterogeneous Databases and Design of DISCO Anthony Tomasic Louiqa Raschid Patrick Valduriez Presented by: Nazia Khatir Texas A&M University.
Search Engines By: Faruq Hasan.
Information Integration BIRN supports integration across complex data sources – Can process wide variety of structured & semi-structured sources (DBMS,
Introduction to the Semantic Web and Linked Data
Scalable Hybrid Keyword Search on Distributed Database Jungkee Kim Florida State University Community Grids Laboratory, Indiana University Workshop on.
Issues in Ontology-based Information integration By Zhan Cui, Dean Jones and Paul O’Brien.
Automatic Metadata Discovery from Non-cooperative Digital Libraries By Ron Shi, Kurt Maly, Mohammad Zubair IADIS International Conference May 2003.
Metadata By N.Gopinath AP/CSE Metadata and it’s role in the lifecycle. The collection, maintenance, and deployment of metadata Metadata and tool integration.
JISC/NSF PI Meeting, June Archon - A Digital Library that Federates Physics Collections with Varying Degrees of Metadata Richness Department of Computer.
The World Wide Web. What is the worldwide web? The content of the worldwide web is held on individual pages which are gathered together to form websites.
Feb 24-27, 2004ICDL 2004, New Dehli Improving Federated Service for Non-cooperating Digital Libraries R. Shi, K. Maly, M. Zubair Department of Computer.
Data Integration Approaches
1 Integration of data sources Patrick Lambrix Department of Computer and Information Science Linköpings universitet.
Integrated Departmental Information Service IDIS provides integration in three aspects Integrate relational querying and text retrieval Integrate search.
General Architecture of Retrieval Systems 1Adrienn Skrop.
Search can be Your Best Friend You just Need to Know How to Talk to it IW 306 Ágnes Molnár.
Harnessing the Deep Web : Present and Future -Tushar Mhaskar Jayant Madhavan, Loredana Afanasiev, Lyublena Antova, Alon Halevy January 7,
Information Retrieval in Practice
Search Engine Architecture
Architetture della Informazione Anno accademico Carlo Batini Methodologies for planning the evolution of data architectures 1.
Prepared by Rao Umar Anwar For Detail information Visit my blog:
Submitted By: Usha MIT-876-2K11 M.Tech(3rd Sem) Information Technology
Introduction to Information Retrieval
Information Retrieval and Web Design
Presentation transcript:

Data Integration Hanna Zhong Department of Computer Science University of Illinois, Urbana-Champaign 11/12/2009

2 Overview Motivation Problem Definition Data Integration Approaches –Virtual integration –Data warehouse Issues Discussion

3 Why Data Integration?

4 Example: Apartment Search Find an apartment near Siebel Center RamshawJSMBankier

5 mediated schema Ramshaw JSM source schema 2 Bankier source schema 3source schema 1 wrapper Find an apartment near Siebel Center

6 mediated schema Ramshaw JSM source schema 2 Bankier source schema 3source schema 1 wrapper Apartment Search Find an apartment near Siebel Center

7 More Examples People Search –Build a yellowpage application on db people Many people doing database stuff in the US How can we find information about a database person, such as classes taught, publications, collaborators, etc? –Homepages –

8 Example Systems Apartment Search DB People Search Etc…

9 Data Integration Arises in numerous contexts –on the Web, at enterprises, military, scientific cooperation, bio-informatics domains, e- commerce, etc. Currently very hot –in both database research and industry Current state of affairs –Mostly ad-hoc solution create a special solution for every case; pay consultants a lot of money.

10 Overview Motivation Problem Definition Data Integration Approaches –Virtual integration –Data warehouse Issues Discussion

11 What is Data Integration? The process of 1.Combining data from different data sources Data sources: –Databases, websites, documents, blogs, discussion forums, s, etc 2.Presenting a unified view of these data Source … Unified View

12 What is Data Integration? (2) The process of 1.Combining data from different data sources 2.Presenting a unified view of these data Source … Results: User Query Search as if the data are from the SAME source

13 Problem Definition How can we access a set of heterogeneous, distributed, autonomous databases as if accessing a single database?

14 Data Integration is Hard  Data sources are heterogeneous, distributed, and autonomous –Sources Type Relational database, text, xml, etc –Query-Language SQL queries, keyword queries, XQuery –Schema Databases have different schemas –Data type & value The same data are represented differently in different sources –Type (e.g. time represented as varchar or timestamp) –Value (e.g. 8pm represented as 8:00pm or 20:00:00) –Semantic Words have different meanings at different sources (e.g. title) –Communication Some sources are accessed via HTTP, others FTP

15 Overview Motivation Problem Definition Data Integration Approaches –Virtual integration –Data warehouse Issues Discussion

16 Event Search Provide a comprehensive search on Champaign-Urbana events in one place –Search events by its title, description, location proximity, dates, venues, and/or data sources

17 Virtual Integration Approach Leave the data in the sources When a query comes in: –Determine the relevant sources to the query –Break down the query into sub-queries for the sources –Get the answers from the sources, and combine them appropriately Data is fresh

18 Virtual Integration Architecture wrapper Query Mediated schema Data source catalog Optimizer Execution engine Source wrapper Mediator: Reformulation engine meta-information about the sources (ie. source contents, what queries are supported, etc) Determine relevant sources Break down the query into sub-queries for the sources

19 Virtual Integration Architecture wrapper Query Mediated schema Data source catalog Reformulation engine Optimizer Execution engine Source wrapper Mediator: Communicate with the data source and do format translations Get the answers and format appropriately

20 Virtual Integration Example UIUC calendareventful.comzvents.com Find upcoming events in Champaign-Urbana schema 1 schema 2 schema 3 Events(title, location, description, cost, start, end) Events(title, where, description, start, end) Mediator: Events(title, location, cost, start, end)

21 Virtual Integration Example UIUC calendareventful.comzvents.com wrapper Find upcoming events in Champaign-Urbana schema 1 schema 2 schema 3 Events(title, location, description, cost, start, end) Events(title, where, description, start, end) Mediator: titlewheredescriptionstartend MusicCanopyMusic9pm11pm

22 Virtual Integration Example UIUC calendareventful.com wrapper zvents.com wrapper Find upcoming events in Champaign-Urbana schema 1 schema 2 schema 3 Events(title, location, description, cost, start, end) Events(title, location, cost, start, end) Mediator: titlelocationcoststartend MusicCanopy Club$58pm11pm

23 Challenges titlewheredescriptionstartend MusicCanopyMusic9pm11pm titlelocationCoststartend MusicCanopy Club$58pm11pm Events(title, location, description, cost, start, end) Data Integration: the process of combining data from different data sources and presenting a unified view of these data titlelocationdescriptioncoststartend Schema Matching

24 Challenges titlewheredescriptionstartend MusicCanopyMusic9pm11pm titlelocationCoststartend MusicCanopy Club$58pm11pm Events(title, location, description, cost, start, end) Data Integration: the process of combining data from different data sources and presenting a unified view of these data titlewheredescriptioncoststartend

25 Virtual Integration Architecture wrapper Query Mediated schema Data source catalog Optimizer Execution engine Source wrapper Mediator: Reformulation engine Break down the query into sub-queries for the sources

26 Virtual Integration UIUC calendareventful.com wrapper zvents.com wrapper Query: Find upcoming events in Champaign-Urbana schema 1 schema 2 schema 3 Events(title, location, description, cost, start, end) Mediator: subQuery 1 subQuery 2subQuery 3

27 Mediators Global-as-view –express the mediated schema relations as a set of views over the data source relations Local-as-view –express the source relations as views over the mediated schema Mediated schema Query schema 1 schema 2 schema 3 wrapper subQuery 1 subQuery 2 subQuery 3

28 Global-as-View GAV Express the mediated schema relations as a set of views over the data source relations –The mediated schema is modeled as a set of views over the source schemas Design the mediated schema around the source schemas Mediated schema: Events(title, location, description, cost, start, end) Source schema: –S1: Events(title, where, description, start, end) –S2: Events(title, location, description, cost, start, end, performer) –S3: Events(title, location, cost, start, end) GAV: Create View Events AS select title, where AS location, description, NULL, start, end from S1 UNION select title, location, description, cost, start, end from S2 UNION select title, location, NULL, cost, start, end from S3

29 Global-as-View GAV (2) Adding sources is hard –The core work is on how to retrieve elements from the source databases –Need to consider all other sources that are available

30 Global-as-View GAV (3) Mediated schema: Events(title, location, description, cost, start, end) Venues(location, city, state) Source schema: –S4: Events(title, description, city, state) GAV: Create View Events AS select title, NULL, description, NULL, NULL, NULL from S4 Create View Venues AS select NULL, city, state from S4 What if we want to find events that are in Champaign?

31 Local-as-View LAV Express the source relations as views over the mediated schema The mediated schema is already designed –Create views on the source schemas Mediated schema: Events(title, location, description, cost, start, end) Source schema: –S1: Events(title, where, description, start, end) –S2: Events(title, location, description, cost, start, end, performer) –S3: Events(title, location, cost, start, end) LAV: Create View S1 select title, location AS where, description, start, end from Events Create View S2 select title, location, description, start, end, NULL from Events Create View S3 select title, location, cost, start, end from Events

32 Local-as-View LAV (2) Mediated schema: Events(title, location, description, cost, start, end) Venues(location, city, state) Source schema: –S4: Events(title, description, city, state) What if we want to find events that are in Champaign? LAV: Create View S4 select title, description, city, state from Events e, Venues v where e.location=v.location AND city=“champaign”

33 Local-as-View LAV (3) Very flexible. –You have the power of the entire query language to define the contents of the source. –Hence, can easily distinguish between contents of closely related sources Adding sources is easy –They’re independent of each other

34 Virtual Integration Example UIUC calendareventful.com wrapper zvents.com wrapper Find upcoming events in Champaign-Urbana schema 1 schema 2 schema 3 Events(title, location, description, cost, start, end) Mediator:

35 What if Schema is Unknown? UIUC calendareventful.com wrapper zvents.com wrapper Find upcoming events in Champaign-Urbana ? ? ? Events(title, location, description, cost, start, end) Mediator:

36 Query Interface UIUC calendareventful.com wrapper zvents.com wrapper Find upcoming events in Champaign-Urbana query interface 1 query interface 2 query interface 3 Mediator: global query interface

37 Query Interface

38 Query Interface

39 Wrappers global query interface UIUC calendareventful.com wrapper zvents.com wrapper Find upcoming events in Champaign-Urbana query interface 1 query interface 2query interface 3 Communicate with the data source and do format translations Get the answers and format appropriately

40 Wrappers

41 Wrappers (2) Once the query is submitted via the query interface, results are returned –Formatting is specific to each source

42 Wrappers (2)

43 Information Extraction What information to extract?

44 Information Extraction (2) Complications…

45 Wrappers Hard to build and maintain (very little science) Major approaches –Machine Learning –Data-intensive, completely-automatic Roadrunner Data sources are accessed via query interfaces –Query interface to each data source is different Scalability –One wrapper per source vs one wrapper per domain

46 Overview Motivation Problem Definition Data Integration Approaches –Virtual integration –Data warehouse Issues Discussion

47 Data Warehousing Load all the data periodically into a central database (warehouse) –Performance is good –Data may not be fresh –Need to clean, scrub you data

48 Data Warehousing Architecture Repository Query Data extraction programs Data cleaning/ scrubbing Source Determine relevant sources Collect and clean data

49 Data Warehousing Example … Data Collection Information ExtractionIndex Repository Query ParserSearchSummarize eventful.com zvents.com

50 Data Collection RSS feed Crawlers –Programs that browse the Web in a methodical, automated manner Link following Extract Hyperlinks URL List

51 Crawlers (2)

52 Data Warehousing Example Source … Data Collection Information ExtractionIndex Repository What data to store? How to store? What data to index? What kind of indexes? How to index? How to search? How to rank? How to map a user query into one the system understands? How to display results? What’s the view (ranked list, map, calendar… etc) Query ParserSearchDisplay What data to extract?

53 Data Integration Approaches Virtual integration –No data are collected offline –On a search, data are collected and processed from various sources at runtime Data warehouse –Data are collected offline and stored in a central repository –Search is performed on the repository When should we take the virtual integration approach? When should we take the warehousing approach?

54 Overview Motivation Problem Definition Data Integration Approaches –Virtual integration –Data warehouse Issues Discussion

55 Data Integration Issues Data collection –Wrappers, crawlers, RSS –Duplications, spams –Freshness, completeness, etc Information extraction –What information to extract? Schema matching Query optimization Query reformulation Scalability –When there are many sources out there, does the solution still work?

56 Discussion

57 References Some slides taken from Professor Anhai Doan, from FALL2005 CS511, UIUC