Information Integration Across Heterogeneous Sources: Where Do We Stand and How to Proceed? Aditya Telang Sharma Chakravarthy, Yan Huang.

Slides:



Advertisements
Similar presentations
Dr. Leo Obrst MITRE Information Semantics Information Discovery & Understanding Command & Control Center February 6, 2014February 6, 2014February 6, 2014.
Advertisements

Limitations of the relational model 1. 2 Overview application areas for which the relational model is inadequate - reasons drawbacks of relational DBMSs.
Processing XML Keyword Search by Constructing Effective Structured Queries Jianxin Li, Chengfei Liu, Rui Zhou and Bo Ning Swinburne University of Technology,
Querying for Information Integration: How to go from an Imprecise Intent to a Precise Query? Aditya Telang Sharma Chakravarthy, Chengkai Li.
By: Mr Hashem Alaidaros MIS 211 Lecture 4 Title: Data Base Management System.
GENERATING AUTOMATIC SEMANTIC ANNOTATIONS FOR RESEARCH DATASETS AYUSH SINGHAL AND JAIDEEP SRIVASTAVA CS DEPT., UNIVERSITY OF MINNESOTA, MN, USA.
Search Engines and Information Retrieval
Managing Data Resources
Visual Web Information Extraction With Lixto Robert Baumgartner Sergio Flesca Georg Gottlob.
Advanced Topics COMP163: Database Management Systems University of the Pacific December 9, 2008.
Integration and Insight Aren’t Simple Enough Laura Haas IBM Distinguished Engineer Director, Computer Science Almaden Research Center.
2005Integration-intro1 Data Integration Systems overview The architecture of a data integration system:  Components and their interaction  Tasks  Concepts.
ReQuest (Validating Semantic Searches) Norman Piedade de Noronha 16 th July, 2004.
PROMPT: Algorithm and Tool for Automated Ontology Merging and Alignment Natalya F. Noy and Mark A. Musen.
MetaQuerier Mid-flight: Toward Large-Scale Integration for the Deep Web Kevin C. Chang.
Advanced Search Giora Feldman, CTO Axioma Search, LLC.
TECHNIQUES FOR OPTIMIZING THE QUERY PERFORMANCE OF DISTRIBUTED XML DATABASE - NAHID NEGAR.
LÊ QU Ố C HUY ID: QLU OUTLINE  What is data mining ?  Major issues in data mining 2.
Chapter 11 Databases.
Katanosh Morovat.   This concept is a formal approach for identifying the rules that encapsulate the structure, constraint, and control of the operation.
Database Design - Lecture 1
Enterprise & Intranet Search How Enterprise is different from Web search What to think about when evaluating Enterprise Search How Intranet use is different.
Search Engines and Information Retrieval Chapter 1.
Multimedia Databases (MMDB)
1 Introduction An organization's survival relies on decisions made by management An organization's survival relies on decisions made by management To make.
Unifying Data and Domain Knowledge Using Virtual Views IBM T.J. Watson Research Center Lipyeow Lim, Haixun Wang, Min Wang, VLDB Summarized.
Database System Concepts and Architecture
Satish Ramanan April 16, AGENDA Context Why - Integrate Search with BI? How - do we get there? - Tool Strategy What - is in it for me ? - Outcomes.
Overviews of ITCS 6161/8161: Advanced Topics on Database Systems Dr. Jianping Fan Department of Computer Science UNC-Charlotte
Dimitrios Skoutas Alkis Simitsis
Chapter 3 DECISION SUPPORT SYSTEMS CONCEPTS, METHODOLOGIES, AND TECHNOLOGIES: AN OVERVIEW Study sub-sections: , 3.12(p )
Software Project Management Lecture # 3. Outline Chapter 22- “Metrics for Process & Projects”  Measurement  Measures  Metrics  Software Metrics Process.
Presenter: Shanshan Lu 03/04/2010
2007. Software Engineering Laboratory, School of Computer Science S E Web-Harvest Web-Harvest: Open Source Web Data Extraction tool 이재정 Software Engineering.
6.1 © 2010 by Prentice Hall 6 Chapter Foundations of Business Intelligence: Databases and Information Management.
MANAGING DATA RESOURCES ~ pertemuan 7 ~ Oleh: Ir. Abdul Hayat, MTI.
1 Technology in Action Chapter 11 Behind the Scenes: Databases and Information Systems Copyright © 2010 Pearson Education, Inc. Publishing as Prentice.
Querying Web Data – The WebQA Approach Author: Sunny K.S.Lam and M.Tamer Özsu CSI5311 Presentation Dongmei Jiang and Zhiping Duan.
Information Integration BIRN supports integration across complex data sources – Can process wide variety of structured & semi-structured sources (DBMS,
1 Context-Aware Internet Sharma Chakravarthy UT Arlington December 19, 2008.
Data Integration Hanna Zhong Department of Computer Science University of Illinois, Urbana-Champaign 11/12/2009.
Scalable Hybrid Keyword Search on Distributed Database Jungkee Kim Florida State University Community Grids Laboratory, Indiana University Workshop on.
Issues in Ontology-based Information integration By Zhan Cui, Dean Jones and Paul O’Brien.
1 WS-GIS: Towards a SOA-Based SDI Federation Fábio Luiz Leite Júnior Information System Laboratory University of Campina Grande
Dec. 13, 2002 WISE2002 Processing XML View Queries Including User-defined Foreign Functions on Relational Databases Yoshiharu Ishikawa Jun Kawada Hiroyuki.
Collaborative Query Previews in Digital Libraries Lin Fu, Dion Goh, Schubert Foo Division of Information Studies School of Communication and Information.
Semantic Data Extraction for B2B Integration Syntactic-to-Semantic Middleware Bruno Silva 1, Jorge Cardoso 2 1 2
1 Copyright © 2009, Oracle. All rights reserved. Oracle Business Intelligence Enterprise Edition: Overview.
Instance Discovery and Schema Matching With Applications to Biological Deep Web Data Integration Tantan Liu, Fan Wang, Gagan Agrawal {liut, wangfa,
Integrated Departmental Information Service IDIS provides integration in three aspects Integrate relational querying and text retrieval Integrate search.
The Need for Data Analysis 2 Managers track daily transactions to evaluate how the business is performing Strategies should be developed to meet organizational.
Viewpoint Modeling and Model-Based Media Generation for Systems Engineers Automatic View and Document Generation for Scalable Model- Based Engineering.
Chapter 8: Web Analytics, Web Mining, and Social Analytics
Managing Data Resources File Organization and databases for business information systems.
Chapter 9 Architectural Design. Why Architecture? The architecture is not the operational software. Rather, it is a representation that enables a software.
Harnessing the Deep Web : Present and Future -Tushar Mhaskar Jayant Madhavan, Loredana Afanasiev, Lyublena Antova, Alon Halevy January 7,
Datamining : Refers to extracting or mining knowledge from large amounts of data Applications : Market Analysis Fraud Detection Customer Retention Production.
Ontology Evolution: A Methodological Overview
Smart Onboarding An EmpFinesse Work Partnership Solution.
Johannes Peter MediaMarktSaturn Retail Group
Information Retrieval
Exploratory search: New name for an old hat?
Database Systems Instructor Name: Lecture-3.
CSE 635 Multimedia Information Retrieval
Search Engine Architecture
Chaitali Gupta, Madhusudhan Govindaraju
Context-Aware Internet
Information Retrieval and Web Design
Reportnet 3.0 Database Feasibility Study – Approach
Presentation transcript:

Information Integration Across Heterogeneous Sources: Where Do We Stand and How to Proceed? Aditya Telang Sharma Chakravarthy, Yan Huang

Motivation “Retrieve castles near London that are reachable by train in less than 2 hours” “Find 3-bedroom houses in Houston within 2 miles of a school and within 5 miles of a highway and priced under 250,000$” “Retrieve French restaurants within 1 mile of IMAX Theater in Dallas, Texas” …

Motivation Search engines Meta-search engines Faceted search engines Domain-specific portals

Current Scenario “Retrieve castles near London that are reachable by train in less than 2 hours” London Train schedules Trains from London Castles Near London - Decision Making Process - Manually Combine Results to arrive at a decision - Decision Making Process - Manually Combine Results to arrive at a decision

Ideal Scenario Information Integration System Intent: Retrieve castles near London that are reachable by train in less than 2 hours Actual Results for the intent

Focus of the Paper Identify the salient challenges needed to be encountered to address this problem Survey existing work to identify the challenges for which acceptable solutions are available Propose a framework that could provide potential solutions towards the problem

Broader Challenges Intent specification and formulation Query processing and optimization Discover of sources, their schemas and characteristics Data Extraction, Integration and Ranking Result Visualization Issues with inconsistency, security, privacy, …

Intent Specification “Retrieve castles near London that are reachable by train in less than 2 hours” – Keyword-based (e.g., search engine query)? – Structured (e.g., SQL) ? – Unstructured (e.g., natural language) ? – Template/Form/Menu-based (deep Web query) ?

Query Processing The number of sources to be integrated are much larger than in a normal database environment. Heterogeneous sources (RDBMS, websites, web services, etc.) do not provide the same processing capabilities found in a typical database system (such as the ability to perform joins). Unlike relational databases, there might be restrictions on how a source can be accessed.

Query Processing In contrast to query optimization in DBMS, the query optimizer in information integration has little information about the data since it resides in remote autonomous sources Web data sources are not necessarily database systems and may have different processing capabilities. Hence, the query optimizer must consider the possibility of exploiting a data source’s query-processing capabilities.

Discovery Source discovery – Given the domain of travel, determine all possible source providing airfare information – Not a simple crawling process since categorization is necessary after crawling [Gal:VLDB’06] – Use of search engines ? web directories ?

Discovery Discovery of source schema and characteristics – Understanding source schema – Understanding query mechanism (for deep Web sources) – Understanding characteristics of sources

Data Extraction How to extract data for individual sub- queries? – APIs, Web services for deep Web? – Data extractors (e.g., Lixto, Florid) for surface Web? Temporary storage of extracted data (becomes a critical issue when data can be large in size such as spatial data)

Data Integration Schema integration a complex challenge across domains [Gal:VLDB’06] Additional challenges while integrating data – Inefficient execution of recursive integration plans – No support to dynamic service composition – Lack of operators to support GeoSpatial data types – No support for record linkage and object consolidation in the mediator can incorporate the source into a new or existing workflow

Ranking In context of integration, ranking has not been addressed as a significant challenge [Telang:ICDE’07] When to rank? – Before integrating sub-query results? – After integrating sub-query results? Source-independent ranking possible?

Other Challenges Visualization of results Handling inconsistencies Ensuring no breach of privacy and security ….

The Current Big Players Industry-level – Google (Google Base) [Madhavan:CIDR’07] – IBM (Web Sphere) – Yahoo (Trip Planner) Academia-level – Havasu [Kambhampatti:ICDE’05] – MetaQuerier [Chang et. al: VLDB’05, CIDR’07] – Ariadne [Knoblock: VLDB’02,03] – …

The InfoMosaic Approach

Knowledge-Base Identify different types of information needed for the domains and sources to answer a query. Domain Knowledge – – Necessary information/knowledge required for elaborating and refining the query based on the domains and keywords provided by the user Source Semantics – – Information store for modeling and maintaining all the necessary information for each source within a given domain Knowledge Base Domain Knowledge Source Semantics MetadataOntology VocabularyOperators Attributes Statistics Schemas

User Intent Specification Specify intent that is more precise than a “search” but less rigid than a “SQL-query” Ability to resolve concepts and their attributes elegantly with minimal user interaction Effectiveness depends on user feedback and past query statistics [Telang:COMAD’08] Feedback-centric Query Specification User Intent Feedback Knowledge-Base Refined Query

Multi-level Query Planning Evaluation is made at each stage to prune plans using relevant cost metrics. Some of the additional cost metrics – – volume of data retrieved from each source – number of calls made to and amount of data sent by each source – quantity of data processed – the number of integration queries executed Refined Query Domain-Level Source- Level SP-1 Domain Level Plan Source- Level … SP-2 … Query Planner & Optimizer Knowledge-Base

Query Execution & Data Extraction Checking availability of sources, identifying attributes to be extracted (using the source semantics) and extracting data Determining the output in XML and spatial data formats for storage and further querying Reuse of previously retrieved results is an integral part of this task Query Executor & Data Extractor Internet Results Query Extracted Results Data Store XML Data Repository Spatial Data Repository Query Plan Knowledge-Base

Integration of Results Generation of XQueries for combining extracted data Develop external functions for XQueries to access spatial data The result of the query will be transformed into a homogeneous schema for understanding and analyzing the results. Data Store XML Data Repository Spatial Data Repository Integrator Results Query Domain Level Plan Knowledge-Base Result Set

Ranking Two approaches to ranking [Telang:DBRank’07]– – Rank Before Integration: Applicable when user-specified metrics can be decomposed and applied to individual sub- queries – Rank After Integration: Applicable when user-specified metrics CANNOT be decomposed and applied to individual sub-queries Ranking Query Executor & Data Extractor Integrator

To Conclude Ideally, an information integration system should allow users to specify what information is needed without having to provide detailed instructions on how or from where to obtain the information. A number of challenges need to be addressed by different research communities (AI, DB, IR, NLP, Semantic Web, …) Existing work suggests we are on the right track Our proposed framework (InfoMosaic) could be a further step in this direction

Thank You !