Integrating Structured & Unstructured Data. Goals  Identify some applications that have crucial requirement for integration of unstructured and structured.

Slides:



Advertisements
Similar presentations
Uncertainty in Data Integration Ai Jing
Advertisements

DAML Queries/Life Cycle SRI International. Parts of Ontologies (used in the examples to follow) Assumptions Researcher String lastName firstName Publication-ref.
Efficient IR-Style Keyword Search over Relational Databases Vagelis Hristidis University of California, San Diego Luis Gravano Columbia University Yannis.
Information Retrieval in Practice
Search Engines and Information Retrieval
Data Management for XML: Research Directions By: Jennifer Widom Stanford University Reviewer: Kristin Streilein.
Information Integration. Modes of Information Integration Applications involved more than one database source Three different modes –Federated Databases.
Information Retrieval and Databases: Synergies and Syntheses IDM Workshop Panel 15 Sep 2003 Jayavel Shanmugasundaram Cornell University.
Advanced Topics COMP163: Database Management Systems University of the Pacific December 9, 2008.
1 Introduction The Database Environment. 2 Web Links Google General Database Search Database News Access Forums Google Database Books O’Reilly Books Oracle.
Reference Collections: Task Characteristics. TREC Collection Text REtrieval Conference (TREC) –sponsored by NIST and DARPA (1992-?) Comparing approaches.
1 Lecture 13: Database Heterogeneity Debriefing Project Phase 2.
Text Mining: Finding Nuggets in Mountains of Textual Data Jochen Dijrre, Peter Gerstl, Roland Seiffert Presented by Huimin Ye.
Overview of Search Engines
Building Trustworthy Semantic Webs Dr. Bhavani Thuraisingham The University of Texas at Dallas Semantic web technologies for secure interoperability and.
1 Overview of Database Federation and IBM Garlic Project Presented by Xiaofen He.
Welcome to CPSC 534B: Web Data Integration & Management Laks V.S. Lakshmanan Rm. CICSR Main Mall.
Research paper: Web Mining Research: A survey SIGKDD Explorations, June Volume 2, Issue 1 Author: R. Kosala and H. Blockeel.
Enterprise & Intranet Search How Enterprise is different from Web search What to think about when evaluating Enterprise Search How Intranet use is different.
Search Engines and Information Retrieval Chapter 1.
Knowledge based Learning Experience Management on the Semantic Web Feng (Barry) TAO, Hugh Davis Learning Society Lab University of Southampton.
Defining Text Mining Preprocessing Transforming unstructured data stored in document collections into a more explicitly structured intermediate format.
DBSQL 14-1 Copyright © Genetic Computer School 2009 Chapter 14 Microsoft SQL Server.
Introduction to Web Mining Spring What is data mining? Data mining is extraction of useful patterns from data sources, e.g., databases, texts, web,
Chapter 2 Architecture of a Search Engine. Search Engine Architecture n A software architecture consists of software components, the interfaces provided.
Thanks to Bill Arms, Marti Hearst Documents. Last time Size of information –Continues to grow IR an old field, goes back to the ‘40s IR iterative process.
1 Searching XML Documents via XML Fragments D. Camel, Y. S. Maarek, M. Mandelbrod, Y. Mass and A. Soffer Presented by Hui Fang.
Querying Structured Text in an XML Database By Xuemei Luo.
Ontologies and Lexical Semantic Networks, Their Editing and Browsing Pavel Smrž and Martin Povolný Faculty of Informatics,
1 Lessons from the TSIMMIS Project Yannis Papakonstantinou Department of Computer Science & Engineering University of California, San Diego.
Giorgos Giannopoulos (IMIS/”Athena” R.C and NTU Athens, Greece) Theodore Dalamagas (IMIS/”Athena” R.C., Greece) Timos Sellis (IMIS/”Athena” R.C and NTU.
1 Metadata –Information about information – Different objects, different forms – e.g. Library catalogue record Property:Value: Author Ian Beardwell Publisher.
1 Of Crawlers, Portals, Mice and Men: Is there more to Mining the Web? Jiawei Han Simon Fraser University, Canada ACM-SIGMOD’99 Web Mining Panel Presentation.
Oracle Database 11g Semantics Overview Xavier Lopez, Ph.D., Dir. Of Product Mgt., Spatial & Semantic Technologies Souripriya Das, Ph.D., Consultant Member.
Web-site Building Methodologies Current Research.
Mining real world data Web data. World Wide Web Hypertext documents –Text –Links Web –billions of documents –authored by millions of diverse people –edited.
Data and Applications Security Developments and Directions Dr. Bhavani Thuraisingham The University of Texas at Dallas Introduction to the Course January.
1 Context-Aware Internet Sharma Chakravarthy UT Arlington December 19, 2008.
VLDB2005 CMS-ToPSS: Efficient Dissemination of RSS Documents Milenko Petrovic Haifeng Liu Hans-Arno Jacobsen University of Toronto.
Building a Distributed Full-Text Index for the Web by Sergey Melnik, Sriram Raghavan, Beverly Yang and Hector Garcia-Molina from Stanford University Presented.
Data Integration Hanna Zhong Department of Computer Science University of Illinois, Urbana-Champaign 11/12/2009.
Dr. Bhavani Thuraisingham September 2006 Building Trustworthy Semantic Webs Lecture #5 ] XML and XML Security.
Data and Applications Security Developments and Directions Dr. Bhavani Thuraisingham The University of Texas at Dallas Lecture #15 Secure Multimedia Data.
Scalable Hybrid Keyword Search on Distributed Database Jungkee Kim Florida State University Community Grids Laboratory, Indiana University Workshop on.
Who Needs All Those Indexes ? One is Enough Bruce Lindsay IBM Almaden Research Center
Web Technologies for Bioinformatics Ken Baclawski.
1 Introduction to Data Mining C hapter 1. 2 Chapter 1 Outline Chapter 1 Outline – Background –Information is Power –Knowledge is Power –Data Mining.
Sovereign Information Sharing, Searching and Mining Rakesh Agrawal IBM Almaden Research Center.
1 Ontolog OOR-BioPortal Comparative Analysis Todd Schneider 15 October 2009.
Information Architecture The Open Group UDEF Project
Dr. Bhavani Thuraisingham January 14, 2011 Building Trustworthy Semantic Webs Lecture #1: Introduction to Trustworthy Semantic Web.
1 Entity Search Engine: Towards Agile Best-Effort Information Integration over the Web Tao Cheng, Kevin Chang University Of Illinois, Urbana-Champaign.
Chapter 1 DECISION SUPPORT SYSTEMS AND BUSINESS INTELLIGENCE Skip subsections: 1.1, 1.2, 1.8, 1.10.
Toward Entity Retrieval over Structured and Text Data Mayssam Sayyadian, Azadeh Shakery, AnHai Doan, ChengXiang Zhai Department of Computer Science University.
Welcome to CPSC 534B: Information Integration Laks V.S. Lakshmanan Rm. 315.
Semantic (web) activity at Elsevier Marc Krellenstein VP, Search and Discovery Elsevier October 27, 2004
Semantic Interoperability in GIS N. L. Sarda Suman Somavarapu.
XML 1. Chapter 8 © 2013 Pearson Education, Inc. Publishing as Prentice Hall SAMPLE XML SCHEMA (XSD) 2 Schema is a record definition, analogous to the.
Database Technologies for E-Commerce Rakesh Agrawal IBM Almaden Research Center.
Information Retrieval in Practice
XML Related Technologies
Search Engine Architecture
Building Trustworthy Semantic Webs
Lecture #11: Ontology Engineering Dr. Bhavani Thuraisingham
Copyright © 2003 by Kyu-Young Whang
What is IR? In the 70’s and 80’s, much of the research focused on document retrieval In 90’s TREC reinforced the view that IR = document retrieval Document.
Introduction to Information Retrieval
Context-Aware Internet
Information Retrieval and Web Design
Data and Applications Security Developments and Directions
Presentation transcript:

Integrating Structured & Unstructured Data

Goals  Identify some applications that have crucial requirement for integration of unstructured and structured data  Identify key technical issues in integrating unstructured and structured data  Identify potential approaches

Definitions (simplified)  Structured object: – }>  Unstructured object: –  Semi-structured object – }, {word}> – pairs may be Given (e.g. author, title, etc.) Extracted (e.g. Date, Zipcode, etc.) Inferred (e.g. Topic)

Representative Applications  BPI: Messasges- unstructured  Web Applications: unstructured pages  Corporate Portals:  DSS involving Combination of simulation with database system  News syndication: author etc + story  Call centers: customer interaction + structured component of complaint  Mail system/document systems  Tourist information system  Product catalogs/engineering spec sheets  Patents/chenistry documents  Matching Legal documents (with cross citations) with building codes --- representative

Key Technical Issues  Query language & data model – Sharp vs fuzzy / complete vs best-effort – Boolean vs similarity queries (relationship to “value”)  Integration strategies – Loose vs. tight coupling Architectures (many possibilities) – Search engine into DBMS or DBMS into search engine – Late & early binding (warehousing vs virtual) – Integration vs articulation (union vs intersection)  Feature extraction from unstructured data  Role of meta data & integrity constraints  Inconsistency of data sources – Priorty rules for mediation  Management & data organization issues – Version management, freshness, security  Continuous queries over streams

 Strucured:People(firstname, lastname, company, location)  Semi-structured:Papers(title, {authors}, text)  Unstructured: Reviews Q1: Reviews of papers by Almaden authors on II  Search reviews using Join(People., Papers.authors).keywords Q2: Folks in Almaden and Watson working on same topic  Join of Papers.text followed by joined with names in People Q3: Papers on privacy & data mining by Agarwal in Watson  Combine ranks of results from People and Papers Q4: Almaden authors whose papers had negative reviews  Infer sentiment of a review and interesting joins Q5: Crrent research topics in Almaden  Join People and Papers followed by clustering

Combining Scores  DB: – Aggarwal, Watson, s1 – Agarwal, Almaden, s2 – Agrawal, Almaden, s3  IR – Sigmod 00 paper, r2 – PODS 01 papers, r1 – KDD00 paper, r3 Query DB IR Result ChopperCombiner Papers on privacy & data mining by Agarwal in Watson

Query Processing Query Chopper & Router DB IR Result Query Chopper & Router DB IR Result

Approaches (1)  Query Languages – XML-based extensions for queries W3C working group on Xquery considering extension for full text XXL (Weikum), XIRQL (Fuhr) – Specialized languages for highly structured data (e.g. chemical molecules)? – Graph-based models & languages (RDF, Protégé – Stanford) – Extended relational (e.g. SQL/MM) – Inverse queries on business events – Reasoning systems – Statistical approaches (approximate/ data mining)

Approaches (2)  Pluses of tight coupling – Enforcement of ontologies, schemas – Security, management, query optimization, integriry constraints  Negatives of tight coupling – Does not address federation issues/autonomy  Pluses of loose coupling – Flexibility  Negatives of loose coupling And the dinner bell rings …

Concluding Remarks  We need further discussion on issues and approaches during the rest of the workshop