Download presentation
Presentation is loading. Please wait.
Published byLinette Johnston Modified over 9 years ago
1
Integrating Structured & Unstructured Data
2
Goals Identify some applications that have crucial requirement for integration of unstructured and structured data Identify key technical issues in integrating unstructured and structured data Identify potential approaches
3
Definitions (simplified) Structured object: – }> Unstructured object: – Semi-structured object – }, {word}> – pairs may be Given (e.g. author, title, etc.) Extracted (e.g. Date, Zipcode, etc.) Inferred (e.g. Topic)
4
Representative Applications BPI: Messasges- unstructured Web Applications: unstructured pages Corporate Portals: DSS involving Combination of simulation with database system News syndication: author etc + story Call centers: customer interaction + structured component of complaint Mail system/document systems Tourist information system Product catalogs/engineering spec sheets Patents/chenistry documents Matching Legal documents (with cross citations) with building codes --- representative
5
Key Technical Issues Query language & data model – Sharp vs fuzzy / complete vs best-effort – Boolean vs similarity queries (relationship to “value”) Integration strategies – Loose vs. tight coupling Architectures (many possibilities) – Search engine into DBMS or DBMS into search engine – Late & early binding (warehousing vs virtual) – Integration vs articulation (union vs intersection) Feature extraction from unstructured data Role of meta data & integrity constraints Inconsistency of data sources – Priorty rules for mediation Management & data organization issues – Version management, freshness, security Continuous queries over streams
6
Strucured:People(firstname, lastname, company, location) Semi-structured:Papers(title, {authors}, text) Unstructured: Reviews Q1: Reviews of papers by Almaden authors on II Search reviews using Join(People., Papers.authors).keywords Q2: Folks in Almaden and Watson working on same topic Join of Papers.text followed by joined with names in People Q3: Papers on privacy & data mining by Agarwal in Watson Combine ranks of results from People and Papers Q4: Almaden authors whose papers had negative reviews Infer sentiment of a review and interesting joins Q5: Crrent research topics in Almaden Join People and Papers followed by clustering
7
Combining Scores DB: – Aggarwal, Watson, s1 – Agarwal, Almaden, s2 – Agrawal, Almaden, s3 IR – Sigmod 00 paper, r2 – PODS 01 papers, r1 – KDD00 paper, r3 Query DB IR Result ChopperCombiner Papers on privacy & data mining by Agarwal in Watson
8
Query Processing Query Chopper & Router DB IR Result Query Chopper & Router DB IR Result
9
Approaches (1) Query Languages – XML-based extensions for queries W3C working group on Xquery considering extension for full text XXL (Weikum), XIRQL (Fuhr) – Specialized languages for highly structured data (e.g. chemical molecules)? – Graph-based models & languages (RDF, Protégé – Stanford) – Extended relational (e.g. SQL/MM) – Inverse queries on business events – Reasoning systems – Statistical approaches (approximate/ data mining)
10
Approaches (2) Pluses of tight coupling – Enforcement of ontologies, schemas – Security, management, query optimization, integriry constraints Negatives of tight coupling – Does not address federation issues/autonomy Pluses of loose coupling – Flexibility Negatives of loose coupling And the dinner bell rings …
11
Concluding Remarks We need further discussion on issues and approaches during the rest of the workshop
Similar presentations
© 2024 SlidePlayer.com. Inc.
All rights reserved.