Integrating Structured & Unstructured Data
Goals Identify some applications that have crucial requirement for integration of unstructured and structured data Identify key technical issues in integrating unstructured and structured data Identify potential approaches
Definitions (simplified) Structured object: – }> Unstructured object: – Semi-structured object – }, {word}> – pairs may be Given (e.g. author, title, etc.) Extracted (e.g. Date, Zipcode, etc.) Inferred (e.g. Topic)
Representative Applications BPI: Messasges- unstructured Web Applications: unstructured pages Corporate Portals: DSS involving Combination of simulation with database system News syndication: author etc + story Call centers: customer interaction + structured component of complaint Mail system/document systems Tourist information system Product catalogs/engineering spec sheets Patents/chenistry documents Matching Legal documents (with cross citations) with building codes --- representative
Key Technical Issues Query language & data model – Sharp vs fuzzy / complete vs best-effort – Boolean vs similarity queries (relationship to “value”) Integration strategies – Loose vs. tight coupling Architectures (many possibilities) – Search engine into DBMS or DBMS into search engine – Late & early binding (warehousing vs virtual) – Integration vs articulation (union vs intersection) Feature extraction from unstructured data Role of meta data & integrity constraints Inconsistency of data sources – Priorty rules for mediation Management & data organization issues – Version management, freshness, security Continuous queries over streams
Strucured:People(firstname, lastname, company, location) Semi-structured:Papers(title, {authors}, text) Unstructured: Reviews Q1: Reviews of papers by Almaden authors on II Search reviews using Join(People., Papers.authors).keywords Q2: Folks in Almaden and Watson working on same topic Join of Papers.text followed by joined with names in People Q3: Papers on privacy & data mining by Agarwal in Watson Combine ranks of results from People and Papers Q4: Almaden authors whose papers had negative reviews Infer sentiment of a review and interesting joins Q5: Crrent research topics in Almaden Join People and Papers followed by clustering
Combining Scores DB: – Aggarwal, Watson, s1 – Agarwal, Almaden, s2 – Agrawal, Almaden, s3 IR – Sigmod 00 paper, r2 – PODS 01 papers, r1 – KDD00 paper, r3 Query DB IR Result ChopperCombiner Papers on privacy & data mining by Agarwal in Watson
Query Processing Query Chopper & Router DB IR Result Query Chopper & Router DB IR Result
Approaches (1) Query Languages – XML-based extensions for queries W3C working group on Xquery considering extension for full text XXL (Weikum), XIRQL (Fuhr) – Specialized languages for highly structured data (e.g. chemical molecules)? – Graph-based models & languages (RDF, Protégé – Stanford) – Extended relational (e.g. SQL/MM) – Inverse queries on business events – Reasoning systems – Statistical approaches (approximate/ data mining)
Approaches (2) Pluses of tight coupling – Enforcement of ontologies, schemas – Security, management, query optimization, integriry constraints Negatives of tight coupling – Does not address federation issues/autonomy Pluses of loose coupling – Flexibility Negatives of loose coupling And the dinner bell rings …
Concluding Remarks We need further discussion on issues and approaches during the rest of the workshop