Welcome to CPSC 534B: Information Integration Laks V.S. Lakshmanan Rm. 315
Course Objectives Most applications of information technology require effective and efficient management of information. Information may reside anywhere – not just in DBs. Information can be heterogeneous. Information of interest may not all be in one place. Information Integration. II enabler for a whole class of new applications.
Course Objectives (contd.) Key technologies: – RDBMS – Heterogeneous database systems – View integration and management – Semistructured data and XML (data on the web) Main goal: learn about key concepts, techniques, algorithms, languages, and abstractions that make II possible. And have some fun.
Tentative Schedule Basic Tools (GOFDB) Week of Jan. 5: Overview/review of FOL. Jan. 12: Review of Relational algebra, calculus, datalog, SQL, integrity constraints. Jan. 19: Query containment and equivalence. Conjunctive Negation & aggregation
Tentative Schedule Integration Take 1 – Global Info. Systems Jan. 26: Integration models – Global As View and Local As View query answering using views (an application) II Take 2 – Dealing with heterogeneity Feb. 2: SchemaLog and SchemaSQL. Feb. 9: Schema Integration & Matching. Feb. 16: Break!
Tentative Schedule (contd.) II Take 3 – Dropping (rigid) structure Feb. 23: Intro to Semistructured data and XML (data model) XPath & Tree Pattern Queries Mar. 1: XPath (contd.) XQuery. Mar. 8: XQuery (contd.) TAX algebra / structural Join algos Mar. 15: XML Storage Native Relational Mar. 22: XML + Information Retrieval
Tentative Schedule (contd.) II Take 4 – Semantic Web (The final frontier?) Mar. 29: Semantic Web and II Project Talks and demos: April 5 onward.
Marking Scheme Assignments 45% Project 55% – Reading papers – Critiquing them – Innovating – Implementing – Reporting and presenting Projects can involve teams of 2-3 people (subject to approval). Each team to include 1 MCS student.
Suggested Project Themes Ideas/suggestions offered throughout the course, so be attentive! Data cleaning: key step required in data integration. Mining DTD/schema for XML docs: what you do when you must deal with XML data with no accompanying DTD/schema. XML schema integration: different XML data sources may follow different DTD/schemas. How do you provide a unified integrated view to the user?
Project Themes (contd.) XML query containment/equivalence: given queries (in XQuery or XPath), can rewrite them into more efficient ones; possibly use DTDs or integrity constraints. XML query operator evaluation algorithms: develop cost models and cost-based physical optimization strategies. XML and data security: how do you ensure queries are evaluated securely? Do not divulge anything you are not supposed to.
Project Themes (contd.) XML and Information Retrieval: effective way of querying documents marked up using XML (e.g., Shakespear’s plays); how do you combine IR and database-style XML querying? Data integration issues for biology: scientific data tends to be heterogeneous. How to meet the data integration challenges there? Query Answering using Views for XML: Extend the QAV technology developed for RDBMS for XML querying.
Project Themes (contd.) Detecting similarity between XML documents: develop notions of similarity between XML docs and implement algorithm(s) for detecting similarity Ranking answers to keyword search queries over XML data: develop and implement algorithms for ranking answers, based on “quality” of match XML interop: leverage semantic web and ontologies for matching schemas (XML or relational) and develop/implement algorithms for answering cross-queries
Project Themes (contd.) Explore higher-order logics for tree (XML) querying: example candidates are HiLog and (extensions of) SchemaLog. [can be purely conceptual or part conceptual and part implementation.]