Welcome to CPSC 534B: Web Data Integration & Management Laks V.S. Lakshmanan Rm. CICSR Main Mall
Course Objectives – The Story 1/5 Most applications of information technology require effective and efficient management of data/information. Data may reside anywhere – not just in DBs. can be heterogeneous. Data of interest may not all be in one place. Data Integration. Native apps may manipulate data w/o providing data management services. enabler for a whole class of new applications.
Course Objectives – The Story 2/5 Data of interest may be buried in a spaghetti: – Software artifacts – Text archives – Web pages (html) – XML data – Relational DBs (whn you are lucky) – Spreadsheets – LDAP directories Need techniques/tools for extracting data of interest from such mess!
Course Objectives – The Story 3/5 You may need to enter into contracts or cooperation models for obtaining data of interest to you – Peer-to-peer (P2P) database systems. Data security is extremely important these days. – What do you know about data security? – How is it different from app security?
Course Objectives – The Story 4/5 Who said the data you are getting (or somehow have access to) is 100% correct or reliable? – How do you model reliability? – How can you answer questions if the info. you have is < 100% reliable? The world doesn’t necessarily speak SQL (or your favorite query language fpr that matter!) – What’s world’s biggest “database”? – How do people “query” that DB?
Course Objectives – The Story 5/5 You and I don’t model our app. data the same way: – There is unbounded flexibility in designing models and structures for storing data in well-structured data models. – Matters only get worse with semistructured data. – E.g.: `google’ may be modeled as a value or as an attribute or as something else. – Price may be in diff. currencies. May/may not include taxes. May be in diff. scales.
Course Objectives (contd.) Key technologies: – Relational DB technology – Heterogeneous database systems – View integration and management – Semistructured data and XML (data on the web) Then branch out to tackle various challenges. Main goal: learn about key concepts, techniques, algorithms, languages, and abstractions that make management and integration of data possible (no matter where it is). And have some fun.
Tentative Schedule Basic Tools (GOFDB) Week of Sept. 12: Overview/review of relational stuff: query languages and integrity constraints. Sept. 19: Query containment and equivalence. Conjunctive Negation & aggregation
Tentative Schedule Integration Take 1 – Global Info. Systems Sept. 26: Integration models – Global As View and Local As View query answering using views (an application) II Take 2 – Dealing with heterogeneity Oct. 3: SchemaLog and SchemaSQL. Oct. 10: Schema Integration & Matching. Oct. 17: Intro to Semistructured data and XML (data model) XPath & Tree Pattern Queries
Tentative Schedule (contd.) II Take 3 – Dropping (rigid) structure Oct. 24: XPath (contd.) XQuery. Oct. 31: XQuery (contd.) TAX algebra Nov. 7: Set stage for projects, critiques, presentations, and talks. – DB + IR. – Security/privacy. – P2P DBMS. – Uncertain and unclean data management. Nov. 14: paper presentations/project demos start.
Marking Scheme Class Participation 5% Assignments 40% Project 55% – Reading papers – Critiquing them – Innovating – Implementing – Reporting and presenting Projects can involve teams of 2-3 people (subject to approval). Each team to include 1 MCS/CS-PhD student.
Suggested Project Themes Ideas/suggestions offered throughout the course, so be attentive! Here is a slightly old list. Stay tuned for newer list in class. Your own project ideas welcome. Projects need my pior approval. Data cleaning: key step required in data integration. Mining DTD/schema for XML docs: what you do when you must deal with XML data with no accompanying DTD/schema. XML schema integration: different XML data sources may follow different DTD/schemas. How do you provide a unified integrated view to the user?
Project Themes (contd.) XML query containment/equivalence: given queries (in XQuery or XPath), can rewrite them into more efficient ones; possibly use DTDs or integrity constraints. XML query operator evaluation algorithms: develop cost models and cost-based physical optimization strategies. XML and data security: how do you ensure queries are evaluated securely? Do not divulge anything you are not supposed to.
Project Themes (contd.) XML and Information Retrieval: effective way of querying documents marked up using XML (e.g., Shakespear’s plays); how do you combine IR and database-style XML querying? Data integration issues for biology: scientific data tends to be heterogeneous. How to meet the data integration challenges there? Query Answering using Views for XML: Extend the QAV technology developed for RDBMS for XML querying.
Project Themes (contd.) Detecting similarity between XML documents: develop notions of similarity between XML docs and implement algorithm(s) for detecting similarity Ranking answers to keyword search queries over XML data: develop and implement algorithms for ranking answers, based on “quality” of match XML interop: leverage semantic web and ontologies for matching schemas (XML or relational) and develop/implement algorithms for answering cross-queries
Project Themes (contd.) Explore higher-order logics for tree (XML) querying: example candidates are HiLog and (extensions of) SchemaLog. [can be purely conceptual or part conceptual and part implementation.]