Download presentation
Presentation is loading. Please wait.
Published byAnastasia Boyd Modified over 9 years ago
1
Welcome to CPSC 534B: Web Data Integration & Management Laks V.S. Lakshmanan laks@cs.ubc.ca Rm. CICSR 315 2366 Main Mall
2
Course Objectives – The Story 1/5 Most applications of information technology require effective and efficient management of data/information. Data may reside anywhere – not just in DBs. can be heterogeneous. Data of interest may not all be in one place. Data Integration. Native apps may manipulate data w/o providing data management services. enabler for a whole class of new applications.
3
Course Objectives – The Story 2/5 Data of interest may be buried in a spaghetti: – Software artifacts – Text archives – Web pages (html) – XML data – Relational DBs (whn you are lucky) – Spreadsheets – LDAP directories Need techniques/tools for extracting data of interest from such mess!
4
Course Objectives – The Story 3/5 You may need to enter into contracts or cooperation models for obtaining data of interest to you – Peer-to-peer (P2P) database systems. Data security is extremely important these days. – What do you know about data security? – How is it different from app security?
5
Course Objectives – The Story 4/5 Who said the data you are getting (or somehow have access to) is 100% correct or reliable? – How do you model reliability? – How can you answer questions if the info. you have is < 100% reliable? The world doesn’t necessarily speak SQL (or your favorite query language fpr that matter!) – What’s world’s biggest “database”? – How do people “query” that DB?
6
Course Objectives – The Story 5/5 You and I don’t model our app. data the same way: – There is unbounded flexibility in designing models and structures for storing data in well-structured data models. – Matters only get worse with semistructured data. – E.g.: `google’ may be modeled as a value or as an attribute or as something else. – Price may be in diff. currencies. May/may not include taxes. May be in diff. scales.
7
Course Objectives (contd.) Key technologies: – Relational DB technology – Heterogeneous database systems – View integration and management – Semistructured data and XML (data on the web) Then branch out to tackle various challenges. Main goal: learn about key concepts, techniques, algorithms, languages, and abstractions that make management and integration of data possible (no matter where it is). And have some fun.
8
Tentative Schedule Basic Tools (GOFDB) Week of Sept. 12: Overview/review of relational stuff: query languages and integrity constraints. Sept. 19: Query containment and equivalence. Conjunctive Negation & aggregation
9
Tentative Schedule Integration Take 1 – Global Info. Systems Sept. 26: Integration models – Global As View and Local As View query answering using views (an application) II Take 2 – Dealing with heterogeneity Oct. 3: SchemaLog and SchemaSQL. Oct. 10: Schema Integration & Matching. Oct. 17: Intro to Semistructured data and XML (data model) XPath & Tree Pattern Queries
10
Tentative Schedule (contd.) II Take 3 – Dropping (rigid) structure Oct. 24: XPath (contd.) XQuery. Oct. 31: XQuery (contd.) TAX algebra Nov. 7: Set stage for projects, critiques, presentations, and talks. – DB + IR. – Security/privacy. – P2P DBMS. – Uncertain and unclean data management. Nov. 14: paper presentations/project demos start.
11
Marking Scheme Class Participation 5% Assignments 40% Project 55% – Reading papers – Critiquing them – Innovating – Implementing – Reporting and presenting Projects can involve teams of 2-3 people (subject to approval). Each team to include 1 MCS/CS-PhD student.
12
Suggested Project Themes Ideas/suggestions offered throughout the course, so be attentive! Here is a slightly old list. Stay tuned for newer list in class. Your own project ideas welcome. Projects need my pior approval. Data cleaning: key step required in data integration. Mining DTD/schema for XML docs: what you do when you must deal with XML data with no accompanying DTD/schema. XML schema integration: different XML data sources may follow different DTD/schemas. How do you provide a unified integrated view to the user?
13
Project Themes (contd.) XML query containment/equivalence: given queries (in XQuery or XPath), can rewrite them into more efficient ones; possibly use DTDs or integrity constraints. XML query operator evaluation algorithms: develop cost models and cost-based physical optimization strategies. XML and data security: how do you ensure queries are evaluated securely? Do not divulge anything you are not supposed to.
14
Project Themes (contd.) XML and Information Retrieval: effective way of querying documents marked up using XML (e.g., Shakespear’s plays); how do you combine IR and database-style XML querying? Data integration issues for biology: scientific data tends to be heterogeneous. How to meet the data integration challenges there? Query Answering using Views for XML: Extend the QAV technology developed for RDBMS for XML querying.
15
Project Themes (contd.) Detecting similarity between XML documents: develop notions of similarity between XML docs and implement algorithm(s) for detecting similarity Ranking answers to keyword search queries over XML data: develop and implement algorithms for ranking answers, based on “quality” of match XML interop: leverage semantic web and ontologies for matching schemas (XML or relational) and develop/implement algorithms for answering cross-queries
16
Project Themes (contd.) Explore higher-order logics for tree (XML) querying: example candidates are HiLog and (extensions of) SchemaLog. [can be purely conceptual or part conceptual and part implementation.]
Similar presentations
© 2024 SlidePlayer.com. Inc.
All rights reserved.