Welcome to CPSC 534B: Web Data Integration & Management Laks V.S. Lakshmanan Rm. CICSR 315 2366 Main Mall.

Slides:



Advertisements
Similar presentations
Native XML Database or RDBMS. Data or Document orientation If you are primarily storing documents, then a Native XML Database may be the best option.
Advertisements

XML DOCUMENTS AND DATABASES
Chapter 5: Introduction to Information Retrieval
Database Systems Research: Where it is (or should be) Headed? (aka looking for a “perfect” candidate) Laks V.S. Lakshmanan Dept. of Computer Science Univ.
1 Oct 30, 2006 LogicSQL-based Enterprise Archive and Search System How to organize the information and make it accessible and useful ? Li-Yan Yuan.
Search Engines and Information Retrieval
Data Management for XML: Research Directions By: Jennifer Widom Stanford University Reviewer: Kristin Streilein.
Xyleme A Dynamic Warehouse for XML Data of the Web.
Chapter 6 Methodology Conceptual Databases Design Transparencies © Pearson Education Limited 1995, 2005.
The Last Lecture Agenda –1:40-2:00pm Integrating XML and Search Engines—Niagara way –2:00-2:10pm My concluding remarks (if any) –2:10-2:45pm Interactive.
Chapter 1: Data Models and DBMS Architecture Title: What Goes Around Comes Around Authors: M. Stonebraker, J. Hellerstein Pages: 2-40.
1 COS 425: Database and Information Management Systems XML and information exchange.
How can Computer Science contribute to Research Publishing?
1 Lecture 13: Database Heterogeneity Debriefing Project Phase 2.
© Anselm SpoerriInfo + Web Tech Course Information Technologies Info + Web Tech Course Anselm Spoerri PhD (MIT) Rutgers University
Web Data Management Dr. Daniel Deutch. Web Data The web has revolutionized our world Data is everywhere Constitutes a great potential But also a lot of.
CAREER: Towards Unifying Database Systems and Information Retrieval Systems NSF IDM Workshop 10 Oct 2004 Jayavel Shanmugasundaram Cornell University.
Infomaster: An information Integration Tool O. M. Duschka and M. R. Genesereth Presentation by Cui Tao.
Distributed Database Management Systems. Reading Textbook: Ch. 4 Textbook: Ch. 4 FarkasCSCE Spring
Methodology Conceptual Database Design
CSSE 533 – Database Systems Week 1, Day 1 Steve Chenoweth CSSE Dept.
XML, distributed databases, and OLAP/warehousing The semantic web and a lot more.
Semantic Interoperability Jérôme Euzenat INRIA & LIG France Natasha Noy Stanford University USA.
NUITS: A Novel User Interface for Efficient Keyword Search over Databases The integration of DB and IR provides users with a wide range of high quality.
Introduction. 
Enterprise & Intranet Search How Enterprise is different from Web search What to think about when evaluating Enterprise Search How Intranet use is different.
Search Engines and Information Retrieval Chapter 1.
Dept. Computer Science, Korea Univ. Intelligent Information System Lab. XML clustering methods Sohn Jong-Soo Intelligent Information.
Methodology - Conceptual Database Design Transparencies
Methodology Conceptual Databases Design
Data Management and Database Technologies Theme 23-FEB-2005.
Database Organization and Design
Web Searching Basics Dr. Dania Bilal IS 530 Fall 2009.
Internet Information Retrieval Sun Wu. Course Goal To learn the basic concepts and techniques of internet search engines –How to use and evaluate search.
Querying Structured Text in an XML Database By Xuemei Luo.
Methodology - Conceptual Database Design. 2 Design Methodology u Structured approach that uses procedures, techniques, tools, and documentation aids to.
Mediators, Wrappers, etc. Based on TSIMMIS project at Stanford. Concepts used in several other related projects. Goal: integrate info. in heterogeneous.
Methodology - Conceptual Database Design
2007. Software Engineering Laboratory, School of Computer Science S E Web-Harvest Web-Harvest: Open Source Web Data Extraction tool 이재정 Software Engineering.
Declaratively Producing Data Mash-ups Sudarshan Murthy 1, David Maier 2 1 Applied Research, Wipro Technologies 2 Department of Computer Science, Portland.
Search Engine Architecture
1 Of Crawlers, Portals, Mice and Men: Is there more to Mining the Web? Jiawei Han Simon Fraser University, Canada ACM-SIGMOD’99 Web Mining Panel Presentation.
Mark Kvamme Sequoia Capital Content Happens!. Remember These Guys?
WEB MINING. In recent years the growth of the World Wide Web exceeded all expectations. Today there are several billions of HTML documents, pictures and.
1/22/08 RTR Project Presentation to TPTF RTR Project Michael Daskalantonakis & Brian Cook.
Algorithmic Detection of Semantic Similarity WWW 2005.
XML and Database.
Integrating Structured & Unstructured Data. Goals  Identify some applications that have crucial requirement for integration of unstructured and structured.
CS 541 Lecture Slides Sunil Prabhakar CS541 Database Systems.
Tallahassee, Florida, 2015 COP4710 Database Systems Project Overview Fall 2015.
1 Information Retrieval LECTURE 1 : Introduction.
IT Enablement Approaches Large Business may have hundreds of processes to be enabled by IT. Several Types of Application may be deployed –Departmental.
Distributed Database Management Systems. Reading Textbook: Ch. 1, Ch. 3 Textbook: Ch. 1, Ch. 3 For next class: Ch. 4 For next class: Ch. 4 FarkasCSCE.
Fall CSE330/CIS550: Introduction to Database Management Systems Prof. Susan Davidson Office: 278 Moore Office hours: TTh
Lecture 15: Query Optimization. Very Big Picture Usually, there are many possible query execution plans. The optimizer is trying to chose a good one.
Welcome to CPSC 534B: Information Integration Laks V.S. Lakshmanan Rm. 315.
Integrated Departmental Information Service IDIS provides integration in three aspects Integrate relational querying and text retrieval Integrate search.
Introduction to Database Systems 1. Pop Quiz Question 1: How often do you use a database system or database system application? a)At least once a day.
BBY 464 Semantic Information Management (Spring 2016) Semantic Query Languages Yaşar Tonta & Orçun Madran [yasartonta, Hacettepe.
COP4710 Database Systems Project Overview.
Datab ase Systems Week 1 by Zohaib Jan.
Search Engine Architecture
Cross-language Information Retrieval
Modeling Your Data Chapter 2 cs542
What is IR? In the 70’s and 80’s, much of the research focused on document retrieval In 90’s TREC reinforced the view that IR = document retrieval Document.
Tools for Memory: Database Management Systems
Search Engine Architecture
Query Optimization.
Querying XML XSLT.
Presentation transcript:

Welcome to CPSC 534B: Web Data Integration & Management Laks V.S. Lakshmanan Rm. CICSR Main Mall

Course Objectives – The Story 1/5 Most applications of information technology require effective and efficient management of data/information. Data may reside anywhere – not just in DBs. can be heterogeneous. Data of interest may not all be in one place.  Data Integration.  Native apps may manipulate data w/o providing data management services.  enabler for a whole class of new applications.

Course Objectives – The Story 2/5 Data of interest may be buried in a spaghetti: – Software artifacts – Text archives – Web pages (html) – XML data – Relational DBs (whn you are lucky) – Spreadsheets – LDAP directories Need techniques/tools for extracting data of interest from such mess!

Course Objectives – The Story 3/5 You may need to enter into contracts or cooperation models for obtaining data of interest to you – Peer-to-peer (P2P) database systems. Data security is extremely important these days. – What do you know about data security? – How is it different from app security?

Course Objectives – The Story 4/5 Who said the data you are getting (or somehow have access to) is 100% correct or reliable? – How do you model reliability? – How can you answer questions if the info. you have is < 100% reliable? The world doesn’t necessarily speak SQL (or your favorite query language fpr that matter!) – What’s world’s biggest “database”? – How do people “query” that DB?

Course Objectives – The Story 5/5 You and I don’t model our app. data the same way: – There is unbounded flexibility in designing models and structures for storing data in well-structured data models. – Matters only get worse with semistructured data. – E.g.: `google’ may be modeled as a value or as an attribute or as something else. – Price may be in diff. currencies. May/may not include taxes. May be in diff. scales.

Course Objectives (contd.) Key technologies: – Relational DB technology – Heterogeneous database systems – View integration and management – Semistructured data and XML (data on the web) Then branch out to tackle various challenges. Main goal: learn about key concepts, techniques, algorithms, languages, and abstractions that make management and integration of data possible (no matter where it is). And have some fun.

Tentative Schedule Basic Tools (GOFDB) Week of Sept. 12: Overview/review of relational stuff: query languages and integrity constraints. Sept. 19: Query containment and equivalence. Conjunctive Negation & aggregation

Tentative Schedule Integration Take 1 – Global Info. Systems Sept. 26: Integration models – Global As View and Local As View query answering using views (an application) II Take 2 – Dealing with heterogeneity Oct. 3: SchemaLog and SchemaSQL. Oct. 10: Schema Integration & Matching. Oct. 17: Intro to Semistructured data and XML (data model) XPath & Tree Pattern Queries

Tentative Schedule (contd.) II Take 3 – Dropping (rigid) structure Oct. 24: XPath (contd.) XQuery. Oct. 31: XQuery (contd.) TAX algebra Nov. 7: Set stage for projects, critiques, presentations, and talks. – DB + IR. – Security/privacy. – P2P DBMS. – Uncertain and unclean data management. Nov. 14: paper presentations/project demos start.

Marking Scheme Class Participation 5% Assignments 40% Project 55% – Reading papers – Critiquing them – Innovating – Implementing – Reporting and presenting Projects can involve teams of 2-3 people (subject to approval). Each team to include  1 MCS/CS-PhD student.

Suggested Project Themes Ideas/suggestions offered throughout the course, so be attentive! Here is a slightly old list. Stay tuned for newer list in class. Your own project ideas welcome. Projects need my pior approval. Data cleaning: key step required in data integration. Mining DTD/schema for XML docs: what you do when you must deal with XML data with no accompanying DTD/schema. XML schema integration: different XML data sources may follow different DTD/schemas. How do you provide a unified integrated view to the user?

Project Themes (contd.) XML query containment/equivalence: given queries (in XQuery or XPath), can rewrite them into more efficient ones; possibly use DTDs or integrity constraints. XML query operator evaluation algorithms: develop cost models and cost-based physical optimization strategies. XML and data security: how do you ensure queries are evaluated securely? Do not divulge anything you are not supposed to.

Project Themes (contd.) XML and Information Retrieval: effective way of querying documents marked up using XML (e.g., Shakespear’s plays); how do you combine IR and database-style XML querying? Data integration issues for biology: scientific data tends to be heterogeneous. How to meet the data integration challenges there? Query Answering using Views for XML: Extend the QAV technology developed for RDBMS for XML querying.

Project Themes (contd.) Detecting similarity between XML documents: develop notions of similarity between XML docs and implement algorithm(s) for detecting similarity Ranking answers to keyword search queries over XML data: develop and implement algorithms for ranking answers, based on “quality” of match XML interop: leverage semantic web and ontologies for matching schemas (XML or relational) and develop/implement algorithms for answering cross-queries

Project Themes (contd.) Explore higher-order logics for tree (XML) querying: example candidates are HiLog and (extensions of) SchemaLog. [can be purely conceptual or part conceptual and part implementation.]