Data Warehousing/Mining Comp 150 DW Semistructured Data

Slides:



Advertisements
Similar presentations
XML: Extensible Markup Language
Advertisements

George Blank University Lecturer. CS 602 Java and the Web Object Oriented Software Development Using Java Chapter 4.
Web-site Management System Strudel Presented by: LAKHLIFI Houda Instructor: Dr. Haddouti.
Visual Web Information Extraction With Lixto Robert Baumgartner Sergio Flesca Georg Gottlob.
Query Languages Aswin Yedlapalli. XML Query data model Document is viewed as a labeled tree with nodes Successors of node may be : - an ordered sequence.
Advanced Topics COMP163: Database Management Systems University of the Pacific December 9, 2008.
1 COS 425: Database and Information Management Systems XML and information exchange.
1 New Ways of Querying the Web by Eliahu Brodsky and Alina Blizhovsky.
Database Systems and XML David Wu CS 632 April 23, 2001.
1 Advanced Topics XML and Databases. 2 XML u Overview u Structure of XML Data –XML Document Type Definition DTD –Namespaces –XML Schema u Query and Transformation.
4/20/2017.
XML, distributed databases, and OLAP/warehousing The semantic web and a lot more.
IS432: Semi-Structured Data Dr. Azeddine Chikh. 1. Semi Structured Data Object Exchange Model.
Database System Concepts and Architecture Lecture # 3 22 June 2012 National University of Computer and Emerging Sciences.
XML과 Database 홍기형 성신여자대학교 성신여자대학교 홍기형.
A Query Translation Scheme for Rapid Implementation of Wrappers Presented By Preetham Swaminathan 03/22/2007 Yannis Papakonstantinou, Ashish Gupta, Hector.
RELATIONAL FAULT TOLERANT INTERFACE TO HETEROGENEOUS DISTRIBUTED DATABASES Prof. Osama Abulnaja Afraa Khalifah
1 CS 430 Database Theory Winter 2005 Lecture 17: Objects, XML, and DBMSs.
5 - 1 Copyright © 2006, The McGraw-Hill Companies, Inc. All rights reserved.
[ Part III of The XML seminar ] Presenter: Xiaogeng Zhao A Introduction of XQL.
Semi-structured Data In many applications, data does not have a rigidly and predefined schema: –e.g., structured files, scientific data, XML. Managing.
A Portrait of the Semantic Web in Action Jeff Heflin and James Hendler IEEE Intelligent Systems December 6, 2010 Hyewon Lim.
SEMI-STRUCTURED DATA (XML) 1. SEMI-STRUCTURED DATA ER, Relational, ODL data models are all based on schema Structure of data is rigid and known is advance.
XML Extensible Markup Language
XML Databases Presented By: Pardeep MT15042 Anurag Goel MT15006.
Information Retrieval in Practice
Normalized bubble chart for Data in the Instructor’s View
CS4222 Principles of Database System
XML: Extensible Markup Language
Logical DB Design: ER to Relational
Module 11: File Structure
CHP - 9 File Structures.
Relational Database Management System
Physical Changes That Don’t Change the Logical Design
Introduction Multimedia initial focus
OBJECTS & DATABASES Arnaud Sahuguet – CIS-550.
XML QUESTIONS AND ANSWERS
Chapter 1: Introduction
Database Management System
GO! with Microsoft Access 2016
课程名 编译原理 Compiling Techniques
Introduction to Programming the WWW I
Introduction to Database Systems
Datamining : Refers to extracting or mining knowledge from large amounts of data Applications : Market Analysis Fraud Detection Customer Retention Production.
The Entity-Relationship Model
CS416 Compiler Design lec00-outline September 19, 2018
Translation of ER-diagram into Relational Schema
Copyright © 2011 Pearson Education, Inc. Publishing as Pearson Addison-Wesley Chapter 2 Database System Concepts and Architecture.
Instructor: Elke Rundensteiner
What is a Database and Why Use One?
From ER to Relational Model
Introduction CI612 Compiler Design CI612 Compiler Design.
Chapter 2 Database Environment.
The Entity-Relationship Model
File Systems and Databases
Information Retrieval
SQL: The Query Language Part 1
eXtensible Markup Language (XML)
Semi-Structured data (XML Data MODEL)
Query Processing CSD305 Advanced Databases.
CS416 Compiler Design lec00-outline February 23, 2019
DATABASES WHAT IS A DATABASE?
The ultimate in data organization
Transaction Management
CSE591: Data Mining by H. Liu
Lec00-outline May 18, 2019 Compiler Design CS416 Compiler Design.
Semi-structured Data In many applications, data does not have a rigidly and predefined schema: e.g., structured files, scientific data, XML. Managing such.
The Entity-Relationship Model
Semi-Structured data (XML)
The Entity-Relationship (ER) Model
Presentation transcript:

Data Warehousing/Mining Comp 150 DW Semistructured Data Instructor: Dan Hebert The slides for this text are organized into several modules. Each lecture contains about enough material for a 1.25 hour class period. (The time estimate is very approximate--it will vary with the instructor, and lectures also differ in length; so use this as a rough guideline.) This lecture is the first of two in Module (1). Module (1): Introduction (DBMS, Relational Model) Module (2): Storage and File Organizations (Disks, Buffering, Indexes) Module (3): Database Concepts (Relational Queries, DDL/ICs, Views and Security) Module (4): Relational Implementation (Query Evaluation, Optimization) Module (5): Database Design (ER Model, Normalization, Physical Design, Tuning) Module (6): Transaction Processing (Concurrency Control, Recovery) Module (7): Advanced Topics

Semistructured Data Everything that has no rigid schema Schema is contained within the data (self-describing), OR No separate schema, OR Schema exists but places only loose constraints on data Emerged as an important topic for a variety of reasons Many data sources like WWW which we would like to treat as databases but cannot for the lack of schema Desirable to have an extremely flexible format for data exchange between disparate databases May want to view structured data as semistructured data for the purpose of browsing

Motivation Some data really is unstructured/semistructured World Wide Web, Data exchange formats Some exotic database management systems, e.g., ACeDB, popular with biologists Data integration Browsing

Motivation - World Wide Web Why do we want to treat the Web as a database? To maintain integrity To query based on structure (as opposed to content) To introduce some “organization”. But the Web has no structure. The best we can say is that it is an enormous graph.

Motivation - Data Formats Much (probably most) of the world’s data is in data formats These are formats defined for the interchange and archiving of data Data formats vary in generality. ASN.1 and XDR are quite general Scientific data formats tend to be “fixed schemas” The textual representation given by data formats is sometimes not immediately translatable into a standard relational/object-oriented representation

Motivation - Data Integration Goal is to integrate all types of information, including unstructured information Irregular, missing information, structure not fully known, dynamic schema evolution, etc. Traditional data models and languages not well suited Cannot accommodate heterogeneous data sets (different types and structures), etc. Difficult to build software that will easily convert between two disparate models OEM (Object Exchange Model) Semistructured data model from TSIMMIS project at Stanford Internal data structure for exchange of data between DBMSs Used by other systems: e.g., Windows 95 registry, Lotus Notes

Motivation - Browsing To query a database one needs to understand the schema. However schemas have opaque terminology and the user may want to start by querying the data with little or no knowledge of the schema. Where in the database is the string “Casablanca” to be found? Are there integers in the database greater than 216 ? What objects in the database have an attribute name that starts with “act”? While extensions to relational query languages have been proposed for such queries, there is no generic technique for interpreting them.

The Model Represent data as some kind of graph-like or tree-like model Cycles are allowed but usually refer to them as trees Several different approaches with minor differences (easy to convert) Data on labels or edges, nodes carry information or not Straightforward to encode relational and object-oriented databases Issue: object identity

Querying Semistructured Data There are (at least) three approaches to this problem Add arbitrary features to SQL or to your favorite query language Find some principled approach to programs that are based on the type of the data Represent the graph (or whatever the structure is) as appropriate predicates and use some variety of datalog on that structure

The “Extend SQL” Approach In fact it is an attempt to extend the philosophy of OQL and comprehension syntax to these new structures It is the approach taken in the design of UnQL and also of Lorel Looks very similar to OQL (path expressions)

Example select Entry.Movie.Title from DB where Entry.Movie.Director...

Syntax Issues Need (path) variables to tie paths and edges together Paths of arbitrary length “Find all strings in db” “Find whether “Allen” acted in “Casablanca” Need regular expresions to constrain paths Rich set of overloadings for operators to deal with comparisons of objects with values and of values with sets

Underlying Computational Strategy Model graph as a relational database and use relational query language. Database large relation (node-id, label, node-id) Used by Stanford group in LORE/LOREL Complications Labels are from heterogeneous set of types, need more than one relation Additional relations if info to be stored in nodes Various navigation issues

Semistructured Data - Case Study Object Exchange Model

OEM Features Common model for heterogeneous information exchange, self-describing Each object: OID Label Type Value OID = unique identifier or NULL Label = character string descriptor Type = atomic data type or set Value = atomic value or set of object references self-describing, schema-less object model OEM objects come in two types: atomic complex --> nested graph “Help pages” for labels Query language OEM-QL

Representing Semistructured Data Using OEM Label <collection, {b1, a1, ...}> b1: <book, {t, a}> t: <title, “Database and ...”> a: <author, {n, p}> n: <name, “Jeff Ullman”> p: <picture, “/gifs/ullman.gif”> a1: <article, {v, w, x}> v: <author, “Gio Wiederhold”> w: <title, “Mediators in the …”> x: <journal, “IEEE Computer”> Set Value Memory Addresses Atomic Value Example of a complex object representing a collection of publications Starting form the top (root), collection (complex) with a set of subobjects… What I want you to get from this is how natural it is to represent semi-structured data in OEM: point out heterogeneity of data in one object ...

An OEM Query Language: OEM-QL Logic-based language for OEM Match object patterns, generate variable bindings, construct new OEM objects from existing ones Get articles published in “IEEE Computer” P :- P:<articles {<journal “IEEE Computer”>}> Get titles of books by “Jeff Ullman” <answer_title T> :- <book {<author “Jeff Ullman”> <title T>}> OEM also has at least one query language explain acronym queries have a head = constructor (select clause) tail = used for matching basically, query body describes patterns of the OEM objects we are looking for, when a match is found, it is bound to an object variable, bindings used to construct new (result) objects form existing ones; structure of answer objects defined in query head MSL with functional notation is as powerful as datalog with functions but less powerful than COA MSL without function symbols is in ptime

Semistructured Data - Case Study WWW Extraction

Problem Lots of valuable information on the Web Embedded in HTML irregular structure highly dynamic Embedded in HTML Limited query facilities cannot be queried directly (e.g., by TSIMMIS wrapper) Search engines provide limited query facilities

Data Extraction Tool Flexible, easy to use Accommodate virtually any HTML source Interface with existing system, e.g., data warehouse, user interface for querying Query queriable by TSIMMIS wrappers returns data as OEM (Object Exchange Model) object World Wide Web Data Warehouse Extractor WH Integrator Specification

Approach Extract Web data into OEM format Query using OEM-QL Python-based, configurable parser Declarative description of HTML source location of data on page how to package data into OEM “Regular expression”-like syntax Human intelligence rather than A.I.

Extractor Specification Consists of commands of the form: [ “variable(s)”, “source”, “pattern” ] variable list data structure, holds extracted result name and contents used in creation of OEM object source input to parser pattern describes how to find text of interest in source

HTML Source File <HTML> <HEAD> . <TABLE> <TR> <TH><I> header 1 </I></TH> <TH><I> header 2 </I></TH> <TH><I> header 3 </I></TH> </TR> <TD> text 1 </TD> <TD><A HREF=http://www.stuff/> text 2 </A></TD> <TD> text 3 </TD> </TABLE> </BODY> </HTML>

Specification File [ [“root”, “get('http://www.example.test/')”, “#” ], [“__tempvar1”, “root”, “*<table>#</table>*” ], [“__tempvar2”, “split (__tempvar1,’</tr>’)”, “#” ], [“rows”, “__tempvar2[1:-1]”, “#” ], [“header1,header2_url,header2,header3”, “rows”, “*<td>#</td>*<a*href=#>#</a>*<td>#</td>*”] ]

Result OEM Object <root complex { <rows complex { <header1 string “text 1”> <header2_url string “http://www.stuff”> <header2 string “text 2” <header3 string “text 3”> }> ...

Basic Syntax:Variable variable(l:p:t) optional parameters for specification of corresponding OEM object l: label name t: type p: parent object _variable temporary data structure, does not appear as OEM object

Basic Syntax: Source split(variable,token) get(URL) creates a list with multiple elements using token as the element separator get(URL) obtain contents of HTML file at address URL

Basic Syntax: Patterns token1 # token2 match and store current input (between tokens) token1 * token2 match, don’t store current input (between tokens)

Syntactic Sugar Functions for extracting commonly used HTML constructs extract_table(variable),pattern split_table_row(variable) split_table_column(variable) extract_list(variable),pattern split_list(variables)

Advanced Features Customization of output structure, label names, data type, ... Extraction across multiple HTML pages Graceful recovery from parse errors resume parsing using next input from source Multiple patterns in single command follow different parse tree depending on structure in source

Sample Extraction Scenario . . .

Extracted OEM Data OEM-QL query: <city C {<high H> < low L>}> :- <temperature {<city_temp {<country “Germany”> <city C> <high_today H> <low_today L>}>}>

Evaluation Better than Can do better writing programs YACC, PERL, etc. A.I. Can do better GUI tool to simplify the generation of extractor specification Machine learning or data mining techniques to automatically infer structure...