Web Warehousing : Design and Issues

Web Warehousing : Design and Issues
Sanjay Kumar Madria Department of Computer Science Purdue University West Lafayette, IN 47907

World Wide Web Web is fast growing
More business organizations putting information in the Web Business on the highway Myriad of raw data to be processed for information

WWW collection of multimedia documents in the form of web pages connected via hyperlinks.

As WWW grows, more chaotic it becomes
Web is fast growing, distributed, non-administered global information resource WWW allows access to text, image, video, sound and graphic data more business organizations creating web servers more chaotic environment to locate information of interest lost in hyperspace syndrome

Characteristics of WWW
WWW is a set of directed graphs data in the WWW has a heterogeneous nature unstructured versus structured information no central authority to manage information Dynamic verses static information Web information discoveries - search engines

Web is Growing! In 1994, WWW grew by 1758 % !! June 1993 - 130
Dec ,576 April ,768 July ,000+ !!!!!

‘COM’ domains are increasing!
As of July 1995, 6.64 million host computers on the Internet: 1.74 million are ‘com’ domains 1.41 million are ‘edu’ domains 0.30 million are ‘net’ 0.27 million are ‘gov’ 0.22 million are ‘mil’ 0.20 million are ‘org’

Top web countries 1. Canada (1) 80% 9. New Zealand(7)101
2. US (4) 140% Sweden (9) 101% 3. Ireland (3) 110% Israel (12) 112% 4. Iceland (2) 68% Cyprus (8) 72% 5. UK (14) 336 % Hong Kong (15)148% 6. Malta (5) 155% Norway (10) 64% 7. Australia (6) 133% 15. Switzerland (13) 75% 8. Singapore (11) 207% 16. Denmark (16) 105%

How users find web sites
Indexes and search engines UseNet newsgroups Cool lists New lists Listservers Print ads Word-of-mouth and 17 Linked web advertisement

Limitations of Search Engines
Do not exploit hyperlinks search is limited to string matching Queries are evaluated on archived data rather than up-to-date data; no indexing on current data low accuracy replicated results no further manipulation possible

Limitations of Search Engines
ERROR 404! No efficient document management Query results cannot be further manipulated No efficient means for knowledge discovery

Key Objectives Design a suitable data model to represent web information development of web algebra and query language Maintenance of Web data Development of knowledge discovery and web mining tools Web warehouse

Current Research Projects
Web Query System W3QS, WebSQL, AKIRA, NetQL, RAW, WebLog Semistructured Data LOREL, UnQL, WebOQL Website Management System STRUDEL

Main Tasks Modeling and Querying the Web view web as directed graph
content and link based queries example - find the page that contain the word “clinton” which has a link from a page containing word “monica”.

Information Extraction and integration
wrapper - program to extract a structured representation of the data; a set of tuples from HTML pages. Mediator - integration of data Web Site Construction and Restructuring creating sites modeling the structure of web sites restructuring data

What to Model Structure of Web sites internal structure of web pages
contents of web sites in finer granularties

Data Representation of Web Data
Graph Data Models Semistructured Data Models (also graph based)

Graph Data Model labeled graph data model where node represents web pages and arcs represent links between pages. The labels on arcs can be viewed as attribute names. Regular path expression queries

Semistructured Data Models
Irregular data structure, no fixed schema known and may be implicit in the data schema may be large and may change frequently the schema is descriptive rather than perspective; describes the current state of data, but violations of schema is still tolerated

data is not strongly typed; for different objects the values of the same attributes may be of differeing types. (heterogenious sources) no restriction on the set of arcs that emanate from a given node in a graph or on the types of the values of attributes ability to query the schemas; acr variables which get bound to labels on arcs, rather than nodes in the graph

Graph based Query Languages
Use graph to model databases support regular path expressions and graph construction in queries. Examples Graph Log for hypertext queries graph query language for OO

Query Languages for Semi-Structured data
Use labeled graphs query the schema of data ability to accommodate irregularities in the data, such as missing links etc. Examples : Lorel (Stanford) , UnQL (AT&T), STRUQL (AT&T)

Comparison of Query Systems

WebSQL-University of Toronto
Model web as relational database Use two relations Document and Anchor Document relation has one tuple for each document in the web and the anchor relation has one tuple for each anchor in each document

WebSQL SQL-like query language for extracting information from the web. capable of systematic processing of either all the links in a page, all the pages that can be reached from a given URL through paths that match a pattern, or a combination of both. provides transparent access to index servers

Web OQL (University of Toronto)
provides a framework that supports a large class of data restructuring operations. Simple semistructured data model for documents and record-based data OQL-like syntax and regular expressions serves as a two-way bridge between databases and the Web.

WebDB View WWW as multimedia documents in the form of web pages
WQL supports selection, aggregation, sorting, summary, grouping projection on title , URL, keywords, tables, forms, images etc.

Presentation Overview
WHOWEDA - warehouse of web data Research objectives Current research Web Data Model (WICM) Web Algebra Future work

“If you build it, they will come”
More chaotic Increasingly difficult to locate information. Related data are scattered in a piecemeal fashion Data, data everywhere….but how to find it?

How does it affect the corporate world?
Lack of credibility of data Different sites with different data Same site different data Historical information is not available Previous versions of web data How does web data change with time Summarization over time Data to information Reduction in productivity Analysis is manual

Web Warehouse: Its Business Value
Local data warehouse is inadequate Web is “hot” as a commercial medium Current size Future growth prospects Exceedingly attractive demographics

WHOWEDA Research Objectives
Build a web warehouse: Web information access Web information manipulation Efficient visualization of web information Maintenance of web data Web data mining Overcome existing limitations Provide effective mechanisms to manipulate web information

WHOWEDA - What? WareHouse Of Web Data Subject - oriented Integrated
Temporal Granularity - Lower, higher Some summary Not updatable Alternative information sources

WHOWEDA! www.cais.ntu.edu.sg:8000/~whoweda
A WareHouse Of WEb DAta Web Information Coupling Model (WICM) Web Objects Web Schema Web Information Coupling Algebra Web Information Maintenance Web Mining and Knowledge discovery

WWW Web Information Coupling System Web Warehouse User Web Querying
Concept Mart Web Querying & Analysis Component Web Information Mining System Web Information Coupling System Web Information Maintenance System Web Mart Web Mart Web Warehouse Web Mart Web Mart

WWW Global Web Manipulation Pre processing Web Warehouse Local Web
User WWW Web Query & Display Warehouse Concept Mart Global Web Manipulation Global Web Coupling Global Ranking Pre processing Data Visualization Schema Tightness Web Warehouse Data Visualization Web Select Web Union Web Project Web Intersection Local Web Manipulation Local Web Coupling Schema Tightness Local Ranking Schema Search Web Join Schema Match

WWW Global Web Coupling Webtable (Jan) Webtable (Feb) Webtable (Mar)
Warehouse Concept Mart Global Web Coupling Webtable (Jan) Webtable (Feb) Webtable (Mar) Webtable (Apr)

Lower-level Granularity Higher level Granularity Web Information
Webtable (Jan) Webtable (Feb) Webtable (Mar) Webtable (Apr) Lower-level Granularity Web Information Manipulation Operators Higher level Granularity Summarized data

Web Objects Node - url, title, format, size, date, text
Link - source-url, target-url, label, link-type Web tuple Web table Web schema Web database

Web Schema Metadata in the warehouse Structural ‘summary’ of web table
Coupling of related information begins with a query graph Query graph ->Web schema

Web Information Coupling System
A database system to couple related web information Web data model Web objects Web schema Web algebra

Meta-data in WHOWEDA Web schema
Schema -tree Information extracted from each web document or node URL, size, keywords, links, title, language, multi-media details, version history

Web Algebra Formal foundation of data representation and manipulation in a web warehouse Web operators: Information access operator Information manipulation operators Web schema operators Data visualization operators

Directly querying the WWW is an expensive and repetitive affair
information are already materialized in different web tables in the web database. mean to gather these similar information by additional manipulation of the materialized web tables. 4/10/2019

Global Coupling - Information Access
To integrate data from the Web To create historical data To couple related information from the WWW satisfying a web schema Operator to create web tables From web with no schema to web table with web schema

Global Coupling Match portions of the web that satisfy the web schema
Input is a query graph Output is a web table Example

Information Manipulation
Used for analysis of web data in the warehouse Web select Web project Local web coupling Web join Web cartesian product Web union Web intersect

Join Processing in Web Databases

Web Join Concatenate tuples based on identical nodes or documents
Input are two web tables and their schemas Output is a joined table Types Pi-web join, sigma-web join, outer joins, web composition, semi web join

Web Join Analyse web tables storing temporal data
Used for combining related data from various web tables Mechanism to provide summarized information Mechanism to find alternative web document in case of “Document Not Found” error Example

Example 1 Produce a list of diseases with their symptoms, evaluation procedures and treatment starting from the web site at Web table Diseases

Treatment list q g Treatment Issues Symptoms list f y x z Symptoms List of Diseases e Evaluation Evaluation w p

q1 Treatment list g1 Treatment Issues f1 x0 y1 z1 Symptoms list AIDS Symptoms List of Diseases e1 Evaluation Evaluation w1 p2 Elisa Test

Example 2 Produce a list of drugs, and their uses and side effects starting from the web site at Web table Drugs

List of Diseases Drug list Issues Uses Use Side effects a b c d r s k Side effects

Side effects of Indavir Drug list Issues AIDS r1 a0 b1 c1 d1 Indavir Side effects List of Diseases Use s1 k1 Uses of Indavir

Web Join Operator Information manipulation operator
Manipulate information residing in a web database to derive additional information Harness useful, composite information from two web tables Capitalize on the reuse of retrieved data from the WWW in order to reduce execution time of queries

Joinable Nodes Node variables participating in the web join process
Expressed as a pair Each node in the pair should have identical URLs

Web Join Combine two web tables by concatenating a web tuple of one web table with a web tuple of other web table whenever there exist joinable nodes Joinable nodes are identified from the schemas of the two web tables URLs of the joinable nodes are identical

Flavors of Web Join Natural Web Join
Theta Web Join (web join followed by web select) Examples Further flavors: Single-node join Multi-node join

Treatment list q g Treatment Issues Symptoms list List of Diseases f y x z Symptoms e Evaluation Evaluation Drug list w p Issues r Side effects b c d Side effects s Use k Uses

AIDS treatment q1 g1 Symptoms of AIDS f1 y1 x0 z1 AIDS e1 AIDS Evaluation Elisa Test w1 p2 r1 Side effects of Indavir b1 c1 d1 Indavir s1 Uses of Indavir k1

Join Existence Given two web tables, we determine if these two web tables are joinable Inspect the schemas of the web tables Satisfy joinability conditions based on: node predicates link predicates node and link predicates locus of a node relative to a joinable node

Join Construction To construct a joined schema, we construct:
node set link set connectivity set predicate set Construction of joined table Concatenating the web tuples of the two input tables over the joinable nodes

Web Select Extracts web tuples from web tables satisfying certain conditions on node and link variables and on connectivities Input is select Schema Output is a web table satisfying the select schema

Web Couple: Coupling web information
4/10/2019

Why web coupling? Related information in the web is supplied by different information provider. Web documents containing similar information can reside in different web tables in Web Database. 4/10/2019

Why web coupling? The web couple operator gives us the capability to manipulate these web tables to harness useful related information. 4/10/2019

Web Couple Operator Web couple operator is a composite operator.
combination of Web Cartesian Product followed by Web Select. Web cartesian product followed by a web select is a frequently used operation. motivates us to create a separate composite operator to handle this. 4/10/2019

Val(p) is the operand of the op(p).
4/10/2019

Definitions Coupling Nodes: We define coupling nodes as node variables participating in the web coupling. We express the coupling nodes of two web schemas as a pair i.e (c, z) since they cannot exist as single node variable. 4/10/2019

Definitions One coupling node variable can be in more than one pair. That is a set of pair of coupling nodes are not disjoint. The attribute of the coupling node as defined in the predicate of the node is called coupling attribute. The predicate is called the coupling predicate. 4/10/2019

Web Coupling 4/10/2019

Types of web coupling Single node coupling : Web coupling when only one node variable in the each schema are involved. Multinode coupling: When more than one node variables in each schemas participate in the web coupling. 4/10/2019

Types of web coupling System driven web coupling: In this case the system to decide which are the node variables to be coupled (coupling nodes). If atleast a pair of coupling nodes cannot be identified then the web tables cannot be coupled. 4/10/2019

Types of web coupling User driven web coupling: In this case the user decides which are the node variables to be coupled (coupling nodes). Coupling is performed only on those user specified node variable(s). 4/10/2019

Types of web coupling Attribute driven web coupling: In this case the user specifies the coupling attributes. Coupling is performed only on those user specified coupling attribute(s). 4/10/2019

Types of web coupling Value driven web coupling: In this case the user specifies the values of the attributes of the nodes on which coupling should be performed. Coupling is performed only on those user specified attribute values. 4/10/2019

Levels of web coupling Schema level web coupling.
Tuple level web coupling. 4/10/2019

Schema level web coupling
We inspect the schemas to decide whether the two web tables can be coupled. If coupling conditions cannot be identified then the two web tables cannot be coupled. We do not inspect the web tuples in the web table. 4/10/2019

Schema level web coupling
Let n and m be the number of web tuples of the two input web tables. Then the coupled web table based on schema level web coupling will always have n*m web tuples. 4/10/2019

Tuple level web coupling
We inspect the web tuples of the two input web tables to identify nodes with similar information. The number of web tuples in the coupled web table <=n*m 4/10/2019

Why two levels? A schema does not capture all the information of the web documents in a web table; not always possible to identify coupling condition by inspecting the schemas. possible to find existence of coupling nodes which are not defined in the schemas. 4/10/2019

Why two levels? Tuple level coupling gives us a mean to correlate web documents containing similar information from the web tables (that cannot be identified from their schemas) at the expense of additional processing. 4/10/2019

Conditions for web coupling
URLs with same directory name such as “/computer/” may contain similar information. Paths with “/cgi-bin/” are not considered. Include all conditions for web join. 4/10/2019

Construction of coupled schema (schema level)
When atleast a pair of coupling nodes are identical (same url). When none of the pair are identical. 4/10/2019

Web Bags Existence of identical web tuples.
Created due to web project operation. Structure based mining Used for discovering Visible nodes Luminous nodes Luminous paths

Definitions Visibility of a web document or node D in a web table W measures the number of different web documents in W that have links to D Luminosity - Reverse of visibility, the number of other distinct documents that are linked from D Luminous paths - a set of inter-linked nodes which occurs number of times in a web table

Web Schema Cancer e f x y z Diseases Cancer

Web Table Cancer http://www.panacea.org/ Diseases f0 x0 y0 z1 e0
Cancer Diseases f0 x0 z1 y0 e0 Cancer Cancer Diseases f0 x0 z2 y0 e0 Cancer Cancer Diseases f0 x0 y0 z1 e0 Cancer Cancer Diseases f0 x0 z4 y0 e0 Cancer Web Table

Projected schema z Cancer

Web Table after eliminating x and y
Cancer z1 Cancer z2 z4 Web Table after eliminating x and y

Projected schema Cancer e z x y Diseases

Web Bag http://www.panacea.org/ Cancer x0 y0 z1 Diseases
Cancer x0 y0 z1 Diseases Cancer x0 y0 z2 Diseases Cancer x0 y0 z1 Diseases Diseases x0 y0 z4 Cancer Web Bag

After removal of identical tuples
Cancer x0 y0 z1 Diseases z2 z4

Cancer z1 Cancer z1 Cancer z2 Cancer z1 Cancer z4

Cancer z1 z2 z4

Visible Nodes Cancer http://www.cancer.org/desc.html z1 Cancer z2
Cancer z1 Cancer z4

Luminous Paths

More Operators . . . Web schema operators:
Schema tightness operator, Schema match operator, Schema search operator Data visualization operators: Ranking operators (Global & Local), Web Nest, Web Un-nest, Web Coalesce, Web Expand, Web Pack, Web Unpack, Web Sort

Partitioning of web tables
Partitioning web tables restructured easily indexed easily monitored easily reorganized easily By time schema tree structure keywords

Warehouse Concept Mart (WCMart)
Subject oriented Concept generation. Manually -> Autonomous. Used for: Ranking tuples Global web coupling Content based mining

Data Mining in Web Warehouse
Scalability of data Text mining Mining information from multiple web tables Interactive web mining Discover rules Web Bag Warehouse Concept Mart

Summary Current status: Future research
Mechanism for accessing and manipulating web information in WHOWEDA Implementing various web operators Future research What types of information can be summarized? What types of knowledge can be mined? Refine web warehouse architecture

Web Warehousing : Design and Issues

Similar presentations

Presentation on theme: "Web Warehousing : Design and Issues"— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

Web Warehousing : Design and Issues

Similar presentations

Presentation on theme: "Web Warehousing : Design and Issues"— Presentation transcript:

Similar presentations

About project

Feedback