Download presentation
Presentation is loading. Please wait.
1
Web Warehousing : Design and Issues
Sanjay Kumar Madria Department of Computer Science Purdue University West Lafayette, IN 47907
3
World Wide Web Web is fast growing
More business organizations putting information in the Web Business on the highway Myriad of raw data to be processed for information
4
WWW collection of multimedia documents in the form of web pages connected via hyperlinks.
5
As WWW grows, more chaotic it becomes
Web is fast growing, distributed, non-administered global information resource WWW allows access to text, image, video, sound and graphic data more business organizations creating web servers more chaotic environment to locate information of interest lost in hyperspace syndrome
6
Characteristics of WWW
WWW is a set of directed graphs data in the WWW has a heterogeneous nature unstructured versus structured information no central authority to manage information Dynamic verses static information Web information discoveries - search engines
7
Web is Growing! In 1994, WWW grew by 1758 % !! June 1993 - 130
Dec ,576 April ,768 July ,000+ !!!!!
8
‘COM’ domains are increasing!
As of July 1995, 6.64 million host computers on the Internet: 1.74 million are ‘com’ domains 1.41 million are ‘edu’ domains 0.30 million are ‘net’ 0.27 million are ‘gov’ 0.22 million are ‘mil’ 0.20 million are ‘org’
9
Top web countries 1. Canada (1) 80% 9. New Zealand(7)101
2. US (4) 140% Sweden (9) 101% 3. Ireland (3) 110% Israel (12) 112% 4. Iceland (2) 68% Cyprus (8) 72% 5. UK (14) 336 % Hong Kong (15)148% 6. Malta (5) 155% Norway (10) 64% 7. Australia (6) 133% 15. Switzerland (13) 75% 8. Singapore (11) 207% 16. Denmark (16) 105%
10
How users find web sites
Indexes and search engines UseNet newsgroups Cool lists New lists Listservers Print ads Word-of-mouth and 17 Linked web advertisement
11
Limitations of Search Engines
Do not exploit hyperlinks search is limited to string matching Queries are evaluated on archived data rather than up-to-date data; no indexing on current data low accuracy replicated results no further manipulation possible
12
Limitations of Search Engines
ERROR 404! No efficient document management Query results cannot be further manipulated No efficient means for knowledge discovery
13
Key Objectives Design a suitable data model to represent web information development of web algebra and query language Maintenance of Web data Development of knowledge discovery and web mining tools Web warehouse
14
Current Research Projects
Web Query System W3QS, WebSQL, AKIRA, NetQL, RAW, WebLog Semistructured Data LOREL, UnQL, WebOQL Website Management System STRUDEL
15
Main Tasks Modeling and Querying the Web view web as directed graph
content and link based queries example - find the page that contain the word “clinton” which has a link from a page containing word “monica”.
16
Information Extraction and integration
wrapper - program to extract a structured representation of the data; a set of tuples from HTML pages. Mediator - integration of data Web Site Construction and Restructuring creating sites modeling the structure of web sites restructuring data
17
What to Model Structure of Web sites internal structure of web pages
contents of web sites in finer granularties
18
Data Representation of Web Data
Graph Data Models Semistructured Data Models (also graph based)
19
Graph Data Model labeled graph data model where node represents web pages and arcs represent links between pages. The labels on arcs can be viewed as attribute names. Regular path expression queries
20
Semistructured Data Models
Irregular data structure, no fixed schema known and may be implicit in the data schema may be large and may change frequently the schema is descriptive rather than perspective; describes the current state of data, but violations of schema is still tolerated
21
data is not strongly typed; for different objects the values of the same attributes may be of differeing types. (heterogenious sources) no restriction on the set of arcs that emanate from a given node in a graph or on the types of the values of attributes ability to query the schemas; acr variables which get bound to labels on arcs, rather than nodes in the graph
22
Graph based Query Languages
Use graph to model databases support regular path expressions and graph construction in queries. Examples Graph Log for hypertext queries graph query language for OO
23
Query Languages for Semi-Structured data
Use labeled graphs query the schema of data ability to accommodate irregularities in the data, such as missing links etc. Examples : Lorel (Stanford) , UnQL (AT&T), STRUQL (AT&T)
24
Comparison of Query Systems
25
WebSQL-University of Toronto
Model web as relational database Use two relations Document and Anchor Document relation has one tuple for each document in the web and the anchor relation has one tuple for each anchor in each document
26
WebSQL SQL-like query language for extracting information from the web. capable of systematic processing of either all the links in a page, all the pages that can be reached from a given URL through paths that match a pattern, or a combination of both. provides transparent access to index servers
27
Web OQL (University of Toronto)
provides a framework that supports a large class of data restructuring operations. Simple semistructured data model for documents and record-based data OQL-like syntax and regular expressions serves as a two-way bridge between databases and the Web.
28
WebDB View WWW as multimedia documents in the form of web pages
WQL supports selection, aggregation, sorting, summary, grouping projection on title , URL, keywords, tables, forms, images etc.
29
Presentation Overview
WHOWEDA - warehouse of web data Research objectives Current research Web Data Model (WICM) Web Algebra Future work
30
“If you build it, they will come”
More chaotic Increasingly difficult to locate information. Related data are scattered in a piecemeal fashion Data, data everywhere….but how to find it?
31
How does it affect the corporate world?
Lack of credibility of data Different sites with different data Same site different data Historical information is not available Previous versions of web data How does web data change with time Summarization over time Data to information Reduction in productivity Analysis is manual
32
Web Warehouse: Its Business Value
Local data warehouse is inadequate Web is “hot” as a commercial medium Current size Future growth prospects Exceedingly attractive demographics
33
WHOWEDA Research Objectives
Build a web warehouse: Web information access Web information manipulation Efficient visualization of web information Maintenance of web data Web data mining Overcome existing limitations Provide effective mechanisms to manipulate web information
34
WHOWEDA - What? WareHouse Of Web Data Subject - oriented Integrated
Temporal Granularity - Lower, higher Some summary Not updatable Alternative information sources
35
WHOWEDA! www.cais.ntu.edu.sg:8000/~whoweda
A WareHouse Of WEb DAta Web Information Coupling Model (WICM) Web Objects Web Schema Web Information Coupling Algebra Web Information Maintenance Web Mining and Knowledge discovery
36
WWW Web Information Coupling System Web Warehouse User Web Querying
Concept Mart Web Querying & Analysis Component Web Information Mining System Web Information Coupling System Web Information Maintenance System Web Mart Web Mart Web Warehouse Web Mart Web Mart
37
WWW Global Web Manipulation Pre processing Web Warehouse Local Web
User WWW Web Query & Display Warehouse Concept Mart Global Web Manipulation Global Web Coupling Global Ranking Pre processing Data Visualization Schema Tightness Web Warehouse Data Visualization Web Select Web Union Web Project Web Intersection Local Web Manipulation Local Web Coupling Schema Tightness Local Ranking Schema Search Web Join Schema Match
38
WWW Global Web Coupling Webtable (Jan) Webtable (Feb) Webtable (Mar)
Warehouse Concept Mart Global Web Coupling Webtable (Jan) Webtable (Feb) Webtable (Mar) Webtable (Apr)
39
Lower-level Granularity Higher level Granularity Web Information
Webtable (Jan) Webtable (Feb) Webtable (Mar) Webtable (Apr) Lower-level Granularity Web Information Manipulation Operators Higher level Granularity Summarized data
40
Web Objects Node - url, title, format, size, date, text
Link - source-url, target-url, label, link-type Web tuple Web table Web schema Web database
41
Web Schema Metadata in the warehouse Structural ‘summary’ of web table
Coupling of related information begins with a query graph Query graph ->Web schema
42
Web Information Coupling System
A database system to couple related web information Web data model Web objects Web schema Web algebra
43
Meta-data in WHOWEDA Web schema
Schema -tree Information extracted from each web document or node URL, size, keywords, links, title, language, multi-media details, version history
44
Web Algebra Formal foundation of data representation and manipulation in a web warehouse Web operators: Information access operator Information manipulation operators Web schema operators Data visualization operators
45
Directly querying the WWW is an expensive and repetitive affair
information are already materialized in different web tables in the web database. mean to gather these similar information by additional manipulation of the materialized web tables. 4/10/2019
46
Global Coupling - Information Access
To integrate data from the Web To create historical data To couple related information from the WWW satisfying a web schema Operator to create web tables From web with no schema to web table with web schema
47
Global Coupling Match portions of the web that satisfy the web schema
Input is a query graph Output is a web table Example
48
Information Manipulation
Used for analysis of web data in the warehouse Web select Web project Local web coupling Web join Web cartesian product Web union Web intersect
49
Join Processing in Web Databases
50
Web Join Concatenate tuples based on identical nodes or documents
Input are two web tables and their schemas Output is a joined table Types Pi-web join, sigma-web join, outer joins, web composition, semi web join
51
Web Join Analyse web tables storing temporal data
Used for combining related data from various web tables Mechanism to provide summarized information Mechanism to find alternative web document in case of “Document Not Found” error Example
52
Example 1 Produce a list of diseases with their symptoms, evaluation procedures and treatment starting from the web site at Web table Diseases
53
Treatment list q g Treatment Issues Symptoms list f y x z Symptoms List of Diseases e Evaluation Evaluation w p
54
q1 Treatment list g1 Treatment Issues f1 x0 y1 z1 Symptoms list AIDS Symptoms List of Diseases e1 Evaluation Evaluation w1 p2 Elisa Test
55
Example 2 Produce a list of drugs, and their uses and side effects starting from the web site at Web table Drugs
56
List of Diseases Drug list Issues Uses Use Side effects a b c d r s k Side effects
57
Side effects of Indavir Drug list Issues AIDS r1 a0 b1 c1 d1 Indavir Side effects List of Diseases Use s1 k1 Uses of Indavir
58
Web Join Operator Information manipulation operator
Manipulate information residing in a web database to derive additional information Harness useful, composite information from two web tables Capitalize on the reuse of retrieved data from the WWW in order to reduce execution time of queries
59
Joinable Nodes Node variables participating in the web join process
Expressed as a pair Each node in the pair should have identical URLs
60
Web Join Combine two web tables by concatenating a web tuple of one web table with a web tuple of other web table whenever there exist joinable nodes Joinable nodes are identified from the schemas of the two web tables URLs of the joinable nodes are identical
61
Flavors of Web Join Natural Web Join
Theta Web Join (web join followed by web select) Examples Further flavors: Single-node join Multi-node join
62
Treatment list q g Treatment Issues Symptoms list List of Diseases f y x z Symptoms e Evaluation Evaluation Drug list w p Issues r Side effects b c d Side effects s Use k Uses
63
AIDS treatment q1 g1 Symptoms of AIDS f1 y1 x0 z1 AIDS e1 AIDS Evaluation Elisa Test w1 p2 r1 Side effects of Indavir b1 c1 d1 Indavir s1 Uses of Indavir k1
64
Join Existence Given two web tables, we determine if these two web tables are joinable Inspect the schemas of the web tables Satisfy joinability conditions based on: node predicates link predicates node and link predicates locus of a node relative to a joinable node
65
Join Construction To construct a joined schema, we construct:
node set link set connectivity set predicate set Construction of joined table Concatenating the web tuples of the two input tables over the joinable nodes
66
Web Select Extracts web tuples from web tables satisfying certain conditions on node and link variables and on connectivities Input is select Schema Output is a web table satisfying the select schema
67
Web Couple: Coupling web information
4/10/2019
68
Why web coupling? Related information in the web is supplied by different information provider. Web documents containing similar information can reside in different web tables in Web Database. 4/10/2019
69
Why web coupling? The web couple operator gives us the capability to manipulate these web tables to harness useful related information. 4/10/2019
70
Web Couple Operator Web couple operator is a composite operator.
combination of Web Cartesian Product followed by Web Select. Web cartesian product followed by a web select is a frequently used operation. motivates us to create a separate composite operator to handle this. 4/10/2019
71
Val(p) is the operand of the op(p).
4/10/2019
72
Definitions Coupling Nodes: We define coupling nodes as node variables participating in the web coupling. We express the coupling nodes of two web schemas as a pair i.e (c, z) since they cannot exist as single node variable. 4/10/2019
73
Definitions One coupling node variable can be in more than one pair. That is a set of pair of coupling nodes are not disjoint. The attribute of the coupling node as defined in the predicate of the node is called coupling attribute. The predicate is called the coupling predicate. 4/10/2019
74
Web Coupling 4/10/2019
75
Types of web coupling Single node coupling : Web coupling when only one node variable in the each schema are involved. Multinode coupling: When more than one node variables in each schemas participate in the web coupling. 4/10/2019
76
Types of web coupling System driven web coupling: In this case the system to decide which are the node variables to be coupled (coupling nodes). If atleast a pair of coupling nodes cannot be identified then the web tables cannot be coupled. 4/10/2019
77
Types of web coupling User driven web coupling: In this case the user decides which are the node variables to be coupled (coupling nodes). Coupling is performed only on those user specified node variable(s). 4/10/2019
78
Types of web coupling Attribute driven web coupling: In this case the user specifies the coupling attributes. Coupling is performed only on those user specified coupling attribute(s). 4/10/2019
79
Types of web coupling Value driven web coupling: In this case the user specifies the values of the attributes of the nodes on which coupling should be performed. Coupling is performed only on those user specified attribute values. 4/10/2019
80
Levels of web coupling Schema level web coupling.
Tuple level web coupling. 4/10/2019
81
Schema level web coupling
We inspect the schemas to decide whether the two web tables can be coupled. If coupling conditions cannot be identified then the two web tables cannot be coupled. We do not inspect the web tuples in the web table. 4/10/2019
82
Schema level web coupling
Let n and m be the number of web tuples of the two input web tables. Then the coupled web table based on schema level web coupling will always have n*m web tuples. 4/10/2019
83
Tuple level web coupling
We inspect the web tuples of the two input web tables to identify nodes with similar information. The number of web tuples in the coupled web table <=n*m 4/10/2019
84
Why two levels? A schema does not capture all the information of the web documents in a web table; not always possible to identify coupling condition by inspecting the schemas. possible to find existence of coupling nodes which are not defined in the schemas. 4/10/2019
85
Why two levels? Tuple level coupling gives us a mean to correlate web documents containing similar information from the web tables (that cannot be identified from their schemas) at the expense of additional processing. 4/10/2019
86
Conditions for web coupling
URLs with same directory name such as “/computer/” may contain similar information. Paths with “/cgi-bin/” are not considered. Include all conditions for web join. 4/10/2019
87
Construction of coupled schema (schema level)
When atleast a pair of coupling nodes are identical (same url). When none of the pair are identical. 4/10/2019
88
Web Bags Existence of identical web tuples.
Created due to web project operation. Structure based mining Used for discovering Visible nodes Luminous nodes Luminous paths
89
Definitions Visibility of a web document or node D in a web table W measures the number of different web documents in W that have links to D Luminosity - Reverse of visibility, the number of other distinct documents that are linked from D Luminous paths - a set of inter-linked nodes which occurs number of times in a web table
90
Web Schema Cancer e f x y z Diseases Cancer
91
Web Table Cancer http://www.panacea.org/ Diseases f0 x0 y0 z1 e0
Cancer Diseases f0 x0 z1 y0 e0 Cancer Cancer Diseases f0 x0 z2 y0 e0 Cancer Cancer Diseases f0 x0 y0 z1 e0 Cancer Cancer Diseases f0 x0 z4 y0 e0 Cancer Web Table
92
Projected schema z Cancer
93
Web Table after eliminating x and y
Cancer z1 Cancer z2 z4 Web Table after eliminating x and y
94
Projected schema Cancer e z x y Diseases
95
Web Bag http://www.panacea.org/ Cancer x0 y0 z1 Diseases
Cancer x0 y0 z1 Diseases Cancer x0 y0 z2 Diseases Cancer x0 y0 z1 Diseases Diseases x0 y0 z4 Cancer Web Bag
96
After removal of identical tuples
Cancer x0 y0 z1 Diseases z2 z4
97
Cancer z1 Cancer z1 Cancer z2 Cancer z1 Cancer z4
98
Cancer z1 z2 z4
99
Visible Nodes Cancer http://www.cancer.org/desc.html z1 Cancer z2
Cancer z1 Cancer z4
100
Luminous Paths
101
More Operators . . . Web schema operators:
Schema tightness operator, Schema match operator, Schema search operator Data visualization operators: Ranking operators (Global & Local), Web Nest, Web Un-nest, Web Coalesce, Web Expand, Web Pack, Web Unpack, Web Sort
102
Partitioning of web tables
Partitioning web tables restructured easily indexed easily monitored easily reorganized easily By time schema tree structure keywords
103
Warehouse Concept Mart (WCMart)
Subject oriented Concept generation. Manually -> Autonomous. Used for: Ranking tuples Global web coupling Content based mining
104
Data Mining in Web Warehouse
Scalability of data Text mining Mining information from multiple web tables Interactive web mining Discover rules Web Bag Warehouse Concept Mart
105
Summary Current status: Future research
Mechanism for accessing and manipulating web information in WHOWEDA Implementing various web operators Future research What types of information can be summarized? What types of knowledge can be mined? Refine web warehouse architecture
Similar presentations
© 2025 SlidePlayer.com. Inc.
All rights reserved.