Download presentation
Presentation is loading. Please wait.
1
Detecting and Representing Relevant Page-Level Web Deltas Sanjay Kumar Madria Department of Computer Science Purdue University West Lafayette, IN 47907 skm@cs.purdue.edu
2
Current Situation of W 3 l The Web allows information to change at any time and in any way l Two forms of changes u Existence u Structure and content modification l Leaves no trace of the previous document Replaces its antecedents leaving no trace!!!!
3
Problems of Change Management l Problem: u Detecting, Representing and Querying these changes l The problem is challenging u Typical database approaches to detect changes based on triggering mechanisms are not usable u Information sources typical do not keep track of historical information to a format that is accessible to the outside user
4
Motivating Example l Assume that there is a web site at www.panacea.gov u Provides information related to drugs used for various diseases
5
Motivating Example l Suppose, on 15th January, a user wishes to find out periodically (every 30 days) u information related to side effects and uses of drugs used for various drugs and u changes to these information at the page-level compared to its previous version
6
Structure of www.panacea.gov l Web page at www.panacea.gov contains a list of diseases l Each link of a particular disease points to a web page containing a list of drugs used for prevention and cure of the disease l Hyperlinks associated with each drug points to documents containing a list of various issues related to a particular drug (description, manufacturers, clinical pharmacology, uses, side-effects etc) l From the hyperlinks associated with each issue, one can retrieve details of these issues for a particular drug
7
A Snapshot as on 15th Jan AIDS Cancer Heart disease Diabetes Impotence Alzheimer’s Disease Indavir Ritonavir Niacin Hirudin Vasomax Caverject Side effects Uses Side effects Uses Side effects Uses Side effects Ibuprofen
8
Some Changes l 25th January u Links related to Diabetes are removed u New link containing information related to Parkinson’s Disease u Information related to issues, side-effects and uses of various drugs for Cancer are also modified
9
A Partial Snapshot as on 25th Jan Parkinson’s Disease Cancer Diabetes Tolcapone Side effects Uses Side effects www.panacea.gov
10
Some Changes l 30th January u Links related to Impotence is modified Previously provided by www.pfizer.com Now by www.panacea.gov u Inter-linked structure of the Web pages related to Caverject is also modified u Information about Viagra, a new drug for Impotence is added
11
A Partial Snapshot as on 30th Jan Impotence Vasomax Caverject Side effects Uses Side effects Viagra www.panacea.gov
12
Some Changes l 8th February u Link structure of Heart Disease is modified Label Heart Disease is modified to Heart Disorder Content of the pages dealing with side-effects and uses of Hirudin are updated Inter-linked document structure of Niacin is modified u Web pages related to the side effects and uses of Ibuprofen (Alzheimer’s Disease) are removed
13
On 8th February Heart disorder Alzheimer’s Disease Niacin Hirudin Side effects Uses Side effects www.panacea.gov
14
A Snapshot as on 15th Feb AIDS Cancer Heart disease Impotence Alzheimer’s Disease Indavir Ritonavir Niacin Hirudin Vasomax Caverject Side effects Uses Viagra Parkinson’s Disease
15
Objectives l Web deltas - Changes to web information l Detecting and representing relevant page-level web deltas u changes that are relevant to user’s query, not any arbitrary changes or web deltas u Restricted to page level l Detect those documents u which are added to the site u deleted from the site u those documents which has undergone content or structural modification l How these delta documents are related to one another and with other documents relevant to the user’s query
16
The WHOWEDA Project l WHOWEDA: A WareHouse of WEb DAta l To design and implement a web warehousing system capable of effective extraction, management, and processing of information on the World Wide Web l Data model: WHOM (WareHouse Object Model)
17
Overview of WHOM l Our web warehouse can be conceived of as a collection of web tables l A set of web tuples and a set of web schemas represents a web table l A web tuple is a directed graph containing nodes and links and satisfies a web schema l Nodes and links contain content, metadata and structural information associated with Web documents and hyperlinks u Tree representation l Web algebra containing web operators to manipulate web tables u Global Coupling, Web Select, Web Join etc.
18
Overview of our approach l Step 1: Two snapshots of old and new relevant data is coupled from the Web using global web coupling operation and materialized in two web tables. l Step 2: Web join, left outer join and right outer joined operations are performed on these two web tables u Result is joined, left and right outer joined web tables l Step 3: Delta web tables containing different types of web deltas are generated from these resultant web tables. l Elaborate on these steps……...
19
Step 1: Retrieving snapshots of Web data using Global Web Coupling
20
Web Query Specification l Features: u Draw a web query as a directed connected acyclic graph (also called a coupling query) u Query can also be specified in text form u Specify search conditions on the nodes and edges of the graph l Performed by the global web coupling operator
21
Coupling Query l Set of node variables Xn u Each variable represents set of Web documents l Set of link variables Xl u Each variable represent set of hyperlinks l Set of connectivities C in DNF defined over node and link variables u To specify hyperlink structure of the documents l Set of predicates P defined over some of the node and link variables u Specify metadata, content or structural conditions l Set of coupling query predicates Q u Conditions on execution of the query
22
Example l Suppose, on 15th January, a user wishes to find out periodically (every 30 days) from the web site at www.panacea.gov u information related to side effects and uses of drugs used for various diseases l Result of the query is stored in the form of web table
23
Coupling Query l Xn = {a, b, d, k} l Xl = { - } l P = {p1, p2, p3, p4} u p1(a) = METADATA:: a[url] EQUALS “www.panacea.gov” u p2(b) = CONTENT:: b[html.body.title] NON-ATTR- CONT “drug list” u p3(k) = CONTENT:: k[html.body.title] NON-ATTR- CONT “uses” u p4(d) = CONTENT:: d[html.body.title] NON-ATTR- CONT “side effects”
24
Coupling Query l C = k1 AND k2 AND k3 u k1 = a b u k2 = b d u k3 = b k l Q = {q1} u q1(b) = COUPLING_QUERY:: polling_frequency EQUALS “30 days”
25
Pictorial Representation ab k d www.panacea.gov “drug list” “side effects” “uses” {1, 3} {1, 6}
26
Web Table Drugs (15th Jan) b0 a0 u0 k0 d0 AIDS Indavir b0 a0 u1 k1 d1 AIDS Ritonavir b1 a0 k2 d2 Cancer Beta Carotene b5 a0 k12 d12 Alzheimer’s Disease Ibuprofen
27
Web Table Drugs (15th Jan) b3 a0 d4k5 Diabetes Albuterol b4 a0 u4 k6 d5 ImpotenceVasomax u6u5 b4 a0 u7 k7 d6 Impotence Cavarject u8 b2 a0 u2 k3 d3 Heart Disease Hirudin
28
Web Table New Drugs (15th Feb) b0 a0 u0 k0 d0 AIDS Indavir b0 a0 u1 k1 d1 AIDS Ritonavir b1 a0 k2 d2 Cancer Beta Carotene b2 a0 u2 k3 d3 Heart Disorder Hirudin
29
Web Table New Drugs (15th Feb) b2 a0 u3 k7 d7 Heart Disorder Niacin b4 a0 u7 k7 d6 Impotence Cavarject b4 a0 u9 k8 d8 ImpotenceVasomax b6 a0 u10 k10 d10 Parkinson’s Disease Tolcapone b6
30
Web Table New Drugs (15th Feb) b6 a0 u10 k10 d10 Parkinson’s Disease Tolcapone b6 b4 a0 u12 k9 d9 ImpotenceViagra
31
Step 2: Performing Web Join, Left and Right Outer Web Join
32
Web Join l Information composition operator l Combines two web tables into a single web table under certain conditions l Combine two web tables by concatenating a web tuple of one web table with a web tuple of other web table whenever there exist joinable nodes l Two nodes are joinable if they are identical l Two nodes are identical if the URL and last modification date of the nodes are same l The joined web tuple is stored in a different web table
33
Web Join l Join web tables Drugs and New Drugs l Nodes which has not undergone any changes are the joinable nodes in these two web tables. l Content modified nodes, new nodes and deleted nodes cannot be joinable nodes
34
Joined web table b0 a0 u0 k0 d0 AIDSIndavir a0 AIDS b0 a0 u1 k1 d1 AIDS Ritonavir a0 AIDS (1) (2) b0 a0 u0 k0 d0 AIDS Indavir a0 u1 k1 d1 AIDS Ritonavir (3)
35
Joined Web Table b2 a0 u3 k4 d7 Heart Disorder Niacin a0 u2 k3 d3 Heart Disease Hirudin (4) b4 a0 u7 Impotence Cavarject b4 a0 u7 k7 d6 Impotence Cavarject u8 (5)
36
Joined Table b2 a0 u2 k3 d3 Heart Disease Hirudin a0 u2 k3 d3 Heart Disorder Hirudin (6)
37
Types of web tuples l Web tuples in which all the nodes are joinable u Results of joining two versions of web tuples that has remained unchanged during the transition l Web tuples in which u some of the nodes are joinable nodes u remaining nodes are the result of insertion, deletion or modification operations b4 a0 u7 Impotence Cavarject b4 a0 u7 k7 d6 Impotence Cavarject u8 (5)
38
Types of web tuples l Tuples in which u Some of the nodes are joinable nodes u Out of the remaining nodes some are result of insertion, deletion or modification and u The remaining ones remained unchanged during the transition b0 a0 u0 k0 d0 AIDS Indavir a0 u1 k1 d1 AIDS Ritonavir (3)
39
Outer Web Join l Web tuples that do not pariticipate in the web join process (dangling web tuples) are absent from the joined web table l Outer web join enables us to identify them u Left outer web join u Right outer web join
40
Web Table New Drugs (15th Feb) b0 a0 u0 k0 d0 AIDS Indavir b0 a0 u1 k1 d1 AIDS Ritonavir b1 a0 k2 d2 Cancer Beta Carotene b2 a0 u2 k3 d3 Heart Disorder Hirudin
41
Web Table New Drugs (15th Feb) b2 a0 u3 k7 d7 Heart Disorder Niacin b4 a0 u7 k7 d6 Impotence Cavarject b4 a0 u9 k8 d8 ImpotenceVasomax
42
Web Table New Drugs (15th Feb) b6 a0 u10 k10 d10 Parkinson’s Disease Tolcapone b6 b4 a0 u12 k9 d9 ImpotenceViagra
43
Right Outer Web Join b1 a0 k2 d2 Cancer Beta Carotene b4 a0 u9 k8 d8 ImpotenceVasomax b4 a0 u12 k9 d9 ImpotenceViagra b6 a0 u10 k10 d10 Parkinson’s Disease Tolcapone b6
44
Types of web tuples l New web tuples which are added during the transition u These tuples contain some new nodes and remaining ones content are changes l Tuples in which all the nodes have undergone content modification l Tuples which existed before and in which some of the nodes are new and remaining ones content have changed.
45
Web Table Drugs (15th Jan) b0 a0 u0 k0 d0 AIDS Indavir b0 a0 u1 k1 d1 AIDS Ritonavir b1 a0 k2 d2 Cancer Beta Carotene b5 a0 k12 d12 Alzheimer’s Disease Ibuprofen
46
Web Table Drugs (15th Jan) b3 a0 d4k5 Diabetes Albuterol b4 a0 u4 k6 d5 ImpotenceVasomax u6u5 b4 a0 u7 k7 d6 Impotence Cavarject u8 b2 a0 u2 k3 d3 Heart Disease Hirudin
47
Left Outer Web Join b1 a0 k2 d2 Cancer Beta Carotene b5 a0 k12 d12 Alzheimer’s Disease Ibuprofen b3 a0 d4k5 Diabetes Albuterol b4 a0 u4 k6 d5 ImpotenceVasomax u6u5
48
Types of web tuples l Web tuples which are deleted during the transition u These tuples do not occur in the new web table l Tuples in which all the nodes have undergone content modification l Tuples in which some of the nodes are deleted and remaining ones content have changed.
49
Step 3: Generating Delta Web Tables
50
Overview l Input u Joined, left outer joined and right outer joined web tables l Output u Set of delta web tables
51
Delta Web Tables l Delta web tables are used to represent web deltas l Encapsulate the relevant changes that has occurred in the Web with respect to a user’s query l Three types u Delta+ web table Contains a set of tuples containing new nodes inserted during transition u Delta- web table Set of web tuples containing nodes removed during the transition u Delta-M web table Set of web tuples representing the previous and current sets of modified nodes
52
Steps for Generation l Phase 1: Delta Nodes Identification Phase u Nodes which are added, deleted or modified during the transition are identified u Input: Old and new version of web tables and a set of joinable nodes from the joined web table u Output: Sets of nodes which are added, deleted or modified during the transition Nodes which exists in new web table but not in old web table are the new nodes Nodes which exists in old web table but not in new one are the deleted nodes Nodes which exists in both the web tables but are not joinable are the nodes which has undergone content modification
53
Steps for Generation l Phase 2: Delta Tuples Identification Phase u Determines how the delta nodes are related to one another and how they are associated with those nodes which have remained unchanged u We identify those tuples which contain nodes which are added, deleted or modified during the transition u Input: Joined, left outer joined and right outer joined web tables, sets of delta nodes u Output: Sets of web tuples represented by Delta+, Delta- and Delta-M web tables
54
Phase 2 (Delta+ Web Table) l Scan joined and right outer joined web tables to identify web tuples containing nodes which are inserted during the transition l New nodes can occur in these tables only because u In the right outer joined table if the remaining nodes in the tuple containing the new nodes are modified (hence not joinable) u In the joined web table if some of the nodes in the tuple containing new nodes has remained unchanged and hence are joinable l These web tuples are stored in Delta+ Web Table
55
Example (Right Outer Web Join) b1 a0 k2 d2 Cancer Beta Carotene b4 a0 u9 k8 d8 ImpotenceVasomax b4 a0 u12 k9 d9 ImpotenceViagra b6 a0 u10 k10 d10 Parkinson’s Disease Tolcapone b6
56
Example (Joined Web Table) b2 a0 u3 k7 d7 Heart Disorder Niacin a0 u2 k3 d3 Heart Disease Hirudin (4)
57
Delta+ Web Table b4 a0 u9 k8 d8 ImpotenceVasomax b4 a0 u12 k9 d9 ImpotenceViagra b6 a0 u10 k10 d10 Parkinson’s Disease Tolcapone b6 b2 a0 u3 k7 d7 Heart Disorder Niacin
58
Phase 2 (Delta- Web Table) l Scan joined and left outer joined web tables to identify web tuples containing nodes which are deleted during the transition l Deleted nodes can occur in these tables only because u In the left outer joined table if the remaining nodes in the tuple containing the deleted nodes are modified (hence not joinable) u In the joined web table if some of the nodes in the tuple containing deleted nodes has remained unchanged and hence are joinable l These web tuples are stored in Delta- Web Table
59
Example (Left Outer Web Join) b1 a0 k2 d2 Cancer Beta Carotene b5 a0 k12 d12 Alzheimer’s Disease Ibuprofen b3 a0 d4k5 Diabetes Albuterol b4 a0 u4 k6 d5 ImpotenceVasomax u6u5
60
Example (Joined Web Table) b4 a0 u7 Impotence Cavarject b4 a0 u7 k7 d6 Impotence Cavarject u8 (5)
61
Delta- Web Table b5 a0 k12 d12 Alzheimer’s Disease Ibuprofen b3 a0 d4k5 Diabetes Albuterol b4 a0 u4 k6 d5 ImpotenceVasomax u6u5 b4 a0 u7 k7 d6 Impotence Cavarject u8
62
Phase 2 (Delta-M Web Table) l Finally, nodes which are modified during the transition can be identified by inspecting all the three web tables u Tuples in the left and right outer joined tables which do not contain any new or deleted node represent the old and new version of these nodes respectively These tuples do not occur in the joined web table as all the nodes are modified u Tuples in left and right outer joined tables that contain modified nodes as well as inserted or deleted nodes These modified nodes may not appear in the joined web table if no other joinable web tuples contain these modified nodes
63
Example (Right Outer Web Join) b1 a0 k2 d2 Cancer Beta Carotene b4 a0 u9 k8 d8 ImpotenceVasomax b4 a0 u12 k9 d9 ImpotenceViagra b6 a0 u10 k10 d10 Parkinson’s Disease Tolcapone b6
64
Example (Left Outer Web Join) b1 a0 k2 d2 Cancer Beta Carotene b5 a0 k12 d12 Alzheimer’s Disease Ibuprofen b3 a0 d4k5 Diabetes Albuterol b4 a0 u4 k6 d5 ImpotenceVasomax u6u5
65
Phase 2 l Tuples in the joined web tables where some of the nodes represent the old and new version of these modified nodes l These web tuples are stored in Delta-M Web Table
66
Example (Joined web table) b0 a0 u0 k0 d0 AIDSIndavir a0 AIDS b0 a0 u1 k1 d1 AIDS Ritonavir a0 AIDS (1) (2)
67
Delta-M Web Table b0 a0 u0 k0 d0 AIDSIndavir a0 AIDS b0 a0 u1 k1 d1 AIDS Ritonavir a0 AIDS (1) (2) b4 a0 u7 Impotence Cavarject b4 a0 u7 k7 d6 Impotence Cavarject u8 (3)
68
Delta-M Web Table b2 a0 u2 k3 d3 Heart Disease Hirudin a0 u2 k3 d3 Heart Disorder Hirudin (4) b1 a0 k2 d2 Cancer Beta Carotene b1 a0 k2 d2 Cancer Beta Carotene (5)
69
Applications l Provides the framework for u Trend analysis u E-commerce Consumer behaviour Product comparisons Competitive Intelligence Notification Services Provide a useful database for buyer and sellers agents
70
Future Work l Analytical and empirical studies of the algorithms for generating delta web tables l Mechanism to distinguish between the modified, new or deleted nodes u Annotation on delta nodes l Extend to sub-page level l Query languages for querying the changes l Change notification service
Similar presentations
© 2025 SlidePlayer.com. Inc.
All rights reserved.