Detecting and Representing Relevant Web Deltas in WHOWEDA Sanjay Kumar Madria Department of Computer Science University of Missouri-Rolla madrias@umr.edu Based on IEEE ICDCS’00 and IEEE TKDE (under minor revision) 2/28/2019
Replaces its antecedents leaving no trace!!!! Current Situation of W3 The Web allows information to change at any time and in any way Two forms of changes Existence Structure and content modification Leaves no trace of the previous document Replaces its antecedents leaving no trace!!!! 2/28/2019
Problems of Change Management Detecting, Representing and Querying these changes The problem is challenging Typical database approaches to detect changes based on triggering mechanisms are not usable No access right, no support for triggers Information sources typically do not keep track of historical information to a format that is accessible to the outside user 2/28/2019
Applications Provides the framework for Web Site Administrator Trend analysis and Mining E-commerce Customers of E-commerce Web Site Competitive Intelligence : Product and Price comparisons Notification Services (with PDA) 2/28/2019
Objectives Web deltas - Changes to web data Detecting and representing relevant page-level web deltas changes that are relevant to user’s query, not any arbitrary changes or web deltas Restricted to page level Detect those documents which are added to the site deleted from the site those documents which have undergone content or structural modification How these delta documents are related to one another and with other documents relevant to the user’s query 2/28/2019
Related Work Lore (Stanford)– change management (SIGMOD’97 and ICDE’98) Contrast OEM based, not applied on Web WebCQ (Georgia Tech) Needs a set of URLs. No interdocument changes Htmldiff (AT&T)– Input - two versions Output – marked up copy highlight changes Difficult to browse in case of large file Ours is based on query , not any change 2/28/2019
Change Mgmt in DBMS Two Approaches Snapshot collection at times t1, t2,….. Snapshot deltas, D and Ds at time t1, t2,….. Contrast – we use snapshot delta approach, but with semi-structured data 2/28/2019
Motivating Example Assume that there is a web site at www.panacea.gov Provides information related to drugs used for various diseases Suppose, on 15th January, a user wishes to find out periodically (every 30 days) information related to side effects and uses of drugs used for various drugs and changes to these information at the page-level compared to its previous version 2/28/2019
Structure of www.panacea.gov www.panacea.gov contains a list of diseases Each link of a particular disease points to a web page containing a list of drugs used for prevention and cure of the disease Hyperlinks associated with each drug points to documents containing a list of various issues related to a particular drug (description, manufacturers, clinical pharmacology, uses, side-effects etc) From the hyperlinks associated with each issue, one can retrieve details of these issues for a particular drug 2/28/2019
A Snapshot as on 15th Jan Side effects Indavir Ritonavir Uses AIDS Cancer Heart disease Alzheimer’s Disease Side effects Hirudin Uses Diabetes Niacin Ibuprofen Impotence Side effects Vasomax Side effects Side effects Caverject Uses 2/28/2019 Uses
A Partial Snapshot as on 25th Jan Side effects Tolcapone Parkinson’s Disease Uses update Cancer New Link www.panacea.gov Diabetes Side effects 2/28/2019
A Partial Snapshot as on 30th Jan Side effects www.panacea.gov Uses Caverject Impotence Side effects Vasomax Viagra Uses 2/28/2019
On 8th February www.panacea.gov Heart disorder Alzheimer’s Disease Side effects Hirudin Uses Niacin Side effects 2/28/2019
A Snapshot as on 15th Feb Indavir Ritonavir Alzheimer’s Disease AIDS Cancer Heart disease Parkinson’s Disease Hirudin Niacin Impotence Viagra Side effects Uses Vasomax Caverject 2/28/2019
Types of Changes Insert Node Delete Node Update Node (update contents) Insert Link – same as either Insert node or update node Delete Link – same as either delete node or update node Update link – same as update node 2/28/2019
WHOWEDA* Project Key Objectives Design a suitable data model to store web data, called WHOM (Warehouse of Object Model) Development of web algebra and query language to extract and manipulate web data Change Management of Web data Development of knowledge discovery and web mining tools *Joint project with NTU, Singapore 2/28/2019
Global Coupling, Web Select, Web Join etc. Overview of WHOM Collection of web tables Set of web tuples and a set of web schemas represents a web table Web tuple - directed graph containing nodes and links and satisfies a web schema Nodes and links contain content, metadata and structural information associated with Web documents and hyperlinks Tree representation (Can handle XML) Web algebra containing web operators to manipulate web tables Global Coupling, Web Select, Web Join etc. 2/28/2019
Result of the query is stored in the form of web table Step 1: Retrieving Snapshots of Web Data Using Coupling Query Graph Example Suppose, on 15th January, a user wishes to find out periodically (every 30 days) from the web site at www.panacea.gov information related to side effects and uses of drugs used for various diseases Result of the query is stored in the form of web table 2/28/2019
Pictorial Representation “side effects” d {1, 6} www.panacea.gov a b “drug list” {1, 3} k “uses” 2/28/2019
Coupling Query Set of node variables Xn, Xn = {a, b, d, k} Each variable represents set of Web documents Set of link variables Xl, Xl = { - } Each variable represent set of hyperlinks Set of predicates P defined over some of the node and link variables P = {p1, p2, p3, p4} p1(a) = METADATA:: a[url] EQUALS “www.panacea.gov” p2(b) = CONTENT:: b[html.body.title] NON-ATTR-CONT “drug list” p3(k) = CONTENT:: k[html.body.title] NON-ATTR-CONT “uses” p4(d) = CONTENT:: d[html.body.title] NON-ATTR-CONT “side effects” 2/28/2019
Coupling Query Set of connectivities C in defined over node and link variables To specify hyperlink structure of the documents Specify metadata, content or structural conditions C = k1 AND k2 AND k3 k1 = a < - > b k2 = b < -{1, 6} > d k3 = b < -{1, 3} > k Set of coupling query predicates Q Conditions on execution of the query Q = {q1} q1(G) = COUPLING_QUERY:: G:polling_frequency EQUALS “30 days” 2/28/2019
Web Table Drugs (15th Jan) Indavir AIDS k0 a0 b0 u1 d1 Ritonavir AIDS k1 Beta Carotene a0 b1 d2 Cancer k2 a0 b5 d12 Ibuprofen Alzheimer’s Disease 2/28/2019 k12
Web Table Drugs (15th Jan) k5 Diabetes Albuterol b4 u4 k6 d5 Impotence Vasomax u6 u5 b4 a0 u7 k7 d6 Impotence Cavarject u8 b2 a0 u2 k3 d3 Heart Disease Hirudin 2/28/2019
Web Table New Drugs (15th Feb) k0 d0 AIDS Indavir u1 k1 d1 Ritonavir Beta Carotene a0 b1 d2 Cancer k2 b2 a0 u2 k3 d3 Heart Disorder Hirudin 2/28/2019
Web Table New Drugs (15th Feb) k7 d7 Heart Disorder Niacin b4 a0 u9 k8 d8 Impotence Vasomax a0 b4 u7 d6 Cavarject Impotence k7 2/28/2019
Web Table New Drugs (15th Feb) k10 d10 Parkinson’s Disease Tolcapone b4 a0 u12 k9 d9 Impotence Viagra 2/28/2019
Storage of Web Objects Warehouse Node pool– distinct nodes, each node has node-id, version-ids warehouse document pool – actual documents Web table pool Table node pool- type identifier name that node and link represents in schema,link-id, version-ids, URL of the node, target node-id, label, and link type of the link web tuple pool- ids of all the nodes and links belonging to web tuple web schema pool – store the web schema and coupling query 2/28/2019
Step 2: Performing Web Join, Left and Right Outer Web Join Combine two web tables by concatenating two web tuples whenever there exist joinable nodes Two nodes are joinable if they are identical Two nodes are identical if the URL and last modification date of the nodes are same The joined web tuple is stored in a different web table 2/28/2019
Web Join Join web tables Drugs and New Drugs Nodes which have not undergone any changes are the joinable nodes in these two web tables. Content modified nodes, new nodes and deleted nodes cannot be joinable nodes 2/28/2019
Joined web table (1) (2) (3) a0 b0 u0 d0 AIDS Indavir AIDS k0 a0 AIDS Ritonavir AIDS a0 k1 b0 a0 u0 k0 d0 AIDS Indavir u1 k1 d1 Ritonavir (3) 2/28/2019
Joined Web Table (4) (5) a0 b2 u3 d7 Heart Disorder Niacin k4 a0 u2 d3 Disease Hirudin k3 b4 a0 u7 Impotence Cavarject k7 d6 u8 (5) 2/28/2019
Joined Table (6) a0 b2 u2 d3 Heart Disease Hirudin k3 Hirudin a0 u2 d3 Disorder k3 2/28/2019
Types of web tuples Web tuples in which all the nodes are joinable Results of joining two versions of web tuples that has remained unchanged during the transition Web tuples in which some of the nodes are joinable nodes remaining nodes are the result of insertion, deletion or modification operations b4 a0 u7 Impotence Cavarject k7 d6 u8 (5) 2/28/2019
Types of web tuples (3) Tuples in which Some of the nodes are joinable nodes Out of the remaining nodes some are result of insertion, deletion or modification and The remaining ones remained unchanged during the transition, but may be joinable with others b0 a0 u0 k0 d0 AIDS Indavir u1 k1 d1 Ritonavir (3) 2/28/2019
Algorithm for Computing joinable nodes 2/28/2019
Algorithm of web join 2/28/2019
Algorithm of web join (continued) 2/28/2019
Outer Web Join Web tuples that do not participate in the web join process (dangling web tuples) are absent from the joined web table Outer web join enables us to identify them Left outer web join Right outer web join 2/28/2019
Types of web tuples (Right Outer) New web tuples which are added during the transition These tuples contain some new nodes and remaining ones content are changed. Tuples in which all the nodes have undergone content modification Tuples which existed before and in which some of the nodes are new and remaining ones content have changed. 2/28/2019
Web Table New Drugs (15th Feb) k0 d0 AIDS Indavir u1 k1 d1 Ritonavir Beta Carotene a0 b1 d2 Cancer k2 b2 a0 u2 k3 d3 Heart Disorder Hirudin 2/28/2019
Web Table New Drugs (15th Feb) k7 d7 Heart Disorder Niacin b4 a0 u9 k8 d8 Impotence Vasomax a0 b4 u7 d6 Cavarject Impotence k7 2/28/2019
Web Table New Drugs (15th Feb) k10 d10 Parkinson’s Disease Tolcapone b4 a0 u12 k9 d9 Impotence Viagra 2/28/2019
Types of web tuples (Left Outer) Web tuples which are deleted during the transition These tuples do not occur in the new web table Tuples in which all the nodes have undergone content modification Tuples in which some of the nodes are deleted and of remaining ones content have changed. 2/28/2019
Web Table Drugs (15th Jan) Indavir AIDS k0 a0 b0 u1 d1 Ritonavir AIDS b1 a0 k2 d2 Cancer Beta Carotene k1 b5 a0 k12 d12 Alzheimer’s Disease Ibuprofen 2/28/2019
Web Table Drugs (15th Jan) k5 Diabetes Albuterol b4 u4 k6 d5 Impotence Vasomax u6 u5 b4 a0 u7 k7 d6 Impotence Cavarject u8 b2 a0 u2 k3 d3 Heart Disease Hirudin 2/28/2019
Algorithm of outer web join 2/28/2019
Algorithm of outer web join (continued) 2/28/2019
Step 3: Generating Delta Web Tables Input Joined, left outer joined and right outer joined web tables Output Set of delta web tables 2/28/2019
Delta Web Tables Encapsulate the relevant changes that have occurred in the Web with respect to a user’s query Three types Delta+ web table Contains a set of tuples containing new nodes inserted during transition Delta- web table Set of web tuples containing nodes removed during the transition Delta-M web table Set of web tuples representing the previous and current sets of modified nodes 2/28/2019
Steps for Generation Phase 1: Delta Nodes Identification Phase Nodes which are added, deleted or modified during the transition are identified Input: Old and new version of web tables and a set of joinable nodes from the joined web table Output: Nodes which exists in new web table but not in old web table are the new nodes Nodes which exists in old web table but not in new one are the deleted nodes Nodes which exists in both the web tables but are not joinable are the nodes which have undergone content modification 2/28/2019
Steps for Generation Phase 2: Delta Tuples Identification Phase Determines how the delta nodes are related to one another and how they are associated with those nodes which have remained unchanged We identify those tuples which contain nodes which are added, deleted or modified during the transition Input: Joined, left outer joined and right outer joined web tables, sets of delta nodes Output: Sets of web tuples represented by Delta+, Delta- and Delta-M web tables 2/28/2019
Phase 2 (Delta+ Web Table) Scan joined and right outer joined web tables to identify web tuples containing nodes which are inserted during the transition New nodes can occur in these tables In the right outer joined table if the remaining nodes in the tuple containing the new nodes, are modified (hence not joinable) In the joined web table if some of the nodes in the tuple containing new nodes, have remained unchanged and hence are joinable These web tuples are stored in Delta+ Web Table 2/28/2019
Example (Right Outer Web Join) k2 d2 Cancer Beta Carotene b4 a0 u9 k8 d8 Impotence Vasomax u12 k9 d9 Viagra b6 u10 k10 d10 Parkinson’s Disease Tolcapone 2/28/2019
Example (Joined Web Table) (4) a0 b2 u3 d7 Heart Disorder Niacin k7 a0 u2 d3 Heart Disease Hirudin k3 2/28/2019
Delta+ Web Table b2 a0 u3 k7 d7 Heart Disorder Niacin b4 a0 u9 k8 d8 Impotence Vasomax u12 k9 d9 Viagra b6 u10 k10 d10 Parkinson’s Disease Tolcapone 2/28/2019
Phase 2 (Delta- Web Table) Scan joined and left outer joined web tables to identify web tuples containing nodes which are deleted during the transition Deleted nodes can occur in these tables only because In the left outer joined table if the remaining nodes in the tuple containing the deleted nodes, are modified (hence not joinable) In the joined web table if some of the nodes in the tuple containing deleted nodes have remained unchanged and hence are joinable These web tuples are stored in Delta- Web Table 2/28/2019
Example (Left Outer Web Join) k2 d2 Cancer Beta Carotene b5 a0 k12 d12 Alzheimer’s Disease Ibuprofen b3 a0 d4 k5 Diabetes Albuterol b4 u4 k6 d5 Impotence Vasomax u6 u5 2/28/2019
Example (Joined Web Table) u7 d6 Cavarject Impotence (5) u8 k7 a0 b4 u7 Cavarject Impotence 2/28/2019
Delta- Web Table b4 a0 u7 k7 d6 Impotence Cavarject u8 b5 a0 k12 d12 Alzheimer’s Disease Ibuprofen b3 a0 d4 k5 Diabetes Albuterol b4 u4 k6 d5 Impotence Vasomax u6 u5 2/28/2019
Phase 2 (Delta-M Web Table) Finally, nodes which are modified during the transition can be identified by inspecting all the three web tables Tuples in the left and right outer joined tables which do not contain any new or deleted node represent the old and new version of these nodes respectively These tuples do not occur in the joined web table as all the nodes are modified Tuples in left and right outer joined tables that contain modified nodes as well as inserted or deleted nodes These modified nodes may not appear in the joined web table if no other joinable web tuples contain these modified nodes 2/28/2019
Phase 2 Tuples in the joined web tables where some of the nodes represent the old and new version of these modified nodes These web tuples are stored in Delta-M Web Table 2/28/2019
Example (Right Outer Web Join) k2 d2 Cancer Beta Carotene b4 a0 u9 k8 d8 Impotence Vasomax b4 a0 u12 k9 d9 Impotence Viagra b6 a0 u10 k10 d10 Parkinson’s Disease Tolcapone 2/28/2019
Example (Left Outer Web Join) k2 d2 Cancer Beta Carotene b5 a0 k12 d12 Alzheimer’s Disease Ibuprofen b3 a0 d4 k5 Diabetes Albuterol b4 u4 k6 d5 Impotence Vasomax u6 u5 2/28/2019
Example (Joined web table) u0 d0 (1) AIDS Indavir AIDS k0 a0 AIDS a0 b0 u1 d1 Ritonavir (2) AIDS a0 k1 2/28/2019
Delta-M Web Table (1) (2) (3) a0 b0 u0 d0 AIDS Indavir AIDS k0 a0 AIDS Ritonavir AIDS a0 k1 a0 b4 u7 d6 Cavarject (3) Impotence u8 k7 a0 b4 u7 Cavarject Impotence 2/28/2019
Delta-M Web Table (4) (5) a0 b2 u2 d3 Heart Disease Hirudin k3 Hirudin Disorder b1 a0 k2 d2 Cancer Beta Carotene k3 (5) b1 a0 k2 d2 Cancer Beta Carotene 2/28/2019
Algorithm Delta 2/28/2019
Algorithm Delta (continued) 2/28/2019
Algorithm Delta (continued) 2/28/2019
Algorithm of GenerateResult Tables 2/28/2019
Algorithm of GenerateResult Tables (continued) 2/28/2019
Algorithm for DeltasFromRightOuter 2/28/2019
Algorithm for DeltasFromLeftOuter 2/28/2019
Algorithm of DeltasFromJoin 2/28/2019
Algorithm of DeltasFromJoin (continued) 2/28/2019
Algorithm of CreateDeltaPlus 2/28/2019
Future Work Analytical and empirical studies of the algorithms for generating delta web tables Mechanism to Represent changes; modified, new or deleted nodes Annotation on delta nodes Extend to sub-page level Query languages for querying the changes Change notification service 2/28/2019