Presentation is loading. Please wait.

Presentation is loading. Please wait.

Detecting and Representing Relevant Web Deltas in WHOWEDA

Similar presentations


Presentation on theme: "Detecting and Representing Relevant Web Deltas in WHOWEDA"— Presentation transcript:

1 Detecting and Representing Relevant Web Deltas in WHOWEDA
Sanjay Kumar Madria Department of Computer Science University of Missouri-Rolla Based on IEEE ICDCS’00 and IEEE TKDE (under minor revision) 2/28/2019

2 Replaces its antecedents leaving no trace!!!!
Current Situation of W3 The Web allows information to change at any time and in any way Two forms of changes Existence Structure and content modification Leaves no trace of the previous document Replaces its antecedents leaving no trace!!!! 2/28/2019

3 Problems of Change Management
Detecting, Representing and Querying these changes The problem is challenging Typical database approaches to detect changes based on triggering mechanisms are not usable No access right, no support for triggers Information sources typically do not keep track of historical information to a format that is accessible to the outside user 2/28/2019

4 Applications Provides the framework for Web Site Administrator
Trend analysis and Mining E-commerce Customers of E-commerce Web Site Competitive Intelligence : Product and Price comparisons Notification Services (with PDA) 2/28/2019

5 Objectives Web deltas - Changes to web data
Detecting and representing relevant page-level web deltas changes that are relevant to user’s query, not any arbitrary changes or web deltas Restricted to page level Detect those documents which are added to the site deleted from the site those documents which have undergone content or structural modification How these delta documents are related to one another and with other documents relevant to the user’s query 2/28/2019

6 Related Work Lore (Stanford)– change management (SIGMOD’97 and ICDE’98) Contrast OEM based, not applied on Web WebCQ (Georgia Tech) Needs a set of URLs. No interdocument changes Htmldiff (AT&T)– Input - two versions Output – marked up copy highlight changes Difficult to browse in case of large file Ours is based on query , not any change 2/28/2019

7 Change Mgmt in DBMS Two Approaches
Snapshot collection at times t1, t2,….. Snapshot deltas, D and Ds at time t1, t2,….. Contrast – we use snapshot delta approach, but with semi-structured data 2/28/2019

8 Motivating Example Assume that there is a web site at www.panacea.gov
Provides information related to drugs used for various diseases Suppose, on 15th January, a user wishes to find out periodically (every 30 days) information related to side effects and uses of drugs used for various drugs and changes to these information at the page-level compared to its previous version 2/28/2019

9 Structure of www.panacea.gov
contains a list of diseases Each link of a particular disease points to a web page containing a list of drugs used for prevention and cure of the disease Hyperlinks associated with each drug points to documents containing a list of various issues related to a particular drug (description, manufacturers, clinical pharmacology, uses, side-effects etc) From the hyperlinks associated with each issue, one can retrieve details of these issues for a particular drug 2/28/2019

10 A Snapshot as on 15th Jan Side effects Indavir Ritonavir Uses AIDS
Cancer Heart disease Alzheimer’s Disease Side effects Hirudin Uses Diabetes Niacin Ibuprofen Impotence Side effects Vasomax Side effects Side effects Caverject Uses 2/28/2019 Uses

11 A Partial Snapshot as on 25th Jan
Side effects Tolcapone Parkinson’s Disease Uses update Cancer New Link Diabetes Side effects 2/28/2019

12 A Partial Snapshot as on 30th Jan
Side effects Uses Caverject Impotence Side effects Vasomax Viagra Uses 2/28/2019

13 On 8th February www.panacea.gov Heart disorder Alzheimer’s
Disease Side effects Hirudin Uses Niacin Side effects 2/28/2019

14 A Snapshot as on 15th Feb Indavir Ritonavir Alzheimer’s Disease AIDS
Cancer Heart disease Parkinson’s Disease Hirudin Niacin Impotence Viagra Side effects Uses Vasomax Caverject 2/28/2019

15 Types of Changes Insert Node Delete Node Update Node (update contents)
Insert Link – same as either Insert node or update node Delete Link – same as either delete node or update node Update link – same as update node 2/28/2019

16 WHOWEDA* Project Key Objectives Design a suitable data model to store web data, called WHOM (Warehouse of Object Model) Development of web algebra and query language to extract and manipulate web data Change Management of Web data Development of knowledge discovery and web mining tools *Joint project with NTU, Singapore 2/28/2019

17 Global Coupling, Web Select, Web Join etc.
Overview of WHOM Collection of web tables Set of web tuples and a set of web schemas represents a web table Web tuple - directed graph containing nodes and links and satisfies a web schema Nodes and links contain content, metadata and structural information associated with Web documents and hyperlinks Tree representation (Can handle XML) Web algebra containing web operators to manipulate web tables Global Coupling, Web Select, Web Join etc. 2/28/2019

18 Result of the query is stored in the form of web table
Step 1: Retrieving Snapshots of Web Data Using Coupling Query Graph Example Suppose, on 15th January, a user wishes to find out periodically (every 30 days) from the web site at information related to side effects and uses of drugs used for various diseases Result of the query is stored in the form of web table 2/28/2019

19 Pictorial Representation
“side effects” d {1, 6} a b “drug list” {1, 3} k “uses” 2/28/2019

20 Coupling Query Set of node variables Xn, Xn = {a, b, d, k}
Each variable represents set of Web documents Set of link variables Xl, Xl = { - } Each variable represent set of hyperlinks Set of predicates P defined over some of the node and link variables P = {p1, p2, p3, p4} p1(a) = METADATA:: a[url] EQUALS “ p2(b) = CONTENT:: b[html.body.title] NON-ATTR-CONT “drug list” p3(k) = CONTENT:: k[html.body.title] NON-ATTR-CONT “uses” p4(d) = CONTENT:: d[html.body.title] NON-ATTR-CONT “side effects” 2/28/2019

21 Coupling Query Set of connectivities C in defined over node and link variables To specify hyperlink structure of the documents Specify metadata, content or structural conditions C = k1 AND k2 AND k3 k1 = a < - > b k2 = b < -{1, 6} > d k3 = b < -{1, 3} > k Set of coupling query predicates Q Conditions on execution of the query Q = {q1} q1(G) = COUPLING_QUERY:: G:polling_frequency EQUALS “30 days” 2/28/2019

22 Web Table Drugs (15th Jan)
Indavir AIDS k0 a0 b0 u1 d1 Ritonavir AIDS k1 Beta Carotene a0 b1 d2 Cancer k2 a0 b5 d12 Ibuprofen Alzheimer’s Disease 2/28/2019 k12

23 Web Table Drugs (15th Jan)
k5 Diabetes Albuterol b4 u4 k6 d5 Impotence Vasomax u6 u5 b4 a0 u7 k7 d6 Impotence Cavarject u8 b2 a0 u2 k3 d3 Heart Disease Hirudin 2/28/2019

24 Web Table New Drugs (15th Feb)
k0 d0 AIDS Indavir u1 k1 d1 Ritonavir Beta Carotene a0 b1 d2 Cancer k2 b2 a0 u2 k3 d3 Heart Disorder Hirudin 2/28/2019

25 Web Table New Drugs (15th Feb)
k7 d7 Heart Disorder Niacin b4 a0 u9 k8 d8 Impotence Vasomax a0 b4 u7 d6 Cavarject Impotence k7 2/28/2019

26 Web Table New Drugs (15th Feb)
k10 d10 Parkinson’s Disease Tolcapone b4 a0 u12 k9 d9 Impotence Viagra 2/28/2019

27 Storage of Web Objects Warehouse Node pool– distinct nodes, each node has node-id, version-ids warehouse document pool – actual documents Web table pool Table node pool- type identifier name that node and link represents in schema,link-id, version-ids, URL of the node, target node-id, label, and link type of the link web tuple pool- ids of all the nodes and links belonging to web tuple web schema pool – store the web schema and coupling query 2/28/2019

28 Step 2: Performing Web Join, Left and Right Outer Web Join
Combine two web tables by concatenating two web tuples whenever there exist joinable nodes Two nodes are joinable if they are identical Two nodes are identical if the URL and last modification date of the nodes are same The joined web tuple is stored in a different web table 2/28/2019

29 Web Join Join web tables Drugs and New Drugs
Nodes which have not undergone any changes are the joinable nodes in these two web tables. Content modified nodes, new nodes and deleted nodes cannot be joinable nodes 2/28/2019

30 Joined web table (1) (2) (3) a0 b0 u0 d0 AIDS Indavir AIDS k0 a0 AIDS
Ritonavir AIDS a0 k1 b0 a0 u0 k0 d0 AIDS Indavir u1 k1 d1 Ritonavir (3) 2/28/2019

31 Joined Web Table (4) (5) a0 b2 u3 d7 Heart Disorder Niacin k4 a0 u2 d3
Disease Hirudin k3 b4 a0 u7 Impotence Cavarject k7 d6 u8 (5) 2/28/2019

32 Joined Table (6) a0 b2 u2 d3 Heart Disease Hirudin k3 Hirudin a0 u2 d3
Disorder k3 2/28/2019

33 Types of web tuples Web tuples in which all the nodes are joinable
Results of joining two versions of web tuples that has remained unchanged during the transition Web tuples in which some of the nodes are joinable nodes remaining nodes are the result of insertion, deletion or modification operations b4 a0 u7 Impotence Cavarject k7 d6 u8 (5) 2/28/2019

34 Types of web tuples (3) Tuples in which
Some of the nodes are joinable nodes Out of the remaining nodes some are result of insertion, deletion or modification and The remaining ones remained unchanged during the transition, but may be joinable with others b0 a0 u0 k0 d0 AIDS Indavir u1 k1 d1 Ritonavir (3) 2/28/2019

35 Algorithm for Computing joinable nodes
2/28/2019

36 Algorithm of web join 2/28/2019

37 Algorithm of web join (continued)
2/28/2019

38 Outer Web Join Web tuples that do not participate in the web join process (dangling web tuples) are absent from the joined web table Outer web join enables us to identify them Left outer web join Right outer web join 2/28/2019

39 Types of web tuples (Right Outer)
New web tuples which are added during the transition These tuples contain some new nodes and remaining ones content are changed. Tuples in which all the nodes have undergone content modification Tuples which existed before and in which some of the nodes are new and remaining ones content have changed. 2/28/2019

40 Web Table New Drugs (15th Feb)
k0 d0 AIDS Indavir u1 k1 d1 Ritonavir Beta Carotene a0 b1 d2 Cancer k2 b2 a0 u2 k3 d3 Heart Disorder Hirudin 2/28/2019

41 Web Table New Drugs (15th Feb)
k7 d7 Heart Disorder Niacin b4 a0 u9 k8 d8 Impotence Vasomax a0 b4 u7 d6 Cavarject Impotence k7 2/28/2019

42 Web Table New Drugs (15th Feb)
k10 d10 Parkinson’s Disease Tolcapone b4 a0 u12 k9 d9 Impotence Viagra 2/28/2019

43 Types of web tuples (Left Outer)
Web tuples which are deleted during the transition These tuples do not occur in the new web table Tuples in which all the nodes have undergone content modification Tuples in which some of the nodes are deleted and of remaining ones content have changed. 2/28/2019

44 Web Table Drugs (15th Jan)
Indavir AIDS k0 a0 b0 u1 d1 Ritonavir AIDS b1 a0 k2 d2 Cancer Beta Carotene k1 b5 a0 k12 d12 Alzheimer’s Disease Ibuprofen 2/28/2019

45 Web Table Drugs (15th Jan)
k5 Diabetes Albuterol b4 u4 k6 d5 Impotence Vasomax u6 u5 b4 a0 u7 k7 d6 Impotence Cavarject u8 b2 a0 u2 k3 d3 Heart Disease Hirudin 2/28/2019

46 Algorithm of outer web join
2/28/2019

47 Algorithm of outer web join (continued)
2/28/2019

48 Step 3: Generating Delta Web Tables
Input Joined, left outer joined and right outer joined web tables Output Set of delta web tables 2/28/2019

49 Delta Web Tables Encapsulate the relevant changes that have occurred in the Web with respect to a user’s query Three types Delta+ web table Contains a set of tuples containing new nodes inserted during transition Delta- web table Set of web tuples containing nodes removed during the transition Delta-M web table Set of web tuples representing the previous and current sets of modified nodes 2/28/2019

50 Steps for Generation Phase 1: Delta Nodes Identification Phase
Nodes which are added, deleted or modified during the transition are identified Input: Old and new version of web tables and a set of joinable nodes from the joined web table Output: Nodes which exists in new web table but not in old web table are the new nodes Nodes which exists in old web table but not in new one are the deleted nodes Nodes which exists in both the web tables but are not joinable are the nodes which have undergone content modification 2/28/2019

51 Steps for Generation Phase 2: Delta Tuples Identification Phase
Determines how the delta nodes are related to one another and how they are associated with those nodes which have remained unchanged We identify those tuples which contain nodes which are added, deleted or modified during the transition Input: Joined, left outer joined and right outer joined web tables, sets of delta nodes Output: Sets of web tuples represented by Delta+, Delta- and Delta-M web tables 2/28/2019

52 Phase 2 (Delta+ Web Table)
Scan joined and right outer joined web tables to identify web tuples containing nodes which are inserted during the transition New nodes can occur in these tables In the right outer joined table if the remaining nodes in the tuple containing the new nodes, are modified (hence not joinable) In the joined web table if some of the nodes in the tuple containing new nodes, have remained unchanged and hence are joinable These web tuples are stored in Delta+ Web Table 2/28/2019

53 Example (Right Outer Web Join)
k2 d2 Cancer Beta Carotene b4 a0 u9 k8 d8 Impotence Vasomax u12 k9 d9 Viagra b6 u10 k10 d10 Parkinson’s Disease Tolcapone 2/28/2019

54 Example (Joined Web Table)
(4) a0 b2 u3 d7 Heart Disorder Niacin k7 a0 u2 d3 Heart Disease Hirudin k3 2/28/2019

55 Delta+ Web Table b2 a0 u3 k7 d7 Heart Disorder Niacin b4 a0 u9 k8 d8
Impotence Vasomax u12 k9 d9 Viagra b6 u10 k10 d10 Parkinson’s Disease Tolcapone 2/28/2019

56 Phase 2 (Delta- Web Table)
Scan joined and left outer joined web tables to identify web tuples containing nodes which are deleted during the transition Deleted nodes can occur in these tables only because In the left outer joined table if the remaining nodes in the tuple containing the deleted nodes, are modified (hence not joinable) In the joined web table if some of the nodes in the tuple containing deleted nodes have remained unchanged and hence are joinable These web tuples are stored in Delta- Web Table 2/28/2019

57 Example (Left Outer Web Join)
k2 d2 Cancer Beta Carotene b5 a0 k12 d12 Alzheimer’s Disease Ibuprofen b3 a0 d4 k5 Diabetes Albuterol b4 u4 k6 d5 Impotence Vasomax u6 u5 2/28/2019

58 Example (Joined Web Table)
u7 d6 Cavarject Impotence (5) u8 k7 a0 b4 u7 Cavarject Impotence 2/28/2019

59 Delta- Web Table b4 a0 u7 k7 d6 Impotence Cavarject u8 b5 a0 k12 d12
Alzheimer’s Disease Ibuprofen b3 a0 d4 k5 Diabetes Albuterol b4 u4 k6 d5 Impotence Vasomax u6 u5 2/28/2019

60 Phase 2 (Delta-M Web Table)
Finally, nodes which are modified during the transition can be identified by inspecting all the three web tables Tuples in the left and right outer joined tables which do not contain any new or deleted node represent the old and new version of these nodes respectively These tuples do not occur in the joined web table as all the nodes are modified Tuples in left and right outer joined tables that contain modified nodes as well as inserted or deleted nodes These modified nodes may not appear in the joined web table if no other joinable web tuples contain these modified nodes 2/28/2019

61 Phase 2 Tuples in the joined web tables where some of the nodes represent the old and new version of these modified nodes These web tuples are stored in Delta-M Web Table 2/28/2019

62 Example (Right Outer Web Join)
k2 d2 Cancer Beta Carotene b4 a0 u9 k8 d8 Impotence Vasomax b4 a0 u12 k9 d9 Impotence Viagra b6 a0 u10 k10 d10 Parkinson’s Disease Tolcapone 2/28/2019

63 Example (Left Outer Web Join)
k2 d2 Cancer Beta Carotene b5 a0 k12 d12 Alzheimer’s Disease Ibuprofen b3 a0 d4 k5 Diabetes Albuterol b4 u4 k6 d5 Impotence Vasomax u6 u5 2/28/2019

64 Example (Joined web table)
u0 d0 (1) AIDS Indavir AIDS k0 a0 AIDS a0 b0 u1 d1 Ritonavir (2) AIDS a0 k1 2/28/2019

65 Delta-M Web Table (1) (2) (3) a0 b0 u0 d0 AIDS Indavir AIDS k0 a0 AIDS
Ritonavir AIDS a0 k1 a0 b4 u7 d6 Cavarject (3) Impotence u8 k7 a0 b4 u7 Cavarject Impotence 2/28/2019

66 Delta-M Web Table (4) (5) a0 b2 u2 d3 Heart Disease Hirudin k3 Hirudin
Disorder b1 a0 k2 d2 Cancer Beta Carotene k3 (5) b1 a0 k2 d2 Cancer Beta Carotene 2/28/2019

67 Algorithm Delta 2/28/2019

68 Algorithm Delta (continued)
2/28/2019

69 Algorithm Delta (continued)
2/28/2019

70 Algorithm of GenerateResult Tables
2/28/2019

71 Algorithm of GenerateResult Tables (continued)
2/28/2019

72 Algorithm for DeltasFromRightOuter
2/28/2019

73 Algorithm for DeltasFromLeftOuter
2/28/2019

74 Algorithm of DeltasFromJoin
2/28/2019

75 Algorithm of DeltasFromJoin (continued)
2/28/2019

76 Algorithm of CreateDeltaPlus
2/28/2019

77 Future Work Analytical and empirical studies of the algorithms for generating delta web tables Mechanism to Represent changes; modified, new or deleted nodes Annotation on delta nodes Extend to sub-page level Query languages for querying the changes Change notification service 2/28/2019


Download ppt "Detecting and Representing Relevant Web Deltas in WHOWEDA"

Similar presentations


Ads by Google