Presentation is loading. Please wait.

Presentation is loading. Please wait.

WHOWEDA : Warehouse of Web Data

Similar presentations


Presentation on theme: "WHOWEDA : Warehouse of Web Data"— Presentation transcript:

1 WHOWEDA : Warehouse of Web Data
Sanjay Kumar Madria Department of Computer Science Purdue University, West Lafayette, IN 47907 11/18/2018 madria

2 WHOWEDA -Key Objectives
Design a suitable data model to represent web information development of web algebra and query language Maintenance of Web data Development of knowledge discovery and web mining tools Web warehouse 11/18/2018 madria

3 copy-right@sanjay madria
WHOWEDA - What? WareHouse Of Web Data Subject - oriented Integrated Temporal Granularity - Lower, higher Some summary Not updatable Alternative information sources 11/18/2018 madria

4 copy-right@sanjay madria
Web Warehouse? Subject-oriented, integrated, time-variant, non-volatile repository of web data for direct querying and analysis for some sort of decision making A process whereby organizations or individuals extract value from their Web informational assets through the use of special stores called web warehouses 11/18/2018 madria

5 WHOWEDA! www.cais.ntu.edu.sg:8000/~whoweda
A WareHouse Of WEb DAta Web Information Coupling Model (WICM) Web Objects Web Schema Web Information Coupling Algebra Web Information Maintenance Web Mining and Knowledge discovery 11/18/2018 madria

6 WWW Web Information Coupling System Web Warehouse User Web Querying
Concept Mart Web Querying & Analysis Component Web Information Mining System Web Information Coupling System Web Information Maintenance System Web Mart Web Mart Web Warehouse Web Mart Web Mart

7 WWW Global Web Manipulation Pre processing Web Warehouse Local Web
User WWW Web Query & Display Warehouse Concept Mart Global Web Manipulation Global Web Coupling Global Ranking Pre processing Data Visualization Schema Tightness Web Warehouse Data Visualization Web Select Web Union Web Project Web Intersection Local Web Manipulation Local Web Coupling Schema Tightness Local Ranking Schema Search Web Join Schema Match

8 copy-right@sanjay madria
Web Objects Node - url, title, format, size, date, text Link - source-url, target-url, label, link-type Web tuple Web table Web schema Web database 11/18/2018 madria

9 copy-right@sanjay madria
Web Schema Metadata in the warehouse Structural ‘summary’ of web table Information Coupling using a Query graph Query graph ->Web schema directed graph represented by Ordered 4-tuple: Set of node variables Set of link variables Connectivities Predicates 11/18/2018 madria

10 copy-right@sanjay madria
11/18/2018 madria

11 copy-right@sanjay madria
Information Square's homepage Headline article 1 Headline article n News specials Airport info (List of video files) List of links to local news world news Local news 1 Local news k World news 1 World news t 11/18/2018 madria

12 copy-right@sanjay madria
x y e g f label CONTAINS "Local News" target_URL CONTAINS "newshub/specials" z url CONTAINS "local" "World News" w "world" target_url CONTAINS "article” h url contains “headlines” 11/18/2018 madria

13 copy-right@sanjay madria
Information Square's homepage Headline article 1 News specials List of links to local news world news Local news 1 World news 1 11/18/2018 madria

14 copy-right@sanjay madria
Schema- example Node variables: Xn = { x, y, z, w } Link variable: Xl = { e, f, g } Connectivities: C = { x<e>y and x<fg->z and x<fh->w } The symbol # represents an unbound node variable or link variable; a variable not restricted by any predicate. “-” represents one unbound links “-+” represents more than one unbound links 11/18/2018 madria

15 copy-right@sanjay madria
Predicates P={x.url=” y.url CONTAINS “headlines” e.target_url CONTAINS "article", f.target.url CONTAINS "newshub/specials", g.label CONTAINS "Local News", z.url CONTAINS "local", h.label CONTAINS "World News", w.url CONTAINS "world" } 11/18/2018 madria

16 copy-right@sanjay madria
Query Graph - Example 1 Query graph - same as schema except that it has one more parameter to control the results returned. Informally, it is directed connected graph consists of nodes, links and keywords imposed on them. Produce a list of diseases with their symptoms, evaluation procedures and treatment starting from the web site at Web table Diseases 11/18/2018 madria

17 Treatment list q g Treatment Issues Symptoms list f y x z Symptoms List of Diseases e Evaluation Evaluation w p

18 q1 Treatment list g1 Treatment Issues f1 x0 y1 z1 Symptoms list AIDS Symptoms List of Diseases e1 Evaluation Evaluation w1 p2 Elisa Test

19 copy-right@sanjay madria
Example 2 Produce a list of drugs, and their uses and side effects starting from the web site at Web table Drugs 11/18/2018 madria

20 List of Diseases Drug list Issues Uses Use Side effects a b c d r s k Side effects

21 Side effects of Indavir Drug list Issues AIDS r1 a0 b1 c1 d1 Indavir Side effects List of Diseases Use s1 k1 Uses of Indavir

22 copy-right@sanjay madria
Query Language Starting from the CS dept. home page at NTU, find all documents that are linked through paths of length less than two containing only local links, and have in their text “database”. 11/18/2018 madria

23 copy-right@sanjay madria
COUPLE WEBTABLE W FROM WWW SUCH THAT NODE I, J IN WWW and LINK e,f,g IN WWW AND I<e|f,g>J WHERE I.url EQUALS “ AND J.text CONTAINS “database” AND f.link-type EQUALS local AND g.link-type EQUALS local; 11/18/2018 madria

24 copy-right@sanjay madria
Web Algebra Formal foundation of data representation and manipulation in a web warehouse Web operators: Information access operator Information manipulation operators Web schema operators Data visualization operators 11/18/2018 madria

25 Information access operator
Global Web Coupling 11/18/2018 madria

26 Information Manipulation
- Web select Web project Local web coupling Web join Web Cartesian product Web union Web intersect Local Web coupling 11/18/2018 madria

27 copy-right@sanjay madria
Web Select Extracts web tuples from web tables satisfying certain conditions on node and link variables and on connectivities Input is select Schema Output is a web table satisfying the select schema 11/18/2018 madria

28 copy-right@sanjay madria
select W1 tuples that contain world news about Indonesia since May sMsW1 where Ms = < Xsn, Xsl, Cs, Ps >, Xsn = { x, w }, Xsl = { }, Cs = { }, Ps = { x.date > "1May1998", w.text CONTAINS “Indonesia”} 11/18/2018 madria

29 copy-right@sanjay madria
Xn’ = { x, y, z, w },Xl’ = { e, f, g } C’ = { x<e>y and x<fg->z and x<fh->w } P’={x.url=” x.date > "1May1998", e.target_url CONTAINS "article", f.target.url CONTAINS "newshub/specials", g.label CONTAINS "Local News", z.url CONTAINS "local", h.label CONTAINS "World News", w.url CONTAINS "world", w.text CONTAINS “Indonesia” } 11/18/2018 madria

30 Web Information Coupling System
A database system to couple related web information Global web Coupling and Local Web Coupling 11/18/2018 madria

31 Global Coupling - Information Access
To integrate data from the Web To create historical data To couple related information from the WWW satisfying a query graph Operator to create web tables From web with no schema to web table with web schema 11/18/2018 madria

32 copy-right@sanjay madria
Why local web coupling? Directly querying the WWW to gather these information is an expensive and repetitive affair Web documents containing similar information can reside in different web tables in a web warehouse A mechanism to gather these similar information by additional manipulation of the materialized web tables 11/18/2018 madria

33 Local Web Couple operator
Two web tuples and can be coupled if there exist atleast one pair of nodes from and which contains similar information. 11/18/2018 madria

34 Local Web Couple operator
The web couple operator is basically a web cartesian product followed by web select: We denote web couple by the symbol: 11/18/2018 madria

35 copy-right@sanjay madria
Web Coupling 11/18/2018 madria

36 copy-right@sanjay madria
Example 1 Produce a list of diseases and their symptoms starting from the web site at Web table Diseases 11/18/2018 madria

37 Web Schema or Query Graph of ``Diseases”
Issues symptoms e z x y symptoms List of Diseases Web Schema or Query Graph of ``Diseases”

38 Web table ``Diseases” List of Diseases http://www.panacea.org/ Issues
Symptoms of AIDS x0 y0 z0 e0 symptoms AIDS List of Diseases Issues Symptoms of Cancer x0 y1 z1 e1 symptoms Cancer List of Diseases Issues Symptoms of Diabetes x0 y2 z2 e2 symptoms Diabetes List of Diseases Issues Symptoms of Lung Diseases x0 y3 z3 e3 symptoms Lung Disease Web table ``Diseases”

39 copy-right@sanjay madria
Example 2 Produce a list of drugs, and their side effects starting from the web site at Web table Drugs 11/18/2018 madria

40 Web Schema or Query Graph of ``Drugs”
Drug list Side effects Issues r c a b d Side effects List of Diseases Web Schema or Query Graph of ``Drugs”

41 Web table ``Drugs” List of Diseases http://www.panacea.org/ Drug list
Issues Side effects a0 b1 c2 d2 r2 AIDS Ritonavir of Ritonavir b2 c3 d3 r3 Cancer Letrozole of letrozole c1 d1 r1 Indavir of Indavir b4 c4 d4 r4 Heart Disorder Beta Carotene of Beta Carotene Web table ``Drugs”

42 Symptoms & Side effects
Issues Symptoms of AIDS AIDS e0 z0 x0 y0 symptoms List of Diseases Side effects of Ritonavir Drug list Issues AIDS r2 a0 b1 c2 d2 Ritonavir Side effects Issues Symptoms of Cancer Cancer e1 z1 x0 y1 symptoms List of Diseases Side effects of betacarotene Issues Heart Disorder r4 a0 b4 c4 d4 Side effects Beta Carotene Symptoms & Side effects

43 copy-right@sanjay madria
M2 = < Xn”, Xl”, C”,P” > for W2 Xn” = { s, t, u}, Xl” = { k, l, m, n }, C” = { s<kl>t and s<mn>u }, P”{s.url= “ k.label = “REGION”, l.target_url= “ m.label = “WORLD”, n.target_url=“ 11/18/2018 madria

44 copy-right@sanjay madria
W1 qq W2 where q = (x.date=s.date) & (w.text CONTAINS “Indonesia”) & (t.text CONTAINS “Indonesia”) Schema of the coupled table is: 11/18/2018 madria

45 copy-right@sanjay madria
Xn* = { x, y, z, w, s, t, u }, Xl* = { e, f, g, k, l, m, n }, C*= { x<e>y and x<fg->z and x<fh->w and s<kl>t and s<mn>u } P* = { x.url=” e.target_url CONTAINS "article", f.target.url CONTAINS "newshub/specials", g.label CONTAINS "Local News", z.url CONTAINS "local", h.label CONTAINS "World News", w.url CONTAINS "world", s.url = “ 11/18/2018 madria

46 copy-right@sanjay madria
k.label = “REGION”, l.target_url = “ m.label = “WORLD”, n.target_url = “ x.date = s.date, w.text CONTAINS “Indonesia”, t.text CONTAINS “Indonesia"} 11/18/2018 madria

47 copy-right@sanjay madria
Local Web Coupling Initiated explicitly by the user User provides the pair of node variables and the keyword set based on which coupling is to be performed Coupling nodes in each pair of web tuples in the input web tables must satisfy one of the coupling conditions 11/18/2018 madria

48 copy-right@sanjay madria
Types of web coupling System driven web coupling: system to decide the coupling nodes. If atleast a pair of coupling nodes cannot be identified then the web tables cannot be coupled. User driven web coupling: user decides the coupling nodes. Coupling is performed only on those user specified node variable(s). 11/18/2018 madria

49 Attribute driven web coupling
Attribute driven web coupling: user specifies the coupling attributes and coupling is performed only on those user specified coupling attribute(s). COUPLE TABLE3 FROM TABLE1 AND TABLE 2 ON ATTRIBUTE “TEXT” AT SCHEMA/TUPLE(optional) 11/18/2018 madria

50 Value Driven web coupling
Value driven web coupling: user specifies the values of the attributes of the nodes on which coupling should be performed. COUPLE TABLE3 FROM TABLE1 AND TABLE 2 ON VALUE “Software Agents” AT SCHEMA/TUPLE(optional) 11/18/2018 madria

51 Schema level web coupling
We inspect the schemas to decide whether the two web tables can be coupled. If coupling conditions cannot be identified then the two web tables cannot be coupled. We do not inspect the web tuples in the web table. Number of web tuples coupled will be n*m. 11/18/2018 madria

52 Tuple level web coupling
We inspect the web tuples of the two input web tables to identify nodes with similar information. The number of web tuples in the coupled web table <=n*m 11/18/2018 madria

53 copy-right@sanjay madria
Why two levels? A schema does not capture all the information of the web documents in a web table; not always possible to identify coupling condition by inspecting the schemas. possible to find existence of coupling nodes which are not defined in the schemas. 11/18/2018 madria

54 copy-right@sanjay madria
Why two levels? Tuple level coupling gives us a mean to correlate web documents containing similar information from the web tables (that cannot be identified from their schemas) at the expense of additional processing. 11/18/2018 madria

55 Conditions for web coupling
The coupling nodes are and 11/18/2018 madria

56 Conditions for web coupling
The coupling nodes are and 11/18/2018 madria

57 Conditions for web coupling
The coupling nodes are and 11/18/2018 madria

58 Conditions for web coupling
The coupling nodes are and 11/18/2018 madria

59 Conditions for web coupling
The coupling nodes are and 11/18/2018 madria

60 Conditions for web coupling
The coupling nodes are and 11/18/2018 madria

61 Conditions for web coupling
The coupling nodes are and For example: computer.html 11/18/2018 madria

62 Conditions for web coupling
The coupling nodes are and 11/18/2018 madria

63 Conditions for web coupling
URLs with same directory name such as “/computer/” may contain similar information. Paths with “/cgi-bin/” are not considered. Include all conditions for web join. 11/18/2018 madria

64 Construction of coupled schema (schema level)
When atleast a pair of coupling nodes are identical (same url). When none of the pair are identical. 11/18/2018 madria

65 copy-right@sanjay madria
Case 1 In case there exist at least one pair of coupling nodes which are identical to one another then we construct the coupled schema as discussed in web join paper (DEXA’98). 11/18/2018 madria

66 copy-right@sanjay madria
Case 2 11/18/2018 madria

67 Join Processing in Web Databases
11/18/2018 madria

68 copy-right@sanjay madria
Web Join Concatenate tuples based on identical nodes or documents Input are two web tables and their schemas Output is a joined table Types Pi-web join, theta-web join, outer joins, web composition, semi web join 11/18/2018 madria

69 copy-right@sanjay madria
Web Join Used for combining related data from various web tables Mechanism to detect changes Mechanism to find alternative web document in case of “Document Not Found” error 11/18/2018 madria

70 copy-right@sanjay madria
Web Join Operator Information manipulation operator Manipulate information residing in a web database to derive additional information Harness useful, composite information from two web tables Capitalize on the reuse of retrieved data from the WWW in order to reduce execution time of queries 11/18/2018 madria

71 copy-right@sanjay madria
Joinable Nodes Node variables participating in the web join process Expressed as a pair Each node in the pair should have identical URLs 11/18/2018 madria

72 copy-right@sanjay madria
Web Join Combine two web tables by concatenating a web tuple of one web table with a web tuple of other web table whenever there exist joinable nodes Joinable nodes are identified from the schemas of the two web tables URLs of the joinable nodes are identical 11/18/2018 madria

73 Treatment list q g Treatment Issues Symptoms list List of Diseases f y x z Symptoms e Evaluation Evaluation Drug list w p Issues r Side effects b c d Side effects s Use k Uses

74 AIDS treatment q1 g1 Symptoms of AIDS f1 y1 x0 z1 AIDS e1 AIDS Evaluation Elisa Test w1 p2 r1 Side effects of Indavir b1 c1 d1 Indavir s1 Uses of Indavir k1

75 copy-right@sanjay madria
Pi-Web Join 11/18/2018 madria

76 copy-right@sanjay madria
Example 1 Produce a list of diseases with their symptoms, evaluation procedures and treatment starting from the web site at Web table Diseases 11/18/2018 madria

77 Query Graph (Web Schema) for Example 1
z p Disease List evaluation symptoms treatment q z Query Graph (Web Schema) for Example 1

78 A web tuple in ``Diseases”
q1 Treatment list x0 z1 Symptoms list AIDS List of Diseases Evaluation p2 Elisa Test A web tuple in ``Diseases”

79 copy-right@sanjay madria
Example 2 Produce a list of drugs, and their uses and side effects starting from the web site at Web table Drugs 11/18/2018 madria

80 Query Graph (Web Schema) of ``Drugs”
List of Diseases Drug list Uses a b d k Side effects Query Graph (Web Schema) of ``Drugs”

81 A web tuple in ``Drugs” List of Diseases http://www.panacea.org/ Drug
Uses of Indavir Use Side effects a0 b1 d1 k1 AIDS of Indavir A web tuple in ``Drugs”

82 copy-right@sanjay madria
Web Project Eliminate nodes from web tuples which are irrelevant Based on project conditions Set of node variables Start node variable and end-node variable Node variable and depth of links Used to isolate data of interest in a web table, allowing subsequent web queries to run over smaller, more structured web table 11/18/2018 madria

83 A web project on ``Diseases”
x0 z1 Symptoms list AIDS List of Diseases Evaluation p2 A web project on ``Diseases”

84 Joined schema q http://www.panacea.org/ z x p Drug list Side effects b
treatment z x symptoms Disease List p evaluation Drug list Side effects b d Joined schema k Uses

85 Joined Tuple q1 Treatment list http://www.panacea.org/ x0 z1 Symptoms
AIDS List of Diseases AIDS Evaluation p2 Side effects of Indavir Drug list Elisa Test b1 d1 Indavir Side effects Use k1 Uses of Indavir Joined Tuple

86 Motivation of Pi-web Join
Quite often web join operation couples irrelevant nodes In a complex web query with several web join operation, the size of the resultant web table can become very large with many ``contaminated” nodes Pi-web join resolves the above limitation by eliminating ``contaminated” nodes Reduces the size of joined web table 11/18/2018 madria

87 copy-right@sanjay madria
Pi-web Join Web join followed by web project The projection conditions are specified by the user: conditions are similar to web project We do not eliminate the joinable nodes By retaining the joinable nodes we preserve the correlation between the information captured from two web tables Pi-web join may result in a web bag 11/18/2018 madria

88 copy-right@sanjay madria
Example 3 Produce a list of diseases with their symptoms and side-effects starting from the web site at 11/18/2018 madria

89 copy-right@sanjay madria
Procedure Perform web join on “Diseases” and “Drugs” Project node variables b, k, q, p, node variables between a and q, node variables between b and k, node variables between b and d 11/18/2018 madria

90 Pi-joined schema http://www.panacea.org/ z x Side effects d
Disease List x symptoms Side effects d Pi-joined schema

91 Pi-joined Tuple http://www.panacea.org/ x0 z1 Symptoms list AIDS
List of Diseases Side effects of Indavir d1 Pi-joined Tuple

92 Benefits of Pi-web Join
Minimize the amount of data transmitted over the network in distributed web join processing Reduction in storage cost associated with a joined web table Reduces cognitive overhead associated with locating relevant nodes Improve completeness of schema by removing unbound nodes and links 11/18/2018 madria

93 copy-right@sanjay madria
Web Bags Existence of identical web tuples. Created due to web project operation. Structure based mining Used for discovering Visible nodes Luminous nodes Luminous paths 11/18/2018 madria

94 copy-right@sanjay madria
Definitions Visibility of a web document or node D in a web table W measures the number of different web documents in W that have links to D Luminosity - Reverse of visibility, the number of other distinct documents that are linked from D Luminous paths - a set of inter-linked nodes which occurs number of times in a web table 11/18/2018 madria

95 copy-right@sanjay madria
Inter-site Support Quantify the inter-site connectivity of a node in a web table let x be a node and hx denote the host name of node x. Let H be a bag of host names of all nodes in W that have direct link to/from x. Let Ch be the number of times hx appears in H. then we define I as 1- Ch /|H| 11/18/2018 madria

96 Steps to find visible nodes
Input: Web table W, node variable x, visibility threshold v Output: Set of visible nodes and inter-site support for each node Create a web table from W where each web tuple contains distinct instances of node x and the preceding node which is linked to x (use project and create distinct tuples if node x has more than 1 incoming edge) Eliminate the nodes linked to x in each tuple of the web table using web project 11/18/2018 madria

97 copy-right@sanjay madria
Check if the collection of web tuples of node x thus created is a web bag by comparing their URLs Create multiplets for each collection of identical nodes For each multiplet calculate the node visibility (using the mathematical formula defined, see FODO-98) Determine the multiplets with node visibility greater than the threshold Create the visible node set and calculte the inter-site support 11/18/2018 madria

98 Steps to find luminous nodes
Input: Web table W, node variable x, luminosity threshold l Output: Set of luminous nodes with inter-site support Steps are similar to that of visible node discovery We consider the nodes linked from x in place of nodes linked to x 11/18/2018 madria

99 Steps to find luminous paths
Input - web table W, nodes x and y Output - threshold value for luminous path Project nodes between x and y and check for web bag else go to next slide Create the collection of multiplets Compute path luminosity for each multiplet using the formula If the path luminosity value of a multiplet is greater than or equal to threshold then a path in the multiplet is a luminous path 11/18/2018 madria

100 Steps to find luminous paths
Otherwise, we create a collection of linear web tuples from the above collection of web tuples This is to identify if there exist a subset of inter-linked nodes between x and y that are luminous paths We repeat the procedure to compute path luminosity for these set of inter-linked nodes 11/18/2018 madria

101 Web Schema Cancer e f x y z Diseases Cancer

102 Web Table Cancer http://www.panacea.org/ Diseases f0 x0 y0 z1 e0
Cancer Diseases f0 x0 z1 y0 e0 Cancer Cancer Diseases f0 x0 z2 y0 e0 Cancer Cancer Diseases f0 x0 y0 z1 e0 Cancer Cancer Diseases f0 x0 z4 y0 e0 Cancer Web Table

103 Projected schema z Cancer

104 Web Table after eliminating x and y
Cancer z1 Cancer z2 z4 Web Table after eliminating x and y

105 Projected schema Cancer e z x y Diseases

106 Web Bag http://www.panacea.org/ Cancer x0 y0 z1 Diseases
Cancer x0 y0 z1 Diseases Cancer x0 y0 z2 Diseases Cancer x0 y0 z1 Diseases Diseases x0 y0 z4 Cancer Web Bag

107 After removal of identical tuples
Cancer x0 y0 z1 Diseases z2 z4

108 Cancer z1 Cancer z1 Cancer z2 Cancer z1 Cancer z4

109 Cancer z1 z2 z4

110 Visible Nodes Cancer http://www.cancer.org/desc.html z1 Cancer z2
Cancer z1 Cancer z4

111 Luminous Paths

112 copy-right@sanjay madria
Change Management Detect web deltas - w.r.t to user query Changes in inter-linked web documents - insert path, delete path, update path Representing changes web algebraic operators - Web Join, web outer join Querying Changes 11/18/2018 madria

113 Mining in Web Warehouse
web structure mining : Web structure mining involves mining the web document’s structures and links. web content mining : Web content mining describes the automatic search of information resources available on-line. web usage mining : Web usage mining includes the data from server access logs, user registration or profiles, user sessions or transactions etc. 11/18/2018 madria

114 copy-right@sanjay madria
11/18/2018 madria

115 copy-right@sanjay madria
From the results returned, find most visible pages. Assume Z1 is the most visible page with the given threshold. This gives estimates about different restaurants selling pizzas. Lower threshold gives you set (Z1, Z2) as visible pages, which sells both pizza and pasta. Generalize rules such as out of 66% of restaurants which offer pizza to their customers, 33% also offers pasta. 11/18/2018 madria

116 Application - Luminosity
Association rules such as X% of all the companies which makes a product “A”, Y% of them also makes a set of products “B and C”. Exmple - certain companies (33%) if they make a product A also make products B and C. the company C makes only the product A. That is, 66% of companies which make a product “A” , 33% of them also make products B and C. 11/18/2018 madria

117 copy-right@sanjay madria
11/18/2018 madria

118 copy-right@sanjay madria
More Operators . . . Web schema operators: Schema tightness operator, Schema match operator, Schema search operator Data visualization operators: Ranking operators (Global & Local), Web Nest, Web Un-nest, Web Coalesce, Web Expand, Web Pack, Web Unpack, Web Sort 11/18/2018 madria

119 Partitioning of web tables
Partitioning web tables restructured easily indexed easily monitored easily reorganized easily By time schema tree structure keywords 11/18/2018 madria

120 Warehouse Concept Mart (WCMart)
Subject oriented Concept generation. Manually -> Autonomous. Used for: Ranking tuples Global web coupling Content based mining 11/18/2018 madria

121 copy-right@sanjay madria
Web Data Refinement Improve web schema - schema tightness operator Partition web tables based on content and structure 11/18/2018 madria

122 Partitioning of web tables
Partitioning web tables restructured easily indexed easily monitored easily reorganized easily By time schema tree structure keywords 11/18/2018 madria

123 WWW Global Web Coupling Webtable (Jan) Webtable (Feb) Webtable (Mar)
Warehouse Concept Mart Global Web Coupling Webtable (Jan) Webtable (Feb) Webtable (Mar) Webtable (Apr)

124 Lower-level Granularity Higher level Granularity Web Information
Webtable (Jan) Webtable (Feb) Webtable (Mar) Webtable (Apr) Lower-level Granularity Web Information Manipulation Operators Higher level Granularity Summarized data

125 What type of information can be summarized?
Structural Content-based time-variant analysis snapshot analysis compare one period with another trend analysis 11/18/2018 madria

126 Structural Summarization
Most volatile documents Sites which change frequently Rate of change over time a pointer to directly access documents which change rapidly Most visible nodes, luminous nodes, luminous paths Change with time Decrease or increase - Analyze the reason 11/18/2018 madria

127 Content Summarization
What can be aggregrated in a web page? Number of links with identical labels Number of keywords Changes in content with time Comparing the changes Open question XML will improve the ability of analysis of web data 11/18/2018 madria

128 copy-right@sanjay madria
Summary Current status: Mechanism for accessing and manipulating web information in WHOWEDA Implementing various web operators and query language Future research What types of information can be summarized? What types of knowledge can be mined? Refine web warehouse architecture 11/18/2018 madria


Download ppt "WHOWEDA : Warehouse of Web Data"

Similar presentations


Ads by Google