WHOWEDA : Warehouse of Web Data Sanjay Kumar Madria Department of Computer Science Purdue University, West Lafayette, IN 47907 skm@cs.purdue.edu 11/18/2018 copy-right@sanjay madria
WHOWEDA -Key Objectives Design a suitable data model to represent web information development of web algebra and query language Maintenance of Web data Development of knowledge discovery and web mining tools Web warehouse 11/18/2018 copy-right@sanjay madria
copy-right@sanjay madria WHOWEDA - What? WareHouse Of Web Data Subject - oriented Integrated Temporal Granularity - Lower, higher Some summary Not updatable Alternative information sources 11/18/2018 copy-right@sanjay madria
copy-right@sanjay madria Web Warehouse? Subject-oriented, integrated, time-variant, non-volatile repository of web data for direct querying and analysis for some sort of decision making A process whereby organizations or individuals extract value from their Web informational assets through the use of special stores called web warehouses 11/18/2018 copy-right@sanjay madria
WHOWEDA! www.cais.ntu.edu.sg:8000/~whoweda A WareHouse Of WEb DAta Web Information Coupling Model (WICM) Web Objects Web Schema Web Information Coupling Algebra Web Information Maintenance Web Mining and Knowledge discovery 11/18/2018 copy-right@sanjay madria
WWW Web Information Coupling System Web Warehouse User Web Querying Concept Mart Web Querying & Analysis Component Web Information Mining System Web Information Coupling System Web Information Maintenance System Web Mart Web Mart Web Warehouse Web Mart Web Mart
WWW Global Web Manipulation Pre processing Web Warehouse Local Web User WWW Web Query & Display Warehouse Concept Mart Global Web Manipulation Global Web Coupling Global Ranking Pre processing Data Visualization Schema Tightness Web Warehouse Data Visualization Web Select Web Union Web Project Web Intersection Local Web Manipulation Local Web Coupling Schema Tightness Local Ranking Schema Search Web Join Schema Match
copy-right@sanjay madria Web Objects Node - url, title, format, size, date, text Link - source-url, target-url, label, link-type Web tuple Web table Web schema Web database 11/18/2018 copy-right@sanjay madria
copy-right@sanjay madria Web Schema Metadata in the warehouse Structural ‘summary’ of web table Information Coupling using a Query graph Query graph ->Web schema directed graph represented by Ordered 4-tuple: Set of node variables Set of link variables Connectivities Predicates 11/18/2018 copy-right@sanjay madria
copy-right@sanjay madria 11/18/2018 copy-right@sanjay madria
copy-right@sanjay madria Information Square's homepage Headline article 1 Headline article n News@TCS News specials Airport info (List of video files) List of links to local news world news Local news 1 Local news k World news 1 World news t 11/18/2018 copy-right@sanjay madria
copy-right@sanjay madria x y e g f label CONTAINS "Local News" target_URL CONTAINS "newshub/specials" z url CONTAINS "local" "World News" w "world" target_url CONTAINS "article” h url contains “headlines” 11/18/2018 copy-right@sanjay madria
copy-right@sanjay madria Information Square's homepage Headline article 1 News specials List of links to local news world news Local news 1 World news 1 11/18/2018 copy-right@sanjay madria
copy-right@sanjay madria Schema- example Node variables: Xn = { x, y, z, w } Link variable: Xl = { e, f, g } Connectivities: C = { x<e>y and x<fg->z and x<fh->w } The symbol # represents an unbound node variable or link variable; a variable not restricted by any predicate. “-” represents one unbound links “-+” represents more than one unbound links 11/18/2018 copy-right@sanjay madria
copy-right@sanjay madria Predicates P={x.url=”http://www.mediacity.com.sg/i-square”, y.url CONTAINS “headlines” e.target_url CONTAINS "article", f.target.url CONTAINS "newshub/specials", g.label CONTAINS "Local News", z.url CONTAINS "local", h.label CONTAINS "World News", w.url CONTAINS "world" } 11/18/2018 copy-right@sanjay madria
copy-right@sanjay madria Query Graph - Example 1 Query graph - same as schema except that it has one more parameter to control the results returned. Informally, it is directed connected graph consists of nodes, links and keywords imposed on them. Produce a list of diseases with their symptoms, evaluation procedures and treatment starting from the web site at http://www.panacea.org/ Web table Diseases 11/18/2018 copy-right@sanjay madria
Treatment list q g Treatment http://www.panacea.org/ Issues Symptoms list f y x z Symptoms List of Diseases e Evaluation Evaluation w p
q1 Treatment list g1 Treatment http://www.panacea.org/ Issues f1 x0 y1 z1 Symptoms list AIDS Symptoms List of Diseases e1 Evaluation Evaluation w1 p2 Elisa Test
copy-right@sanjay madria Example 2 Produce a list of drugs, and their uses and side effects starting from the web site at http://www.panacea.org/ Web table Drugs 11/18/2018 copy-right@sanjay madria
List of Diseases http://www.panacea.org/ Drug list Issues Uses Use Side effects a b c d r s k Side effects
Side effects of Indavir Drug list http://www.panacea.org/ Issues AIDS r1 a0 b1 c1 d1 Indavir Side effects List of Diseases Use s1 k1 Uses of Indavir
copy-right@sanjay madria Query Language Starting from the CS dept. home page at NTU, find all documents that are linked through paths of length less than two containing only local links, and have in their text “database”. 11/18/2018 copy-right@sanjay madria
copy-right@sanjay madria COUPLE WEBTABLE W FROM WWW SUCH THAT NODE I, J IN WWW and LINK e,f,g IN WWW AND I<e|f,g>J WHERE I.url EQUALS “http://www.ntu.edu.sg” AND J.text CONTAINS “database” AND f.link-type EQUALS local AND g.link-type EQUALS local; 11/18/2018 copy-right@sanjay madria
copy-right@sanjay madria Web Algebra Formal foundation of data representation and manipulation in a web warehouse Web operators: Information access operator Information manipulation operators Web schema operators Data visualization operators 11/18/2018 copy-right@sanjay madria
Information access operator Global Web Coupling 11/18/2018 copy-right@sanjay madria
Information Manipulation - Web select Web project Local web coupling Web join Web Cartesian product Web union Web intersect Local Web coupling 11/18/2018 copy-right@sanjay madria
copy-right@sanjay madria Web Select Extracts web tuples from web tables satisfying certain conditions on node and link variables and on connectivities Input is select Schema Output is a web table satisfying the select schema 11/18/2018 copy-right@sanjay madria
copy-right@sanjay madria select W1 tuples that contain world news about Indonesia since May 1 1998. sMsW1 where Ms = < Xsn, Xsl, Cs, Ps >, Xsn = { x, w }, Xsl = { }, Cs = { }, Ps = { x.date > "1May1998", w.text CONTAINS “Indonesia”} 11/18/2018 copy-right@sanjay madria
copy-right@sanjay madria Xn’ = { x, y, z, w },Xl’ = { e, f, g } C’ = { x<e>y and x<fg->z and x<fh->w } P’={x.url=”http://www.mediacity.com.sg/i-square”, x.date > "1May1998", e.target_url CONTAINS "article", f.target.url CONTAINS "newshub/specials", g.label CONTAINS "Local News", z.url CONTAINS "local", h.label CONTAINS "World News", w.url CONTAINS "world", w.text CONTAINS “Indonesia” } 11/18/2018 copy-right@sanjay madria
Web Information Coupling System A database system to couple related web information Global web Coupling and Local Web Coupling 11/18/2018 copy-right@sanjay madria
Global Coupling - Information Access To integrate data from the Web To create historical data To couple related information from the WWW satisfying a query graph Operator to create web tables From web with no schema to web table with web schema 11/18/2018 copy-right@sanjay madria
copy-right@sanjay madria Why local web coupling? Directly querying the WWW to gather these information is an expensive and repetitive affair Web documents containing similar information can reside in different web tables in a web warehouse A mechanism to gather these similar information by additional manipulation of the materialized web tables 11/18/2018 copy-right@sanjay madria
Local Web Couple operator Two web tuples and can be coupled if there exist atleast one pair of nodes from and which contains similar information. 11/18/2018 copy-right@sanjay madria
Local Web Couple operator The web couple operator is basically a web cartesian product followed by web select: We denote web couple by the symbol: 11/18/2018 copy-right@sanjay madria
copy-right@sanjay madria Web Coupling 11/18/2018 copy-right@sanjay madria
copy-right@sanjay madria Example 1 Produce a list of diseases and their symptoms starting from the web site at http://www.panacea.org/ Web table Diseases 11/18/2018 copy-right@sanjay madria
Web Schema or Query Graph of ``Diseases” Issues http://www.panacea.org/ symptoms e z x y symptoms List of Diseases Web Schema or Query Graph of ``Diseases”
Web table ``Diseases” List of Diseases http://www.panacea.org/ Issues Symptoms of AIDS x0 y0 z0 e0 symptoms AIDS List of Diseases http://www.panacea.org/ Issues Symptoms of Cancer x0 y1 z1 e1 symptoms Cancer List of Diseases http://www.panacea.org/ Issues Symptoms of Diabetes x0 y2 z2 e2 symptoms Diabetes List of Diseases http://www.panacea.org/ Issues Symptoms of Lung Diseases x0 y3 z3 e3 symptoms Lung Disease Web table ``Diseases”
copy-right@sanjay madria Example 2 Produce a list of drugs, and their side effects starting from the web site at http://www.panacea.org/ Web table Drugs 11/18/2018 copy-right@sanjay madria
Web Schema or Query Graph of ``Drugs” Drug list Side effects http://www.panacea.org/ Issues r c a b d Side effects List of Diseases Web Schema or Query Graph of ``Drugs”
Web table ``Drugs” List of Diseases http://www.panacea.org/ Drug list Issues Side effects a0 b1 c2 d2 r2 AIDS Ritonavir of Ritonavir b2 c3 d3 r3 Cancer Letrozole of letrozole c1 d1 r1 Indavir of Indavir b4 c4 d4 r4 Heart Disorder Beta Carotene of Beta Carotene Web table ``Drugs”
Symptoms & Side effects Issues http://www.panacea.org/ Symptoms of AIDS AIDS e0 z0 x0 y0 symptoms List of Diseases Side effects of Ritonavir Drug list http://www.panacea.org/ Issues AIDS r2 a0 b1 c2 d2 Ritonavir Side effects Issues http://www.panacea.org/ Symptoms of Cancer Cancer e1 z1 x0 y1 symptoms List of Diseases Side effects of betacarotene http://www.panacea.org/ Issues Heart Disorder r4 a0 b4 c4 d4 Side effects Beta Carotene Symptoms & Side effects
copy-right@sanjay madria M2 = < Xn”, Xl”, C”,P” > for W2 Xn” = { s, t, u}, Xl” = { k, l, m, n }, C” = { s<kl>t and s<mn>u }, P”{s.url= “http://www.asia1.com.sg/straitstimes/”, k.label = “REGION”, l.target_url= “http://www.asia1.com.sg/straitstimes/pages/sea*.html”, m.label = “WORLD”, n.target_url=“http://www.asia1.com.sg/straitstimes/pages/wrld*.html”} 11/18/2018 copy-right@sanjay madria
copy-right@sanjay madria W1 qq W2 where q = (x.date=s.date) & (w.text CONTAINS “Indonesia”) & (t.text CONTAINS “Indonesia”) Schema of the coupled table is: 11/18/2018 copy-right@sanjay madria
copy-right@sanjay madria Xn* = { x, y, z, w, s, t, u }, Xl* = { e, f, g, k, l, m, n }, C*= { x<e>y and x<fg->z and x<fh->w and s<kl>t and s<mn>u } P* = { x.url=”http://www.mediacity.com.sg/i-square”, e.target_url CONTAINS "article", f.target.url CONTAINS "newshub/specials", g.label CONTAINS "Local News", z.url CONTAINS "local", h.label CONTAINS "World News", w.url CONTAINS "world", s.url = “http://www.asia1.com.sg/straitstimes/”, 11/18/2018 copy-right@sanjay madria
copy-right@sanjay madria k.label = “REGION”, l.target_url = “http://www.asia1.com.sg/straitstimes/pages/sea*.html”, m.label = “WORLD”, n.target_url = “http://www.asia1.com.sg/straitstimes/pages/world*.html”, x.date = s.date, w.text CONTAINS “Indonesia”, t.text CONTAINS “Indonesia"} 11/18/2018 copy-right@sanjay madria
copy-right@sanjay madria Local Web Coupling Initiated explicitly by the user User provides the pair of node variables and the keyword set based on which coupling is to be performed Coupling nodes in each pair of web tuples in the input web tables must satisfy one of the coupling conditions 11/18/2018 copy-right@sanjay madria
copy-right@sanjay madria Types of web coupling System driven web coupling: system to decide the coupling nodes. If atleast a pair of coupling nodes cannot be identified then the web tables cannot be coupled. User driven web coupling: user decides the coupling nodes. Coupling is performed only on those user specified node variable(s). 11/18/2018 copy-right@sanjay madria
Attribute driven web coupling Attribute driven web coupling: user specifies the coupling attributes and coupling is performed only on those user specified coupling attribute(s). COUPLE TABLE3 FROM TABLE1 AND TABLE 2 ON ATTRIBUTE “TEXT” AT SCHEMA/TUPLE(optional) 11/18/2018 copy-right@sanjay madria
Value Driven web coupling Value driven web coupling: user specifies the values of the attributes of the nodes on which coupling should be performed. COUPLE TABLE3 FROM TABLE1 AND TABLE 2 ON VALUE “Software Agents” AT SCHEMA/TUPLE(optional) 11/18/2018 copy-right@sanjay madria
Schema level web coupling We inspect the schemas to decide whether the two web tables can be coupled. If coupling conditions cannot be identified then the two web tables cannot be coupled. We do not inspect the web tuples in the web table. Number of web tuples coupled will be n*m. 11/18/2018 copy-right@sanjay madria
Tuple level web coupling We inspect the web tuples of the two input web tables to identify nodes with similar information. The number of web tuples in the coupled web table <=n*m 11/18/2018 copy-right@sanjay madria
copy-right@sanjay madria Why two levels? A schema does not capture all the information of the web documents in a web table; not always possible to identify coupling condition by inspecting the schemas. possible to find existence of coupling nodes which are not defined in the schemas. 11/18/2018 copy-right@sanjay madria
copy-right@sanjay madria Why two levels? Tuple level coupling gives us a mean to correlate web documents containing similar information from the web tables (that cannot be identified from their schemas) at the expense of additional processing. 11/18/2018 copy-right@sanjay madria
Conditions for web coupling The coupling nodes are and 11/18/2018 copy-right@sanjay madria
Conditions for web coupling The coupling nodes are and 11/18/2018 copy-right@sanjay madria
Conditions for web coupling The coupling nodes are and 11/18/2018 copy-right@sanjay madria
Conditions for web coupling The coupling nodes are and 11/18/2018 copy-right@sanjay madria
Conditions for web coupling The coupling nodes are and 11/18/2018 copy-right@sanjay madria
Conditions for web coupling The coupling nodes are and 11/18/2018 copy-right@sanjay madria
Conditions for web coupling The coupling nodes are and For example: computer.html 11/18/2018 copy-right@sanjay madria
Conditions for web coupling The coupling nodes are and 11/18/2018 copy-right@sanjay madria
Conditions for web coupling URLs with same directory name such as “/computer/” may contain similar information. Paths with “/cgi-bin/” are not considered. Include all conditions for web join. 11/18/2018 copy-right@sanjay madria
Construction of coupled schema (schema level) When atleast a pair of coupling nodes are identical (same url). When none of the pair are identical. 11/18/2018 copy-right@sanjay madria
copy-right@sanjay madria Case 1 In case there exist at least one pair of coupling nodes which are identical to one another then we construct the coupled schema as discussed in web join paper (DEXA’98). 11/18/2018 copy-right@sanjay madria
copy-right@sanjay madria Case 2 11/18/2018 copy-right@sanjay madria
Join Processing in Web Databases 11/18/2018 copy-right@sanjay madria
copy-right@sanjay madria Web Join Concatenate tuples based on identical nodes or documents Input are two web tables and their schemas Output is a joined table Types Pi-web join, theta-web join, outer joins, web composition, semi web join 11/18/2018 copy-right@sanjay madria
copy-right@sanjay madria Web Join Used for combining related data from various web tables Mechanism to detect changes Mechanism to find alternative web document in case of “Document Not Found” error 11/18/2018 copy-right@sanjay madria
copy-right@sanjay madria Web Join Operator Information manipulation operator Manipulate information residing in a web database to derive additional information Harness useful, composite information from two web tables Capitalize on the reuse of retrieved data from the WWW in order to reduce execution time of queries 11/18/2018 copy-right@sanjay madria
copy-right@sanjay madria Joinable Nodes Node variables participating in the web join process Expressed as a pair Each node in the pair should have identical URLs 11/18/2018 copy-right@sanjay madria
copy-right@sanjay madria Web Join Combine two web tables by concatenating a web tuple of one web table with a web tuple of other web table whenever there exist joinable nodes Joinable nodes are identified from the schemas of the two web tables URLs of the joinable nodes are identical 11/18/2018 copy-right@sanjay madria
Treatment list q g Treatment http://www.panacea.org/ Issues Symptoms list List of Diseases f y x z Symptoms e Evaluation Evaluation Drug list w p Issues r Side effects b c d Side effects s Use k Uses
AIDS treatment q1 g1 Symptoms of AIDS http://www.panacea.org/ f1 y1 x0 z1 AIDS e1 AIDS Evaluation Elisa Test w1 p2 r1 Side effects of Indavir b1 c1 d1 Indavir s1 Uses of Indavir k1
copy-right@sanjay madria Pi-Web Join 11/18/2018 copy-right@sanjay madria
copy-right@sanjay madria Example 1 Produce a list of diseases with their symptoms, evaluation procedures and treatment starting from the web site at http://www.panacea.org/ Web table Diseases 11/18/2018 copy-right@sanjay madria
Query Graph (Web Schema) for Example 1 z p Disease List evaluation symptoms treatment q http://www.panacea.org/ z Query Graph (Web Schema) for Example 1
A web tuple in ``Diseases” q1 Treatment list http://www.panacea.org/ x0 z1 Symptoms list AIDS List of Diseases Evaluation p2 Elisa Test A web tuple in ``Diseases”
copy-right@sanjay madria Example 2 Produce a list of drugs, and their uses and side effects starting from the web site at http://www.panacea.org/ Web table Drugs 11/18/2018 copy-right@sanjay madria
Query Graph (Web Schema) of ``Drugs” List of Diseases http://www.panacea.org/ Drug list Uses a b d k Side effects Query Graph (Web Schema) of ``Drugs”
A web tuple in ``Drugs” List of Diseases http://www.panacea.org/ Drug Uses of Indavir Use Side effects a0 b1 d1 k1 AIDS of Indavir A web tuple in ``Drugs”
copy-right@sanjay madria Web Project Eliminate nodes from web tuples which are irrelevant Based on project conditions Set of node variables Start node variable and end-node variable Node variable and depth of links Used to isolate data of interest in a web table, allowing subsequent web queries to run over smaller, more structured web table 11/18/2018 copy-right@sanjay madria
A web project on ``Diseases” http://www.panacea.org/ x0 z1 Symptoms list AIDS List of Diseases Evaluation p2 A web project on ``Diseases”
Joined schema q http://www.panacea.org/ z x p Drug list Side effects b treatment http://www.panacea.org/ z x symptoms Disease List p evaluation Drug list Side effects b d Joined schema k Uses
Joined Tuple q1 Treatment list http://www.panacea.org/ x0 z1 Symptoms AIDS List of Diseases AIDS Evaluation p2 Side effects of Indavir Drug list Elisa Test b1 d1 Indavir Side effects Use k1 Uses of Indavir Joined Tuple
Motivation of Pi-web Join Quite often web join operation couples irrelevant nodes In a complex web query with several web join operation, the size of the resultant web table can become very large with many ``contaminated” nodes Pi-web join resolves the above limitation by eliminating ``contaminated” nodes Reduces the size of joined web table 11/18/2018 copy-right@sanjay madria
copy-right@sanjay madria Pi-web Join Web join followed by web project The projection conditions are specified by the user: conditions are similar to web project We do not eliminate the joinable nodes By retaining the joinable nodes we preserve the correlation between the information captured from two web tables Pi-web join may result in a web bag 11/18/2018 copy-right@sanjay madria
copy-right@sanjay madria Example 3 Produce a list of diseases with their symptoms and side-effects starting from the web site at http://www.panacea.org/ 11/18/2018 copy-right@sanjay madria
copy-right@sanjay madria Procedure Perform web join on “Diseases” and “Drugs” Project node variables b, k, q, p, node variables between a and q, node variables between b and k, node variables between b and d 11/18/2018 copy-right@sanjay madria
Pi-joined schema http://www.panacea.org/ z x Side effects d Disease List x symptoms Side effects d Pi-joined schema
Pi-joined Tuple http://www.panacea.org/ x0 z1 Symptoms list AIDS List of Diseases Side effects of Indavir d1 Pi-joined Tuple
Benefits of Pi-web Join Minimize the amount of data transmitted over the network in distributed web join processing Reduction in storage cost associated with a joined web table Reduces cognitive overhead associated with locating relevant nodes Improve completeness of schema by removing unbound nodes and links 11/18/2018 copy-right@sanjay madria
copy-right@sanjay madria Web Bags Existence of identical web tuples. Created due to web project operation. Structure based mining Used for discovering Visible nodes Luminous nodes Luminous paths 11/18/2018 copy-right@sanjay madria
copy-right@sanjay madria Definitions Visibility of a web document or node D in a web table W measures the number of different web documents in W that have links to D Luminosity - Reverse of visibility, the number of other distinct documents that are linked from D Luminous paths - a set of inter-linked nodes which occurs number of times in a web table 11/18/2018 copy-right@sanjay madria
copy-right@sanjay madria Inter-site Support Quantify the inter-site connectivity of a node in a web table let x be a node and hx denote the host name of node x. Let H be a bag of host names of all nodes in W that have direct link to/from x. Let Ch be the number of times hx appears in H. then we define I as 1- Ch /|H| 11/18/2018 copy-right@sanjay madria
Steps to find visible nodes Input: Web table W, node variable x, visibility threshold v Output: Set of visible nodes and inter-site support for each node Create a web table from W where each web tuple contains distinct instances of node x and the preceding node which is linked to x (use project and create distinct tuples if node x has more than 1 incoming edge) Eliminate the nodes linked to x in each tuple of the web table using web project 11/18/2018 copy-right@sanjay madria
copy-right@sanjay madria Check if the collection of web tuples of node x thus created is a web bag by comparing their URLs Create multiplets for each collection of identical nodes For each multiplet calculate the node visibility (using the mathematical formula defined, see FODO-98) Determine the multiplets with node visibility greater than the threshold Create the visible node set and calculte the inter-site support 11/18/2018 copy-right@sanjay madria
Steps to find luminous nodes Input: Web table W, node variable x, luminosity threshold l Output: Set of luminous nodes with inter-site support Steps are similar to that of visible node discovery We consider the nodes linked from x in place of nodes linked to x 11/18/2018 copy-right@sanjay madria
Steps to find luminous paths Input - web table W, nodes x and y Output - threshold value for luminous path Project nodes between x and y and check for web bag else go to next slide Create the collection of multiplets Compute path luminosity for each multiplet using the formula If the path luminosity value of a multiplet is greater than or equal to threshold then a path in the multiplet is a luminous path 11/18/2018 copy-right@sanjay madria
Steps to find luminous paths Otherwise, we create a collection of linear web tuples from the above collection of web tuples This is to identify if there exist a subset of inter-linked nodes between x and y that are luminous paths We repeat the procedure to compute path luminosity for these set of inter-linked nodes 11/18/2018 copy-right@sanjay madria
Web Schema Cancer http://www.panacea.org/ e f x y z Diseases Cancer
Web Table Cancer http://www.panacea.org/ Diseases f0 x0 y0 z1 e0 http://www.cancer.org/desc.html Cancer Diseases f0 x0 z1 y0 e0 Cancer http://www.cancer.org/desc.html Cancer Diseases f0 x0 z2 y0 e0 Cancer Cancer Diseases f0 x0 y0 z1 e0 Cancer http://www.cancer.org/desc.html Cancer Diseases f0 x0 z4 y0 e0 Cancer Web Table
Projected schema z Cancer
Web Table after eliminating x and y Cancer http://www.cancer.org/desc.html z1 Cancer z2 z4 Web Table after eliminating x and y
Projected schema Cancer http://www.panacea.org/ e z x y Diseases
Web Bag http://www.panacea.org/ Cancer x0 y0 z1 Diseases http://www.cancer.org/desc.html http://www.panacea.org/ Cancer x0 y0 z1 Diseases http://www.cancer.org/desc.html http://www.panacea.org/ Cancer x0 y0 z2 Diseases http://www.disease.com/cancer/skin.htm http://www.panacea.org/ Cancer x0 y0 z1 Diseases http://www.cancer.org/desc.html http://www.jhu.edu/medical/research/cancer.htm http://www.panacea.org/ Diseases x0 y0 z4 Cancer Web Bag
After removal of identical tuples http://www.panacea.org/ Cancer x0 y0 z1 Diseases z2 http://www.disease.com/cancer/skin.htm z4 http://www.jhu.edu/medical/research/cancer.htm http://www.cancer.org/desc.html
Cancer z1 http://www.cancer.org/desc.html Cancer z1 http://www.cancer.org/desc.html Cancer z2 http://www.disease.com/cancer/skin.htm Cancer z1 http://www.cancer.org/desc.html Cancer z4 http://www.jhu.edu/medical/research/cancer.htm
Cancer z1 http://www.cancer.org/desc.html z2 http://www.disease.com/cancer/skin.htm http://www.jhu.edu/medical/research/cancer.htm z4 http://www.cancer.org/desc.html
Visible Nodes Cancer http://www.cancer.org/desc.html z1 Cancer z2 http://www.disease.com/cancer/skin.htm Cancer z1 http://www.cancer.org/desc.html Cancer z4 http://www.jhu.edu/medical/research/cancer.htm
Luminous Paths
copy-right@sanjay madria Change Management Detect web deltas - w.r.t to user query Changes in inter-linked web documents - insert path, delete path, update path Representing changes web algebraic operators - Web Join, web outer join Querying Changes 11/18/2018 copy-right@sanjay madria
Mining in Web Warehouse web structure mining : Web structure mining involves mining the web document’s structures and links. web content mining : Web content mining describes the automatic search of information resources available on-line. web usage mining : Web usage mining includes the data from server access logs, user registration or profiles, user sessions or transactions etc. 11/18/2018 copy-right@sanjay madria
copy-right@sanjay madria 11/18/2018 copy-right@sanjay madria
copy-right@sanjay madria From the results returned, find most visible pages. Assume Z1 is the most visible page with the given threshold. This gives estimates about different restaurants selling pizzas. Lower threshold gives you set (Z1, Z2) as visible pages, which sells both pizza and pasta. Generalize rules such as out of 66% of restaurants which offer pizza to their customers, 33% also offers pasta. 11/18/2018 copy-right@sanjay madria
Application - Luminosity Association rules such as X% of all the companies which makes a product “A”, Y% of them also makes a set of products “B and C”. Exmple - certain companies (33%) if they make a product A also make products B and C. the company C makes only the product A. That is, 66% of companies which make a product “A” , 33% of them also make products B and C. 11/18/2018 copy-right@sanjay madria
copy-right@sanjay madria 11/18/2018 copy-right@sanjay madria
copy-right@sanjay madria More Operators . . . Web schema operators: Schema tightness operator, Schema match operator, Schema search operator Data visualization operators: Ranking operators (Global & Local), Web Nest, Web Un-nest, Web Coalesce, Web Expand, Web Pack, Web Unpack, Web Sort 11/18/2018 copy-right@sanjay madria
Partitioning of web tables Partitioning web tables restructured easily indexed easily monitored easily reorganized easily By time schema tree structure keywords 11/18/2018 copy-right@sanjay madria
Warehouse Concept Mart (WCMart) Subject oriented Concept generation. Manually -> Autonomous. Used for: Ranking tuples Global web coupling Content based mining 11/18/2018 copy-right@sanjay madria
copy-right@sanjay madria Web Data Refinement Improve web schema - schema tightness operator Partition web tables based on content and structure 11/18/2018 copy-right@sanjay madria
Partitioning of web tables Partitioning web tables restructured easily indexed easily monitored easily reorganized easily By time schema tree structure keywords 11/18/2018 copy-right@sanjay madria
WWW Global Web Coupling Webtable (Jan) Webtable (Feb) Webtable (Mar) Warehouse Concept Mart Global Web Coupling Webtable (Jan) Webtable (Feb) Webtable (Mar) Webtable (Apr)
Lower-level Granularity Higher level Granularity Web Information Webtable (Jan) Webtable (Feb) Webtable (Mar) Webtable (Apr) Lower-level Granularity Web Information Manipulation Operators Higher level Granularity Summarized data
What type of information can be summarized? Structural Content-based time-variant analysis snapshot analysis compare one period with another trend analysis 11/18/2018 copy-right@sanjay madria
Structural Summarization Most volatile documents Sites which change frequently Rate of change over time a pointer to directly access documents which change rapidly Most visible nodes, luminous nodes, luminous paths Change with time Decrease or increase - Analyze the reason 11/18/2018 copy-right@sanjay madria
Content Summarization What can be aggregrated in a web page? Number of links with identical labels Number of keywords Changes in content with time Comparing the changes Open question XML will improve the ability of analysis of web data 11/18/2018 copy-right@sanjay madria
copy-right@sanjay madria Summary Current status: Mechanism for accessing and manipulating web information in WHOWEDA Implementing various web operators and query language Future research What types of information can be summarized? What types of knowledge can be mined? Refine web warehouse architecture www.cais.ntu.edu.sg:8000/~whoweda 11/18/2018 copy-right@sanjay madria