1 WHOWEDA : Warehouse of Web Data Sanjay Kumar Madria Department of Computer Science Purdue University, West Lafayette, IN 47907

Slides:



Advertisements
Similar presentations
Chapter 10: Designing Databases
Advertisements

XML: Extensible Markup Language
C6 Databases.
Relational Algebra, Join and QBE Yong Choi School of Business CSUB, Bakersfield.
D ATABASE S YSTEMS I R ELATIONAL A LGEBRA. 22 R ELATIONAL Q UERY L ANGUAGES Query languages (QL): Allow manipulation and retrieval of data from a database.
Management Information Systems, Sixth Edition
Database Management Systems 3ed, R. Ramakrishnan and J. Gehrke1 Relational Algebra Chapter 4, Part A Modified by Donghui Zhang.
INFS614, Fall 08 1 Relational Algebra Lecture 4. INFS614, Fall 08 2 Relational Query Languages v Query languages: Allow manipulation and retrieval of.
CMPT 354, Simon Fraser University, Fall 2008, Martin Ester 52 Database Systems I Relational Algebra.
Information Retrieval in Practice
Managing Data Resources
Xyleme A Dynamic Warehouse for XML Data of the Web.
Detecting and Representing Relevant Page-Level Web Deltas Sanjay Kumar Madria Department of Computer Science Purdue University West Lafayette, IN
Advanced Topics COMP163: Database Management Systems University of the Pacific December 9, 2008.
Database Management Systems, R. Ramakrishnan and J. Gehrke1 Relational Algebra Chapter 4, Part A.
WEDAGEN: A Synthetic Web Database Generator. Presentation Outline l Existing WWW search mechanisms l WHOWEDA: A Warehouse of Web Data l Modular structure.
Research Issues in Web Data Mining Sanjay Kumar Madria Department of Computer Science Purdue University, West Lafayette,
1 MARG-DARSHAK: A Scrapbook on Web Search engines allow the users to enter keywords relating to a topic and retrieve information about internet sites (URLs)
Representation of Web Data in a Web Warehouse Ragini A.S. & Shipra Dutta November 20 th, 2001.
Page 1Prepared by Sapient for MITVersion 0.1 – August – September 2004 This document represents a snapshot of an evolving set of documents. For information.
1 Chapter 2 Reviewing Tables and Queries. 2 Chapter Objectives Identify the steps required to develop an Access application Specify the characteristics.
Rutgers University Relational Algebra 198:541 Rutgers University.
Relational Algebra Chapter 4 - part I. 2 Relational Query Languages  Query languages: Allow manipulation and retrieval of data from a database.  Relational.
Overview of Search Engines
Chapter 9 Database Management
Query Processing Presented by Aung S. Win.
Database System Concepts and Architecture Lecture # 3 22 June 2012 National University of Computer and Emerging Sciences.
Data Management Turban, Aronson, and Liang Decision Support Systems and Intelligent Systems, Seventh Edition.
Ihr Logo Chapter 5 Business Intelligence: Data Warehousing, Data Acquisition, Data Mining, Business Analytics, and Visualization Turban, Aronson, and Liang.
Chapter 7 Web Content Mining Xxxxxx. Introduction Web-content mining techniques are used to discover useful information from content on the web – textual.
Data Warehouse Overview September 28, 2012 presented by Terry Bilskie.
Chapter 7: Database Systems Succeeding with Technology: Second Edition.
Chapter 2 Architecture of a Search Engine. Search Engine Architecture n A software architecture consists of software components, the interfaces provided.
© 2005 Prentice Hall, Decision Support Systems and Intelligent Systems, 7th Edition, Turban, Aronson, and Liang 5-1 Chapter 5 Business Intelligence: Data.
Querying Structured Text in an XML Database By Xuemei Luo.
The WWW as a Database: WWW Query Languages Curtis Dyreson James Cook University ( Townsville, Australia ) Aalborg University.
Discovering Computers Fundamentals Fifth Edition Chapter 9 Database Management.
M Taimoor Khan Course Objectives 1) Basic Concepts 2) Tools 3) Database architecture and design 4) Flow of data (DFDs)
A Model for Fast Web Mining Prototyping Nivio Ziviani UFMG – Brazil Álvaro Pereir a Ricardo Baeza-Yates Jesus Bisbal UPF – Spain.
C6 Databases. 2 Traditional file environment Data Redundancy and Inconsistency: –Data redundancy: The presence of duplicate data in multiple data files.
Ihr Logo Chapter 5 Business Intelligence: Data Warehousing, Data Acquisition, Data Mining, Business Analytics, and Visualization Turban, Aronson, and Liang.
Database Management Systems, R. Ramakrishnan and J. Gehrke1 Relational Algebra.
5 - 1 Copyright © 2006, The McGraw-Hill Companies, Inc. All rights reserved.
Building Data and Document-Driven Decision Support Systems How do managers access and use large databases of historical and external facts?
6.1 © 2010 by Prentice Hall 6 Chapter Foundations of Business Intelligence: Databases and Information Management.
DATABASE MANAGEMENT SYSTEMS CMAM301. Introduction to database management systems  What is Database?  What is Database Systems?  Types of Database.
MANAGING DATA RESOURCES ~ pertemuan 7 ~ Oleh: Ir. Abdul Hayat, MTI.
1 Relational Algebra Chapter 4, Sections 4.1 – 4.2.
Chapter 13 Designing Databases Systems Analysis and Design Kendall & Kendall Sixth Edition.
Database Management Systems 3ed, R. Ramakrishnan and J. Gehrke1 Database Management Systems Chapter 4 Relational Algebra.
DATA RESOURCE MANAGEMENT
8 Chapter Eight Server-side Scripts. 8 Chapter Objectives Create dynamic Web pages that retrieve and display database data using Active Server Pages Process.
Semi-structured Data In many applications, data does not have a rigidly and predefined schema: –e.g., structured files, scientific data, XML. Managing.
MapReduce and the New Software Stack. Outline  Algorithm Using MapReduce  Matrix-Vector Multiplication  Matrix-Vector Multiplication by MapReduce 
Web mining is the use of data mining techniques to automatically discover and extract information from Web documents/services
General Architecture of Retrieval Systems 1Adrienn Skrop.
Managing Data Resources File Organization and databases for business information systems.
Datamining : Refers to extracting or mining knowledge from large amounts of data Applications : Market Analysis Fraud Detection Customer Retention Production.
Relational Algebra Chapter 4, Part A
Copyright © 2011 Pearson Education, Inc. Publishing as Pearson Addison-Wesley Chapter 2 Database System Concepts and Architecture.
MANAGING DATA RESOURCES
WHOWEDA : Warehouse of Web Data
WHOWEDA : Warehouse of Web Data
Relational Algebra Chapter 4, Sections 4.1 – 4.2
Web Couple: Coupling web information
Introduction to Information Retrieval
Detecting and Representing Relevant Web Deltas in WHOWEDA
The ultimate in data organization
Web Warehousing : Design and Issues
Chapter 17 Designing Databases
Presentation transcript:

1 WHOWEDA : Warehouse of Web Data Sanjay Kumar Madria Department of Computer Science Purdue University, West Lafayette, IN

2

3 WWW collection of multimedia documents in the form of web pages connected via hyperlinks.

4 Characteristics of WWW WWW is a set of directed graphs data in the WWW has a heterogeneous nature unstructured versus structured information no central authority to manage information Dynamic verses static information Web information discoveries - search engines

5 As WWW grows, more chaotic it becomes Web is fast growing, distributed, non- administered global information resource WWW allows access to text, image, video, sound and graphic data more business organizations creating web servers more chaotic environment to locate information of interest lost in hyperspace syndrome

6 Does it affect the corporate world? Lack of credibility of data –Different sites with different data –Same site different data Historical information is not available –Previous versions of web data –How does web data change with time –Summarization over time Data to information Reduction in productivity –Analysis is manual

7 How users find web sites Indexes and search engines 75 UseNet newsgroups 44 Cool lists 27 New lists 24 Listservers 23 Print ads 21 Word-of-mouth and 17 Linked web advertisement 4

8 Limitations of Search Engines Do not exploit hyperlinks search is limited to string matching Queries are evaluated on archived data rather than up-to-date data; no indexing on current data low accuracy replicated results no further manipulation possible

9 Limitations of Search Engines ERROR 404! No efficient document management Query results cannot be further manipulated No efficient means for knowledge discovery

10 Current Research Projects Web Query System –W3QS, WebSQL, AKIRA, NetQL, RAW, WebLog Semistructured Data –LOREL, UnQL, WebOQL Website Management System –STRUDEL Web Warehouse - WHOWEDA

11 WHOWEDA -Key Objectives Design a suitable data model to represent web information development of web algebra and query language Maintenance of Web data Development of knowledge discovery and web mining tools Web warehouse

12 WHOWEDA - What? WareHouse Of Web Data –Subject - oriented –Integrated –Temporal –Granularity - Lower, higher –Some summary –Not updatable –Alternative information sources

13 What is a Web Warehouse? Subject-oriented, integrated, time-variant, non-volatile repository of web data for direct querying and analysis for some sort of decision making A process whereby organizations or individuals extract value from their Web informational assets through the use of special stores called web warehouses

14 WHOWEDA! A WareHouse Of WEb DAta Web Information Coupling Model (WICM) –Web Objects –Web Schema Web Information Coupling Algebra Web Information Maintenance Web Mining and Knowledge discovery

WebInformationCouplingSystem Web Information Maintenance System Web Information Mining System WarehouseConceptMart WebMart WWW WebWarehouse WebMart WebMart WebMart Web Querying & Analysis Component User

Global Web Manipulation WarehouseConceptMart WWW WebWarehouseWebWarehouse Web Query & Display User Pre processing Local Web Manipulation Global Web Coupling Coupling Global Ranking Data Visualization Web Select Local Web Coupling Web Project Local Ranking Web Join Web Union Web Intersection Schema Tightness Schema Search Schema Match Schema Tightness Data Visualization

17 Web Objects Node - url, title, format, size, date, text Link - source-url, target-url, label, link-type Web tuple Web table Web schema Web database

18 Web Schema Metadata in the warehouse Structural ‘summary’ of web table Information Coupling using a Query graph Query graph ->Web schema directed graph represented by Ordered 4- tuple: –Set of node variables –Set of link variables –Connectivities –Predicates

19

20 Information Square's homepage Headline article 1 Headline article n TCS News specials Airport info (List of video files) List of links to local news List of links to world news Local news 1 Local news k World news 1 World news t

21 x y e x y e gg f label CONTAINS "Local News" target_URL CONTAINS "newshub/specia ls" z url CONTAINS "local" label CONTAINS "World News" w url CONTAINS "world" target_url CONTAINS "article” h url contains “headlines”

22 Information Square's homepage Headline article 1 News specials List of links to local news List of links to world news Local news 1 World news 1

23 Schema- example Node variables:Xn = { x, y, z, w } Link variable:Xl = { e, f, g } Connectivities:C = { x y and x z and x w } – The symbol represents an anonymous node variable, a node variable not restricted by any predicate.

24 Predicates P={x.url=” square”, y.url CONTAINS “headlines” e.target_url CONTAINS "article", f.target.url CONTAINS "newshub/specials", g.label CONTAINS "Local News", z.url CONTAINS "local", h.label CONTAINS "World News", w.url CONTAINS "world" }

25 Query Graph - Example 1 Query graph - same as schema except that it has one more parameter to control the results returned. Informally, it is directed connected graph consists of nodes, links and keywords imposed on them. Produce a list of diseases with their symptoms, evaluation procedures and treatment starting from the web site at Web table Diseases

List of Diseases x Treatment list q Treatmentg Symptoms list z Symptoms f Issues y e Evaluation wp Evaluation

List of Diseases x0 Treatment list q1 Treatment g1 Symptomslist z1 Symptoms f1 Issues y1 e1 Evaluation w1p2 Elisa Test AIDS Evaluation

28 Example 2 Produce a list of drugs, and their uses and side effects starting from the web site at Web table Drugs

List of Diseases Drug list Issues Uses Use Side effects ab c d r s k Sideeffects

List of Diseases Drug list list Issues Uses of Indavir Use Side effects a0b1c1d1 r1 s1 k1 AIDS Indavir of Indavir

31 Query Language Starting from the CS deptt home page at NTU, find all documents that are linked through paths of length less than two containing only local links, and have in their text “database”.

32 COUPLE WEBTABLE W FROM WWW SUCH THAT NODE I, j IN WWW and LINK e,f,g IN WWW AND I j WHERE I.url EQUALS “ AND j.text CONTAINS “database” AND f.link-type EQUALS local AND g.link-type EQUALS local;

33 Web Algebra Formal foundation of data representation and manipulation in a web warehouse Web operators: –Information access operator –Information manipulation operators –Web schema operators –Data visualization operators

34 Information access operator Global Web Coupling

35 Information Manipulation - Web select –Web project –Local web coupling –Web join –Web cartesian product –Web union –Web intersect –Local Web coupling

36 Web Select Extracts web tuples from web tables satisfying certain conditions on node and link variables and on connectivities Input is select Schema Output is a web table satisfying the select schema

37 select W1 tuples that contain world news about Indonesia since May  Ms W1 where Ms =, Xsn = { x, w },Xsl = { }, Cs = { }, Ps = { x.date > "1May1998", w.text CONTAINS “Indonesia”}

38 Xn’ = { x, y, z, w },Xl’ = { e, f, g } C’ = { x y and x z and x w } P’={x.url=” square”, x.date > "1May1998", e.target_url CONTAINS "article", f.target.url CONTAINS "newshub/specials", g.label CONTAINS "Local News", z.url CONTAINS "local", h.label CONTAINS "World News", w.url CONTAINS "world", w.text CONTAINS “Indonesia” }

39 Web Information Coupling System A database system to couple related web information Global web Coupling and Local Web Coupling

40 Global Coupling - Information Access To integrate data from the Web To create historical data To couple related information from the WWW satisfying a query graph Operator to create web tables From web with no schema to web table with web schema

41 Why local web coupling? Directly querying the WWW to gather these information is an expensive and repetitive affair Web documents containing similar information can reside in different web tables in a web warehouse A mechanism to gather these similar information by additional manipulation of the materialized web tables

42 Local Web Couple operator Two web tuples and can be coupled if there exist atleast one pair of nodes from and which contains similar information.

43 Local Web Couple operator The web couple operator is basically a web cartesian product followed by web select: We denote web couple by the symbol:

44 Web Coupling

45 M2 = for W2 Xn” ={ s, t, u}, Xl” = { k, l, m, n }, C” ={ s t and s u }, P”{s.url= “ k.label = “REGION”, l.target_url= “ s/sea*.html”, m.label = “WORLD”, n.target_url=“ itstimes/pages/wrld*.html”}

46 W1  q W2 where q = (x.date=s.date) & (w.text CONTAINS “Indonesia”) & (t.text CONTAINS “Indonesia”)

47 Xn* = { x, y, z, w, s, t, u }, Xl* = { e, f, g, k, l, m, n }, C*= { x y and x z and x w and s t and s u } P* = { x.url=” square”, e.target_url CONTAINS "article", f.target.url CONTAINS "newshub/specials", g.label CONTAINS "Local News", z.url CONTAINS "local", h.label CONTAINS "World News", w.url CONTAINS "world", s.url = “

48 k.label = “REGION”, l.target_url = “ s/sea*.html”, m.label = “WORLD”, n.target_url = “ s/wrld*.html”, x.date = s.date, w.text CONTAINS “Indonesia”, t.text CONTAINS “Indonesia"}

49 Local Web Coupling Initiated explicitly by the user User provides the pair of node variables and the keyword set based on which coupling is to be performed Coupling nodes in each pair of web tuples in the input web tables must satisfy one of the coupling conditions

50 Construction of coupled table First perform a web cartesian product on the two web tables For each web tuple in the resultant web table – the specified instances of node variables are inspected to determine whether the web tuple satisfy coupling compatibility condition(s)

51 Construction of coupled table –If a pair of nodes satisfy none of the conditions, the corresponding web tuple is rejected –Otherwise, the web tuple is stored in a separate web table

52 Types of web coupling System driven web coupling: In this case the system to decide which are the node variables to be coupled (coupling nodes). If atleast a pair of coupling nodes cannot be identified then the web tables cannot be coupled.

53 Types of web coupling User driven web coupling: In this case the user decides which are the node variables to be coupled (coupling nodes). Coupling is performed only on those user specified node variable(s).

54 Types of web coupling Attribute driven web coupling: In this case the user specifies the coupling attributes. Coupling is performed only on those user specified coupling attribute(s).

55 Attribute driven web coupling COUPLE TABLE3 FROM TABLE1 AND TABLE 2 ON ATTRIBUTE “TEXT” AT SCHEMA/TUPLE(optional)

56 Types of web coupling Value driven web coupling: In this case the user specifies the values of the attributes of the nodes on which coupling should be performed. Coupling is performed only on those user specified attribute values.

57 Value driven web coupling COUPLE TABLE3 FROM TABLE1 AND TABLE 2 ON VALUE “Software Agents” AT SCHEMA/TUPLE(optional)

58 Schema level web coupling We inspect the schemas to decide whether the two web tables can be coupled. If coupling conditions cannot be identified then the two web tables cannot be coupled. We do not inspect the web tuples in the web table. Number of web tuples coupled will be n*m.

59 Tuple level web coupling We inspect the web tuples of the two input web tables to identify nodes with similar information. The number of web tuples in the coupled web table <=n*m

60 Why two levels? A schema does not capture all the information of the web documents in a web table; not always possible to identify coupling condition by inspecting the schemas. possible to find existence of coupling nodes which are not defined in the schemas.

61 Why two levels? Tuple level coupling gives us a mean to correlate web documents containing similar information from the web tables (that cannot be identified from their schemas) at the expense of additional processing.

62 Join Processing in Web Databases

63 Web Join Concatenate tuples based on identical nodes or documents Input are two web tables and their schemas Output is a joined table Types –Pi-web join, theta-web join, outer joins, web composition, semi web join

64 Web Join Used for combining related data from various web tables Mechanism to detect changes Mechanism to find alternative web document in case of “Document Not Found” error

65 Web Join Operator Information manipulation operator Manipulate information residing in a web database to derive additional information Harness useful, composite information from two web tables Capitalize on the reuse of retrieved data from the WWW in order to reduce execution time of queries

66 Joinable Nodes Node variables participating in the web join process Expressed as a pair Each node in the pair should have identical URLs

67 Web Join Combine two web tables by concatenating a web tuple of one web table with a web tuple of other web table whenever there exist joinable nodes Joinable nodes are identified from the schemas of the two web tables URLs of the joinable nodes are identical

List of Diseases x Treatment list q Treatmentg Symptoms list z Symptoms f Issues y e Evaluation w p Evaluation Drug list Uses Use Side effects bcd r s k Sideeffects Issues

x0 AIDS treatment q1 g1 Symptoms of AIDS z1 f1 y1 e1 w1 p2 Evaluation b1c1d1 r1 s1 k1 Side effects of Indavir AIDS AIDS Elisa Test Indavir Uses of Indavir

70 Join Existence Given two web tables, we determine if these two web tables are joinable Inspect the schemas of the web tables Satisfy joinability conditions based on: –node predicates –link predicates –node and link predicates –locus of a node relative to a joinable node

71 Join Construction To construct a joined schema, we construct: –node set –link set –connectivity set –predicate set Construction of joined table –Concatenating the web tuples of the two input tables over the joinable nodes

72 Web Bags Existence of identical web tuples. Created due to web project operation. Structure based mining Used for discovering –Visible nodes –Luminous nodes –Luminous paths

73 Definitions Visibility of a web document or node D in a web table W measures the number of different web documents in W that have links to D Luminosity - Reverse of visibility, the number of other distinct documents that are linked from D Luminous paths - a set of inter-linked nodes which occurs number of times in a web table

74 Steps to find visible nodes Input: Web table W, node variable x, visibility threshold v Output: Set of visible nodes Create a web table from W where each web tuple contains distinct instances of node x and the preceeding node which is linked to x Eliminate the nodes linked to x in each tuple of the web table using web project

75 Steps to find visible nodes Input: Web table W, node variable x, visibility threshold v Output: Set of visible nodes Create a web table from W where each web tuple contains distinct instances of node x and the preceeding node which is linked to x Eliminate the nodes linked to x in each tuple of the web table using web project

76 Steps to find visible nodes Check if the collection of web tuples of node x thus created is a web bag by comparing their URLs Create multiplets for each collection of identical nodes For each multiplet calculate the node visibility Determine the multiplets with node visibility greater than the threshold Create the visible node set

77 Steps to find luminous nodes Input: Web table W, node variable x, luminosity threshold l Output: Set of luminous nodes Steps are similar to that of visible node discovery We consider the nodes linked from x in place of nodes linked to x

78 Steps to find luminous nodes Input: Web table W, node variable x, luminosity threshold l Output: Set of luminous nodes Steps are similar to that of visible node discovery We consider the nodes linked from x in place of nodes linked to x

79 Steps to find luminous paths Create the collection of multiplets Compute path luminosity for each multiplet If the path luminosity value of a multiplet is greater than or equal to threshold then a path in the multiplet is a luminous path Otherwise, we create a collection of linear web tuples from the above collection of web tuples

80 Steps to find luminous paths This is to identify if there exist a subset of inter-linked nodes between x and y that are luminous paths We repeat the procedure to compute path luminosity for these set of inter-linked nodes

xyz Cancer CancerDiseases e f Web Schema

Cancer x0 y0 Diseases Cancere0 f0 z1 Cancer x0 y0 Diseases Cancere0 f0 z1 Cancer x0 y0 Diseases Cancere0 f0 z2 Cancer x0 y0 Diseases Cancere0 f0 z4 Cancer x0 y0 Diseases Cancere0 f0 z1 Web Table

zCancer Projected schema

Cancerhttp:// z1 Cancer z1 Cancer z2 Cancer z4 Cancer z1 Web Table after eliminating x and y

xy z Cancer Diseases e Projected schema

x0y0z1 Diseases x0y0z1 Diseases x0y0z1 Diseases x0y0z2 Diseases Web Bag Cancer x0y0z4 Diseases

Cancer x0y0z1 Diseaseshttp:// x0y0z2 Diseases Cancer x0y0z4 Diseases After removal of identical tuples

Cancer z1http:// Cancer z Cancerz4

Cancer z2 Cancer z1 Cancer z4

Cancer z1 Cancer z2 Cancer z1 Cancer z4 Visible Nodes

Luminous Paths

92 More Operators... Web schema operators: –Schema tightness operator, Schema match operator, Schema search operator Data visualization operators: –Ranking operators (Global & Local), Web Nest, Web Un-nest, Web Coalesce, Web Expand, Web Pack, Web Unpack, Web Sort

93 Partitioning of web tables Partitioning web tables –restructured easily –indexed easily –monitored easily –reorganized easily By –time schema tree structure keywords

94 Warehouse Concept Mart (WCMart) Subject oriented Concept generation. Manually -> Autonomous. Used for: –Ranking tuples –Global web coupling –Content based mining

95 Mining in Web Warehouse Web Structure Mining Web Content Mining Web usage Mining

96 Web Data Refinement Improve web schema - schema tightness operator Partition web tables based on content and structure

97 Partitioning of web tables Partitioning web tables –restructured easily –indexed easily –monitored easily –reorganized easily By –time schema tree structure keywords

WarehouseConceptMartWarehouseConceptMart WWW

Web Information ManipulationOperators Lower-levelGranularity Higher level Granularity

WebInformationCouplingSystem Web Information Mining System WarehouseConceptMart WWW WebWarehouse Web Querying & Analysis Component User

101 Structural Content-based –time-variant analysis –snapshot analysis –compare one period with another –trend analysis What type of information can be summarized?

102 Most volatile documents –Sites which change frequently –Rate of change over time –a pointer to directly access documents which change rapidly Most visible nodes, luminous nodes, luminous paths –Change with time –Decrease or increase - Analyze the reason Structural Summarization

103 What can be aggregrated in a web page? –Number of links with identical labels –Number of keywords Changes in content with time –Comparing the changes Open question XML will improve the ability of analysis of web data Content Summarization

104 Summary Current status: –Mechanism for accessing and manipulating web information in WHOWEDA –Implementing various web operators and query language Future research –What types of information can be summarized? –What types of knowledge can be mined? –Refine web warehouse architecture