Download presentation
Presentation is loading. Please wait.
1
WEDAGEN: A Synthetic Web Database Generator
2
Presentation Outline l Existing WWW search mechanisms l WHOWEDA: A Warehouse of Web Data l Modular structure of WEDAGEN l Configuration parameters l Performance evaluation l Summary and future work
3
Existing W 3 Search Mechanisms l Time delay in manual navigation of the web l Overwhelming results and unwanted information l No tool for organizing and storing harnessed information for further manipulation
4
Existing W 3 Search Mechanisms l Time delay in manual navigation of the web l Overwhelming results and unwanted information l No tool for organizing and storing harnessed information for further manipulation l Search engines and browsers are not always the best ways to systematically harness information from the web
5
Existing W 3 Search Mechanisms l Time delay in manual navigation of the web l Overwhelming results and unwanted information l No tool for organizing and storing harnessed information for further manipulation l Search engines and browsers are not always the best ways to systematically harness information from the web l The WHOWEDA approach @ NTU
6
Overview of WHOWEDA l A web warehousing system to store and manipulate web information l Store extracted information as ‘web tables’ and provide ‘web operators’ to manipulate web tables l To extract information from W 3, user defines a ‘query graph’ l Results of extraction is a set of web tuples; each tuple instantiates the query graph l More information: u http://www.cais.ntu.edu.sg:8000/~whoweda
7
Example: Query graph (web schema) N1.URL EQUALS “http://sunsite.doc.ic.ac.uk/ bySubject/Computing/ UniSciDepts.html” L2.LABEL EQUALS “faculty” L3.LABEL EQUALS “research projects” L4.LABEL CONTAINS “publications” L5.LABEL CONTAINS “publications” N5.TEXT CONTAINS “Internet computing” N1 N2 N3 N4 N5
8
Example: Query results Id Name Age A1 John 23 C2 Wendy 35 B4 Jane 25 A2 Wendy 35 C9 Pete 42 B3 Kim 38 F8 Tom 22 G7 Cindy 47
9
Objectives l Need to perform systematic evaluation of web operators during WHOWEDA development l Limitations of testing using real web data l To design a testbed that is controllable, comprehensive and systematic for evaluating web database systems l To control the quantity and quality of synthetic web tuples by allowing users to specify configuration parameters and web schemas
10
Objectives l Need to perform systematic evaluation of web operators during WHOWEDA development l Limitations of testing using real web data l To design a testbed that is controllable, comprehensive and systematic for evaluating web database systems l To control the quantity and quality of synthetic web tuples by allowing users to specify configuration parameters and web schemas l WEBAGEN: A Web Database Generator
11
System Architecture of WEDAGEN
12
Configuration Input Parameters WEDAGEN parameters DefaultSpecific Selectivity Instance Related Control NumTuples NumSourceNodeInstances FanOut NumKeyWordsPerNodeInstance NumWordsPerNodeInstance NumWordsPerLinkLabel NumWordsPerHostName NumWordsPerTitle LocalGlobalLink NumSourceNodeInstances FanOut NumKeyWordsPerNodeInstance NumWordsPerNodeInstance NumWordsPerLinkLabel NumWordsPerTitle NumWordsPerHostName LocalGlobalLink NodeSelectivity TableSelectivity Web Schema Fan-In
13
Parameter Values Suggestion Start Generate specific parameter values user change specific parameters Calculate max. no. of tuples to be generated Is calculated value > NumTuples Calculate NumSourceNodeInstances to generate specified number of tuples Store suggested values in file User change specific parameters End Invoke instance generation module
14
Instance Generation Module (IGM) 1. No. of node instance generator Num Source Node Instances Fan out No. of Node Instances per node 2. URL generator 3. Node instance attribute generator 4. Link set generator 5. Web page generator Num words per URL URLs of all node instances Link set of each instance Node attributes e.g. title, text, date Num Source Node Instances Num words per node instance Images web page Num words per title Node Pool Web pages Web tables Tuple Extraction Module
15
Directed Graph Output from IGM
16
Tuple Extraction Module (TEM) l IGM generates all node and link instances interconnected as directed graph(s) l TEM extracts and constructs individual web tuples from the directed graph(s) l Node and link instances have IDs assigned l Web tuples stored in a web table file l A web table has been constructed that is complete with node, link and tuple information
17
Extracted Web Tuples
18
Preliminary Evaluation l Elapsed time used to measure overhead of web table generation l A set of sample test configurations identified consisting of typical combinations of 4 web schemas and input parameters l Performance measured with respect to: u Complexity of schema u Total number of node instances and total number of tuples
19
Four Test Schemas
20
Three Table Sizes
21
Elapsed Time Vs No. of Tuples
22
Experimental Findings l Time elapsed in generating web table increases with size of table l Rate of growth is different for different schemas; i.e., schema complexity affects elapsed time u Generating table of tree schema (schema 2) takes longer than that of linear schema (schema 1) u Generating table of schema 2 takes longer than that of schema 4
23
Summary l Identified parameters to create web data of different sizes and complexities successfully determined l Designed and implemented WEDAGEN and has been successfully integrated into the WHOWEDA system l Able to scale up well with increasing web schema complexity and web table size l Time and effort required to evaluate web database system performance can be reduced with WEBAGEN
24
Future Work l Inclusion of more parameters: u Minimum and maximum depth of a tuple. u Average ratio of bound and unbound nodes in a tuple. l Apply WEDAGEN to other database systems similar to WHOWEDA l Develop WHOWEDA into a full-fledged benchmark toolkit
Similar presentations
© 2025 SlidePlayer.com. Inc.
All rights reserved.