WEDAGEN: A Synthetic Web Database Generator. Presentation Outline l Existing WWW search mechanisms l WHOWEDA: A Warehouse of Web Data l Modular structure.

Slides:



Advertisements
Similar presentations
Stefania Bergamasco, Cecilia Colasanti An integrated approach to turn statistics into knowledge combining data warehouse, controlled vocabularies and advanced.
Advertisements

PRACTICAL PHP AND MYSQL WALKTHROUGH USING SAMPLE CODES – MAX NG.
Amit Shvarchenberg and Rafi Sayag. Based on a paper by: Robin Dhamankar, Yoonkyong Lee, AnHai Doan Department of Computer Science University of Illinois,
Database Management Systems 3ed, R. Ramakrishnan and J. Gehrke1 Relational Algebra Chapter 4, Part A Modified by Donghui Zhang.
Transaction.
Page 1 Integrating Multiple Data Sources using a Standardized XML Dictionary Ramon Lawrence Integrating Multiple Data Sources using a Standardized XML.
Information Retrieval in Practice
Paper by: A. Balmin, T. Eliaz, J. Hornibrook, L. Lim, G. M. Lohman, D. Simmen, M. Wang, C. Zhang Slides and Presentation By: Justin Weaver.
Xyleme A Dynamic Warehouse for XML Data of the Web.
Visual Web Information Extraction With Lixto Robert Baumgartner Sergio Flesca Georg Gottlob.
Chapter 3 An Introduction to Relational Databases.
Statistical Relational Learning for Link Prediction Alexandrin Popescul and Lyle H. Unger Presented by Ron Bjarnason 11 November 2003.
Integrating data sources on the World-Wide Web Ramon Lawrence and Ken Barker U. of Manitoba, U. of Calgary
Architecture and Real Time Systems Lab University of Massachusetts, Amherst An Application Driven Reliability Measures and Evaluation Tool for Fault Tolerant.
Database Management Systems, R. Ramakrishnan and J. Gehrke1 Relational Algebra Chapter 4, Part A.
Representation of Web Data in a Web Warehouse Ragini A.S. & Shipra Dutta November 20 th, 2001.
Automatic Data Ramon Lawrence University of Manitoba
Rutgers University Relational Algebra 198:541 Rutgers University.
Overview of Search Engines
UT DALLAS Erik Jonsson School of Engineering & Computer Science FEARLESS engineering Secure Data Storage and Retrieval in the Cloud Bhavani Thuraisingham,
Relational Algebra, R. Ramakrishnan and J. Gehrke (with additions by Ch. Eick) 1 Relational Algebra.
Databases & Data Warehouses Chapter 3 Database Processing.
This chapter is extracted from Sommerville’s slides. Text book chapter
Nutch Search Engine Tool. Nutch overview A full-fledged web search engine Functionalities of Nutch  Internet and Intranet crawling  Parsing different.
An Extension to XML Schema for Structured Data Processing Presented by: Jacky Ma Date: 10 April 2002.
Server-side Scripting Powering the webs favourite services.
Chapter 3 An Introduction to Relational Databases.
Microsoft Access Lecture -13- By lec. (Eng.) Hind Basil University of Technology Department of Materials Engineering 1.
Database Application Security Models Database Application Security Models 1.
Michael Cafarella Alon HalevyNodira Khoussainova University of Washington Google, incUniversity of Washington Data Integration for Relational Web.
DBXplorer: A System for Keyword- Based Search over Relational Databases Sanjay Agrawal, Surajit Chaudhuri, Gautam Das Cathy Wang
Academic Year 2014 Spring. MODULE CC3005NI: Advanced Database Systems “QUERY OPTIMIZATION” Academic Year 2014 Spring.
FlexElink Winter presentation 26 February 2002 Flexible linking (and formatting) management software Hector Sanchez Universitat Jaume I Ing. Informatica.
Search Engines. Search Strategies Define the search topic(s) and break it down into its component parts What terms, words or phrases do you use to describe.
Switch off your Mobiles Phones or Change Profile to Silent Mode.
Database Management Systems, R. Ramakrishnan and J. Gehrke1 Relational Algebra.
9-1 Using SafetyAnalyst Module 4 Countermeasure Evaluation.
UHD::3320::CH121 DESIGN PHASE Chapter 12. UHD::3320::CH122 Design Phase Two Aspects –Actions which operate on data –Data on which actions operate Two.
Keyword Search in Databases using PageRank By Michael Sirivianos April 11, 2003.
FusionInspector & FusionInspectorWeb Galaxy-integration.
1 Relational Algebra Chapter 4, Sections 4.1 – 4.2.
FlexTable: Using a Dynamic Relation Model to Store RDF Data IDS Lab. Seungseok Kang.
Database Management Systems, R. Ramakrishnan1 Relational Algebra Module 3, Lecture 1.
Access Chapter 1: Intro to Access Objectives Navigate among objects in Access database Difference between working in storage and memory Good database file.
Relational Operator Evaluation. Overview Application Programmer (e.g., business analyst, Data architect) Sophisticated Application Programmer (e.g.,
1 Chapter 12 Configuration management This chapter is extracted from Sommerville’s slides. Text book chapter 29 1.
Business Intelligence Training Siemens Engineering Pakistan Zeeshan Shah December 07, 2009.
Hyperion Artifact Life Cycle Management Agenda  Overview  Demo  Tips & Tricks  Takeaways  Queries.
MapReduce and the New Software Stack. Outline  Algorithm Using MapReduce  Matrix-Vector Multiplication  Matrix-Vector Multiplication by MapReduce 
Hierarchical Modeling.  Explain the 3 different types of model for which computer graphics is used for.  Differentiate the 2 different types of entity.
Instance Discovery and Schema Matching With Applications to Biological Deep Web Data Integration Tantan Liu, Fan Wang, Gagan Agrawal {liut, wangfa,
1 Overview of Query Evaluation Chapter Outline  Query Optimization Overview  Algorithm for Relational Operations.
SQL: Interactive Queries (2) Prof. Weining Zhang Cs.utsa.edu.
PAIR project progress report Yi-Ting Chou Shui-Lung Chuang Xuanhui Wang.
General Architecture of Retrieval Systems 1Adrienn Skrop.
Chapter 3 An Introduction to Relational Databases.
Query Optimization. overview Application Programmer (e.g., business analyst, Data architect) Sophisticated Application Programmer (e.g., SAP admin) DBA,
Lecture-6 Bscshelp.com. Todays Lecture  Which Kinds of Applications Are Targeted?  Business intelligence  Search engines.
© NCSR, Frascati, July 18-19, 2002 CROSSMARC big picture Domain-specific Web sites Domain-specific Spidering Domain Ontology XHTML pages WEB Focused Crawling.
The Web Web Design. 3.2 The Web Focus on Reading Main Ideas A URL is an address that identifies a specific Web page. Web browsers have varying capabilities.
Datamining : Refers to extracting or mining knowledge from large amounts of data Applications : Market Analysis Fraud Detection Customer Retention Production.
Relational Algebra Chapter 4, Part A
Relational Algebra 461 The slides for this text are organized into chapters. This lecture covers relational algebra, from Chapter 4. The relational calculus.
Relational Algebra Chapter 4, Sections 4.1 – 4.2
MANAGING DATA RESOURCES
Web Couple: Coupling web information
Database Systems Instructor Name: Lecture-3.
Creating Noninput Items
Magnet & /facet Zheng Liang
CENG 351 File Structures and Data Managemnet
Presentation transcript:

WEDAGEN: A Synthetic Web Database Generator

Presentation Outline l Existing WWW search mechanisms l WHOWEDA: A Warehouse of Web Data l Modular structure of WEDAGEN l Configuration parameters l Performance evaluation l Summary and future work

Existing W 3 Search Mechanisms l Time delay in manual navigation of the web l Overwhelming results and unwanted information l No tool for organizing and storing harnessed information for further manipulation

Existing W 3 Search Mechanisms l Time delay in manual navigation of the web l Overwhelming results and unwanted information l No tool for organizing and storing harnessed information for further manipulation l Search engines and browsers are not always the best ways to systematically harness information from the web

Existing W 3 Search Mechanisms l Time delay in manual navigation of the web l Overwhelming results and unwanted information l No tool for organizing and storing harnessed information for further manipulation l Search engines and browsers are not always the best ways to systematically harness information from the web l The WHOWEDA NTU

Overview of WHOWEDA l A web warehousing system to store and manipulate web information l Store extracted information as ‘web tables’ and provide ‘web operators’ to manipulate web tables l To extract information from W 3, user defines a ‘query graph’ l Results of extraction is a set of web tuples; each tuple instantiates the query graph l More information: u

Example: Query graph (web schema) N1.URL EQUALS “ bySubject/Computing/ UniSciDepts.html” L2.LABEL EQUALS “faculty” L3.LABEL EQUALS “research projects” L4.LABEL CONTAINS “publications” L5.LABEL CONTAINS “publications” N5.TEXT CONTAINS “Internet computing” N1 N2 N3 N4 N5

Example: Query results Id Name Age A1 John 23 C2 Wendy 35 B4 Jane 25 A2 Wendy 35 C9 Pete 42 B3 Kim 38 F8 Tom 22 G7 Cindy 47

Objectives l Need to perform systematic evaluation of web operators during WHOWEDA development l Limitations of testing using real web data l To design a testbed that is controllable, comprehensive and systematic for evaluating web database systems l To control the quantity and quality of synthetic web tuples by allowing users to specify configuration parameters and web schemas

Objectives l Need to perform systematic evaluation of web operators during WHOWEDA development l Limitations of testing using real web data l To design a testbed that is controllable, comprehensive and systematic for evaluating web database systems l To control the quantity and quality of synthetic web tuples by allowing users to specify configuration parameters and web schemas l WEBAGEN: A Web Database Generator

System Architecture of WEDAGEN

Configuration Input Parameters WEDAGEN parameters DefaultSpecific Selectivity Instance Related Control NumTuples NumSourceNodeInstances FanOut NumKeyWordsPerNodeInstance NumWordsPerNodeInstance NumWordsPerLinkLabel NumWordsPerHostName NumWordsPerTitle LocalGlobalLink NumSourceNodeInstances FanOut NumKeyWordsPerNodeInstance NumWordsPerNodeInstance NumWordsPerLinkLabel NumWordsPerTitle NumWordsPerHostName LocalGlobalLink NodeSelectivity TableSelectivity Web Schema Fan-In

Parameter Values Suggestion Start Generate specific parameter values user change specific parameters Calculate max. no. of tuples to be generated Is calculated value > NumTuples Calculate NumSourceNodeInstances to generate specified number of tuples Store suggested values in file User change specific parameters End Invoke instance generation module

Instance Generation Module (IGM) 1. No. of node instance generator Num Source Node Instances Fan out No. of Node Instances per node 2. URL generator 3. Node instance attribute generator 4. Link set generator 5. Web page generator Num words per URL URLs of all node instances Link set of each instance Node attributes e.g. title, text, date Num Source Node Instances Num words per node instance Images web page Num words per title Node Pool Web pages Web tables Tuple Extraction Module

Directed Graph Output from IGM

Tuple Extraction Module (TEM) l IGM generates all node and link instances interconnected as directed graph(s) l TEM extracts and constructs individual web tuples from the directed graph(s) l Node and link instances have IDs assigned l Web tuples stored in a web table file l A web table has been constructed that is complete with node, link and tuple information

Extracted Web Tuples

Preliminary Evaluation l Elapsed time used to measure overhead of web table generation l A set of sample test configurations identified consisting of typical combinations of 4 web schemas and input parameters l Performance measured with respect to: u Complexity of schema u Total number of node instances and total number of tuples

Four Test Schemas

Three Table Sizes

Elapsed Time Vs No. of Tuples

Experimental Findings l Time elapsed in generating web table increases with size of table l Rate of growth is different for different schemas; i.e., schema complexity affects elapsed time u Generating table of tree schema (schema 2) takes longer than that of linear schema (schema 1) u Generating table of schema 2 takes longer than that of schema 4

Summary l Identified parameters to create web data of different sizes and complexities successfully determined l Designed and implemented WEDAGEN and has been successfully integrated into the WHOWEDA system l Able to scale up well with increasing web schema complexity and web table size l Time and effort required to evaluate web database system performance can be reduced with WEBAGEN

Future Work l Inclusion of more parameters: u Minimum and maximum depth of a tuple. u Average ratio of bound and unbound nodes in a tuple. l Apply WEDAGEN to other database systems similar to WHOWEDA l Develop WHOWEDA into a full-fledged benchmark toolkit