Web Data Management COSC 4806. Introduction  The ‘ world wide web’  a vast, widely distributed collection of semi-structured multimedia documents 

Web Data Management COSC 4806

Introduction  The ‘ world wide web’  a vast, widely distributed collection of semi-structured multimedia documents  heterogeneous collection of documents  documents in the form of web pages  documents connected via hyperlinks

World Wide Web  The web is growing rapidly  Business organizations increasingly presenting information on the Web  ‘Business on the highway’  Myriad of raw data to be processed for information

World Wide Web  The web is a fast growing, distributed & non-administered global information resource  WWW allows access to text, images, video, sound and graphical data  Ever-increasing number of businesses building web servers  A chaotic environment to locate information of interest  Lost in hyperspace syndrome

World Wide Web  Characteristics of the WWW :  it’s a set of directed graphs  data is heterogeneous, self-describing & schema less  unstructured, deeply nested information  no central authority for information management  dynamic information vs. static information  web information discovery – search engines

World Wide Web  Rapid growth of web:  In 1994, WWW grew by 1758 % !!  June 1993 - 130  June 1994 - 1265  Dec. 1994 - 11,576  April 1995 - 15,768  July 1995 - 23,000+  January 2005 – 11.5 billion publicly- indexed web pages

World Wide Web .com domains on the rise, as of July 2006:  76,683,115 hosts for ‘com’ domains  10,232,188 hosts for ‘edu’ domains  185,919,955 hosts for ‘net’ domains  727,773 hosts for ‘gov’ domains  1,933,551 hosts for ‘mil’ domains  1,660,470 hosts for ‘org’ domains

World Wide Web  The exponential growth of the Internet is reflected in the number of hosts on the net  1.000 in 1984  10.000 in 1987  100.000 in 1989  1.000.000 in 1992  10.000.000 in 1996  100.000.000 in 2000  171,638,297 in 2003  489,774,269 in July 2007  Net Timeline (http://www.pbs.org/internet/timeline/)  Internet Domain Survey (http://www.isc.org/ds/)

World Wide Web  Distribution of hosts (worldwide)  US 195,138,696  European Union 22,000,414  Japan 21,304,292  Germany 7,657,162  Netherlands 6,781,729  South Korea 5,433,591  Australia 5,351,622  UK 4,688,307  Brazil 4,392,693  Taiwan 3,838,383

World Wide Web  Popular search methods  email 77%  Search engine 63%  Get news 46%  Job related search 29%  Instant messaging 18%  Online banking 18%  Chat room 8%  Travel reservation 5%  Read blogs 3%  Online auction 3%

World Wide Web  Key limitations of search engines:  do not exploit hyperlinks  search limited to string matching  queries evaluated on archived data rather than up-to-date data; no indexing on current data  low accuracy; replicated results  no further manipulation possible

World Wide Web  Key limitations of search engines (contd.):  ERROR 404!  No efficient document management  Query results cannot be further manipulated  No efficient means for knowledge discovery

World Wide Web  more issues..  specifying/understanding what information is wanted  the high degree of variability of accessible information  the variability in conceptual vocabulary or “ontology” used to describe information  complexity of querying unstructured data

World Wide Web  contd.  complexity of querying structured data  uncontrolled nature of web-based information content  determining which information sources to search/query

World Wide Web  Search Engines capabilities:  Selection of language  Keywords with disjunction, adjacency, presence, absence,...  Word stemming (Hotbot)  Similarity search (Excite)  Natural language (LycosPro)  Restrict by modification date (Hotbot) or range of dates (AltaVista)  Restrict result types (e.g., must include images) (Hotbot)  Restrict by geographical source (content or domain) (Hotbot)  Restrict within various structured regions of a document (titles or URLs) (LycosPro); (summary, first heading, title, URL) (Opentext)

World Wide Web  Search & Retrieval..  Using several search engines is better than using only one Search engine % web covered Hotbot 34 AltaVista 28 Northern Light 20 Excite 14 Infoseek 10 Lycos 3

World Wide Web  Schemes to locate information:  Supervised links between sites  ask at the reference desk  Gopher (Univ. Of Minnesota): menu format with links both to sites and content  Classification of documents  search in the catalog  Archie (McGill Univ.): system to automatically gather, index and serve information from all anonymous FTP sites  Automated searching  wander around the library  Use META tags to gethermeta data  Spiders (robots, web-crawlers)

World Wide Web  Popular search engines.. Year 2000 AltaVista Yahoo HotBot Year 2001 Google NorthernLight AltaVista

World Wide Web  Boolean search in Alta vista..

World Wide Web  Specifying field content in HotBot..

World Wide Web  Natural language interface in AskJeeves

World Wide Web  Examples of search strategies:  Rank web pages based on popularity  Rank web pages based on word frequency  Match query to an expert database  The major search engines use a mixed strategy

World Wide Web  Frequency based ranking:  Library analogue: Keyword search  Basic factors in HotBot ranking of pages: - words in the title - keyword meta tags - word frequency in the document - document length

World Wide Web  Alternative word frequency measures:  Excite uses a thesaurus to search for what you want, rather than what you ask for  AltaVista allows you to look for words that occur within a set distance of each other  NorthernLight weighs results by search term sequence, from left to right

World Wide Web  Popularity based ranking:  Library analogue: citation index  The Google strategy for ranking pages: - Rank is based on the number of links to a page - Pages with a high rank have a lot of other web pages that link to it - The formula is on the Google help page

World Wide Web  More on popularity ranking:  The Google philosophy is also applied by others, such as NorthernLight  HotBot measures popularity of a page by how frequently users have clicked on it in past search results

World Wide Web  Expert Databases, Yahoo  An expert database contains predefined responses to common queries  A simple approach is subject directory, e.g. in Yahoo!, which contains a selection of links for each topic  The selection is small, but can be useful  Library analogue: Trustworthy references

World Wide Web  Expert Databases, AskJeeves  AskJeeves has predefined responses to various types of common queries  These prepared answers are augmented by a meta-search, which searches other SEs  Library analogue: Reference desk

World Wide Web  Example, best wines in France; AskJeeves

World Wide Web  Best wines in France; HotBot

World Wide Web  Best wines in France; Google

World Wide Web  Linux in Iceland; Google

World Wide Web  Linux in Iceland; HotBot

World Wide Web  Linux in Iceland; AskJeeves

Web Data Management  Web Data Management; key objectives  Design a suitable data model to represent web information  Development of web algebra and query language, query optimization  Maintenance of Web data - view maintenance  Development of knowledge discovery and web mining tools  Web warehouse  Data integration, secondary storages, indexes

Web Data Management  Limitations of the web..  Applications cannot consume HTML  HTML wrapper technology is brittle  Companies merge, need interoperability

Web Data Management  Paradigm Shift  New Web standards – XML  XML generated by applications and consumed by applications  Data exchange - Across platforms: enterprise interoperability - Across enterprises Web : from documents to data

Web Data Management  Database challenges:  Query optimization and processing  Views and transformations  Data warehousing and data integration  Mediators and query rewriting  Secondary storages  Indexes

Web Data Management  DBMS needs paradigm shift too  Web data differs from database data - self describing, schema less, - structure changes without notice, - heterogeneous, deeply nested, - irregular documents and data mixed - designed by document expert, but not DB expert - need Web Data Management

Web Data Management  Web data representation  HTML - Hypertext Markup Language - fixed grammar, no regular expressions - Simple representation of data - good for simple data and intended for human consumption - difficult to extract information  SGML - Standard Generalized Markup Language - good for publishing deeply structured document  XML - Extended Markup Language - a subset of SGML

Web Data Management  Terminology  HTML - Hypertext Mark-up Language  HTTP - Hypertext Transmission Protocol  URL - Uniform Resource Locator  example - := :// / /filename >[ ] where - is http, ftp, gopher - host is internet address … - #location is a textual label in the file

Web Data Management  Prevalent, persistent and informative  HTML documents (now XML) created by humans or applications  Accessed day in and day out by Humans and Applications  Persistent HTML documents  Can database technology help?

Web Data Management  Some recent research projects  Web Query System - W3QS, WebSQL, AKIRA, NetQL, RAW, WebLog, Araneus  Semi structured Data Management - LOREL, UnQL, WebOQL, Florid  Website Management System - STRUDEL, Araneus  Web Warehouse - WHOWEDA

Web Data Management  Main tasks..  Modeling and Querying the Web -view web as directed graph -content and link based queries - example - find the page that contain the word “Clinton” which has a link from a page containing word “Monica”

Web Data Management  Main tasks contd.  Information Extraction and integration -wrapper - program to extract a structured representation of the data; a set of tuples from HTML pages. - mediator: integration of data - software that accesses multiple sources from a uniform interface  Web Site Construction and Restructuring - creating sites - modeling the structure of web sites - restructuring data

Web Data Management  What to model?  Structure of Web sites  Internal structure of web pages  Contents of web sites in finer granularities

Web Data Management  Data representation of Web data  Graph Data Models  Semi structured Data Models (also graph based)

Web Data Management  Graph data model  Labeled graph data model where nodes represent web pages & arcs represent links between pages  Labels on arcs can be viewed as attribute names  Regular path expression queries

Web Data Management  Semi structured data models  Irregular data structure, no fixed schema known and may be implicit in the data  Schema may be large and may change frequently  Schema is descriptive rather than perspective; describes current state of data, but violations of schema still tolerated

Web Data Management  Semi structured data models  Data is not strongly typed; for different objects the values of the same attributes may be of differing types. (heterogeneous sources)  No restriction on the set of arcs that emanate from a given node in a graph or on the types of the values of attributes  Ability to query the schemas; arc variables which get bound to labels on arcs, rather than nodes in the graph

Web Data Management  Graph based Query Languages  Use graph to model databases  Support regular path expressions and graph construction in queries.  Examples - Graph Log for hypertext queries - graph query language for OO

Web Data Management  Query languages for semi structured data:  Use labeled graphs  Query the schema of data  Ability to accommodate irregularities in the data, such as missing links etc.  Examples : Lorel (Stanford), UnQL (AT&T), STRUQL (AT&T

Web Data Management  Comparing Query Systems

Web Data Management  Types of Query Languages  First Generation  Second Generation

Web Data Management  First Generation Query languages  Combine the content-based queries of search engines with structure-based queries  Combine conditions on text pattern in documents with graph pattern describing link structures  Examples – - W3QL (TECHNION, Israel), WebSQL (Toronto), WebLOG (Concordia)

Web Data Management  Second Generation Query languages  Called web data manipulation languages  Web pages as atomic objects with properties that they contain or do not contain certain text patterns and they point to other objects  Useful for data wrapping, transformation, and restructuring  Useful for web site transformation and restructuring

Web Data Management  How they differ?  Provide access to the structure of web objects they manipulate - return structure  Model internal structures of web documents as well as the external links that connect them  Support references to model hyperlinks and some support to ordered collections of records for more natural data representation  Ability to create new complex structures as a result of a query

Web Data Management  Examples..  WebOQL  STRUQL  Florid

Web Data Management  Information Integration  To answer queries that may require extracting and combining data from multiple web sources  Example - Movie database ; data about movies, their start casts, directors, schedule etc.  Give me a movie playing time and a review of movies starring Frank Sinatra, playing tonight in Paris

Web Data Management  Approaches  Web warehouse – Data from multiple web sources is loaded into a warehouse, all queries are applied to warehouse data - Disadvantage - Warehouse needs to be updated when data sources change - Advantage - Performance Improvement  Virtual warehouse – Data remain in the web sources, queries are decomposed at run time into queries to sources - Data is not replicated and is fresh - Due to autonomy of web sources query optimization and execution methodology may differ and performance may be affected - Good when the number of sources are large, data changes frequently, little control over web sources

Web Data Management  Virtual approach vs. DBMS  In virtual approach, data is not communicated directly with storage manager, instead it communicates to wrappers  Second, user does not pose queries directly in the schema in which data is stored, user is free from knowing the structure  User pose the queries to mediated schema, virtual relations (not stored anywhere) designed for particular application

Web Data Management  Data Integration Steps  Specification of mediated schema and reformulation – Mediated schema is the set of collection and attribute names needed to formulate queries - Data integration system translates the query on the mediated schema into a query to data source  Completeness of data in web sources  Differing query processing capabilities  Query Optimization – selecting a set of minimal sources and minimal queries  Wrapper construction  Matching objects across sources

Web Data Management COSC 4806. Introduction  The ‘ world wide web’  a vast, widely distributed collection of semi-structured multimedia documents 

Similar presentations

Presentation on theme: "Web Data Management COSC 4806. Introduction  The ‘ world wide web’  a vast, widely distributed collection of semi-structured multimedia documents "— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

Web Data Management COSC 4806. Introduction  The ‘ world wide web’  a vast, widely distributed collection of semi-structured multimedia documents 

Similar presentations

Presentation on theme: "Web Data Management COSC 4806. Introduction  The ‘ world wide web’  a vast, widely distributed collection of semi-structured multimedia documents "— Presentation transcript:

Similar presentations

About project

Feedback