Presentation is loading. Please wait.

Presentation is loading. Please wait.

Web Data Management COSC 4806. Introduction  The ‘ world wide web’  a vast, widely distributed collection of semi-structured multimedia documents 

Similar presentations


Presentation on theme: "Web Data Management COSC 4806. Introduction  The ‘ world wide web’  a vast, widely distributed collection of semi-structured multimedia documents "— Presentation transcript:

1 Web Data Management COSC 4806

2 Introduction  The ‘ world wide web’  a vast, widely distributed collection of semi-structured multimedia documents  heterogeneous collection of documents  documents in the form of web pages  documents connected via hyperlinks

3 World Wide Web  The web is growing rapidly  Business organizations increasingly presenting information on the Web  ‘Business on the highway’  Myriad of raw data to be processed for information

4 World Wide Web  The web is a fast growing, distributed & non-administered global information resource  WWW allows access to text, images, video, sound and graphical data  Ever-increasing number of businesses building web servers  A chaotic environment to locate information of interest  Lost in hyperspace syndrome

5 World Wide Web  Characteristics of the WWW :  it’s a set of directed graphs  data is heterogeneous, self-describing & schema less  unstructured, deeply nested information  no central authority for information management  dynamic information vs. static information  web information discovery – search engines

6 World Wide Web  Rapid growth of web:  In 1994, WWW grew by 1758 % !!  June 1993 - 130  June 1994 - 1265  Dec. 1994 - 11,576  April 1995 - 15,768  July 1995 - 23,000+  January 2005 – 11.5 billion publicly- indexed web pages

7 World Wide Web .com domains on the rise, as of July 2006:  76,683,115 hosts for ‘com’ domains  10,232,188 hosts for ‘edu’ domains  185,919,955 hosts for ‘net’ domains  727,773 hosts for ‘gov’ domains  1,933,551 hosts for ‘mil’ domains  1,660,470 hosts for ‘org’ domains

8 World Wide Web  The exponential growth of the Internet is reflected in the number of hosts on the net  1.000 in 1984  10.000 in 1987  100.000 in 1989  1.000.000 in 1992  10.000.000 in 1996  100.000.000 in 2000  171,638,297 in 2003  489,774,269 in July 2007  Net Timeline (http://www.pbs.org/internet/timeline/)  Internet Domain Survey (http://www.isc.org/ds/)

9 World Wide Web  Distribution of hosts (worldwide)  US 195,138,696  European Union 22,000,414  Japan 21,304,292  Germany 7,657,162  Netherlands 6,781,729  South Korea 5,433,591  Australia 5,351,622  UK 4,688,307  Brazil 4,392,693  Taiwan 3,838,383

10 World Wide Web  Popular search methods  email 77%  Search engine 63%  Get news 46%  Job related search 29%  Instant messaging 18%  Online banking 18%  Chat room 8%  Travel reservation 5%  Read blogs 3%  Online auction 3%

11 World Wide Web  Key limitations of search engines:  do not exploit hyperlinks  search limited to string matching  queries evaluated on archived data rather than up-to-date data; no indexing on current data  low accuracy; replicated results  no further manipulation possible

12 World Wide Web  Key limitations of search engines (contd.):  ERROR 404!  No efficient document management  Query results cannot be further manipulated  No efficient means for knowledge discovery

13 World Wide Web  more issues..  specifying/understanding what information is wanted  the high degree of variability of accessible information  the variability in conceptual vocabulary or “ontology” used to describe information  complexity of querying unstructured data

14 World Wide Web  contd.  complexity of querying structured data  uncontrolled nature of web-based information content  determining which information sources to search/query

15 World Wide Web  Search Engines capabilities:  Selection of language  Keywords with disjunction, adjacency, presence, absence,...  Word stemming (Hotbot)  Similarity search (Excite)  Natural language (LycosPro)  Restrict by modification date (Hotbot) or range of dates (AltaVista)  Restrict result types (e.g., must include images) (Hotbot)  Restrict by geographical source (content or domain) (Hotbot)  Restrict within various structured regions of a document (titles or URLs) (LycosPro); (summary, first heading, title, URL) (Opentext)

16 World Wide Web  Search & Retrieval..  Using several search engines is better than using only one Search engine % web covered Hotbot 34 AltaVista 28 Northern Light 20 Excite 14 Infoseek 10 Lycos 3

17 World Wide Web  Schemes to locate information:  Supervised links between sites  ask at the reference desk  Gopher (Univ. Of Minnesota): menu format with links both to sites and content  Classification of documents  search in the catalog  Archie (McGill Univ.): system to automatically gather, index and serve information from all anonymous FTP sites  Automated searching  wander around the library  Use META tags to gethermeta data  Spiders (robots, web-crawlers)

18 World Wide Web  Popular search engines.. Year 2000 AltaVista Yahoo HotBot Year 2001 Google NorthernLight AltaVista

19 World Wide Web  Boolean search in Alta vista..

20 World Wide Web  Specifying field content in HotBot..

21 World Wide Web  Natural language interface in AskJeeves

22 World Wide Web  Examples of search strategies:  Rank web pages based on popularity  Rank web pages based on word frequency  Match query to an expert database  The major search engines use a mixed strategy

23 World Wide Web  Frequency based ranking:  Library analogue: Keyword search  Basic factors in HotBot ranking of pages: - words in the title - keyword meta tags - word frequency in the document - document length

24 World Wide Web  Alternative word frequency measures:  Excite uses a thesaurus to search for what you want, rather than what you ask for  AltaVista allows you to look for words that occur within a set distance of each other  NorthernLight weighs results by search term sequence, from left to right

25 World Wide Web  Popularity based ranking:  Library analogue: citation index  The Google strategy for ranking pages: - Rank is based on the number of links to a page - Pages with a high rank have a lot of other web pages that link to it - The formula is on the Google help page

26 World Wide Web  More on popularity ranking:  The Google philosophy is also applied by others, such as NorthernLight  HotBot measures popularity of a page by how frequently users have clicked on it in past search results

27 World Wide Web  Expert Databases, Yahoo  An expert database contains predefined responses to common queries  A simple approach is subject directory, e.g. in Yahoo!, which contains a selection of links for each topic  The selection is small, but can be useful  Library analogue: Trustworthy references

28 World Wide Web  Expert Databases, AskJeeves  AskJeeves has predefined responses to various types of common queries  These prepared answers are augmented by a meta-search, which searches other SEs  Library analogue: Reference desk

29 World Wide Web  Example, best wines in France; AskJeeves

30 World Wide Web  Best wines in France; HotBot

31 World Wide Web  Best wines in France; Google

32 World Wide Web  Linux in Iceland; Google

33 World Wide Web  Linux in Iceland; HotBot

34 World Wide Web  Linux in Iceland; AskJeeves

35 Web Data Management  Web Data Management; key objectives  Design a suitable data model to represent web information  Development of web algebra and query language, query optimization  Maintenance of Web data - view maintenance  Development of knowledge discovery and web mining tools  Web warehouse  Data integration, secondary storages, indexes

36 Web Data Management  Limitations of the web..  Applications cannot consume HTML  HTML wrapper technology is brittle  Companies merge, need interoperability

37 Web Data Management  Paradigm Shift  New Web standards – XML  XML generated by applications and consumed by applications  Data exchange - Across platforms: enterprise interoperability - Across enterprises Web : from documents to data

38 Web Data Management  Database challenges:  Query optimization and processing  Views and transformations  Data warehousing and data integration  Mediators and query rewriting  Secondary storages  Indexes

39 Web Data Management  DBMS needs paradigm shift too  Web data differs from database data - self describing, schema less, - structure changes without notice, - heterogeneous, deeply nested, - irregular documents and data mixed - designed by document expert, but not DB expert - need Web Data Management

40 Web Data Management  Web data representation  HTML - Hypertext Markup Language - fixed grammar, no regular expressions - Simple representation of data - good for simple data and intended for human consumption - difficult to extract information  SGML - Standard Generalized Markup Language - good for publishing deeply structured document  XML - Extended Markup Language - a subset of SGML

41 Web Data Management  Terminology  HTML - Hypertext Mark-up Language  HTTP - Hypertext Transmission Protocol  URL - Uniform Resource Locator  example - := :// / /filename >[ ] where - is http, ftp, gopher - host is internet address … - #location is a textual label in the file

42 Web Data Management  Prevalent, persistent and informative  HTML documents (now XML) created by humans or applications  Accessed day in and day out by Humans and Applications  Persistent HTML documents  Can database technology help?

43 Web Data Management  Some recent research projects  Web Query System - W3QS, WebSQL, AKIRA, NetQL, RAW, WebLog, Araneus  Semi structured Data Management - LOREL, UnQL, WebOQL, Florid  Website Management System - STRUDEL, Araneus  Web Warehouse - WHOWEDA

44 Web Data Management  Main tasks..  Modeling and Querying the Web -view web as directed graph -content and link based queries - example - find the page that contain the word “Clinton” which has a link from a page containing word “Monica”

45 Web Data Management  Main tasks contd.  Information Extraction and integration -wrapper - program to extract a structured representation of the data; a set of tuples from HTML pages. - mediator: integration of data - software that accesses multiple sources from a uniform interface  Web Site Construction and Restructuring - creating sites - modeling the structure of web sites - restructuring data

46 Web Data Management  What to model?  Structure of Web sites  Internal structure of web pages  Contents of web sites in finer granularities

47 Web Data Management  Data representation of Web data  Graph Data Models  Semi structured Data Models (also graph based)

48 Web Data Management  Graph data model  Labeled graph data model where nodes represent web pages & arcs represent links between pages  Labels on arcs can be viewed as attribute names  Regular path expression queries

49 Web Data Management  Semi structured data models  Irregular data structure, no fixed schema known and may be implicit in the data  Schema may be large and may change frequently  Schema is descriptive rather than perspective; describes current state of data, but violations of schema still tolerated

50 Web Data Management  Semi structured data models  Data is not strongly typed; for different objects the values of the same attributes may be of differing types. (heterogeneous sources)  No restriction on the set of arcs that emanate from a given node in a graph or on the types of the values of attributes  Ability to query the schemas; arc variables which get bound to labels on arcs, rather than nodes in the graph

51 Web Data Management  Graph based Query Languages  Use graph to model databases  Support regular path expressions and graph construction in queries.  Examples - Graph Log for hypertext queries - graph query language for OO

52 Web Data Management  Query languages for semi structured data:  Use labeled graphs  Query the schema of data  Ability to accommodate irregularities in the data, such as missing links etc.  Examples : Lorel (Stanford), UnQL (AT&T), STRUQL (AT&T

53 Web Data Management  Comparing Query Systems

54 Web Data Management  Types of Query Languages  First Generation  Second Generation

55 Web Data Management  First Generation Query languages  Combine the content-based queries of search engines with structure-based queries  Combine conditions on text pattern in documents with graph pattern describing link structures  Examples – - W3QL (TECHNION, Israel), WebSQL (Toronto), WebLOG (Concordia)

56 Web Data Management  Second Generation Query languages  Called web data manipulation languages  Web pages as atomic objects with properties that they contain or do not contain certain text patterns and they point to other objects  Useful for data wrapping, transformation, and restructuring  Useful for web site transformation and restructuring

57 Web Data Management  How they differ?  Provide access to the structure of web objects they manipulate - return structure  Model internal structures of web documents as well as the external links that connect them  Support references to model hyperlinks and some support to ordered collections of records for more natural data representation  Ability to create new complex structures as a result of a query

58 Web Data Management  Examples..  WebOQL  STRUQL  Florid

59 Web Data Management  Information Integration  To answer queries that may require extracting and combining data from multiple web sources  Example - Movie database ; data about movies, their start casts, directors, schedule etc.  Give me a movie playing time and a review of movies starring Frank Sinatra, playing tonight in Paris

60 Web Data Management  Approaches  Web warehouse – Data from multiple web sources is loaded into a warehouse, all queries are applied to warehouse data - Disadvantage - Warehouse needs to be updated when data sources change - Advantage - Performance Improvement  Virtual warehouse – Data remain in the web sources, queries are decomposed at run time into queries to sources - Data is not replicated and is fresh - Due to autonomy of web sources query optimization and execution methodology may differ and performance may be affected - Good when the number of sources are large, data changes frequently, little control over web sources

61 Web Data Management  Virtual approach vs. DBMS  In virtual approach, data is not communicated directly with storage manager, instead it communicates to wrappers  Second, user does not pose queries directly in the schema in which data is stored, user is free from knowing the structure  User pose the queries to mediated schema, virtual relations (not stored anywhere) designed for particular application

62 Web Data Management  Data Integration Steps  Specification of mediated schema and reformulation – Mediated schema is the set of collection and attribute names needed to formulate queries - Data integration system translates the query on the mediated schema into a query to data source  Completeness of data in web sources  Differing query processing capabilities  Query Optimization – selecting a set of minimal sources and minimal queries  Wrapper construction  Matching objects across sources


Download ppt "Web Data Management COSC 4806. Introduction  The ‘ world wide web’  a vast, widely distributed collection of semi-structured multimedia documents "

Similar presentations


Ads by Google