Web Data Management COSC 4806. Introduction  The ‘ world wide web’  a vast, widely distributed collection of semi-structured multimedia documents 

Slides:

Advertisements

Similar presentations

Access Part I Accessing Health Information Through the Internet.

Advertisements

The Internet and the Web

Classification & Your Intranet: From Chaos to Control Susan Stearns Inmagic, Inc. E-Libraries E204 May, 2003.

Chapter 10: Designing Databases

XML: Extensible Markup Language

Natural Language Processing WEB SEARCH ENGINES August, 2002.

Slide 1 Web-Base Management Systems Aaron Brown and David Oppenheimer CS294-7 February 11, 1999.

INTERNET A collection of networks. History ARPANet – developed for security of sending in case of a nuclear attack IDEA – the system would not go down.

1 Introduction to XML. XML eXtensible implies that users define tag content Markup implies it is a coded document Language implies it is a metalanguage.

Information Retrieval in Practice

Web-site Management System Strudel Presented by: LAKHLIFI Houda Instructor: Dr. Haddouti.

Mastering the Internet, XHTML, and JavaScript Chapter 7 Searching the Internet.

The Internet Useful Definitions and Concepts About the Internet.

Advanced Topics COMP163: Database Management Systems University of the Pacific December 9, 2008.

1 Class Number – CS 412 Web Data MGMT and XML Instructor – Sanjay Madria Lesson Title - Introduction.

Search engines. The number of Internet hosts exceeded in in in in in

Searching and Researching the World Wide: Emphasis on Christian Websites Developed from the book: Searching and Researching on the Internet and World Wide.

Searching the World Wide Web From Greenlaw/Hepp, In-line/On-line: Fundamentals of the Internet and the World Wide Web 1 Introduction Directories, Search.

Putting Semi-structured Data to Practice Alon Levy Seattle, Washingon University of Washington.

Overview of Search Engines

Lecture-8/ T. Nouf Almujally

1 Internet History Internet made up of thousands of networks worldwide No one in charge of Internet - No governing body Internet backbone owned by private.

History of the Internet  Began in 1969 as a network of computers at UCLA, Santa Barbara, Stanford & Univ. of Utah.  It was funded by the U.S Dept.

Research paper: Web Mining Research: A survey SIGKDD Explorations, June Volume 2, Issue 1 Author: R. Kosala and H. Blockeel.

DATA COMMUNICATION DONE BY: ALVIN SAMPATH CARLVIN SAMPATH.

16-1 The World Wide Web The Web An infrastructure of distributed information combined with software that uses networks as a vehicle to exchange that information.

Web Search Created by Ejaj Ahamed. What is web?  The World Wide Web began in 1989 at the CERN Particle Physics Lab in Switzerland. The Web did not gain.

The Internet Writer’s Handbook 2/e Introduction to World Wide Web Terms Writing for the Web.

XHTML Introductory1 Linking and Publishing Basic Web Pages Chapter 3.

Search Engines Meta Engines People Directories Subject Directories Domains explained URLs explained Hypertext Language Contents.

Operating Systems Concepts 1/e Ruth Watson Chapter 12 Chapter 12 Introduction to the Internet Ruth Watson.

Chapter 7 Web Content Mining Xxxxxx. Introduction Web-content mining techniques are used to discover useful information from content on the web – textual.

Network Installation. Internet & Intranets Topics to be discussed Internet. Intranet. .

HTML, XHTML, and CSS Sixth Edition Chapter 1 Introduction to HTML, XHTML, and CSS.

1999 Asian Women's Network Training Workshop Tools for Searching Information on the Web  Search Engines  Meta-searchers  Information Gateways  Subject.

Web Searching Basics Dr. Dania Bilal IS 530 Fall 2009.

Search Engine Interfaces search engine modus operandi.

Search Engine By Bhupendra Ratha, Lecturer School of Library and Information Science Devi Ahilya University, Indore

XP New Perspectives on The Internet, Sixth Edition— Comprehensive Tutorial 3 1 Searching the Web Using Search Engines and Directories Effectively Tutorial.

The Internet 8th Edition Tutorial 4 Searching the Web.

5 - 1 Copyright © 2006, The McGraw-Hill Companies, Inc. All rights reserved.

Internet Research Tips Daniel Fack. Internet Research Tips The internet is a self publishing medium. It must be be analyzed for appropriateness of research.

2007. Software Engineering Laboratory, School of Computer Science S E Web-Harvest Web-Harvest: Open Source Web Data Extraction tool 이재정 Software Engineering.

4 1 SEARCHING THE WEB Using Search Engines and Directories Effectively New Perspectives on THE INTERNET.

1 Of Crawlers, Portals, Mice and Men: Is there more to Mining the Web? Jiawei Han Simon Fraser University, Canada ACM-SIGMOD’99 Web Mining Panel Presentation.

CS315-Web Search & Data Mining. A Semester in 50 minutes or less The Web History Key technologies and developments Its future Information Retrieval (IR)

Using HTML Textual and Structural Data for Web Image Search Cheng Thao, Ethan Munson, Jim Dabrowski, Nikolas D. Bohne University of Wisconsin-Milwaukee.

INTERNET. Objectives Explain the origin of the Internet and describe how the Internet works. Explain the difference between the World Wide Web and the.

WEB MINING. In recent years the growth of the World Wide Web exceeded all expectations. Today there are several billions of HTML documents, pictures and.

OWL Representing Information Using the Web Ontology Language.

Data Integration Hanna Zhong Department of Computer Science University of Illinois, Urbana-Champaign 11/12/2009.

Internet and WWW. Internet Network linking computers to other computers Access to numerous resources – Communications systems Instant messaging.

World Wide Web Guide * for Students to the Internet.

The World Wide Web. What is the worldwide web? The content of the worldwide web is held on individual pages which are gathered together to form websites.

IT Enablement Approaches Large Business may have hundreds of processes to be enabled by IT. Several Types of Application may be deployed –Departmental.

Website Design, Development and Maintenance ONLY TAKE DOWN NOTES ON INDICATED SLIDES.

Semi-structured Data In many applications, data does not have a rigidly and predefined schema: –e.g., structured files, scientific data, XML. Managing.

A Portrait of the Semantic Web in Action Jeff Heflin and James Hendler IEEE Intelligent Systems December 6, 2010 Hyewon Lim.

Internet Power Searching: Finding Pearls in a Zillion Grains of Sand By Daniel Arze.

Web Design Terminology Unit 2 STEM. 1. Accessibility – a web page or site that address the users limitations or disabilities 2. Active server page (ASP)

INTERNET AND . WHAT IS INTERNET The Internet can be defined as the wired or wireless mode of communication through which one can receive, transmit.

Week-6 (Lecture-1) Publishing and Browsing the Web: Publishing: 1. upload the following items on the web Google documents Spreadsheets Presentations drawings.

General Architecture of Retrieval Systems 1Adrienn Skrop.

Web Design Vocabulary #3. HTML Hypertext Markup Language - The coding scheme used to format text for use on the World Wide Web.

INTRODUCTION TO INFORMATION SYSTEMS LECTURE 9: DATABASE FEATURES, FUNCTIONS AND ARCHITECTURES PART (2) أ/ غدير عاشور 1.

Chapter 10: Web Basics.

What is Internet Internet is a network of networks, linking computers to computers. Each runs software to provide or “serve” information and/or to access.

Web Warehousing : Design and Issues

Semi-structured Data In many applications, data does not have a rigidly and predefined schema: e.g., structured files, scientific data, XML. Managing such.

Presentation transcript:

Web Data Management COSC 4806

Introduction  The ‘ world wide web’  a vast, widely distributed collection of semi-structured multimedia documents  heterogeneous collection of documents  documents in the form of web pages  documents connected via hyperlinks

World Wide Web  The web is growing rapidly  Business organizations increasingly presenting information on the Web  ‘Business on the highway’  Myriad of raw data to be processed for information

World Wide Web  The web is a fast growing, distributed & non-administered global information resource  WWW allows access to text, images, video, sound and graphical data  Ever-increasing number of businesses building web servers  A chaotic environment to locate information of interest  Lost in hyperspace syndrome

World Wide Web  Characteristics of the WWW :  it’s a set of directed graphs  data is heterogeneous, self-describing & schema less  unstructured, deeply nested information  no central authority for information management  dynamic information vs. static information  web information discovery – search engines

World Wide Web  Rapid growth of web:  In 1994, WWW grew by 1758 % !!  June  June  Dec ,576  April ,768  July ,000+  January 2005 – 11.5 billion publicly- indexed web pages

World Wide Web .com domains on the rise, as of July 2006:  76,683,115 hosts for ‘com’ domains  10,232,188 hosts for ‘edu’ domains  185,919,955 hosts for ‘net’ domains  727,773 hosts for ‘gov’ domains  1,933,551 hosts for ‘mil’ domains  1,660,470 hosts for ‘org’ domains

World Wide Web  The exponential growth of the Internet is reflected in the number of hosts on the net  in 1984  in 1987  in 1989  in 1992  in 1996  in 2000  171,638,297 in 2003  489,774,269 in July 2007  Net Timeline (  Internet Domain Survey (

World Wide Web  Distribution of hosts (worldwide)  US 195,138,696  European Union 22,000,414  Japan 21,304,292  Germany 7,657,162  Netherlands 6,781,729  South Korea 5,433,591  Australia 5,351,622  UK 4,688,307  Brazil 4,392,693  Taiwan 3,838,383

World Wide Web  Popular search methods  77%  Search engine 63%  Get news 46%  Job related search 29%  Instant messaging 18%  Online banking 18%  Chat room 8%  Travel reservation 5%  Read blogs 3%  Online auction 3%

World Wide Web  Key limitations of search engines:  do not exploit hyperlinks  search limited to string matching  queries evaluated on archived data rather than up-to-date data; no indexing on current data  low accuracy; replicated results  no further manipulation possible

World Wide Web  Key limitations of search engines (contd.):  ERROR 404!  No efficient document management  Query results cannot be further manipulated  No efficient means for knowledge discovery

World Wide Web  more issues..  specifying/understanding what information is wanted  the high degree of variability of accessible information  the variability in conceptual vocabulary or “ontology” used to describe information  complexity of querying unstructured data

World Wide Web  contd.  complexity of querying structured data  uncontrolled nature of web-based information content  determining which information sources to search/query

World Wide Web  Search Engines capabilities:  Selection of language  Keywords with disjunction, adjacency, presence, absence,...  Word stemming (Hotbot)  Similarity search (Excite)  Natural language (LycosPro)  Restrict by modification date (Hotbot) or range of dates (AltaVista)  Restrict result types (e.g., must include images) (Hotbot)  Restrict by geographical source (content or domain) (Hotbot)  Restrict within various structured regions of a document (titles or URLs) (LycosPro); (summary, first heading, title, URL) (Opentext)

World Wide Web  Search & Retrieval..  Using several search engines is better than using only one Search engine % web covered Hotbot 34 AltaVista 28 Northern Light 20 Excite 14 Infoseek 10 Lycos 3

World Wide Web  Schemes to locate information:  Supervised links between sites  ask at the reference desk  Gopher (Univ. Of Minnesota): menu format with links both to sites and content  Classification of documents  search in the catalog  Archie (McGill Univ.): system to automatically gather, index and serve information from all anonymous FTP sites  Automated searching  wander around the library  Use META tags to gethermeta data  Spiders (robots, web-crawlers)

World Wide Web  Popular search engines.. Year 2000 AltaVista Yahoo HotBot Year 2001 Google NorthernLight AltaVista

World Wide Web  Boolean search in Alta vista..

World Wide Web  Specifying field content in HotBot..

World Wide Web  Natural language interface in AskJeeves

World Wide Web  Examples of search strategies:  Rank web pages based on popularity  Rank web pages based on word frequency  Match query to an expert database  The major search engines use a mixed strategy

World Wide Web  Frequency based ranking:  Library analogue: Keyword search  Basic factors in HotBot ranking of pages: - words in the title - keyword meta tags - word frequency in the document - document length

World Wide Web  Alternative word frequency measures:  Excite uses a thesaurus to search for what you want, rather than what you ask for  AltaVista allows you to look for words that occur within a set distance of each other  NorthernLight weighs results by search term sequence, from left to right

World Wide Web  Popularity based ranking:  Library analogue: citation index  The Google strategy for ranking pages: - Rank is based on the number of links to a page - Pages with a high rank have a lot of other web pages that link to it - The formula is on the Google help page

World Wide Web  More on popularity ranking:  The Google philosophy is also applied by others, such as NorthernLight  HotBot measures popularity of a page by how frequently users have clicked on it in past search results

World Wide Web  Expert Databases, Yahoo  An expert database contains predefined responses to common queries  A simple approach is subject directory, e.g. in Yahoo!, which contains a selection of links for each topic  The selection is small, but can be useful  Library analogue: Trustworthy references

World Wide Web  Expert Databases, AskJeeves  AskJeeves has predefined responses to various types of common queries  These prepared answers are augmented by a meta-search, which searches other SEs  Library analogue: Reference desk

World Wide Web  Example, best wines in France; AskJeeves

World Wide Web  Best wines in France; HotBot

World Wide Web  Best wines in France; Google

World Wide Web  Linux in Iceland; Google

World Wide Web  Linux in Iceland; HotBot

World Wide Web  Linux in Iceland; AskJeeves

Web Data Management  Web Data Management; key objectives  Design a suitable data model to represent web information  Development of web algebra and query language, query optimization  Maintenance of Web data - view maintenance  Development of knowledge discovery and web mining tools  Web warehouse  Data integration, secondary storages, indexes

Web Data Management  Limitations of the web..  Applications cannot consume HTML  HTML wrapper technology is brittle  Companies merge, need interoperability

Web Data Management  Paradigm Shift  New Web standards – XML  XML generated by applications and consumed by applications  Data exchange - Across platforms: enterprise interoperability - Across enterprises Web : from documents to data

Web Data Management  Database challenges:  Query optimization and processing  Views and transformations  Data warehousing and data integration  Mediators and query rewriting  Secondary storages  Indexes

Web Data Management  DBMS needs paradigm shift too  Web data differs from database data - self describing, schema less, - structure changes without notice, - heterogeneous, deeply nested, - irregular documents and data mixed - designed by document expert, but not DB expert - need Web Data Management

Web Data Management  Web data representation  HTML - Hypertext Markup Language - fixed grammar, no regular expressions - Simple representation of data - good for simple data and intended for human consumption - difficult to extract information  SGML - Standard Generalized Markup Language - good for publishing deeply structured document  XML - Extended Markup Language - a subset of SGML

Web Data Management  Terminology  HTML - Hypertext Mark-up Language  HTTP - Hypertext Transmission Protocol  URL - Uniform Resource Locator  example - := :// / /filename >[ ] where - is http, ftp, gopher - host is internet address … - #location is a textual label in the file

Web Data Management  Prevalent, persistent and informative  HTML documents (now XML) created by humans or applications  Accessed day in and day out by Humans and Applications  Persistent HTML documents  Can database technology help?

Web Data Management  Some recent research projects  Web Query System - W3QS, WebSQL, AKIRA, NetQL, RAW, WebLog, Araneus  Semi structured Data Management - LOREL, UnQL, WebOQL, Florid  Website Management System - STRUDEL, Araneus  Web Warehouse - WHOWEDA

Web Data Management  Main tasks..  Modeling and Querying the Web -view web as directed graph -content and link based queries - example - find the page that contain the word “Clinton” which has a link from a page containing word “Monica”

Web Data Management  Main tasks contd.  Information Extraction and integration -wrapper - program to extract a structured representation of the data; a set of tuples from HTML pages. - mediator: integration of data - software that accesses multiple sources from a uniform interface  Web Site Construction and Restructuring - creating sites - modeling the structure of web sites - restructuring data

Web Data Management  What to model?  Structure of Web sites  Internal structure of web pages  Contents of web sites in finer granularities

Web Data Management  Data representation of Web data  Graph Data Models  Semi structured Data Models (also graph based)

Web Data Management  Graph data model  Labeled graph data model where nodes represent web pages & arcs represent links between pages  Labels on arcs can be viewed as attribute names  Regular path expression queries

Web Data Management  Semi structured data models  Irregular data structure, no fixed schema known and may be implicit in the data  Schema may be large and may change frequently  Schema is descriptive rather than perspective; describes current state of data, but violations of schema still tolerated

Web Data Management  Semi structured data models  Data is not strongly typed; for different objects the values of the same attributes may be of differing types. (heterogeneous sources)  No restriction on the set of arcs that emanate from a given node in a graph or on the types of the values of attributes  Ability to query the schemas; arc variables which get bound to labels on arcs, rather than nodes in the graph

Web Data Management  Graph based Query Languages  Use graph to model databases  Support regular path expressions and graph construction in queries.  Examples - Graph Log for hypertext queries - graph query language for OO

Web Data Management  Query languages for semi structured data:  Use labeled graphs  Query the schema of data  Ability to accommodate irregularities in the data, such as missing links etc.  Examples : Lorel (Stanford), UnQL (AT&T), STRUQL (AT&T

Web Data Management  Comparing Query Systems

Web Data Management  Types of Query Languages  First Generation  Second Generation

Web Data Management  First Generation Query languages  Combine the content-based queries of search engines with structure-based queries  Combine conditions on text pattern in documents with graph pattern describing link structures  Examples – - W3QL (TECHNION, Israel), WebSQL (Toronto), WebLOG (Concordia)

Web Data Management  Second Generation Query languages  Called web data manipulation languages  Web pages as atomic objects with properties that they contain or do not contain certain text patterns and they point to other objects  Useful for data wrapping, transformation, and restructuring  Useful for web site transformation and restructuring

Web Data Management  How they differ?  Provide access to the structure of web objects they manipulate - return structure  Model internal structures of web documents as well as the external links that connect them  Support references to model hyperlinks and some support to ordered collections of records for more natural data representation  Ability to create new complex structures as a result of a query

Web Data Management  Examples..  WebOQL  STRUQL  Florid

Web Data Management  Information Integration  To answer queries that may require extracting and combining data from multiple web sources  Example - Movie database ; data about movies, their start casts, directors, schedule etc.  Give me a movie playing time and a review of movies starring Frank Sinatra, playing tonight in Paris

Web Data Management  Approaches  Web warehouse – Data from multiple web sources is loaded into a warehouse, all queries are applied to warehouse data - Disadvantage - Warehouse needs to be updated when data sources change - Advantage - Performance Improvement  Virtual warehouse – Data remain in the web sources, queries are decomposed at run time into queries to sources - Data is not replicated and is fresh - Due to autonomy of web sources query optimization and execution methodology may differ and performance may be affected - Good when the number of sources are large, data changes frequently, little control over web sources

Web Data Management  Virtual approach vs. DBMS  In virtual approach, data is not communicated directly with storage manager, instead it communicates to wrappers  Second, user does not pose queries directly in the schema in which data is stored, user is free from knowing the structure  User pose the queries to mediated schema, virtual relations (not stored anywhere) designed for particular application

Web Data Management  Data Integration Steps  Specification of mediated schema and reformulation – Mediated schema is the set of collection and attribute names needed to formulate queries - Data integration system translates the query on the mediated schema into a query to data source  Completeness of data in web sources  Differing query processing capabilities  Query Optimization – selecting a set of minimal sources and minimal queries  Wrapper construction  Matching objects across sources