1 Class Number – CS 412 Web Data MGMT and XML Instructor – Sanjay Madria Lesson Title - Introduction.

Slides:



Advertisements
Similar presentations
Access Part I Accessing Health Information Through the Internet.
Advertisements

Classification & Your Intranet: From Chaos to Control Susan Stearns Inmagic, Inc. E-Libraries E204 May, 2003.
XML: Extensible Markup Language
Introduction to Computing Using Python CSC Winter 2013 Week 8: WWW and Search  World Wide Web  Python Modules for WWW  Web Crawling  Thursday:
Query Languages. Information Retrieval Concerned with the: Representation of Storage of Organization of, and Access to Information items.
Natural Language Processing WEB SEARCH ENGINES August, 2002.
Slide 1 Web-Base Management Systems Aaron Brown and David Oppenheimer CS294-7 February 11, 1999.
INTERNET A collection of networks. History ARPANet – developed for security of sending in case of a nuclear attack IDEA – the system would not go down.
Information Retrieval in Practice
Web-site Management System Strudel Presented by: LAKHLIFI Houda Instructor: Dr. Haddouti.
The Internet Useful Definitions and Concepts About the Internet.
Query Languages Aswin Yedlapalli. XML Query data model Document is viewed as a labeled tree with nodes Successors of node may be : - an ordered sequence.
Advanced Topics COMP163: Database Management Systems University of the Pacific December 9, 2008.
Search engines. The number of Internet hosts exceeded in in in in in
Searching and Researching the World Wide: Emphasis on Christian Websites Developed from the book: Searching and Researching on the Internet and World Wide.
Searching the World Wide Web From Greenlaw/Hepp, In-line/On-line: Fundamentals of the Internet and the World Wide Web 1 Introduction Directories, Search.
Putting Semi-structured Data to Practice Alon Levy Seattle, Washingon University of Washington.
Overview of Search Engines
1 Internet History Internet made up of thousands of networks worldwide No one in charge of Internet - No governing body Internet backbone owned by private.
The Internet & The World Wide Web Notes
Chapter 10 Publishing and Maintaining Your Web Site.
HTML Comprehensive Concepts and Techniques Intro Project Introduction to HTML.
Lecturer: Ghadah Aldehim
Web Data Management COSC Introduction  The ‘ world wide web’  a vast, widely distributed collection of semi-structured multimedia documents 
History of the Internet  Began in 1969 as a network of computers at UCLA, Santa Barbara, Stanford & Univ. of Utah.  It was funded by the U.S Dept.
DATA COMMUNICATION DONE BY: ALVIN SAMPATH CARLVIN SAMPATH.
16-1 The World Wide Web The Web An infrastructure of distributed information combined with software that uses networks as a vehicle to exchange that information.
The Internet Writer’s Handbook 2/e Introduction to World Wide Web Terms Writing for the Web.
XHTML Introductory1 Linking and Publishing Basic Web Pages Chapter 3.
Search Engines Meta Engines People Directories Subject Directories Domains explained URLs explained Hypertext Language Contents.
Operating Systems Concepts 1/e Ruth Watson Chapter 12 Chapter 12 Introduction to the Internet Ruth Watson.
Chapter 7 Web Content Mining Xxxxxx. Introduction Web-content mining techniques are used to discover useful information from content on the web – textual.
Internet, intranet, and multimedia database processing l Database processing across local and wide area networks l Alternative architectures for distributing.
Network Installation. Internet & Intranets Topics to be discussed Internet. Intranet. .
Web Searching Basics Dr. Dania Bilal IS 530 Fall 2009.
Search Engine By Bhupendra Ratha, Lecturer School of Library and Information Science Devi Ahilya University, Indore
1/28: The Internet & Website Design What is the Internet? –Parts of the Internet –Internet & WWW basics –Searching the WWW Website design considerations.
Chapter 9 Publishing and Maintaining Your Site. 2 Principles of Web Design Chapter 9 Objectives Understand the features of Internet Service Providers.
XP New Perspectives on The Internet, Sixth Edition— Comprehensive Tutorial 3 1 Searching the Web Using Search Engines and Directories Effectively Tutorial.
The Internet 8th Edition Tutorial 4 Searching the Web.
Search Engines. Search Strategies Define the search topic(s) and break it down into its component parts What terms, words or phrases do you use to describe.
Internet Research Tips Daniel Fack. Internet Research Tips The internet is a self publishing medium. It must be be analyzed for appropriateness of research.
2007. Software Engineering Laboratory, School of Computer Science S E Web-Harvest Web-Harvest: Open Source Web Data Extraction tool 이재정 Software Engineering.
4 1 SEARCHING THE WEB Using Search Engines and Directories Effectively New Perspectives on THE INTERNET.
1 Of Crawlers, Portals, Mice and Men: Is there more to Mining the Web? Jiawei Han Simon Fraser University, Canada ACM-SIGMOD’99 Web Mining Panel Presentation.
Data Integration Hanna Zhong Department of Computer Science University of Illinois, Urbana-Champaign 11/12/2009.
Company LOGO In the Name of Allah,The Most Gracious, The Most Merciful King Khalid University College of Computer and Information System Websites Programming.
Website design and structure. A Website is a collection of webpages that are linked together. Webpages contain text, graphics, sound and video clips.
World Wide Web Guide * for Students to the Internet.
The World Wide Web. What is the worldwide web? The content of the worldwide web is held on individual pages which are gathered together to form websites.
Website Design, Development and Maintenance ONLY TAKE DOWN NOTES ON INDICATED SLIDES.
Semi-structured Data In many applications, data does not have a rigidly and predefined schema: –e.g., structured files, scientific data, XML. Managing.
The Internet is a Big Collection of Computers and Cables. -"interconnection of computer networks". Millions of personal, business, and governmental.
A Portrait of the Semantic Web in Action Jeff Heflin and James Hendler IEEE Intelligent Systems December 6, 2010 Hyewon Lim.
1 Integration of data sources Patrick Lambrix Department of Computer and Information Science Linköpings universitet.
Microsoft Office 2008 for Mac – Illustrated Unit D: Getting Started with Safari.
Web Design Terminology Unit 2 STEM. 1. Accessibility – a web page or site that address the users limitations or disabilities 2. Active server page (ASP)
Week-6 (Lecture-1) Publishing and Browsing the Web: Publishing: 1. upload the following items on the web Google documents Spreadsheets Presentations drawings.
The Internet. The Internet and Systems that Use It Internet –A group of computer networks that encircle the entire globe –Began in 1969 Protocol –Language.
1 UNIT 13 The World Wide Web. Introduction 2 Agenda The World Wide Web Search Engines Video Streaming 3.
General Architecture of Retrieval Systems 1Adrienn Skrop.
INTRODUCTION TO INFORMATION SYSTEMS LECTURE 9: DATABASE FEATURES, FUNCTIONS AND ARCHITECTURES PART (2) أ/ غدير عاشور 1.
Information Retrieval in Practice
Chapter 10: Web Basics.
Types of Search Questions
What is Internet Internet is a network of networks, linking computers to computers. Each runs software to provide or “serve” information and/or to access.
Web Warehousing : Design and Issues
Semi-structured Data In many applications, data does not have a rigidly and predefined schema: e.g., structured files, scientific data, XML. Managing such.
WEBSQL -University of Toronto
Presentation transcript:

1 Class Number – CS 412 Web Data MGMT and XML Instructor – Sanjay Madria Lesson Title - Introduction

2 The link for the Real Player live stream for the is: The link to view the archived Real Player lecture at 28 and 56 kbs is: c082803kbs2856.rm (The lecture date section will change for each produced class) The link to view the Real Player archived lecture at 200 kbs is: c082803kbs200.rm For example, to watch the lecture using real player for say 15 th Sept, you modify the date as “ CS412Lec091503kbs200.rm”

3 Web Data Management and XML Sanjay Kumar Madria Department of Computer Science University of Missouri-Rolla

4 WWW Huge, widely distributed, heterogeneous collection of semi- structured multimedia documents in the form of web pages connected via hyperlinks.

5 World Wide Web Web is fast growing More business organizations putting information in the Web Business on the highway Myriad of raw data to be processed for information

6 As WWW grows, more chaotic it becomes Web is fast growing, distributed, non- administered global information resource WWW allows access to text, image, video, sound and graphic data More business organizations creating web servers More chaotic environment to locate information of interest Lost in hyperspace syndrome

7 Characteristics of WWW WWW is a set of directed graphs Data in the WWW has a heterogeneous nature, self-describing and schema less Unstructured information, deeply nested No central authority to manage information Dynamic verses static information Web information discoveries - search engines

8 Web is Growing! In 1994, WWW grew by 1758 % !! June June Dec ,576 April ,768 July , !!!!!

9 ‘COM’ domains are increasing! As of July 1995, 6.64 million host computers on the Internet: –1.74 million are ‘com’ domains –1.41 million are ‘edu’ domains –0.30 million are ‘net’ –0.27 million are ‘gov’ –0.22 million are ‘mil’ –0.20 million are ‘org’

10 The number of Internet hosts exceeded in in in in in in 2000

11 Top web countries 1. Canada (1) 80% 9. New Zealand(7) US (4) 140% 10. Sweden (9) 101% 3. Ireland (3) 110% 11. Israel (12) 112% 4. Iceland (2) 68% 12. Cyprus (8) 72% 5. UK (14) 336 % 13. Hong Kong (15)148% 6. Malta (5) 155% 14. Norway (10) 64% 7. Australia (6) 133% 15. Switzerland (13) 75% 8. Singapore (11) 207% 16. Denmark (16) 105%

12 How users find web sites Indexes and search engines 75 UseNet newsgroups 44 Cool lists 27 New lists 24 Listservers 23 Print ads 21 Word-of-mouth and 17 Linked web advertisement 4

13 Limitations of Search Engines Do not exploit hyperlinks Search is limited to string matching Queries are evaluated on archived data rather than up-to-date data; no indexing on current data Low accuracy Replicated results No further manipulation possible

14 Limitations of Search Engines ERROR 404! No efficient document management Query results cannot be further manipulated No efficient means for knowledge discovery

15 More PROBLEMS Specifying/understanding what information is wanted High degree of variability of accessible information Variability in conceptual vocabulary or “ontology” used to describe information Complexity of querying unstructured data

16 Complexity of querying structured data Uncontrolled nature of web-based information content Determining which information sources to search/query

17 Search Engine Capabilities –Selection of language –Keywords with disjunction, adjacency, presence, absence,... –Word stemming (Hotbot) –Similarity search (Excite) –Natural language (LycosPro) –Restrict by modification date (Hotbot) or range of dates (Alta Vista) –Restrict result types (e.g., must include images) (Hotbot) –Restrict by geographical source (content or domain) (Hotbot) –Restrict within various structured regions of a document (titles or URLs) (Lycos Pro); (summary, first heading, title, URL) (Opentext)

18 SEARCH & RETRIEVAL Search Engines Search engine% web covered Hotbot34 AltaVista28 Northern Light20 Excite14 Infoseek10 Lycos3 + using several search engines is better than using only one +Source: Lawrence, S., and Giles, C.L., “Searching the World Wide Web,” Science 280, pp , 1998.

19 Schemes to locate information Supervised links between sites –ask at the reference desk Classification of documents –search in the catalog Automated searching –wander around the library

20 The most popular search engines Year 2000 AltaVista Yahoo HotBot Year 2001 Google NorthernLight AltaVista

21 Boolean search in AltaVista

22 Specifying field content in HotBot

23 Natural language interface in AskJeeves

24 Three examples of search strategies Rank web pages based on popularity Rank web pages based on word frequency Match query to an expert database All the major search engines use a mixed strategy in ranking web pages and responding to queries

25 Rank based on word frequency Library analogue: Keyword search Basic factors in HotBot ranking of pages: –words in the title –keyword meta tags –word frequency in the document –document length

26 Alternative word frequency measures Excite uses a thesaurus to search for what you want, rather than what you ask for AltaVista allows you to look for words that occur within a set distance of each other NorthernLight weighs results by search term sequence, from left to right

27 Rank based on popularity Library analogue: citation index The Google strategy for ranking pages: –Rank is based on the number of links to a page –Pages with a high rank have a lot of other web pages that link to it –The formula is on the Google help page

28 More on popularity ranking The Google philosophy is also applied by others, such as NorthernLight HotBot measures the popularity of a page by how frequently users have clicked on it in past search results

29 Expert databases: Yahoo! An expert database contains predefined responses to common queries A simple approach is subject directory, e.g. in Yahoo!, which contains a selection of links for each topic The selection is small, but can be useful Library analogue: Trustworthy references

30 Expert databases: AskJeeves AskJeeves has predefined responses to various types of common queries These prepared answers are augmented by a meta-search, which searches other SEs Library analogue: Reference desk

31 Best wines in France: AskJeeves

32 Best wines in France: HotBot

33 Best wines in France: Google

34 Linux in Iceland: Google

35 Linux in Iceland: HotBot

36 Linux in Iceland: AskJeeves

37 Web Data Management is the Key

38 Key Objectives Design a suitable data model to represent web information Development of web algebra and query language, query optimization Maintenance of Web data - View Maintenance Development of knowledge discovery and web mining tools Web warehouse Web data integration, secondary storages, indexes

39 Limitations of the Web Today Applications can not consume HTML HTML wrapper technology is brittle Companies merge, need interoperability fast

40 Paradigm Shift New Web standards – XML XML generated by applications and consumed by applications Data exchange –Across platforms: enterprise interoperability –Across enterprises Web : from documents to data

41 Database challenges Query optimization and processing Views and transformations Data warehousing and data integration Mediators and query rewriting Secondary storages indexes

42 DBMS needs paradigm shift to Web data differs from database data self describing, schema less structure changes without notice heterogeneous, deeply nested, irregular documents and data mixed Designed by document, but not db expert Need web data mgmt

43 Web Data Representation HTML - Hypertext Markup Language –fixed grammar, no regular expressions –Simple representation of data –good for simple data and intended for human consumption –difficult to extract information SGML - Standard Generalized Markup Language - good for publishing deeply structured document XML - Extended Markup Language -a subset of SGML

44 Terminology HTML - Hypertext Mark-up Language HTTP - Hypertext Transmission Protocol URL - Uniform Resource Locator example - := :// / /filen ame>[ ] where – is http, ftp, gopher –host is internet address … –#location is a textual label in the file.

45 Links are specified as Anhor Text “destination URL is the URL of the destination document and Anchor Text is the text that appears as an anchor when displayed. Example: Nanyang Technological University Absolute and relative URL New York is relative NCSA's Beginner's Guide to HTML absolute address

46 World Wide Web Prevalent, persistent and informative HTML documents (soon, XML) created by humans or applications. Can database technology help? Persistent HTML documents!!! Accessed day in and day out by humans and applications.

47 Current Research Projects Web Query System –W3QS, WebSQL, AKIRA, NetQL, RAW, WebLog, Araneus Semistructured Data Management –LOREL, UnQL, WebOQL, Florid Website Management System –STRUDEL, Araneus Web Warehouse –WHOWEDA, Xylem.com

48 Main Tasks Modeling and Querying the Web –view web as directed graph – content and link based queries –example - find the page that contain the word “clinton” which has a link from a page containing word “monica”.

49 Information Extraction and integration –wrapper - program to extract a structured representation of the data; a set of tuples from HTML pages. –Mediator - integration of data-softwares that access multiple source from a uniform interface Web Site Construction and Restructuring –creating sites –modeling the structure of web sites –restructuring data

50 What to Model Structure of Web sites Internal structure of web pages Contents of web sites in finer granularities

51 Data Representation of Web Data Graph Data Models Semistructured Data Models (also graph based)

52 Graph Data Model Labeled graph data model where node represents web pages and arcs represent links between pages. Labels on arcs can be viewed as attribute names. Regular path expression queries

53 Semistructured Data Models Irregular data structure, no fixed schema known and may be implicit in the data Schema may be large and may change frequently Schema is descriptive rather than perspective; describes the current state of data, but violations of schema is still tolerated

54 Data is not strongly typed; for different objects the values of the same attributes may be of differing types. (heterogenious sources) No restriction on the set of arcs that emanate from a given node in a graph or on the types of the values of attributes Ability to query the schemas; acr variables which get bound to labels on arcs, rather than nodes in the graph

55 Graph based Query Languages Use graph to model databases Support regular path expressions and graph construction in queries. Examples Graph Log for hypertext queries graph query language for OO

56 Query Languages for Semi- Structured data Use labeled graphs Query the schema of data Ability to accommodate irregularities in the data, such as missing links etc. Examples : Lorel (Stanford), UnQL (AT&T), STRUQL (AT&T)

57 Comparison of Query Systems

58 Types of Query Languages First Generation Second generation

59 First Generation Query Languages Combine the content-based queries of search engines with structure-based queries Combine conditions on text pattern in documents with graph pattern describing link structures Examples - W3QL (TECHNION, Israel) WebSQL (Toronto), WebLOG (Concordia)

60 Second generation languages Called web data manipulation languages Web pages as atomic objects with properties that they contain or do not contain certain text patterns and they point to other objects Useful for data wrapping, transformation, and restructuring Useful for web site transformation and restructuring Access to internal structure of web pages, it helps in extracting a set of tuples from the web pages of a movie database which requires parsing and selectively access certain subtrees in the parse tree

61 How they Differ? Provide access to the structure of web objects they manipulate - return structure Model internal structures of web documents as well as the external links that connect them Support references to model hyperlinks and some support to ordered collections of records for more natural data representation Ability to create new complex structures as a result of a query

62 Examples Web OQL STRUQL Florid

63 Information Integration To answer queries that may require extracting and combining data from multiple web sources Example - Movie database ; data about movies, their start casts, directors, schedule etc. Give me a movie playing time and a review of movies starring Frank Sinatra, playing tonight in Paris

64 Approaches Web warehouse – Data from multiple web sources is loaded into a warehouse, all queries are applied to warehouse data –Advantage - Warehouse needs to be updated when data sources change –Disadvantage - Performance Improvement Virtual warehouse – Data remain in the web sources, queries are decomposed at run time into queries to sources –Data is not replicated and is fresh –Due to autonomy of web sources query optimization and execution methodology mat differ and performance may be affected –Good when the number of sources are large, data changes frequently, little control over web sources

65 Virtual approach verses DBMS In virtual approach, data is not communicated directly with storage manager, instead it communicates to wrappers Second, user does not pose queries directly in the schema in which data is stored, user is free from knowing the structure User pose the queries to mediated schema, virtual relations (not stored anywhere) designed for particular application

66 Steps in data integration Specification of mediated schema and reformulation – Mediated schema is the set of collection and attribute names needed to formulate queries –Data integration system translates the query on the mediated schema into a query to data source Completeness of data in web sources Differing query processing capabilities Query Optimization – selecting a set of minimal sources and minimal queries Wrapper construction Matching objects across sources