The WWW as a Database: WWW Query Languages Curtis Dyreson James Cook University ( Townsville, Australia ) Aalborg University.

Slides:



Advertisements
Similar presentations
Fatma Y. ELDRESI Fatma Y. ELDRESI ( MPhil ) Systems Analysis / Programming Specialist, AGOCO Part time lecturer in University of Garyounis,
Advertisements

A Prototype Implementation of a Framework for Organising Virtual Exhibitions over the Web Ali Elbekai, Nick Rossiter School of Computing, Engineering and.
XML: Extensible Markup Language
Haystack: Per-User Information Environment 1999 Conference on Information and Knowledge Management Eytan Adar et al Presented by Xiao Hu CS491CXZ.
Crawling, Ranking and Indexing. Organizing the Web The Web is big. Really big. –Over 3 billion pages, just in the indexable Web The Web is dynamic Problems:
Search Engines and Information Retrieval
Paper by: A. Balmin, T. Eliaz, J. Hornibrook, L. Lim, G. M. Lohman, D. Simmen, M. Wang, C. Zhang Slides and Presentation By: Justin Weaver.
Semantic Search Jiawei Rong Authors Semantic Search, in Proc. Of WWW Author R. Guhua (IBM) Rob McCool (Stanford University) Eric Miller.
Sigir’99 Inside Internet Search Engines: Fundamentals Jan Pedersen and William Chang.
A Topic Specific Web Crawler and WIE*: An Automatic Web Information Extraction Technique using HPS Algorithm Dongwon Lee Database Systems Lab.
FACT: A Learning Based Web Query Processing System Hongjun Lu, Yanlei Diao Hong Kong U. of Science & Technology Songting Chen, Zengping Tian Fudan University.
Searching The Web Search Engines are computer programs (variously called robots, crawlers, spiders, worms) that automatically visit Web sites and, starting.
ReQuest (Validating Semantic Searches) Norman Piedade de Noronha 16 th July, 2004.
Chapter 19: Information Retrieval
CS 345 Data Mining Lecture 1 Introduction to Web Mining.
1 MARG-DARSHAK: A Scrapbook on Web Search engines allow the users to enter keywords relating to a topic and retrieve information about internet sites (URLs)
CS 405G: Introduction to Database Systems 24 NoSQL Reuse some slides of Jennifer Widom Chen Qian University of Kentucky.
Internet Research Search Engines & Subject Directories.
What’s The Difference??  Subject Directory  Search Engine  Deep Web Search.
DAT602 Database Application Development Lecture 15 Java Server Pages Part 1.
Enhancing Internet Search Engines to Achieve Concept- based Retrieval F. Lu, T. Johnsten, V. Raghavan, and D. Traylor.
Chapter 7 Web Content Mining Xxxxxx. Introduction Web-content mining techniques are used to discover useful information from content on the web – textual.
XML as a Boxwood Data Structure Feng Zhou, John MacCormick, Lidong Zhou, Nick Murphy, Chandu Thekkath 8/20/04.
Search Engine By Bhupendra Ratha, Lecturer School of Library and Information Science Devi Ahilya University, Indore
Vijayshankar Raman, CS294-7, Spring Querying the WWW Alberto O. Mendelzon George A. Mihaila Tova Milo.
Topical Crawlers for Building Digital Library Collections Presenter: Qiaozhu Mei.
Internet Information Retrieval Sun Wu. Course Goal To learn the basic concepts and techniques of internet search engines –How to use and evaluate search.
SharePoint 2010 Search Architecture The Connector Framework Enhancing the Search User Interface Creating Custom Ranking Models.
XP New Perspectives on The Internet, Sixth Edition— Comprehensive Tutorial 3 1 Searching the Web Using Search Engines and Directories Effectively Tutorial.
The Internet 8th Edition Tutorial 4 Searching the Web.
A Model for Fast Web Mining Prototyping Nivio Ziviani UFMG – Brazil Álvaro Pereir a Ricardo Baeza-Yates Jesus Bisbal UPF – Spain.
Curtis Spencer Ezra Burgoyne An Internet Forum Index.
Declaratively Producing Data Mash-ups Sudarshan Murthy 1, David Maier 2 1 Applied Research, Wipro Technologies 2 Department of Computer Science, Portland.
Autumn Web Information retrieval (Web IR) Handout #1:Web characteristics Ali Mohammad Zareh Bidoki ECE Department, Yazd University
IT-522: Web Databases And Information Retrieval By Dr. Syed Noman Hasany.
1 Internet Research Third Edition Unit A Searching the Internet Effectively.
Modeling and Querying Web Data A Survey By Li Lu.
Web- and Multimedia-based Information Systems Lecture 2.
XML and Database.
Internet Research – Illustrated, Fourth Edition Unit A.
INTEGRATING BROWSING AND SEARCHING WebGlimpse and ScentTrails -Rajesh Golla.
CIW Lesson 6MBSH Mr. Schmidt1.  Define databases and database components  Explain relational database concepts  Define Web search engines and explain.
Search engine note. Search Signals “Heuristics” which allow for the sorting of search results – Word based: frequency, position, … – HTML based: emphasis,
Toward Semantic Search: RDFa based facet browser Jin Guang Zheng Tetherless World Constellation.
Date: 2013/4/1 Author: Jaime I. Lopez-Veyna, Victor J. Sosa-Sosa, Ivan Lopez-Arevalo Source: KEYS’12 Advisor: Jia-ling Koh Speaker: Chen-Yu Huang KESOSD.
Characteristics of Information on the Web Dania Bilal IS 530 Spring 2006.
Integrated Departmental Information Service IDIS provides integration in three aspects Integrate relational querying and text retrieval Integrate search.
Web Crawling and Automatic Discovery Donna Bergmark March 14, 2002.
Semantic collaborative web caching Jean-Marc Pierson Lionel Brunie, David Coquil LISI, INSA de LYON
Apache Solr Dima Ionut Daniel. Contents What is Apache Solr? Architecture Features Core Solr Concepts Configuration Conclusions Bibliography.
Week-6 (Lecture-1) Publishing and Browsing the Web: Publishing: 1. upload the following items on the web Google documents Spreadsheets Presentations drawings.
SIMS 202, Marti Hearst Final Review Prof. Marti Hearst SIMS 202.
WEB SEARCH BASICS By K.KARTHIKEYAN. Web search basics The Web Ad indexes Web spider Indexer Indexes Search User Sec
CS 405G: Introduction to Database Systems
Chapter Five Web Search Engines
Enhancing Internet Search Engines to Achieve Concept-based Retrieval
OUTLINE Basic ideas of traditional retrieval systems
SIS: A system for Personal Information Retrieval and Re-Use
Information Retrieval on the World Wide Web
Search Engines & Subject Directories
Information Retrieval
CS 440 Database Management Systems
Magnet & /facet Zheng Liang
Search Engines & Subject Directories
Search Engines & Subject Directories
Search Engine Architecture
Information Retrieval and Web Design
WEBSQL -University of Toronto
Information Retrieval and Web Design
Information Retrieval and Web Design
Presentation transcript:

The WWW as a Database: WWW Query Languages Curtis Dyreson James Cook University ( Townsville, Australia ) Aalborg University

Outline searching the WWW –search engines –WWW query languages WebSQL –WWW graph –cost Jumping Spider –hybrid

Searching the WWW search engines –Altavista, Infoseek, 2100 others! static architecture –robot: periodic, slow, non-uniform coverage –index: keywords to URLs, fast, ranking algorithm example query Lecture notes on trees in a data structures course.

A Search Engine Index

data structures

A Search Engine Index lecture notes data structures

A Search Engine Index lecture notes trees data structures

A Search Engine Index lecture notes trees data structures

A Search Engine Index lecture notes trees data structures

WWW Query Languages search engines index single pages multi-page concepts hunting strategy –search engine to nearby page –manual search WWW query languages WebSQL, W3QS, WebLog

WWW Graph Structure large (650K servers, 350M pages) dynamic, cyclic link = edge page = node

WebSQL SQL-like search engine to find pages path expression (regular expression of links) text manipulation predicates SELECT FROM WHERE ;

WebSQL From Clause from clause collects a set of documents unstructured - primitive schema MENTIONS - retrieve from search engine DOCUMENT x SUCH THAT x MENTIONS ‘data structures’

WebSQL From Clause from clause collects a set of documents unstructured - primitive schema Document[URL, text, link to URL, modify date] MENTIONS - retrieve from search engine SELECT z.URL FROM DOCUMENT x SUCH THAT x MENTIONS ‘data structures’, DOCUMENT y SUCH THAT x -> y, DOCUMENT z SUCH THAT y->* z WHERE y CONTAINS ‘lecture notes’ AND z CONTAINS ‘trees’;

WebSQL From Clause path expression finds related documents URL local link: -> global link: => DOCUMENT x SUCH THAT “ DOCUMENT y SUCH THAT x -> y DOCUMENT y SUCH THAT x => y

WebSQL From Clause at most one link: ? any number of links: * alternation: | DOCUMENT y SUCH THAT x ->(->)? y DOCUMENT y SUCH THAT x (=> | ->*) y DOCUMENT y SUCH THAT x ->* y

WebSQL From Clause: Example FROM Document X SUCH THAT X MENTIONS ‘Java’, Document Y SUCH THAT X -> | ->-> Y

WebSQL From Clause: Example FROM Document X SUCH THAT X MENTIONS ‘Java’, Document Y SUCH THAT X -> | ->-> Y Java

WebSQL From Clause: Example FROM Document X SUCH THAT X MENTIONS ‘Java’, Document Y SUCH THAT X -> | ->-> Y Java

WebSQL From Clause: Example FROM Document X SUCH THAT X MENTIONS ‘Java’, Document Y SUCH THAT X -> | ->-> Y Java

WebSQL From Clause path expression limits search space local link, search limited to local machine global link, can go anywhere =>* would search all of WWW pre-analysis, filtering even three to four local links infeasible

WebSQL Where Clause like SQL CONTAINS, text search of retrieved document can push CONTAINS into navigation WHERE y CONTAINS ‘lecture notes’ AND y.length < 4000;

WebSQL Query Find lecture notes on trees in a data structures course. SELECT z. FROM DOCUMENT x SUCH THAT x MENTIONS ‘data structures’, DOCUMENT y SUCH THAT x -> y, DOCUMENT z SUCH THAT y->* z WHERE y CONTAINS ‘lecture notes’ AND z CONTAINS ‘trees’;

data structures -> lecture notes

data structures

data structures -> lecture notes data structures

data structures -> lecture notes data structures lecture notes

lecture notes ->* trees data structures lecture notes

lecture notes ->* trees data structures lecture notes

lecture notes ->* trees data structures lecture notes trees

Result data structures lecture notes trees

WebSQL Example

WebSQL Architecture Java implementation

WWW Query Language - Drawbacks dynamic architecture O(p**k) - p is length of path expression - k is branching factor a priori knowledge of topology back links are a problem

Jumping Spider - a Hybrid like a search engine - static architecture - keyword searches like a WWW query language - uses modified WWW graph - one kind of path expression

Kinds of Links content refinement queries are common heuristic information in subdirectories is refined different kinds of links back - subdirectory to parent down - parent directory to subdirectory side - unrelated directories

Re-using the WWW Graph

Directory Trees

Down Links

Back Links

Eliminate Back Links

Transitive Closure of Down Links

Plus a Side Link

data structures -> lecture notes data structures

data structures -> lecture notes data structures

data structures -> lecture notes data structures lecture notes

lecture notes -> trees data structures lecture notes

lecture notes -> trees data structures lecture notes trees

Analysis search engine index - adds a pertinent index pertinent index - O(nlogn) to O(n**2) space - all URLs that can reach this URL - tree-like, so should be close to O(nlogn) more intersections implemented in Perl 5

Related Work WWW query languages WebSQL (Arocena et al. - WWW6 ’97) W3QS (Konopnicki and Shmueli - VLDB’95) WebLog (Lakshmanan et al. RIDE ’96) AKIRA (Lacroix et al. - ER ’97) Indexes that already use directories Infoseek WebGlimpse (Manber et al. - Usenix ’97) Semi-structured data models - many

Future Work scale to size of WWW extended query language (negation) easier installation