The WWW as a Database: WWW Query Languages Curtis Dyreson James Cook University ( Townsville, Australia ) Aalborg University
Outline searching the WWW –search engines –WWW query languages WebSQL –WWW graph –cost Jumping Spider –hybrid
Searching the WWW search engines –Altavista, Infoseek, 2100 others! static architecture –robot: periodic, slow, non-uniform coverage –index: keywords to URLs, fast, ranking algorithm example query Lecture notes on trees in a data structures course.
A Search Engine Index
data structures
A Search Engine Index lecture notes data structures
A Search Engine Index lecture notes trees data structures
A Search Engine Index lecture notes trees data structures
A Search Engine Index lecture notes trees data structures
WWW Query Languages search engines index single pages multi-page concepts hunting strategy –search engine to nearby page –manual search WWW query languages WebSQL, W3QS, WebLog
WWW Graph Structure large (650K servers, 350M pages) dynamic, cyclic link = edge page = node
WebSQL SQL-like search engine to find pages path expression (regular expression of links) text manipulation predicates SELECT FROM WHERE ;
WebSQL From Clause from clause collects a set of documents unstructured - primitive schema MENTIONS - retrieve from search engine DOCUMENT x SUCH THAT x MENTIONS ‘data structures’
WebSQL From Clause from clause collects a set of documents unstructured - primitive schema Document[URL, text, link to URL, modify date] MENTIONS - retrieve from search engine SELECT z.URL FROM DOCUMENT x SUCH THAT x MENTIONS ‘data structures’, DOCUMENT y SUCH THAT x -> y, DOCUMENT z SUCH THAT y->* z WHERE y CONTAINS ‘lecture notes’ AND z CONTAINS ‘trees’;
WebSQL From Clause path expression finds related documents URL local link: -> global link: => DOCUMENT x SUCH THAT “ DOCUMENT y SUCH THAT x -> y DOCUMENT y SUCH THAT x => y
WebSQL From Clause at most one link: ? any number of links: * alternation: | DOCUMENT y SUCH THAT x ->(->)? y DOCUMENT y SUCH THAT x (=> | ->*) y DOCUMENT y SUCH THAT x ->* y
WebSQL From Clause: Example FROM Document X SUCH THAT X MENTIONS ‘Java’, Document Y SUCH THAT X -> | ->-> Y
WebSQL From Clause: Example FROM Document X SUCH THAT X MENTIONS ‘Java’, Document Y SUCH THAT X -> | ->-> Y Java
WebSQL From Clause: Example FROM Document X SUCH THAT X MENTIONS ‘Java’, Document Y SUCH THAT X -> | ->-> Y Java
WebSQL From Clause: Example FROM Document X SUCH THAT X MENTIONS ‘Java’, Document Y SUCH THAT X -> | ->-> Y Java
WebSQL From Clause path expression limits search space local link, search limited to local machine global link, can go anywhere =>* would search all of WWW pre-analysis, filtering even three to four local links infeasible
WebSQL Where Clause like SQL CONTAINS, text search of retrieved document can push CONTAINS into navigation WHERE y CONTAINS ‘lecture notes’ AND y.length < 4000;
WebSQL Query Find lecture notes on trees in a data structures course. SELECT z. FROM DOCUMENT x SUCH THAT x MENTIONS ‘data structures’, DOCUMENT y SUCH THAT x -> y, DOCUMENT z SUCH THAT y->* z WHERE y CONTAINS ‘lecture notes’ AND z CONTAINS ‘trees’;
data structures -> lecture notes
data structures
data structures -> lecture notes data structures
data structures -> lecture notes data structures lecture notes
lecture notes ->* trees data structures lecture notes
lecture notes ->* trees data structures lecture notes
lecture notes ->* trees data structures lecture notes trees
Result data structures lecture notes trees
WebSQL Example
WebSQL Architecture Java implementation
WWW Query Language - Drawbacks dynamic architecture O(p**k) - p is length of path expression - k is branching factor a priori knowledge of topology back links are a problem
Jumping Spider - a Hybrid like a search engine - static architecture - keyword searches like a WWW query language - uses modified WWW graph - one kind of path expression
Kinds of Links content refinement queries are common heuristic information in subdirectories is refined different kinds of links back - subdirectory to parent down - parent directory to subdirectory side - unrelated directories
Re-using the WWW Graph
Directory Trees
Down Links
Back Links
Eliminate Back Links
Transitive Closure of Down Links
Plus a Side Link
data structures -> lecture notes data structures
data structures -> lecture notes data structures
data structures -> lecture notes data structures lecture notes
lecture notes -> trees data structures lecture notes
lecture notes -> trees data structures lecture notes trees
Analysis search engine index - adds a pertinent index pertinent index - O(nlogn) to O(n**2) space - all URLs that can reach this URL - tree-like, so should be close to O(nlogn) more intersections implemented in Perl 5
Related Work WWW query languages WebSQL (Arocena et al. - WWW6 ’97) W3QS (Konopnicki and Shmueli - VLDB’95) WebLog (Lakshmanan et al. RIDE ’96) AKIRA (Lacroix et al. - ER ’97) Indexes that already use directories Infoseek WebGlimpse (Manber et al. - Usenix ’97) Semi-structured data models - many
Future Work scale to size of WWW extended query language (negation) easier installation