Download presentation
Presentation is loading. Please wait.
Published byTiffany Debra Long Modified over 9 years ago
1
The WWW as a Database: WWW Query Languages Curtis Dyreson James Cook University ( Townsville, Australia ) Aalborg University
2
Outline searching the WWW –search engines –WWW query languages WebSQL –WWW graph –cost Jumping Spider –hybrid
3
Searching the WWW search engines –Altavista, Infoseek, 2100 others! static architecture –robot: periodic, slow, non-uniform coverage –index: keywords to URLs, fast, ranking algorithm example query Lecture notes on trees in a data structures course.
4
A Search Engine Index
5
data structures
6
A Search Engine Index lecture notes data structures
7
A Search Engine Index lecture notes trees data structures
8
A Search Engine Index lecture notes trees data structures
9
A Search Engine Index lecture notes trees data structures
10
WWW Query Languages search engines index single pages multi-page concepts hunting strategy –search engine to nearby page –manual search WWW query languages WebSQL, W3QS, WebLog
11
WWW Graph Structure large (650K servers, 350M pages) dynamic, cyclic link = edge page = node
12
WebSQL SQL-like search engine to find pages path expression (regular expression of links) text manipulation predicates SELECT FROM WHERE ;
13
WebSQL From Clause from clause collects a set of documents unstructured - primitive schema MENTIONS - retrieve from search engine DOCUMENT x SUCH THAT x MENTIONS ‘data structures’
14
WebSQL From Clause from clause collects a set of documents unstructured - primitive schema Document[URL, text, link to URL, modify date] MENTIONS - retrieve from search engine SELECT z.URL FROM DOCUMENT x SUCH THAT x MENTIONS ‘data structures’, DOCUMENT y SUCH THAT x -> y, DOCUMENT z SUCH THAT y->* z WHERE y CONTAINS ‘lecture notes’ AND z CONTAINS ‘trees’;
15
WebSQL From Clause path expression finds related documents URL local link: -> global link: => DOCUMENT x SUCH THAT “http://www.cs.auc.dk” DOCUMENT y SUCH THAT x -> y DOCUMENT y SUCH THAT x => y
16
WebSQL From Clause at most one link: ? any number of links: * alternation: | DOCUMENT y SUCH THAT x ->(->)? y DOCUMENT y SUCH THAT x (=> | ->*) y DOCUMENT y SUCH THAT x ->* y
17
WebSQL From Clause: Example FROM Document X SUCH THAT X MENTIONS ‘Java’, Document Y SUCH THAT X -> | ->-> Y
18
WebSQL From Clause: Example FROM Document X SUCH THAT X MENTIONS ‘Java’, Document Y SUCH THAT X -> | ->-> Y Java
19
WebSQL From Clause: Example FROM Document X SUCH THAT X MENTIONS ‘Java’, Document Y SUCH THAT X -> | ->-> Y Java
20
WebSQL From Clause: Example FROM Document X SUCH THAT X MENTIONS ‘Java’, Document Y SUCH THAT X -> | ->-> Y Java
21
WebSQL From Clause path expression limits search space local link, search limited to local machine global link, can go anywhere =>* would search all of WWW pre-analysis, filtering even three to four local links infeasible
22
WebSQL Where Clause like SQL CONTAINS, text search of retrieved document can push CONTAINS into navigation WHERE y CONTAINS ‘lecture notes’ AND y.length < 4000;
23
WebSQL Query Find lecture notes on trees in a data structures course. SELECT z. FROM DOCUMENT x SUCH THAT x MENTIONS ‘data structures’, DOCUMENT y SUCH THAT x -> y, DOCUMENT z SUCH THAT y->* z WHERE y CONTAINS ‘lecture notes’ AND z CONTAINS ‘trees’;
24
data structures -> lecture notes
25
data structures
26
data structures -> lecture notes data structures
27
data structures -> lecture notes data structures lecture notes
28
lecture notes ->* trees data structures lecture notes
29
lecture notes ->* trees data structures lecture notes
30
lecture notes ->* trees data structures lecture notes trees
31
Result data structures lecture notes trees
32
WebSQL Example
33
WebSQL Architecture Java implementation
34
WWW Query Language - Drawbacks dynamic architecture O(p**k) - p is length of path expression - k is branching factor a priori knowledge of topology back links are a problem
35
Jumping Spider - a Hybrid like a search engine - static architecture - keyword searches like a WWW query language - uses modified WWW graph - one kind of path expression
36
Kinds of Links content refinement queries are common heuristic information in subdirectories is refined different kinds of links back - subdirectory to parent down - parent directory to subdirectory side - unrelated directories
37
Re-using the WWW Graph
38
Directory Trees
39
Down Links
40
Back Links
41
Eliminate Back Links
42
Transitive Closure of Down Links
43
Plus a Side Link
44
data structures -> lecture notes data structures
45
data structures -> lecture notes data structures
46
data structures -> lecture notes data structures lecture notes
47
lecture notes -> trees data structures lecture notes
48
lecture notes -> trees data structures lecture notes trees
49
Analysis search engine index - adds a pertinent index pertinent index - O(nlogn) to O(n**2) space - all URLs that can reach this URL - tree-like, so should be close to O(nlogn) more intersections implemented in Perl 5
50
Related Work WWW query languages WebSQL (Arocena et al. - WWW6 ’97) W3QS (Konopnicki and Shmueli - VLDB’95) WebLog (Lakshmanan et al. RIDE ’96) AKIRA (Lacroix et al. - ER ’97) Indexes that already use directories Infoseek WebGlimpse (Manber et al. - Usenix ’97) Semi-structured data models - many
51
Future Work scale to size of WWW extended query language (negation) easier installation
Similar presentations
© 2025 SlidePlayer.com. Inc.
All rights reserved.