Presentation is loading. Please wait.

Presentation is loading. Please wait.

The WWW as a Database: WWW Query Languages Curtis Dyreson James Cook University ( Townsville, Australia ) Aalborg University.

Similar presentations


Presentation on theme: "The WWW as a Database: WWW Query Languages Curtis Dyreson James Cook University ( Townsville, Australia ) Aalborg University."— Presentation transcript:

1 The WWW as a Database: WWW Query Languages Curtis Dyreson James Cook University ( Townsville, Australia ) Aalborg University

2 Outline searching the WWW –search engines –WWW query languages WebSQL –WWW graph –cost Jumping Spider –hybrid

3 Searching the WWW search engines –Altavista, Infoseek, 2100 others! static architecture –robot: periodic, slow, non-uniform coverage –index: keywords to URLs, fast, ranking algorithm example query Lecture notes on trees in a data structures course.

4 A Search Engine Index

5 data structures

6 A Search Engine Index lecture notes data structures

7 A Search Engine Index lecture notes trees data structures

8 A Search Engine Index lecture notes trees data structures

9 A Search Engine Index lecture notes trees data structures

10 WWW Query Languages search engines index single pages multi-page concepts hunting strategy –search engine to nearby page –manual search WWW query languages WebSQL, W3QS, WebLog

11 WWW Graph Structure large (650K servers, 350M pages) dynamic, cyclic link = edge page = node

12 WebSQL SQL-like search engine to find pages path expression (regular expression of links) text manipulation predicates SELECT FROM WHERE ;

13 WebSQL From Clause from clause collects a set of documents unstructured - primitive schema MENTIONS - retrieve from search engine DOCUMENT x SUCH THAT x MENTIONS ‘data structures’

14 WebSQL From Clause from clause collects a set of documents unstructured - primitive schema Document[URL, text, link to URL, modify date] MENTIONS - retrieve from search engine SELECT z.URL FROM DOCUMENT x SUCH THAT x MENTIONS ‘data structures’, DOCUMENT y SUCH THAT x -> y, DOCUMENT z SUCH THAT y->* z WHERE y CONTAINS ‘lecture notes’ AND z CONTAINS ‘trees’;

15 WebSQL From Clause path expression finds related documents URL local link: -> global link: => DOCUMENT x SUCH THAT “http://www.cs.auc.dk” DOCUMENT y SUCH THAT x -> y DOCUMENT y SUCH THAT x => y

16 WebSQL From Clause at most one link: ? any number of links: * alternation: | DOCUMENT y SUCH THAT x ->(->)? y DOCUMENT y SUCH THAT x (=> | ->*) y DOCUMENT y SUCH THAT x ->* y

17 WebSQL From Clause: Example FROM Document X SUCH THAT X MENTIONS ‘Java’, Document Y SUCH THAT X -> | ->-> Y

18 WebSQL From Clause: Example FROM Document X SUCH THAT X MENTIONS ‘Java’, Document Y SUCH THAT X -> | ->-> Y Java

19 WebSQL From Clause: Example FROM Document X SUCH THAT X MENTIONS ‘Java’, Document Y SUCH THAT X -> | ->-> Y Java

20 WebSQL From Clause: Example FROM Document X SUCH THAT X MENTIONS ‘Java’, Document Y SUCH THAT X -> | ->-> Y Java

21 WebSQL From Clause path expression limits search space local link, search limited to local machine global link, can go anywhere =>* would search all of WWW pre-analysis, filtering even three to four local links infeasible

22 WebSQL Where Clause like SQL CONTAINS, text search of retrieved document can push CONTAINS into navigation WHERE y CONTAINS ‘lecture notes’ AND y.length < 4000;

23 WebSQL Query Find lecture notes on trees in a data structures course. SELECT z. FROM DOCUMENT x SUCH THAT x MENTIONS ‘data structures’, DOCUMENT y SUCH THAT x -> y, DOCUMENT z SUCH THAT y->* z WHERE y CONTAINS ‘lecture notes’ AND z CONTAINS ‘trees’;

24 data structures -> lecture notes

25 data structures

26 data structures -> lecture notes data structures

27 data structures -> lecture notes data structures lecture notes

28 lecture notes ->* trees data structures lecture notes

29 lecture notes ->* trees data structures lecture notes

30 lecture notes ->* trees data structures lecture notes trees

31 Result data structures lecture notes trees

32 WebSQL Example

33 WebSQL Architecture Java implementation

34 WWW Query Language - Drawbacks dynamic architecture O(p**k) - p is length of path expression - k is branching factor a priori knowledge of topology back links are a problem

35 Jumping Spider - a Hybrid like a search engine - static architecture - keyword searches like a WWW query language - uses modified WWW graph - one kind of path expression

36 Kinds of Links content refinement queries are common heuristic information in subdirectories is refined different kinds of links back - subdirectory to parent down - parent directory to subdirectory side - unrelated directories

37 Re-using the WWW Graph

38 Directory Trees

39 Down Links

40 Back Links

41 Eliminate Back Links

42 Transitive Closure of Down Links

43 Plus a Side Link

44 data structures -> lecture notes data structures

45 data structures -> lecture notes data structures

46 data structures -> lecture notes data structures lecture notes

47 lecture notes -> trees data structures lecture notes

48 lecture notes -> trees data structures lecture notes trees

49 Analysis search engine index - adds a pertinent index pertinent index - O(nlogn) to O(n**2) space - all URLs that can reach this URL - tree-like, so should be close to O(nlogn) more intersections implemented in Perl 5

50 Related Work WWW query languages WebSQL (Arocena et al. - WWW6 ’97) W3QS (Konopnicki and Shmueli - VLDB’95) WebLog (Lakshmanan et al. RIDE ’96) AKIRA (Lacroix et al. - ER ’97) Indexes that already use directories Infoseek WebGlimpse (Manber et al. - Usenix ’97) Semi-structured data models - many

51 Future Work scale to size of WWW extended query language (negation) easier installation


Download ppt "The WWW as a Database: WWW Query Languages Curtis Dyreson James Cook University ( Townsville, Australia ) Aalborg University."

Similar presentations


Ads by Google