Presentation is loading. Please wait.

Presentation is loading. Please wait.

Complex Queries over Web Repositories Sriram Raghavan and Hector Garcia-Molina Computer Science Department Stanford University Gülfem IŞIKLAR M.Mirac KOCATÜRK.

Similar presentations


Presentation on theme: "Complex Queries over Web Repositories Sriram Raghavan and Hector Garcia-Molina Computer Science Department Stanford University Gülfem IŞIKLAR M.Mirac KOCATÜRK."— Presentation transcript:

1 Complex Queries over Web Repositories Sriram Raghavan and Hector Garcia-Molina Computer Science Department Stanford University Gülfem IŞIKLAR M.Mirac KOCATÜRK

2 19.11.2003 Complex Queries over Web Repositories 2 Outline Introduction Challenges and Solution Approach Model of a Web Repository Query Operators Examples of Complex Queries Optimizing and Executing Complex Web Queries Conclusion

3 19.11.2003 Complex Queries over Web Repositories 3 Introduction Web repositories manage large heterogeneous collections of Web pages and associated indexes. For effective analysis and mining, these repositories must provide a declarative query interface that supports complex expressive Web queries. In this paper, we model a Web repository in terms of “Web relations” and describe an algebra for expressing complex Web queries.

4 19.11.2003 Complex Queries over Web Repositories 4

5 19.11.2003 Complex Queries over Web Repositories 5 Example 1: Let S be a weighted set consisting of all the pages in the stanford.edu domain that contain the phrase ’Mobile networking’. Compute R, the set of all the “.edu” domains (except stanford.edu) that pages in S point to (we say a page p points to domain D if it points to any page in D). List the top-10 domains in R in descending order of their weights.

6 19.11.2003 Complex Queries over Web Repositories 6 Example 2: With each comic strip C, he associates a website C S, and a set C W containing the name of the strip and the names of the characters featured in that strip. For example, Dilbert W = {Dilbert, Dogbert, The Boss} and Dilbert S = dilbert.com. Extract a set of at most 10000 pages from the stanford.edu domain, preferring pages whose URLs either include the “~” character or include the path fragment “/people/”. For each comic strip C, compute f 1 (C), the number of pages in S that contain the words in C W, and f 2 (C), the number of pages in C S that pages in S point to. f 1 (C) + f 2 (C) is a measure of popularity for comic strip C.

7 19.11.2003 Complex Queries over Web Repositories 7 Challenges and Solution Approach Query models used in relational or text retrieval systems provide some, but not all of the features required to support Web queries. Thus, treating a Web repository as an application of a text retrieval system will support the “document collection” view. However, queries involving navigation or relational operators will be extremely hard to formulate and execute. On the other hand, the relational model provides a rich and well-tested suite of operators for expressing complex predicates over Web page attributes. However, ranks and orders are not intrinsic to the the basic relational model.

8 19.11.2003 Complex Queries over Web Repositories 8 Model of a Web Repository Page: We use the term “page” to refer to any Web resource that is referenced by a URL, crawled, and stored in the repository. Link: We use the term “link” to refer to any hypertext link that is embedded in the pages in the repository. Each link is associated with a source page (the page in which the hypertext link occurs) and a destination page (the page that the link refers to), and a unique identifier linkID.

9 19.11.2003 Complex Queries over Web Repositories 9 Ordered relation: Given a relation R and a strict partial ordering > R on the tuples of R, we refer to the pair [ R; > R ] as an ordered relation on R. For instance, we define an ordered relation [R; > R ] = [ R; {a > R d; a > R e; b > R d; b > R e; c > R d; c > R e } ], where each tuple whose domain attribute is stanford.edu is > R -related to any tuple outside the stanford.edu domain.

10 19.11.2003 Complex Queries over Web Repositories 10 Ranked relation: Given a relation R and a function w that assigns weights ( normalized to the range [0,1] ) to the tuples of R, we can define a new relation [ R ; w ] that is simply R with an additional implicit real-valued attribute w. We refer to [ R ; w ] as a ranked relation on R and to R as the “base relation” of [ R ; w ].

11 19.11.2003 Complex Queries over Web Repositories 11 We model a Web repository as a 6-tuple W = ( I P ; I L ; W R ; P ; L ; F ): I P (resp. I L ) is an identifier space from which the pageID (resp. linkID) for every page (resp. link) is chosen. W R is a set of plain, ranked, or ordered relations called Web relations. A relation R is said to be a Web relation if it contains at least one attribute whose domain is I P, I L 2 Ip, or 2 IL. P  W R is a universal page relation. P contains one tuple for each page in the repository and one column for each page attribute. P = (pageID,....)

12 19.11.2003 Complex Queries over Web Repositories 12 L  W R is a universal link relation. L contains one tuple for each hyperlink in the repository and one column for every available link attribute. L = (linkID, srcID, destID,...) F is a set of predefined page and link ranking functions that have been registered in the repository.

13 19.11.2003 Complex Queries over Web Repositories 13 Query Operators

14 Unary Relational Operators Select (σ): For an ordered relation, we define ( [ R ; > R ] ) = [ S ; > S ], where S = (R) and > S is defined as a > S b iff a > R b and a; b  S. For instance, referring to relation R, given the ordered relation [ R ; > R ] = [ R; {a > d; a > e; b > d; b > e; c > d; c > e } ], representing “preference to stanford.edu pages over berkeley.edu pages”, σ pInDegree≥4 ([R;> R ]) will yield [S;> S ] where S = σ pInDegree≥4 (R) = {b; d; e}, and > S includes the two ordering conditions b > S d and b > S e.

15 19.11.2003 Complex Queries over Web Repositories 15 Projection (Π): Projection (Π): The semantics of projection also carry over unchanged from traditional relational algebra, except for the folllowing changes: The result of the projection must include at least one attribute whose domain is I P, I L, 2 Ip, or 2 IL, to ensure that the result is a Web relation. Projection on a ranked relation [ R ; w ] will retain the ranking attribute R.w even if it is not listed in the projection list.

16 19.11.2003 Complex Queries over Web Repositories 16 Group-by ( γ ) on plain Web relation : The γ operator is used to group the incoming links to WS based on the language of the page in which the links occur. Group-by ()

17 19.11.2003 Complex Queries over Web Repositories 17 Group-by ( γ ) on ranked Web relation: The value of t.w’ for a tuple t  [S ; w’] is computed by averaging the ranks of all the tuples of R belonging to the corresponding group.

18 19.11.2003 Complex Queries over Web Repositories 18 Group-by ( γ ) on ordered Web relation: The partial ordering > R is used to express the following preference: “prefer pages with depths ≤ 3”. Thus, b > R c, g > R e, etc., as shown in the diagram.

19 19.11.2003 Complex Queries over Web Repositories 19 Binary Relational Operators Union ( U ): We define [ X; > X ] [ [ Y; > Y ] = [ Z ; > Z ], where Z = X U Y If a > X b and a > Y b, then a > Z b If either a > X b and b ∉ Y or a > Y b and b ∉ X, then a > Z b

20 19.11.2003 Complex Queries over Web Repositories 20 Set-difference (−): Note that the result has a partial order which is simply > X restricted to the elements present in the result. [ X ; > X ] − [ Y ; > Y ] = [ Z ; > Z ], where Z = X − Y and a > Z b iff a > X b and a; b  Z

21 19.11.2003 Complex Queries over Web Repositories 21 Intersection ( ⋂ ): [ X ; > X ] ⋂ [ Y ; > Y ] = [ Z ; > Z ], where Z = X ⋂ Y and a > Z b iff a; b  Z, a > X b, and a > Y b

22 19.11.2003 Complex Queries over Web Repositories 22 Cross-product ( X ): Cross-product operations can involve any pair of plain, ranked, or ordered relations. The challenge is to define the ordering or ranking of the result for each possible combination of operands.

23 19.11.2003 Complex Queries over Web Repositories 23 Ranking and ordering operators Rank (Ψ): Operator Ψ simply formalizes the act of applying a ranking function to a base relation. Thus, given a relation R and ranking function f : R x {R} → [ 0 ; 1 ], we define Ψ( f ; R ) = [ R ; f ]. Compose ( Θ h,op ): The compose operator Θ is used to merge two ranked relations to produce another ranked relation.

24 19.11.2003 Complex Queries over Web Repositories 24 Order ( Φ ): The operator Φ constructs an ordered relation, given either a ranked relation or a plain base relation. When applied on a ranked relation, Φ ([R; f]) returns the corresponding ordered relation [ R ; > f ].

25 19.11.2003 Complex Queries over Web Repositories 25 Prune ( Ω k ): The prune operator provides a mechanism for retrieving a fixed-size subset of tuples from a relation. In particular, given a relation R, Ω k (R) selects a subset of size min(k; |R|). For example, consider the ordered relation [ R; {a > b ; a > c ; a > e ; f > b ; f > c ; f > e} ] shown in previous figure, corresponding to the preference for “.com” domains over “.org” domains. Ω 4 on this relation can yield any set of four tuples as long as at least a and f are part of the result (thus, 6 possible results). Thus, one possible result of applying Ω 4 is [{a; f; e; d}; {a > e; f > e} ].

26 19.11.2003 Complex Queries over Web Repositories 26 QUERY OPERATORS NAVIGATION OPERATORS → Λ is represented as forward navigation ← Λ is represented as backward navigation These operators are expressed in terms of cross product and group by operations.

27 19.11.2003 Complex Queries over Web Repositories 27 QUERY OPERATORS Navigation Operators differ in 2 ways: 1. Binary Navigation operator 2. Unary Navigation operator

28 19.11.2003 Complex Queries over Web Repositories 28 QUERY OPERATORS Binary Navigation Operator 1. Navigation with Ranking 2. Navigation with Ordering a. Ordering only on pages b. Ordering both pages and links

29 19.11.2003 Complex Queries over Web Repositories 29 QUERY OPERATORS (navigation with ordering) [R,> R ]=Φ pLanguage=English>pLanguage≠English(R) [S,> S ]=Φ IntraDomain=yes>IntraDomain=No(S)

30 19.11.2003 Complex Queries over Web Repositories 30 QUERY OPERATORS (navigation with ranking) In this example we have the terms of; [R,f] and [S,g]

31 19.11.2003 Complex Queries over Web Repositories 31 QUERY OPERATORS UNARY NAVIGATION OPERATORS Instead of choosing from a set of tuples these operators permit navigation using all available data links in the repository. So if R is ordered and ranked then each neighbour will also be correspondingly ordered and ranked.

32 19.11.2003 Complex Queries over Web Repositories 32 EXAMPLES OF COMPLEX QUERIES Example 1

33 19.11.2003 Complex Queries over Web Repositories 33 EXAMPLES OF COMPLEX QUERIES Example 2

34 19.11.2003 Complex Queries over Web Repositories 34 EXAMPLES OF COMPLEX QUERIES Example 3

35 19.11.2003 Complex Queries over Web Repositories 35 EXAMPLES OF COMPLEX QUERIES Example 4

36 19.11.2003 Complex Queries over Web Repositories 36 OPTIMIZING AND EXECUTING COMPLEX WEB QUERIES An optimizer and execution engine is developed to efficiently executing the complex queries. The challenges of the system are: 1. Certain unique features of Web data set 2. The storage structures used in Web repositories 3. Characteristics of complex web queries

37 19.11.2003 Complex Queries over Web Repositories 37 OPTIMIZING AND EXECUTING COMPLEX WEB QUERIES As with join operations in relational queries, optimization of navigation operations is crucial for web queries. There are two techniques to optimize navigation operation: 1. Exploit Query Locality 2. Exploit Prune

38 19.11.2003 Complex Queries over Web Repositories 38 OPTIMIZING AND EXECUTING COMPLEX WEB QUERIES PAGE CLUSTERS To identify and exploit locality during query execution, we partition the entire set in the repository into page clusters. We attempt to group together “related” pages so that all the pages relevant to a complex query as distributed among a relatively small number of clusters.

39 19.11.2003 Complex Queries over Web Repositories 39 OPTIMIZING AND EXECUTING COMPLEX WEB QUERIES S-NODE REPRESENTATION Supernode graphs resides in memory. Graph chunks are loaded from disk on demand.

40 19.11.2003 Complex Queries over Web Repositories 40 EXPERIMENTAL RESULTS 35-million page data set (approximately; 600 million links with 300 GB of HTML)

41 19.11.2003 Complex Queries over Web Repositories 41 EXPERIMENTAL RESULTS 30 Web queries over 5 different 20-million data set

42 19.11.2003 Complex Queries over Web Repositories 42 RELATED WORKS Drawing inspiration graph and hyper-text query systems, a number of web query languages have been developed in the past such as WebSQL, W3QL, StruQL etc. These models are not incorporate with the notions ordering and ranking. At implementation level, these systems are intended for “online” queries for Web-Site Management as opposed to our “warehouse” model.

43 19.11.2003 Complex Queries over Web Repositories 43 CONCLUSION We addressed the problem of formulating and executing complex queries over Web repositories. We showed that the key characteristics of Web queries are the combination of navigation, text search and relational operators which can manipulate ordering and ranking. Finally we discussed some of the optimization techniques to execute such queries more efficiently.

44 THANK YOU THANK YOU


Download ppt "Complex Queries over Web Repositories Sriram Raghavan and Hector Garcia-Molina Computer Science Department Stanford University Gülfem IŞIKLAR M.Mirac KOCATÜRK."

Similar presentations


Ads by Google