Presentation is loading. Please wait.

Presentation is loading. Please wait.

Merging Ranks from Heterogeneous Internet Sources Hector Garcia-Molina Luis Gravano Stanford University.

Similar presentations


Presentation on theme: "Merging Ranks from Heterogeneous Internet Sources Hector Garcia-Molina Luis Gravano Stanford University."— Presentation transcript:

1 Merging Ranks from Heterogeneous Internet Sources Hector Garcia-Molina Luis Gravano Stanford University

2 Luis Gravano2 Users Have Many Available Information Sources Source 1 h 11, h 12, h 13,... Source 2... Nothing! User Query Query Results “ Houses near Palo Alto for around $300K.”

3 Stanford University Luis Gravano3 Challenges Sources are too numerousSources are too numerous Sources are heterogeneous (query language, model, results)Sources are heterogeneous (query language, model, results) Users want a single query resultUsers want a single query result

4 Stanford University Luis Gravano4 Metasearcher Selects the good sources for a querySelects the good sources for a query Extracts and combines the query results from the sourcesExtracts and combines the query results from the sources

5 Stanford University Luis Gravano5 Text Sources Rank Query Results Text Source Doc 1: 0.8 Doc 2: 0.6... “Distributed Databases”

6 Stanford University Luis Gravano6 Structured Sources on the Internet also Rank Results A real-estate agent receives queries on Location and Price: Q: “Houses with preferred location in Palo Alto and preferred price around $300K.”

7 Stanford University Luis Gravano7 The Agent Ranks its Houses Based on its Own Scoring Function Q: “Houses with preferred location in Palo Alto and preferred price around $300K.”

8 Stanford University Luis Gravano8 A Metasearcher then Faces Two Problems Extracting the top objects from the underlying sourcesExtracting the top objects from the underlying sources Merging the results from the various sourcesMerging the results from the various sources

9 Stanford University Luis Gravano9 Merging Query Results is Easy with Enough Information Given a record like: the metasearcher ignores the Source score and computes its Target score from the Location and Price

10 Stanford University Luis Gravano10 Extracting the Top Objects from a Source is Hard The metasearcher’s scoring function might be different from the source’s!

11 Stanford University Luis Gravano11 We Want to Avoid Extracting All the Source’s Contents Assume a house h with: Source(Q, h) = 0 (worst for source)Source(Q, h) = 0 (worst for source) Target(Q, h) = 1 (best for metasearcher)Target(Q, h) = 1 (best for metasearcher) Problem!

12 Stanford University Luis Gravano12 The Example Query is Not Manageable at the Agent A query Q is manageable at a source if   < 1 such that: Source Target (0,0) (1,1)    Source(Q, h)  Target(Q, h)- 

13 Stanford University Luis Gravano13 Single-Attribute Queries Are More Likely to be Manageable Single-attribute queries for Q: Q 1 : Location = Palo AltoQ 1 : Location = Palo Alto Q 2 : Price = $300KQ 2 : Price = $300K

14 Stanford University Luis Gravano14 The Example Becomes Tractable! … if the top Target objects for Q are among the top Source objects for Q 1 and Q 2

15 Stanford University Luis Gravano15 A Cover Bounds the Target Scores for Q Q 1, …, Q m single-attribute queries form a cover for Q if  g 1, …, g m, G such that: Target(Q i, h)  g i Target(Q, h)  G

16 Stanford University Luis Gravano16 Having a Manageable Cover for a Query is Sufficient... Manageable Cover for query Q at source S “Efficient” Executions Possible at S

17 Stanford University Luis Gravano17 Having a Manageable Cover for a Query is Sufficient... (1) Pick a manageable cover C = {Q 1,..., Q m } for Q at S (2) For i = 1 to m: Find  i for Q i (3) Pick 0  g 1,..., g m, G < 1 for cover C (4) For i = 1 to m (5) Retrieve all objects t with Source(Q i, t)  G i = g i -  i (6) Compute Target(Q, t) for all objects t retrieved (7) If  i such that G i  0 Then Go to Step (11) (8) If for all t retrieved, Target(Q, t)  G Then (9) Find new, lower 0  g 1,..., g m, G < 1 for C (10) Go to Step (4) (11) Output those objects retrieved with the highest Target score

18 Stanford University Luis Gravano18 Algorithm to Extract Top Target Objects Q1Q1Q1Q1 Q2Q2Q2Q2 0 1 g1g1g1g1 g2g2g2g2 Target(Q, h)  G

19 Stanford University Luis Gravano19 Algorithm to Extract Top Target Objects Q1Q1Q1Q1 Q2Q2Q2Q2 0 1 g1’g1’g1’g1’ g2’g2’g2’g2’ Target(Q, h)  G’ Target(Q, h’)  G’! h’

20 Stanford University Luis Gravano20 Preliminary Performance Results for our Algorithm Target=Min: 14% objects retrievedTarget=Min: 14% objects retrieved Target=Max: 4% objects retrievedTarget=Max: 4% objects retrieved 10,000 objects 4 query attributes  =0

21 Stanford University Luis Gravano21 Preliminary Performance Results for our Algorithm Target=Min: 25% objects retrievedTarget=Min: 25% objects retrieved Target=Max: 44% objects retrievedTarget=Max: 44% objects retrieved 10,000 objects 4 query attributes  =0.10

22 Stanford University Luis Gravano22 Having a Manageable Cover for a Query is Also Necessary... No Manageable Cover No Manageable Cover for query Q at source S Efficient Executions Impossible at S

23 Stanford University Luis Gravano23 A Manageable Cover is Necessary: Proof Consider Q 1, Q 2, Q 3 minimal cover for Q with: Q 1, Q 2 manageable, Q 3 not manageable For any “efficient “execution, build h such that: h is not retrieved h is not retrieved Target(Q, h) > G = max{Target(Q, o) | o retrieved} Target(Q, h) > G = max{Target(Q, o) | o retrieved}

24 Stanford University Luis Gravano24 A Manageable Cover is Necessary: Proof Q1Q1Q1Q1 Q2Q2Q2Q2 Q3Q3Q3Q3 0 1 g1g1g1g1 g2g2g2g2 g3g3g3g3

25 h’h’ h’ Target(Q, h’) > G!

26 h h’ h h’ Target(Q, h) > G! h Target(Q 3, h)  Target(Q, h’) Target(Q, h’) > G

27 Stanford University Luis Gravano27 We Studied Two Metasearching Problems Extracting the top objects from the underlying sourcesExtracting the top objects from the underlying sources Merging the results from the various sourcesMerging the results from the various sources

28 Stanford University Luis Gravano28 Related Work: Collection Fusion Voorhees et al.Voorhees et al. Callan/Lu/CroftCallan/Lu/Croft Gauch/WangGauch/Wang


Download ppt "Merging Ranks from Heterogeneous Internet Sources Hector Garcia-Molina Luis Gravano Stanford University."

Similar presentations


Ads by Google