Download presentation
Presentation is loading. Please wait.
1
Distributed Search over the Hidden Web Hierarchical Database Sampling and Selection Panagiotis G. Ipeirotis & Luis Gravano
2
Outline Introduction Background Focused Probing for Content Summary Construction Exploiting Topic Hierarchies for Database Selection Experiments: Data and Metrics Experimental Results Conclusion and Future Work
3
Introduction Search engines create their indexes by spidering or crawling Web pages Hidden Web sources store their content in searchable databases
4
An Example… Searching in the medical database CANCERLIT – www.cancer.gov, the query [ lung and cancer ] returns 68,430 matches.www.cancer.gov Searching Google with the query [“lung and cancer” site: www.cancer.gov ] returns 23 matcheswww.cancer.gov
5
Meta - searchers Tools for searching Hidden data sources Relies on statistical or content summaries Performs three main tasks: Database Selection Query Translation Result Merging
6
This Paper Presents An algorithm to derive content summaries from “uncooperative” databases A Database Selection Algorithm that exploits: Extracted content summaries Hierarchical classification of the databases
7
Background: Existing Database Selection Algorithms
8
Contd. : Database Selection Algorithms Assumption: Query words are independently distributed over database documents. The answer to a query is the set of all the documents that satisfy the Boolean expression. Deficiency: Are the content summaries accurate and up to date?
9
Uniform Probing for Content Summary Construction Extracts a document sample from a given database, D and computes the frequency of each observed word w in the sample, SampleDF(w)
10
The Algorithm: Start with an empty content summary where SampleDF(w) = 0 for each word w, and a general (i.e., not specific to D), comprehensive word dictionary. Pick a word and send it as a query to database D. Retrieve the top-k documents returned. If the number of retrieved documents exceeds a pre-specified threshold, stop. Else continue the sampling process by returning to Step 2.
11
2 Versions of Algorithm RS-Ord : RandomSampling-OtherResource RS-Lrd : RandomSampling-LearnedResource
12
Deficiencies: ActualDF(w) for each word w is not revealed RS-Ord tends to produce inefficient executions in which it repeatedly issues queries to databases that produce no matches
13
Database Classification Rationale : Queries closely associated with topical categories retrieve mainly documents about that category Place the database in a classification scheme based on the number of matches
14
Automation and Hierarchical Classification Automates classification by queries derived automatically from a rule-based document classifier. A rule-based classifier is a set of logical rules defining classification decisions. jordan AND bulls-->Sports, hepatitis-->Health Apply this principle recursively to create a hierarchical classifier.
15
Focused Probing Sends query probes, and extracts number of matches without retrieving any documents. Calculates two metrics, the Coverage(Ci) and Specificity(Ci) for the subcategory Ci If the values of Coverage(Ci) and Specificity(Ci) exceed two pre-specified thresholds Tc and Ts, respectively, classify the database into a category Ci
16
Author’s Algorithm Exploit Topic Hierarchy Produce a document sample that Is topically representative of the contents Gives accurate and efficient content summary
17
Content-Summary Construction Steps of the algorithm: Query the database using focused probing to: Retrieve a document sample. Generate a preliminary content summary Categorize the database. Estimate the absolute frequencies of the words retrieved from the database.
18
Building Content Summaries from Extracted Documents ActualDF(w): The actual number of documents in the database that contain word w. The algorithm knows this number only if [w] is a single word query probe that was issued to the database SampleDF(w): The number of documents in the extracted sample that contain word w.
19
Focused Probing for Content Summary Construction
22
Estimating Absolute Document Frequencies Use Mandelbrot’s equation P(r+p) -B for distribution of words for estimating unknown ActualDF (¢) frequencies. Sort words in descending order of their SampleDF(¢) frequencies Focus on words with known ActualDF (¢) frequencies. Find the P, B, and p parameter values that best fit the data. Estimate ActualDF (wi) for all words wi with unknown ActualDF (wi) as P(ri+p) -B
23
Example
24
Creating Content Summaries for Topic Categories Example: “metastasis” did not appear in any of the documents sampled from CANCERLIT during probing Cancer-BACUP classified under “Cancer”, has a high ActualDF est (metastasis) = 3, 569 Convey this information by associating a content summary with category “Cancer” that is obtained by merging the summaries of all databases under this category In merged summary, ActualDF est (w) is sum of the document frequency of w for databases under this category
25
Creating Content Summaries for Topic Categories
26
Selecting Databases Hierarchically: Algorithm Inputs : a query Q, target databases K, top category C Steps : HierSelect(Query Q, Category C, int K) 1: Use a flat database selection algorithm to assign a score for Q to each subcategory of C 2: if there is a subcategory C with a non-zero score 3: Pick the subcategory Cj with the highest score 4: if NumDBs(Cj) >= K //Cj has enough databases 5: return HierSelect(Q,Cj,K) 6: else // Cj does not have enough databases 7: return DBs(Cj) FlatSelect(Q,C-Cj,K-NumDBs(Cj)) 8: else // no subcategory C has non-zero score 9: return FlatSelect(Q,C,K)
27
Example: Topic hierarchy for database selection (babe AND ruth,k=3)
28
Experiments :Data and Metrics Evaluate two main sets of techniques: 1. Content-summary construction techniques 2. Database selection techniques Evaluate the algorithms, using two data sets Controlled Database Set Web Database Set
29
Data Sets Controlled Database Set 500,000 newsgroup articles from 54 newsgroup 81,000 articles to train documents classifiers over the 72 – node topic hierarchy 419,000 articles to build the set of Controlled Databases Contained 500 databases ranging in size from 25 to 25,000 documents.
30
Data Sets Web Database Set 50 real web accessible databases with no control over it. Databases picked randomly from two directories of hidden-web databases, namely InvisibleWeb and Complete Planet
31
Content-summary construction Test variations of Focused Probing technique against RS-Ord and RS-Lrd. Focused Probing: Evaluated configurations with different underlying document classifiers for query-probe creation. Different values for the thresholds Ts and Tc Varied the specificity threshold Ts from 0 to 1 Fixed coverage threshold to Tc = 10.
32
Database Selection Effectiveness Underlying Database selection algorithm: Hierarchical algorithm Relies on a “flat” database selection algorithm. Chose algorithms: CORI, bGlOSS Adapted both algorithms to work with category content summary.
33
Database Selection Effectiveness Content Summary Construction Evaluated how the hierarchical database selection algorithm behaved over content summaries generated by different techniques Also studied QPilot Strategy Exploits HTML links to characterize text databases.
34
Content Summary Quality Metric : content summaries coverage of the actual database vocabulary ctf = Σ wєTr ActualDF(w) / Σ wєTd ActualDF(w) Tr = set of terms in content summary, Td = complete set of words in vocabulary Results: Focused Probing techniques achieve much higher ctf ratios than RS-Ord and RS-Lrd. The coverage of the Focused Probing summaries increases for lower thresholds of Ts
35
Content Summary Quality Correlation of word rankings: Used Spearman Rank Correlation Coefficient (SRCC ) – to measure how well a content summary orders words by frequencies with respect to the actual word frequency order in the database. Result :The Focused Probing method have higher SRCC values than the RS-Ord and RS- Lrd.
36
Content Summary Quality
37
Content Summary Quality - Efficiency Focused Probing techniques on average retrieve one document per query sent RS-Lrd retrieves about one document per two queries. RS-Ord unnecessarily issues many queries that produce no document matches.
38
Content Summary Quality Produce significantly better-quality summaries than RS-Ord and RS-Lrd do in terms of vocabulary coverage and word ranking preservation.
40
Database Selection Effectiveness Methodology: Web set of real web-accessible databases 50 queries from the Web Track of TREC Each database selection algorithm picked 3 databases for the query Retrieved the top 5 documents for the query. Human evaluators to judge the relevance of each retrieved document for the query
41
Database Selection Effectiveness Measured the precision of a technique for each query q as : Average precision of different database selection algorithms.
42
Database Selection Effectiveness Analysis: All the flat selection techniques suffer from incomplete coverage of the underlying probing-generated summaries. QPilot summaries do not work well for database selection because they generally contain only a few words and are hence highly incomplete.
43
Hierarchical vs. flat database selection The hierarchical algorithm using CORI as flat database selection has 50% better precision than CORI for flat selection with the same content summaries. For bGlOSS, the improvement is 92%. Reason: Topic hierarchy compensates for incomplete content summaries.
44
Hierarchical vs. flat database selection Measured fraction of times that hierarchical database selection algorithm picked a database for a query That produced matches for the query And was given a zero score by the flat database selection algorithm of choice.
45
Conclusion Presented a novel and efficient method for the construction of content summaries of web accessible text databases Presented a hierarchical database selection algorithm that exploits the database content summaries Algorithm generated classification to produce accurate results even for imperfect content summaries.
46
Future Work Alternative hierarchy traversing techniques. For example, “route” queries to multiple categories if appropriate. Examine the effect of absolute frequency estimation on database selection. Alternative methods for creating content summaries.
Similar presentations
© 2024 SlidePlayer.com. Inc.
All rights reserved.