Presentation is loading. Please wait.

Presentation is loading. Please wait.

Panagiotis G. Ipeirotis Luis Gravano

Similar presentations


Presentation on theme: "Panagiotis G. Ipeirotis Luis Gravano"— Presentation transcript:

1 Panagiotis G. Ipeirotis Luis Gravano
Distributed Search over the Hidden Web Hierarchical Database Sampling and Selection Panagiotis G. Ipeirotis Luis Gravano Computer Science Department Columbia University

2 Distributed Search? Why? “Surface” Web vs. “Hidden” Web
Link structure Crawlable Documents indexed by search engines “Hidden” Web No link structure Documents “hidden” in databases Documents not indexed by search engines Need to query each collection individually 2/24/2019 Columbia University

3 Hidden Web: Examples PubMed search: [diabetes] 178,975 matches
PubMed is at Google search: [diabetes site:  119 matches Database Query Matches Google PubMed diabetes 178,975 119 U.S. Patents wireless network 16,741 Library of Congress visa regulations >10,000 2/24/2019 Columbia University

4 Distributed Search: Challenges
Content summaries of databases (vocabulary, word frequencies) kidneys 220,000 stones 40,000 ... kidneys 5 stones 40 kidneys 20 stones 950 Select good databases for query Evaluate query at these databases Merge results from databases Hidden Web Metasearcher PubMed Library of Congress ESPN 2/24/2019 Columbia University

5 Database Selection Problems
How to extract content summaries? How to use the extracted content summaries? basketball 4 cancer 4,532 cpu 23 Web Database basketball 4 cancer 4,532 cpu 23 Web Database 1 basketball 4 cancer ,298 cpu 0 Web Database 2 cancer Metasearcher basketball 6,340 cancer 2 cpu 0 Web Database 3 2/24/2019 Columbia University

6 Extracting Content Summaries from Web Databases
No direct access to remote documents other than by querying Resort to query-based document sampling: Send queries to database Retrieve document sample Use sample to create approximate content summary 2/24/2019 Columbia University

7 “Random” Query-Based Sampling
Pick a word and send it as a query to database Retrieve top-k documents returned (e.g., k=4) Repeat until “enough” (e.g., 300) documents are retrieved Callan et al., SIGMOD’99, TOIS 2001 Word Frequency in Sample cancer 150 (out of 300) aids 114 (out of 300) heart 98 (out of 300) basketball 2 (out of 300) Use word frequencies in sample to create content summary 2/24/2019 Columbia University

8 Random Sampling: Problems
No actual word frequencies computed for content summaries, only a “ranking” of words Many words missing from content summaries (many rare words) Many queries return very few or no matches # documents word rank Zipf’s law Many words appear in only one or two documents 2/24/2019 Columbia University

9 Our Technique: Focused Probing
Train document classifiers Find representative words for each category Use classifier rules to derive a topically-focused sample from database Estimate actual document frequencies for all discovered words 2/24/2019 Columbia University

10 Focused Probing: Training
Start with a predefined topic hierarchy and preclassified documents Train document classifiers for each node Extract rules from classifiers: ibm AND computers → Computers lung AND cancer → Health angina → Heart hepatitis AND liver → Hepatitis } Root SIGMOD 2001 } Health 2/24/2019 Columbia University

11 Focused Probing: Sampling
Transform each rule into a query For each query: Send to database Record number of matches Retrieve top-k matching documents At the end of round: Analyze matches for each category Choose category to focus on Sampling proceeds in rounds: In each round, the rules associated with each node are turned into queries for the database Representative document sample Actual frequencies for some “important” words Output: 2/24/2019 Columbia University

12 Sample Frequencies and Actual Frequencies
“liver” appears in 200 out of 300 documents in sample “kidney” appears in 100 out of 300 documents in sample “hepatitis” appears in 30 out of 300 documents in sample Document frequencies in actual database? Query “liver” returned 140,000 matches Query “hepatitis” returned 20,000 matches “kidney” was not a query probe… Can exploit number of matches from one-word queries 2/24/2019 Columbia University

13 Adjusting Document Frequencies
We know ranking r of words according to document frequency in sample We know absolute document frequency f of some words from one-word queries Mandelbrot’s formula connects empirically word frequency f and ranking r We use curve-fitting to estimate the absolute frequency of all words in sample f r 2/24/2019 Columbia University

14 Actual PubMed Content Summary
Number of Documents: 3,868,552 category: Health, Diseases cancer 1,398,178 aids 106,512 heart 281,506 hepatitis 23,481 basketball 907 cpu 487 Extracted automatically ~ 27,500 words in extracted content summary Fewer than 200 queries sent At most 4 documents retrieved per query The extracted content summary accurately represents size, contents, and classification of the database 2/24/2019 Columbia University

15 Focused Probing: Contributions
Focuses database sampling on dense topic areas Estimates absolute document frequencies of words Classifies databases along the way Classification useful for database selection 2/24/2019 Columbia University

16 Database Selection Problems
How to extract content summaries? How to use the extracted content summaries? basketball 4 cancer 4,532 cpu 23 Web Database basketball 4 cancer 4,532 cpu 23 Web Database 1 basketball 4 cancer ,298 cpu 0 Web Database 2 cancer Metasearcher basketball 6,340 cancer 2 cpu 0 Web Database 3 2/24/2019 Columbia University

17 Database Selection and Extracted Content Summaries
Database selection algorithms assume complete content summaries Content summaries extracted by (small-scale) sampling are inherently incomplete (Zipf's law) Queries with undiscovered words are problematic Database Classification Helps: Similar topics ↔ Similar content summaries Extracted content summaries complement each other 2/24/2019 Columbia University

18 Content Summaries for Categories: Example
Cancerlit contains “metastasis”, not found during sampling CancerBacup contains “diabetes”, not found during sampling Cancer category content summary contains both 2/24/2019 Columbia University

19 Hierarchical DB Selection: Outline
Create aggregated content summaries for categories Hierarchically direct queries using categories Category content summaries are more complete than database content summaries Various traversal techniques possible 2/24/2019 Columbia University

20 Hierarchical DB Selection: Example
To select D databases: Use a “flat” DB selection algorithm to score categories Proceed to category with highest score Repeat until category is a leaf, or category has fewer than D databases 2/24/2019 Columbia University

21 Experiments: Content Summary Extraction
Focused Probing compared to Random Sampling: Better vocabulary coverage Better word ranking More efficient for same sample size More effective for same sample size Actual aids basketball cancer heart pneumonia Sample Actual cancer pneumonia aids heart basketball Sample Ignores “off-topic” documents Better sample: Each retrieved document “represents” many unretrieved, so “on-topic” sampling helps Retrieves same number of documents using fewer queries Topic detection helps More results in the paper! 4 types of classifiers (SVM, Ripper, C4.5, Bayes), frequency estimation, different data sets… 2/24/2019 Columbia University

22 Experiments: Database Selection
LoC Query LoC LoCc Data set and workload: 50 real Web databases 50 TREC Web Track queries Metric: 15 For each query pick 3 databases Retrieve 5 documents from each database Return 15 documents to user Mark “relevant” and “irrelevant” documents LoC LoC Database Selection LoC LoC LoC LoC LoC LoC LoC LoC LoC LoC LoC LoC LoC LoC LoC LoC Good database selection algorithms choose databases with relevant documents 2/24/2019 Columbia University

23 Experiments: Precision of Database Selection Algorithms
Hierarchical Flat Focused Probing 0.27 0.17 Random Sampling - 0.18 Hierarchical database selection improves precision drastically Category content summaries more complete Topic-based database clustering helps More results in the paper! (different flat selection algorithms, more content summary extraction algorithms…) Best result for centralized search ~ 0.35 Not an option for Hidden Web! 2/24/2019 Columbia University

24 Contributions Technique for extracting content summaries from completely autonomous Hidden-Web databases Technique for estimating frequencies: Possible to distinguish large from small databases Hierarchical database selection exploits classification improving drastically precision of distributed search Content summary extraction implemented and available for download at: 2/24/2019 Columbia University

25 Future Work Different techniques for merging content summaries for category content summary creation Effect of frequency estimation on database selection Different hierarchy “traversing” algorithms for hierarchical database selection 2/24/2019 Columbia University


Download ppt "Panagiotis G. Ipeirotis Luis Gravano"

Similar presentations


Ads by Google