Livnat Sharabani SDBI 2006 The Hidden Web. 2 Based on: “Distributed search over the hidden web: Hierarchical database sampling and selection” “Distributed.

Slides:

Advertisements

Similar presentations

Downloading Textual Hidden-Web Content Through Keyword Queries

Advertisements

Mustafa Cayci INFS 795 An Evaluation on Feature Selection for Text Clustering.

DECISION TREES. Decision trees  One possible representation for hypotheses.

Chapter 5: Introduction to Information Retrieval

1 Evaluation Rong Jin. 2 Evaluation  Evaluation is key to building effective and efficient search engines usually carried out in controlled experiments.

Crawling, Ranking and Indexing. Organizing the Web The Web is big. Really big. –Over 3 billion pages, just in the indexable Web The Web is dynamic Problems:

TI: An Efficient Indexing Mechanism for Real-Time Search on Tweets Chun Chen 1, Feng Li 2, Beng Chin Ooi 2, and Sai Wu 2 1 Zhejiang University, 2 National.

Pete Bohman Adam Kunk.  Introduction  Related Work  System Overview  Indexing Scheme  Ranking  Evaluation  Conclusion.

Natural Language Processing WEB SEARCH ENGINES August, 2002.

Automatic Classification of Text Databases Through Query Probing Panagiotis G. Ipeirotis Luis Gravano Columbia University Mehran Sahami E.piphany Inc.

1 Entity Ranking Using Wikipedia as a Pivot (CIKM 10’) Rianne Kaptein, Pavel Serdyukov, Arjen de Vries, Jaap Kamps 2010/12/14 Yu-wen,Hsu.

Search Engines and Information Retrieval

Distributed Search over the Hidden Web Hierarchical Database Sampling and Selection Panagiotis G. Ipeirotis Luis Gravano Computer Science Department Columbia.

Query Operations: Automatic Local Analysis. Introduction Difficulty of formulating user queries –Insufficient knowledge of the collection –Insufficient.

Distributed Search over the Hidden Web Hierarchical Database Sampling and Selection Panagiotis G. Ipeirotis & Luis Gravano.

Mastering the Internet, XHTML, and JavaScript Chapter 7 Searching the Internet.

INFO 624 Week 3 Retrieval System Evaluation

1 Ranked Queries over sources with Boolean Query Interfaces without Ranking Support Vagelis Hristidis, Florida International University Yuheng Hu, Arizona.

Distributed Search over the Hidden Web: Hierarchical Database Sampling and Selection.

Web queries classification Nguyen Viet Bang WING group meeting June 9 th 2006.

J. Chen, O. R. Zaiane and R. Goebel An Unsupervised Approach to Cluster Web Search Results based on Word Sense Communities.

Web Search – Summer Term 2006 VII. Selected Topics - Metasearch Engines [1] (c) Wolfgang Hürst, Albert-Ludwigs-University.

Distributed Information Retrieval Jamie Callan Carnegie Mellon University

Federated Search of Text Search Engines in Uncooperative Environments Luo Si Language Technology Institute School of Computer Science Carnegie Mellon University.

Query Operations: Automatic Global Analysis. Motivation Methods of local analysis extract information from local set of documents retrieved to expand.

Chapter 5: Information Retrieval and Web Search

Overview of Search Engines

 Search engines are programs that search documents for specified keywords and returns a list of the documents where the keywords were found.  A search.

1 Modeling Query-Based Access to Text Databases Eugene Agichtein Panagiotis Ipeirotis Luis Gravano Computer Science Department Columbia University.

Donghui Xu Spring 2011, COMS E6125 Prof. Gail Kaiser.

Internet Research, Second Edition- Illustrated 1 Internet Research: Unit A Searching the Internet Effectively.

Modeling (Chap. 2) Modern Information Retrieval Spring 2000.

Search Engines and Information Retrieval Chapter 1.

©2008 Srikanth Kallurkar, Quantum Leap Innovations, Inc. All rights reserved. Apollo – Automated Content Management System Srikanth Kallurkar Quantum Leap.

PAUL ALEXANDRU CHIRITA STEFANIA COSTACHE SIEGFRIED HANDSCHUH WOLFGANG NEJDL 1* L3S RESEARCH CENTER 2* NATIONAL UNIVERSITY OF IRELAND PROCEEDINGS OF THE.

Query Routing in Peer-to-Peer Web Search Engine Speaker: Pavel Serdyukov Supervisors: Gerhard Weikum Christian Zimmer Matthias Bender International Max.

Web Searching Basics Dr. Dania Bilal IS 530 Fall 2009.

UOS 1 Ontology Based Personalized Search Zhang Tao The University of Seoul.

When Experts Agree: Using Non-Affiliated Experts To Rank Popular Topics Meital Aizen.

Search and Navigation Based on the paper, “Improved Search Engines and Navigation Preference in Personal Information Management” Ofer Bergman, Ruth Beyth-Marom,

TOPIC CENTRIC QUERY ROUTING Research Methods (CS689) 11/21/00 By Anupam Khanal.

XP New Perspectives on The Internet, Sixth Edition— Comprehensive Tutorial 3 1 Searching the Web Using Search Engines and Directories Effectively Tutorial.

Chapter 6: Information Retrieval and Web Search

Detecting Dominant Locations from Search Queries Lee Wang, Chuang Wang, Xing Xie, Josh Forman, Yansheng Lu, Wei-Ying Ma, Ying Li SIGIR 2005.

Search Engine Marketing SEM = Search Engine Marketing SEO = Search Engine Optimization optimizing (altering/changing) your page in order to get a higher.

GUIDED BY DR. A. J. AGRAWAL Search Engine By Chetan R. Rathod.

Enhancing Cluster Labeling Using Wikipedia David Carmel, Haggai Roitman, Naama Zwerdling IBM Research Lab (SIGIR’09) Date: 11/09/2009 Speaker: Cho, Chin.

Data Mining for Web Intelligence Presentation by Julia Erdman.

WIRED Week 3 Syllabus Update (next week) Readings Overview - Quick Review of Last Week’s IR Models (if time) - Evaluating IR Systems - Understanding Queries.

LANGUAGE MODELS FOR RELEVANCE FEEDBACK Lee Won Hee.

Wikipedia as Sense Inventory to Improve Diversity in Web Search Results Celina SantamariaJulio GonzaloJavier Artiles nlp.uned.es UNED,c/Juan del Rosal,

Web- and Multimedia-based Information Systems Lecture 2.

Probabilistic Latent Query Analysis for Combining Multiple Retrieval Sources Rong Yan Alexander G. Hauptmann School of Computer Science Carnegie Mellon.

Chapter Ⅳ. Categorization 2007 년 2 월 15 일 인공지능연구실 송승미 Text : THE TEXT MINING HANDBOOK Page. 64 ~ 81.

Hidden-Web Databases: Classification and Search Luis Gravano Columbia University Joint work with Panos Ipeirotis (Columbia)

CP3024 Lecture 12 Search Engines. What is the main WWW problem?  With an estimated 800 million web pages finding the one you want is difficult!

DISTRIBUTED INFORMATION RETRIEVAL Lee Won Hee.

The Effect of Database Size Distribution on Resource Selection Algorithms Luo Si and Jamie Callan School of Computer Science Carnegie Mellon University.

Pete Bohman Adam Kunk.  Introduction  Related Work  System Overview  Indexing Scheme  Ranking  Evaluation  Conclusion.

Bringing Order to the Web : Automatically Categorizing Search Results Advisor ： Dr. Hsu Graduate ： Keng-Wei Chang Author ： Hao Chen Susan Dumais.

Relevant Document Distribution Estimation Method for Resource Selection Luo Si and Jamie Callan School of Computer Science Carnegie Mellon University

XRANK: RANKED KEYWORD SEARCH OVER XML DOCUMENTS Lin Guo Feng Shao Chavdar Botev Jayavel Shanmugasundaram Abhishek Chennaka, Alekhya Gade Advanced Database.

Search Engines and Search techniques

Panagiotis G. Ipeirotis Tom Barry Luis Gravano

Information Retrieval

SDLIP + STARTS = SDARTS A Protocol and Toolkit for Metasearching

Panos Ipeirotis Luis Gravano

Panagiotis G. Ipeirotis Luis Gravano

Information Retrieval and Web Design

Presentation transcript:

Livnat Sharabani SDBI 2006 The Hidden Web

2 Based on: “Distributed search over the hidden web: Hierarchical database sampling and selection” “Distributed search over the hidden web: Hierarchical database sampling and selection” (Luis Gravano, Panagiotis G. Ipeirotis, Columbia University, VLDB 2002) “When one sample is not enough: Improving text database selection using shrinkage” “When one sample is not enough: Improving text database selection using shrinkage” (Luis Gravano, Panagiotis G. Ipeirotis, Columbia University, SIGMOD 2004)

3 Content What is the hidden web? What is the hidden web? Content Summary. Content Summary. Database classification. Database classification. Combined Algorithm. Combined Algorithm. Shrinkage. Shrinkage. Experiments Result. Experiments Result. Summary. Summary.

4 What is the hidden web? The “hidden- web” / “invisible-web” is what you cannot retrieve ("see") in the search results The “hidden- web” / “invisible-web” is what you cannot retrieve ("see") in the search results The “surface-web” / “visible-web” is what you see in the results pages from general web search engines. The “surface-web” / “visible-web” is what you see in the results pages from general web search engines.

5 “Surface” web vs. “Hidden” web

6 Why Are Some Pages Invisible? Technical barrier: Technical barrier: When typing or judgment are required. When typing or judgment are required. Dynamically generated pages. Dynamically generated pages. Pages search engines choose to exclude: Pages search engines choose to exclude: Links containing ‘?’ (can be a spiders trap) Links containing ‘?’ (can be a spiders trap) Flash, shockwave (spiders are html optimized) Flash, shockwave (spiders are html optimized)

7 The hidden web - majority Text databases on the web which are “hidden” behind search interfaces. Text databases on the web which are “hidden” behind search interfaces.

8 “Surface” web vs. “Hidden” web Surface web: Link structure. Link structure. The content is crawlable. The content is crawlable. The content is indexed by search engines like Google. The content is indexed by search engines like Google. Hidden web: Documents “hidden” in databases. Documents “hidden” in databases. The content is not crawlable. The content is not crawlable. Need to query each collection individually. Need to query each collection individually. Keywords:

9 Content What is the hidden web? What is the hidden web? Content Summary. Content Summary. Database classification. Database classification. Combined Algorithm. Combined Algorithm. Shrinkage. Shrinkage. Experiments Result. Experiments Result. Summary. Summary.

10 Metasearchers Metsearcher is a tool for searching over multiple hidden databases simultaneously through a query interface. A metasearcher performs three main tasks: Database selection. Query translation. Result merging. DB1 DB2 DB3 Metasearcher Query results WEB

11 DB Content Summary CNN.fn Num Docs:44,730 Worddf BreastCancer…12444… Statistics that characterize the database content: Statistics that characterize the database content: Document frequencies of the words appear in the database Number of documents stored in the database. Examples: Examples:CANCERLIT Num Docs: 148,944 Worddf BreastCancer…121,13491,688…

12 Typical DB Selection Algorithm Typical database selection algorithm depends on the database content summary to make decision. Typical database selection algorithm depends on the database content summary to make decision. Given a content summary the algorithm estimates how relevant the database is for a given query. Given a content summary the algorithm estimates how relevant the database is for a given query.

13 bGIOSS The algorithm: calculate the number of documents which expected to have the words in the query. The algorithm: calculate the number of documents which expected to have the words in the query. Example: for query “breast cancer” bGIOSS will calculate: Example: for query “breast cancer” bGIOSS will calculate: CANCERLIT: |c|=148,944 df(breast)=121,134 df(cancer)=91,688 CANCERLIT: |c|=148,944 df(breast)=121,134 df(cancer)=91,688148,944*(121,134/148,944)*(91,688/148,944)=~74,569 CNN.fn: |C|=44,730, df(breast)=124, df(cancer)=44 CNN.fn: |C|=44,730, df(breast)=124, df(cancer)=44 44,730 *(124/ 44,730)*(44/ 44,730)=~0 44,730 *(124/ 44,730)*(44/ 44,730)=~0 CNN.fn Num Docs:44,730 Worddf BreastCancer12444CANCERLIT Num Docs: 148,944 Worddf BreastCancer121,13491,688

14 Database Selection The data base selection is based on the contents summary. The data base selection is based on the contents summary. How do the metasearcher obtain the DB content summary? How do the metasearcher obtain the DB content summary? Exported by the DB itself. Exported by the DB itself. Manually generated description. Manually generated description. Use a technique to automate the extraction of content summaries from searchable text DBs. Use a technique to automate the extraction of content summaries from searchable text DBs.

15 Content Summary construction A pioneer work done by J. Callan and M. Connell was presented at SIGMOD ’99. A pioneer work done by J. Callan and M. Connell was presented at SIGMOD ’99. Their algorithm extracts a document sample from a given database D and computes the frequency of each observed word w in the sample. Their algorithm extracts a document sample from a given database D and computes the frequency of each observed word w in the sample.

16 Content Summary construction The algorithm: The algorithm: 1. Start with a comprehensive word dictionary. 2. Pick a word and send it as a query to database D. 3. Retrieve the top k documents returned. 4. If the number of retrieved documents exceeds a pre-specified threshold stop sampling. Otherwise return to step For each word w in the retrieved documents calculate SampleDF( w ).

17 Content Summary construction There are two main versions of this algorithm that differ in how they pick words from the dictionary: There are two main versions of this algorithm that differ in how they pick words from the dictionary: RS-Ord (Random Sampling Other Resource) – picks a random word from the dictionary. RS-Ord (Random Sampling Other Resource) – picks a random word from the dictionary. RS-Lrd (Random Sampling Learned Resource)- pick a word from a previously retrieved documents. RS-Lrd (Random Sampling Learned Resource)- pick a word from a previously retrieved documents. Both versions do not retrieve the actual document frequency for each word w, Hence 2 DBs, differing significantly in size, might be assigned similar content summaries. Both versions do not retrieve the actual document frequency for each word w, Hence 2 DBs, differing significantly in size, might be assigned similar content summaries.

18 Content What is the hidden web? What is the hidden web? Content Summary. Content Summary. Database classification. Database classification. Combined Algorithm. Combined Algorithm. Shrinkage. Shrinkage. Experiments Result. Experiments Result. Summary. Summary.

19 Database Classification Classifying a database to hierarchy of topics is another way to characterize the content of a database. Classifying a database to hierarchy of topics is another way to characterize the content of a database. Example: “CANCERLIT” can be classified under the category “health”.

20 Topics hierarchy Topics hierarchy: Topics hierarchy:

21 Automatic Document Classifier I Queries closely associated with topical categories retrieve mainly documents about that category. Queries closely associated with topical categories retrieve mainly documents about that category. example: “breast” and “cancer” is likely to retrieve documents related to health. By observing the number of matches generated for each query at a database we can classify the database. By observing the number of matches generated for each query at a database we can classify the database. example: if a database generates a large number of matches to queries associated with health and few matches for other categories we can classify the database under category health.

22 Automatic Document Classifier II A rule based document classifier uses a set of rules defining a classification decisions. A rule based document classifier uses a set of rules defining a classification decisions. Examples: Examples: “Jordan” AND “basketball” sports “hepatitis” health A database can be classified to more than one category. A database can be classified to more than one category.

23 Automatic Document Classifier III The algorithm defines for each subcategory c i : The algorithm defines for each subcategory c i : Coverage(c i ) – the number of documents estimated to belong to c i. Coverage(c i ) – the number of documents estimated to belong to c i. Specificity(c i ) – the fraction of documents estimated to belong to c i. Specificity(c i ) – the fraction of documents estimated to belong to c i. The algorithm classify a database into a category c i if the values of Coverage(c i ) and specificity(c i ) exceed two pre-specify thresholds. The algorithm classify a database into a category c i if the values of Coverage(c i ) and specificity(c i ) exceed two pre-specify thresholds.

24 Example Rules: Rules: “soccer” => sport “soccer” => sport “basketball” => sport “basketball” => sport “diet” => health “diet” => health “diabetes” => health “diabetes” => health Pre-define thresholds: Pre-define thresholds: Coverage(c i )=100 Coverage(c i )=100 Specificity(c i )=0.5 Specificity(c i )=0.5 Coverage(sport)= Coverage(sport)=300 Documents frequency soccer300 basketball200 diet140 diabetes12 Cancer250

25 Example Rules: Rules: “soccer” => sport “soccer” => sport “basketball” => sport “basketball” => sport “diet” => health “diet” => health “diabetes” => health “diabetes” => health Pre-define thresholds: Pre-define thresholds: Coverage(c i )=100 Coverage(c i )=100 Specificity(c i )=0.5 Specificity(c i )=0.5 Coverage(sport)= Coverage(sport)= = 500 Documents frequency soccer300 basketball200 diet140 diabetes12 Cancer250

26 Example Rules: Rules: “soccer” => sport “soccer” => sport “basketball” => sport “basketball” => sport “diet” => health “diet” => health “diabetes” => health “diabetes” => health Pre-define thresholds: Pre-define thresholds: Coverage(c i )=100 Coverage(c i )=100 Specificity(c i )=0.5 Specificity(c i )=0.5 Coverage(sport)= Coverage(sport)= = 500 Coverage(health)= Coverage(health)=140 Documents frequency soccer300 basketball200 diet140 diabetes12 Cancer250

27 Example Rules: Rules: “soccer” => sport “soccer” => sport “basketball” => sport “basketball” => sport “diet” => health “diet” => health “diabetes” => health “diabetes” => health Pre-define thresholds: Pre-define thresholds: Coverage(c i )=100 Coverage(c i )=100 Specificity(c i )=0.5 Specificity(c i )=0.5 Coverage(sport)= Coverage(sport)= = 500 Coverage(health)= Coverage(health)= = 162 Documents frequency soccer300 basketball200 diet140 diabetes12 Cancer250

28 Example Rules: Rules: “soccer” => sport “soccer” => sport “basketball” => sport “basketball” => sport “diet” => health “diet” => health “diabetes” => health “diabetes” => health Pre-define thresholds: Pre-define thresholds: Coverage(c i )=100 Coverage(c i )=100 Specificity(c i )=0.5 Specificity(c i )=0.5 Specificity(sport) = Specificity(sport) = 500/( )=0.76 Documents frequency soccer300 basketball200 diet140 diabetes12 Cancer250

29 Example Rules: Rules: “soccer” => sport “soccer” => sport “basketball” => sport “basketball” => sport “diet” => health “diet” => health “diabetes” => health “diabetes” => health Pre-define thresholds: Pre-define thresholds: Coverage(c i )=100 Coverage(c i )=100 Specificity(c i )=0.5 Specificity(c i )=0.5 Specificity(sport) = Specificity(sport) = 500/( )=0.76 Documents frequency soccer300 basketball200 diet140 diabetes12 Cancer250

30 Example Rules: Rules: “soccer” => sport “soccer” => sport “basketball” => sport “basketball” => sport “diet” => health “diet” => health “diabetes” => health “diabetes” => health Pre-define thresholds: Pre-define thresholds: Coverage(c i )=100 Coverage(c i )=100 Specificity(c i )=0.5 Specificity(c i )=0.5 Specificity(sport) = Specificity(sport) =500/( )=0.76 Specificity(health) = Specificity(health) = 162/( ) = 0.24 Documents frequency soccer300 basketball200 diet140 diabetes12 Cancer250

31 Example Rules: Rules: “soccer” => sport “soccer” => sport “basketball” => sport “basketball” => sport “diet” => health “diet” => health “diabetes” => health “diabetes” => health Pre-define thresholds: Pre-define thresholds: Coverage(c i )=100 Coverage(c i )=100 Specificity(c i )=0.5 Specificity(c i )=0.5 Specificity(sport) = Specificity(sport) =500/( )=0.76 Specificity(health) = Specificity(health) = 162/( ) = 0.24 Documents frequency soccer300 basketball200 diet140 diabetes12 Cancer250

32 Example Rules: Rules: “soccer” => sport “soccer” => sport “basketball” => sport “basketball” => sport “diet” => health “diet” => health “diabetes” => health “diabetes” => health Pre-define thresholds: Pre-define thresholds: Coverage(c i )=100 Coverage(c i )=100 Specificity(c i )=0.5 Specificity(c i )=0.5 sporthealth coverage Specificity

33 Example Rules: Rules: “soccer” => sport “soccer” => sport “basketball” => sport “basketball” => sport “diet” => health “diet” => health “diabetes” => health “diabetes” => health Pre-define thresholds: Pre-define thresholds: Coverage(c i )=100 Coverage(c i )=100 Specificity(c i )=0.5 Specificity(c i )=0.5 sporthealth coverage Specificity The word “cancer” did not appear in the rules thus did not affect coverage nor specificity.

34 QProber View Demo

35 Content What is the hidden web? What is the hidden web? Content Summary. Content Summary. Database classification. Database classification. Combined Algorithm. Combined Algorithm. Shrinkage. Shrinkage. Experiments Result. Experiments Result. Summary. Summary.

36 Construct Content Summary Algorithm outline: Algorithm outline: 1. Retrieve a document sample. 2. Generate a preliminary content summary. 3. Categorize the database. 4. Estimate the absolute frequencies of the words retrieved from the database.

37 Construct Content Summary Algorithm outline: Algorithm outline: 1. Retrieve a document sample. 2. Generate a preliminary content summary. 3. Categorize the database. 4. Estimate the absolute frequencies of the words retrieved from the database.

38 Document Sample Document sample for category c: newdocs = Ø newdocs = Ø For each subcategory c i of c: For each subcategory c i of c: For each query q relevant for c i : For each query q relevant for c i : newdocs = newdocs U {top k documents return for q} newdocs = newdocs U {top k documents return for q} If q consist a single word w If q consist a single word w then ActualDF( w )= #matches returned for q.

39 Document Sample – Example I START Sport Basketballsoccer HealthArtsScience Rules Sport “Jordan” and “bulls”, “Romario” and “soccer”, “Maradona”, “swimming”, etc. Health “diabetes”, “diet” and “fat”, “stomach”, etc. … We know ActualDF(. )

40 Construct Content Summary Algorithm outline: Algorithm outline: 1. Retrieve a document sample. 2. Generate a preliminary content summary. 3. Categorize the database. 4. Estimate the absolute frequencies of the words retrieved from the database.

41 Content Summary Build content summary for category c: For each word w in newdocs : For each word w in newdocs : SampleDF( w )= #documents in newdocs that contain w. SampleDF( w )= #documents in newdocs that contain w.

42 Construct Content Summary Algorithm outline: Algorithm outline: 1. Retrieve a document sample. 2. Generate a preliminary content summary. 3. Categorize the database. 4. Estimate the absolute frequencies of the words retrieved from the database.

43 Categorizing the Database The algorithm is recursive. The algorithm is recursive. We go down the topics hierarchy according to the “Coverage” and the “specificity”. We go down the topics hierarchy according to the “Coverage” and the “specificity”. Categorization: Categorization: If Coverage( c i )>treshold1 and specificity( c i )>threshold2 If Coverage( c i )>treshold1 and specificity( c i )>threshold2 Then getContentSummary( c i )

44 Document Sample – Example II START Sport Basketballsoccer HealthArtsScience Requirements: Requirements: Coverage(c i ) > x1 Specificity(c i ) > x2 NBA statistics NBA statistics

45 Construct Content Summary Algorithm outline: Algorithm outline: 1. Retrieve a document sample. 2. Generate a preliminary content summary. 3. Categorize the database. 4. Estimate the absolute frequencies of the words retrieved from the database.

46 Estimating absolute document Frequencies To evaluate the absolute document frequencies the paper uses Zipf’s observation that was refined later by Mendelbort: To evaluate the absolute document frequencies the paper uses Zipf’s observation that was refined later by Mendelbort: f=P(r+p) -B

47 Estimating absolute document Frequencies f=P(r+p) -B f => the frequency of the word. f => the frequency of the word. r => The rank of the word (by it’s frequency). r => The rank of the word (by it’s frequency). P, p, B => parameters of the specific document collection. P, p, B => parameters of the specific document collection.

48 Estimating absolute document Frequencies f=P(r+p ) -B f => the frequency of the word. f => the frequency of the word. r => The rank of the word (by it’s frequency). r => The rank of the word (by it’s frequency). P, p, B => parameters of the specific document collection. P, p, B => parameters of the specific document collection.

49 Estimating absolute document Frequencies f=P(r+p) -B f => the frequency of the word. f => the frequency of the word. r => The rank of the word (by it’s frequency). r => The rank of the word (by it’s frequency). P, p, B => parameters of the specific document collection. P, p, B => parameters of the specific document collection.

50 Estimating absolute document Frequencies - Example Rank: Rank: r(“Bulls”)=1 r(“Bulls”)=1 r(“Jordan”)=2 r(“Jordan”)=2 r(“Maradona”)=3 r(“Maradona”)=3 r(“Romario”)=4 r(“Romario”)=4 Rules Sport “Jordan” and “Bulls”, “Romario” and “soccer”, “Maradona”, “swimming”, etc. SampleDFActualDFJordan45--- Bulls80--- Maradona Romario32--- …

51 Estimating absolute document Frequencies Estimate actual word frequencies: Estimate actual word frequencies: 1. Sort words in their descending order of their SampleDF(.). Determine the rank r i of each word w i. 2. Estimate P, p, B by the ActualDF(.) you have. 3. Estimate absolute document frequency for all words in the sample.

52 Estimating absolute document Frequencies - Example Rank: Rank: r(“Bulls”)=1 r(“Bulls”)=1 r(“Jordan”)=2 r(“Jordan”)=2 r(“Maradona”)=3 r(“Maradona”)=3 r(“Romario”)=4 r(“Romario”)=4 According to Maradona (and more actualDF) estimate P, p and B According to Maradona (and more actualDF) estimate P, p and B Estimate ActualDF of “Jordan”, “Bulls” etc. Estimate ActualDF of “Jordan”, “Bulls” etc. Rules Sport “Jordan” and “Bulls”, “Romario” and “soccer”, “Maradona”, “swimming”, etc. SampleDFActualDFJordan45--- Bulls80--- Maradona Romario32--- …

53 Content Summary Problems The sparse data problem: The sparse data problem: The content summary tends to include the most frequent words but generally miss many other words that appear only in few documents. The content summary tends to include the most frequent words but generally miss many other words that appear only in few documents. Example: The word “hemophilia” appears in 0.1% of the PubMed documents. Example: The word “hemophilia” appears in 0.1% of the PubMed documents. A typical content summary for PubMed will not include “hemophilia” in it, thus causing the metasearcher to find PubMed as a non relevant database to query containing “hemophilia”.

54 Content Summary Problems Disproportion: Disproportion: Some word might be disproportionately represented in the document summary. Challenge: Challenge: Improving the quality of the content summary without necessarily increasing the document sample size. Improving the quality of the content summary without necessarily increasing the document sample size.

55 Content What is the hidden web? What is the hidden web? Content Summary. Content Summary. Database classification. Database classification. Combined Algorithm. Combined Algorithm. Shrinkage. Shrinkage. Experiments Result. Experiments Result. Summary. Summary.

56 Shrinkage When multiple databases correspond to similar topic categories they tend to have similar content summaries. When multiple databases correspond to similar topic categories they tend to have similar content summaries. The content summaries of databases under similar topics can mutually complement each other. The content summaries of databases under similar topics can mutually complement each other.

57 Category Content Summary D3 D1D2 ^DB = 1000 Df(“hypertension”)=480 P(“hypertension”)=0.48 ^DB = 2000 Df(“hypertension”)=0 P(“hypertension”)=0 P(“hypertension”) = 0.16 ((2000*0)+(1000*0.48))/3000

58 Shrunk content Summary I To create a shrunk content summary we must first create the categories content summary for all the categories in the hierarchy. To create a shrunk content summary we must first create the categories content summary for all the categories in the hierarchy. Consider a path in the topic hierarchy C 1,….,C m were c i =parent(c i+1 ) Consider a path in the topic hierarchy C 1,….,C m were c i =parent(c i+1 ) Root c1 c2 c3 D

59 Shrunk content Summary II A shrunk content summary for database D classified under categories c 1 …c m is: A shrunk content summary for database D classified under categories c 1 …c m is: Where: Where:

60 Shrunk content Summary III Root ……C1 C2 …C3 … D P(w|D)=0.6 P(w|C3)=0.4 P(w|C2)=0.78 P(w|C1)=0.3 P(w|Root)=0.01 Shrunk content Summary: 0.01*λ *λ *λ *λ *λ 4

61 Shrunk content Summary IV The category weights: The category weights: λ m+1 is the highest among the λ i ’s, which means the highest weight is given to the original content summary. The shrunk content summary incorporates information from multiple content summary and thus it can be closer to the complete (and unknown) content summary. The shrunk content summary incorporates information from multiple content summary and thus it can be closer to the complete (and unknown) content summary.

62 Shrunk content summary – is it always good? Not always, if the “uncertainty” associated with the score is low don’t use shrinkage: Not always, if the “uncertainty” associated with the score is low don’t use shrinkage: The sample size - If the database sample includes most of the documents from the DB (a small DB) then this sample is sufficiently complete. In this case shrinkage is not needed and might be undesirable. The sample size - If the database sample includes most of the documents from the DB (a small DB) then this sample is sufficiently complete. In this case shrinkage is not needed and might be undesirable. The frequency of the query words – if all the query words appear in almost all of the sample documents then the distribution of the words over the DB is “certain”. Same goes if every query word appears in close to no sample document. The frequency of the query words – if all the query words appear in almost all of the sample documents then the distribution of the words over the DB is “certain”. Same goes if every query word appears in close to no sample document.

63 Content What is the hidden web? What is the hidden web? Content Summary. Content Summary. Database classification. Database classification. Combined Algorithm. Combined Algorithm. Shrinkage. Shrinkage. Experiments Result. Experiments Result. Summary. Summary.

64 Experiments Result The papers refer to 2 aspects: The papers refer to 2 aspects: Content summary quality. Content summary quality. Database selection accuracy. Database selection accuracy. The papers show that the idea of exploiting content summaries of similarly classified databases increases the content summary quality and improves the database selection for a given query. The papers show that the idea of exploiting content summaries of similarly classified databases increases the content summary quality and improves the database selection for a given query.

65 Content summary quality I Comparing coverage of the retrieve vocabulary. RS-ORD and RS-LRD vs. different Rulers. Specificity % retrieved words

66 Content summary quality II Comparing rank of words. RS-ORD and RS-LRD vs. different Rulers.

67 Content summary quality III Comparing the number of queries done to the database. RS-ORD and RS-LRD vs. different Rulers. Comparing the number of queries done to the database. RS-ORD and RS-LRD vs. different Rulers.

68 Data base selection using shrinkage The shrinkage improves selecting relevant data bases.

69 Content What is the hidden web? What is the hidden web? Content Summary. Content Summary. Database classification. Database classification. Combined Algorithm. Combined Algorithm. Shrinkage. Shrinkage. Experiments Result. Experiments Result. Summary. Summary.

70 Summary I Database selection is critical to building efficient metasearchers that interact with potentially large number of databases. Database selection is critical to building efficient metasearchers that interact with potentially large number of databases. The metasearchers uses the database content summary to select the most relevant databases for a given query. The metasearchers uses the database content summary to select the most relevant databases for a given query.

71 Summary II The papers present methods to improve the database content summary: The papers present methods to improve the database content summary: Creating Content summary with estimation of actual document frequency. Creating Content summary with estimation of actual document frequency. Categorizing databases in a classification scheme. Categorizing databases in a classification scheme. A method to exploits content summaries of similarly classified databases and combines them using shrinkage. A method to exploits content summaries of similarly classified databases and combines them using shrinkage.

72 The End "The invisible portion of the Web will continue to grow exponentially before the tools to uncover the hidden Web are ready for general use" ( /deepweb.asp) QUESTIONS?

73 Appendix The metasearcher Turbo The metasearcher Turbo