Livnat Sharabani SDBI 2006 The Hidden Web
2 Based on: “Distributed search over the hidden web: Hierarchical database sampling and selection” “Distributed search over the hidden web: Hierarchical database sampling and selection” (Luis Gravano, Panagiotis G. Ipeirotis, Columbia University, VLDB 2002) “When one sample is not enough: Improving text database selection using shrinkage” “When one sample is not enough: Improving text database selection using shrinkage” (Luis Gravano, Panagiotis G. Ipeirotis, Columbia University, SIGMOD 2004)
3 Content What is the hidden web? What is the hidden web? Content Summary. Content Summary. Database classification. Database classification. Combined Algorithm. Combined Algorithm. Shrinkage. Shrinkage. Experiments Result. Experiments Result. Summary. Summary.
4 What is the hidden web? The “hidden- web” / “invisible-web” is what you cannot retrieve ("see") in the search results The “hidden- web” / “invisible-web” is what you cannot retrieve ("see") in the search results The “surface-web” / “visible-web” is what you see in the results pages from general web search engines. The “surface-web” / “visible-web” is what you see in the results pages from general web search engines.
5 “Surface” web vs. “Hidden” web
6 Why Are Some Pages Invisible? Technical barrier: Technical barrier: When typing or judgment are required. When typing or judgment are required. Dynamically generated pages. Dynamically generated pages. Pages search engines choose to exclude: Pages search engines choose to exclude: Links containing ‘?’ (can be a spiders trap) Links containing ‘?’ (can be a spiders trap) Flash, shockwave (spiders are html optimized) Flash, shockwave (spiders are html optimized)
7 The hidden web - majority Text databases on the web which are “hidden” behind search interfaces. Text databases on the web which are “hidden” behind search interfaces.
8 “Surface” web vs. “Hidden” web Surface web: Link structure. Link structure. The content is crawlable. The content is crawlable. The content is indexed by search engines like Google. The content is indexed by search engines like Google. Hidden web: Documents “hidden” in databases. Documents “hidden” in databases. The content is not crawlable. The content is not crawlable. Need to query each collection individually. Need to query each collection individually. Keywords:
9 Content What is the hidden web? What is the hidden web? Content Summary. Content Summary. Database classification. Database classification. Combined Algorithm. Combined Algorithm. Shrinkage. Shrinkage. Experiments Result. Experiments Result. Summary. Summary.
10 Metasearchers Metsearcher is a tool for searching over multiple hidden databases simultaneously through a query interface. A metasearcher performs three main tasks: Database selection. Query translation. Result merging. DB1 DB2 DB3 Metasearcher Query results WEB
11 DB Content Summary CNN.fn Num Docs:44,730 Worddf BreastCancer…12444… Statistics that characterize the database content: Statistics that characterize the database content: Document frequencies of the words appear in the database Number of documents stored in the database. Examples: Examples:CANCERLIT Num Docs: 148,944 Worddf BreastCancer…121,13491,688…
12 Typical DB Selection Algorithm Typical database selection algorithm depends on the database content summary to make decision. Typical database selection algorithm depends on the database content summary to make decision. Given a content summary the algorithm estimates how relevant the database is for a given query. Given a content summary the algorithm estimates how relevant the database is for a given query.
13 bGIOSS The algorithm: calculate the number of documents which expected to have the words in the query. The algorithm: calculate the number of documents which expected to have the words in the query. Example: for query “breast cancer” bGIOSS will calculate: Example: for query “breast cancer” bGIOSS will calculate: CANCERLIT: |c|=148,944 df(breast)=121,134 df(cancer)=91,688 CANCERLIT: |c|=148,944 df(breast)=121,134 df(cancer)=91,688148,944*(121,134/148,944)*(91,688/148,944)=~74,569 CNN.fn: |C|=44,730, df(breast)=124, df(cancer)=44 CNN.fn: |C|=44,730, df(breast)=124, df(cancer)=44 44,730 *(124/ 44,730)*(44/ 44,730)=~0 44,730 *(124/ 44,730)*(44/ 44,730)=~0 CNN.fn Num Docs:44,730 Worddf BreastCancer12444CANCERLIT Num Docs: 148,944 Worddf BreastCancer121,13491,688
14 Database Selection The data base selection is based on the contents summary. The data base selection is based on the contents summary. How do the metasearcher obtain the DB content summary? How do the metasearcher obtain the DB content summary? Exported by the DB itself. Exported by the DB itself. Manually generated description. Manually generated description. Use a technique to automate the extraction of content summaries from searchable text DBs. Use a technique to automate the extraction of content summaries from searchable text DBs.
15 Content Summary construction A pioneer work done by J. Callan and M. Connell was presented at SIGMOD ’99. A pioneer work done by J. Callan and M. Connell was presented at SIGMOD ’99. Their algorithm extracts a document sample from a given database D and computes the frequency of each observed word w in the sample. Their algorithm extracts a document sample from a given database D and computes the frequency of each observed word w in the sample.
16 Content Summary construction The algorithm: The algorithm: 1. Start with a comprehensive word dictionary. 2. Pick a word and send it as a query to database D. 3. Retrieve the top k documents returned. 4. If the number of retrieved documents exceeds a pre-specified threshold stop sampling. Otherwise return to step For each word w in the retrieved documents calculate SampleDF( w ).
17 Content Summary construction There are two main versions of this algorithm that differ in how they pick words from the dictionary: There are two main versions of this algorithm that differ in how they pick words from the dictionary: RS-Ord (Random Sampling Other Resource) – picks a random word from the dictionary. RS-Ord (Random Sampling Other Resource) – picks a random word from the dictionary. RS-Lrd (Random Sampling Learned Resource)- pick a word from a previously retrieved documents. RS-Lrd (Random Sampling Learned Resource)- pick a word from a previously retrieved documents. Both versions do not retrieve the actual document frequency for each word w, Hence 2 DBs, differing significantly in size, might be assigned similar content summaries. Both versions do not retrieve the actual document frequency for each word w, Hence 2 DBs, differing significantly in size, might be assigned similar content summaries.
18 Content What is the hidden web? What is the hidden web? Content Summary. Content Summary. Database classification. Database classification. Combined Algorithm. Combined Algorithm. Shrinkage. Shrinkage. Experiments Result. Experiments Result. Summary. Summary.
19 Database Classification Classifying a database to hierarchy of topics is another way to characterize the content of a database. Classifying a database to hierarchy of topics is another way to characterize the content of a database. Example: “CANCERLIT” can be classified under the category “health”.
20 Topics hierarchy Topics hierarchy: Topics hierarchy:
21 Automatic Document Classifier I Queries closely associated with topical categories retrieve mainly documents about that category. Queries closely associated with topical categories retrieve mainly documents about that category. example: “breast” and “cancer” is likely to retrieve documents related to health. By observing the number of matches generated for each query at a database we can classify the database. By observing the number of matches generated for each query at a database we can classify the database. example: if a database generates a large number of matches to queries associated with health and few matches for other categories we can classify the database under category health.
22 Automatic Document Classifier II A rule based document classifier uses a set of rules defining a classification decisions. A rule based document classifier uses a set of rules defining a classification decisions. Examples: Examples: “Jordan” AND “basketball” sports “hepatitis” health A database can be classified to more than one category. A database can be classified to more than one category.
23 Automatic Document Classifier III The algorithm defines for each subcategory c i : The algorithm defines for each subcategory c i : Coverage(c i ) – the number of documents estimated to belong to c i. Coverage(c i ) – the number of documents estimated to belong to c i. Specificity(c i ) – the fraction of documents estimated to belong to c i. Specificity(c i ) – the fraction of documents estimated to belong to c i. The algorithm classify a database into a category c i if the values of Coverage(c i ) and specificity(c i ) exceed two pre-specify thresholds. The algorithm classify a database into a category c i if the values of Coverage(c i ) and specificity(c i ) exceed two pre-specify thresholds.
24 Example Rules: Rules: “soccer” => sport “soccer” => sport “basketball” => sport “basketball” => sport “diet” => health “diet” => health “diabetes” => health “diabetes” => health Pre-define thresholds: Pre-define thresholds: Coverage(c i )=100 Coverage(c i )=100 Specificity(c i )=0.5 Specificity(c i )=0.5 Coverage(sport)= Coverage(sport)=300 Documents frequency soccer300 basketball200 diet140 diabetes12 Cancer250
25 Example Rules: Rules: “soccer” => sport “soccer” => sport “basketball” => sport “basketball” => sport “diet” => health “diet” => health “diabetes” => health “diabetes” => health Pre-define thresholds: Pre-define thresholds: Coverage(c i )=100 Coverage(c i )=100 Specificity(c i )=0.5 Specificity(c i )=0.5 Coverage(sport)= Coverage(sport)= = 500 Documents frequency soccer300 basketball200 diet140 diabetes12 Cancer250
26 Example Rules: Rules: “soccer” => sport “soccer” => sport “basketball” => sport “basketball” => sport “diet” => health “diet” => health “diabetes” => health “diabetes” => health Pre-define thresholds: Pre-define thresholds: Coverage(c i )=100 Coverage(c i )=100 Specificity(c i )=0.5 Specificity(c i )=0.5 Coverage(sport)= Coverage(sport)= = 500 Coverage(health)= Coverage(health)=140 Documents frequency soccer300 basketball200 diet140 diabetes12 Cancer250
27 Example Rules: Rules: “soccer” => sport “soccer” => sport “basketball” => sport “basketball” => sport “diet” => health “diet” => health “diabetes” => health “diabetes” => health Pre-define thresholds: Pre-define thresholds: Coverage(c i )=100 Coverage(c i )=100 Specificity(c i )=0.5 Specificity(c i )=0.5 Coverage(sport)= Coverage(sport)= = 500 Coverage(health)= Coverage(health)= = 162 Documents frequency soccer300 basketball200 diet140 diabetes12 Cancer250
28 Example Rules: Rules: “soccer” => sport “soccer” => sport “basketball” => sport “basketball” => sport “diet” => health “diet” => health “diabetes” => health “diabetes” => health Pre-define thresholds: Pre-define thresholds: Coverage(c i )=100 Coverage(c i )=100 Specificity(c i )=0.5 Specificity(c i )=0.5 Specificity(sport) = Specificity(sport) = 500/( )=0.76 Documents frequency soccer300 basketball200 diet140 diabetes12 Cancer250
29 Example Rules: Rules: “soccer” => sport “soccer” => sport “basketball” => sport “basketball” => sport “diet” => health “diet” => health “diabetes” => health “diabetes” => health Pre-define thresholds: Pre-define thresholds: Coverage(c i )=100 Coverage(c i )=100 Specificity(c i )=0.5 Specificity(c i )=0.5 Specificity(sport) = Specificity(sport) = 500/( )=0.76 Documents frequency soccer300 basketball200 diet140 diabetes12 Cancer250
30 Example Rules: Rules: “soccer” => sport “soccer” => sport “basketball” => sport “basketball” => sport “diet” => health “diet” => health “diabetes” => health “diabetes” => health Pre-define thresholds: Pre-define thresholds: Coverage(c i )=100 Coverage(c i )=100 Specificity(c i )=0.5 Specificity(c i )=0.5 Specificity(sport) = Specificity(sport) =500/( )=0.76 Specificity(health) = Specificity(health) = 162/( ) = 0.24 Documents frequency soccer300 basketball200 diet140 diabetes12 Cancer250
31 Example Rules: Rules: “soccer” => sport “soccer” => sport “basketball” => sport “basketball” => sport “diet” => health “diet” => health “diabetes” => health “diabetes” => health Pre-define thresholds: Pre-define thresholds: Coverage(c i )=100 Coverage(c i )=100 Specificity(c i )=0.5 Specificity(c i )=0.5 Specificity(sport) = Specificity(sport) =500/( )=0.76 Specificity(health) = Specificity(health) = 162/( ) = 0.24 Documents frequency soccer300 basketball200 diet140 diabetes12 Cancer250
32 Example Rules: Rules: “soccer” => sport “soccer” => sport “basketball” => sport “basketball” => sport “diet” => health “diet” => health “diabetes” => health “diabetes” => health Pre-define thresholds: Pre-define thresholds: Coverage(c i )=100 Coverage(c i )=100 Specificity(c i )=0.5 Specificity(c i )=0.5 sporthealth coverage Specificity
33 Example Rules: Rules: “soccer” => sport “soccer” => sport “basketball” => sport “basketball” => sport “diet” => health “diet” => health “diabetes” => health “diabetes” => health Pre-define thresholds: Pre-define thresholds: Coverage(c i )=100 Coverage(c i )=100 Specificity(c i )=0.5 Specificity(c i )=0.5 sporthealth coverage Specificity The word “cancer” did not appear in the rules thus did not affect coverage nor specificity.
34 QProber View Demo
35 Content What is the hidden web? What is the hidden web? Content Summary. Content Summary. Database classification. Database classification. Combined Algorithm. Combined Algorithm. Shrinkage. Shrinkage. Experiments Result. Experiments Result. Summary. Summary.
36 Construct Content Summary Algorithm outline: Algorithm outline: 1. Retrieve a document sample. 2. Generate a preliminary content summary. 3. Categorize the database. 4. Estimate the absolute frequencies of the words retrieved from the database.
37 Construct Content Summary Algorithm outline: Algorithm outline: 1. Retrieve a document sample. 2. Generate a preliminary content summary. 3. Categorize the database. 4. Estimate the absolute frequencies of the words retrieved from the database.
38 Document Sample Document sample for category c: newdocs = Ø newdocs = Ø For each subcategory c i of c: For each subcategory c i of c: For each query q relevant for c i : For each query q relevant for c i : newdocs = newdocs U {top k documents return for q} newdocs = newdocs U {top k documents return for q} If q consist a single word w If q consist a single word w then ActualDF( w )= #matches returned for q.
39 Document Sample – Example I START Sport Basketballsoccer HealthArtsScience Rules Sport “Jordan” and “bulls”, “Romario” and “soccer”, “Maradona”, “swimming”, etc. Health “diabetes”, “diet” and “fat”, “stomach”, etc. … We know ActualDF(. )
40 Construct Content Summary Algorithm outline: Algorithm outline: 1. Retrieve a document sample. 2. Generate a preliminary content summary. 3. Categorize the database. 4. Estimate the absolute frequencies of the words retrieved from the database.
41 Content Summary Build content summary for category c: For each word w in newdocs : For each word w in newdocs : SampleDF( w )= #documents in newdocs that contain w. SampleDF( w )= #documents in newdocs that contain w.
42 Construct Content Summary Algorithm outline: Algorithm outline: 1. Retrieve a document sample. 2. Generate a preliminary content summary. 3. Categorize the database. 4. Estimate the absolute frequencies of the words retrieved from the database.
43 Categorizing the Database The algorithm is recursive. The algorithm is recursive. We go down the topics hierarchy according to the “Coverage” and the “specificity”. We go down the topics hierarchy according to the “Coverage” and the “specificity”. Categorization: Categorization: If Coverage( c i )>treshold1 and specificity( c i )>threshold2 If Coverage( c i )>treshold1 and specificity( c i )>threshold2 Then getContentSummary( c i )
44 Document Sample – Example II START Sport Basketballsoccer HealthArtsScience Requirements: Requirements: Coverage(c i ) > x1 Specificity(c i ) > x2 NBA statistics NBA statistics
45 Construct Content Summary Algorithm outline: Algorithm outline: 1. Retrieve a document sample. 2. Generate a preliminary content summary. 3. Categorize the database. 4. Estimate the absolute frequencies of the words retrieved from the database.
46 Estimating absolute document Frequencies To evaluate the absolute document frequencies the paper uses Zipf’s observation that was refined later by Mendelbort: To evaluate the absolute document frequencies the paper uses Zipf’s observation that was refined later by Mendelbort: f=P(r+p) -B
47 Estimating absolute document Frequencies f=P(r+p) -B f => the frequency of the word. f => the frequency of the word. r => The rank of the word (by it’s frequency). r => The rank of the word (by it’s frequency). P, p, B => parameters of the specific document collection. P, p, B => parameters of the specific document collection.
48 Estimating absolute document Frequencies f=P(r+p ) -B f => the frequency of the word. f => the frequency of the word. r => The rank of the word (by it’s frequency). r => The rank of the word (by it’s frequency). P, p, B => parameters of the specific document collection. P, p, B => parameters of the specific document collection.
49 Estimating absolute document Frequencies f=P(r+p) -B f => the frequency of the word. f => the frequency of the word. r => The rank of the word (by it’s frequency). r => The rank of the word (by it’s frequency). P, p, B => parameters of the specific document collection. P, p, B => parameters of the specific document collection.
50 Estimating absolute document Frequencies - Example Rank: Rank: r(“Bulls”)=1 r(“Bulls”)=1 r(“Jordan”)=2 r(“Jordan”)=2 r(“Maradona”)=3 r(“Maradona”)=3 r(“Romario”)=4 r(“Romario”)=4 Rules Sport “Jordan” and “Bulls”, “Romario” and “soccer”, “Maradona”, “swimming”, etc. SampleDFActualDFJordan45--- Bulls80--- Maradona Romario32--- …
51 Estimating absolute document Frequencies Estimate actual word frequencies: Estimate actual word frequencies: 1. Sort words in their descending order of their SampleDF(.). Determine the rank r i of each word w i. 2. Estimate P, p, B by the ActualDF(.) you have. 3. Estimate absolute document frequency for all words in the sample.
52 Estimating absolute document Frequencies - Example Rank: Rank: r(“Bulls”)=1 r(“Bulls”)=1 r(“Jordan”)=2 r(“Jordan”)=2 r(“Maradona”)=3 r(“Maradona”)=3 r(“Romario”)=4 r(“Romario”)=4 According to Maradona (and more actualDF) estimate P, p and B According to Maradona (and more actualDF) estimate P, p and B Estimate ActualDF of “Jordan”, “Bulls” etc. Estimate ActualDF of “Jordan”, “Bulls” etc. Rules Sport “Jordan” and “Bulls”, “Romario” and “soccer”, “Maradona”, “swimming”, etc. SampleDFActualDFJordan45--- Bulls80--- Maradona Romario32--- …
53 Content Summary Problems The sparse data problem: The sparse data problem: The content summary tends to include the most frequent words but generally miss many other words that appear only in few documents. The content summary tends to include the most frequent words but generally miss many other words that appear only in few documents. Example: The word “hemophilia” appears in 0.1% of the PubMed documents. Example: The word “hemophilia” appears in 0.1% of the PubMed documents. A typical content summary for PubMed will not include “hemophilia” in it, thus causing the metasearcher to find PubMed as a non relevant database to query containing “hemophilia”.
54 Content Summary Problems Disproportion: Disproportion: Some word might be disproportionately represented in the document summary. Challenge: Challenge: Improving the quality of the content summary without necessarily increasing the document sample size. Improving the quality of the content summary without necessarily increasing the document sample size.
55 Content What is the hidden web? What is the hidden web? Content Summary. Content Summary. Database classification. Database classification. Combined Algorithm. Combined Algorithm. Shrinkage. Shrinkage. Experiments Result. Experiments Result. Summary. Summary.
56 Shrinkage When multiple databases correspond to similar topic categories they tend to have similar content summaries. When multiple databases correspond to similar topic categories they tend to have similar content summaries. The content summaries of databases under similar topics can mutually complement each other. The content summaries of databases under similar topics can mutually complement each other.
57 Category Content Summary D3 D1D2 ^DB = 1000 Df(“hypertension”)=480 P(“hypertension”)=0.48 ^DB = 2000 Df(“hypertension”)=0 P(“hypertension”)=0 P(“hypertension”) = 0.16 ((2000*0)+(1000*0.48))/3000
58 Shrunk content Summary I To create a shrunk content summary we must first create the categories content summary for all the categories in the hierarchy. To create a shrunk content summary we must first create the categories content summary for all the categories in the hierarchy. Consider a path in the topic hierarchy C 1,….,C m were c i =parent(c i+1 ) Consider a path in the topic hierarchy C 1,….,C m were c i =parent(c i+1 ) Root c1 c2 c3 D
59 Shrunk content Summary II A shrunk content summary for database D classified under categories c 1 …c m is: A shrunk content summary for database D classified under categories c 1 …c m is: Where: Where:
60 Shrunk content Summary III Root ……C1 C2 …C3 … D P(w|D)=0.6 P(w|C3)=0.4 P(w|C2)=0.78 P(w|C1)=0.3 P(w|Root)=0.01 Shrunk content Summary: 0.01*λ *λ *λ *λ *λ 4
61 Shrunk content Summary IV The category weights: The category weights: λ m+1 is the highest among the λ i ’s, which means the highest weight is given to the original content summary. The shrunk content summary incorporates information from multiple content summary and thus it can be closer to the complete (and unknown) content summary. The shrunk content summary incorporates information from multiple content summary and thus it can be closer to the complete (and unknown) content summary.
62 Shrunk content summary – is it always good? Not always, if the “uncertainty” associated with the score is low don’t use shrinkage: Not always, if the “uncertainty” associated with the score is low don’t use shrinkage: The sample size - If the database sample includes most of the documents from the DB (a small DB) then this sample is sufficiently complete. In this case shrinkage is not needed and might be undesirable. The sample size - If the database sample includes most of the documents from the DB (a small DB) then this sample is sufficiently complete. In this case shrinkage is not needed and might be undesirable. The frequency of the query words – if all the query words appear in almost all of the sample documents then the distribution of the words over the DB is “certain”. Same goes if every query word appears in close to no sample document. The frequency of the query words – if all the query words appear in almost all of the sample documents then the distribution of the words over the DB is “certain”. Same goes if every query word appears in close to no sample document.
63 Content What is the hidden web? What is the hidden web? Content Summary. Content Summary. Database classification. Database classification. Combined Algorithm. Combined Algorithm. Shrinkage. Shrinkage. Experiments Result. Experiments Result. Summary. Summary.
64 Experiments Result The papers refer to 2 aspects: The papers refer to 2 aspects: Content summary quality. Content summary quality. Database selection accuracy. Database selection accuracy. The papers show that the idea of exploiting content summaries of similarly classified databases increases the content summary quality and improves the database selection for a given query. The papers show that the idea of exploiting content summaries of similarly classified databases increases the content summary quality and improves the database selection for a given query.
65 Content summary quality I Comparing coverage of the retrieve vocabulary. RS-ORD and RS-LRD vs. different Rulers. Specificity % retrieved words
66 Content summary quality II Comparing rank of words. RS-ORD and RS-LRD vs. different Rulers.
67 Content summary quality III Comparing the number of queries done to the database. RS-ORD and RS-LRD vs. different Rulers. Comparing the number of queries done to the database. RS-ORD and RS-LRD vs. different Rulers.
68 Data base selection using shrinkage The shrinkage improves selecting relevant data bases.
69 Content What is the hidden web? What is the hidden web? Content Summary. Content Summary. Database classification. Database classification. Combined Algorithm. Combined Algorithm. Shrinkage. Shrinkage. Experiments Result. Experiments Result. Summary. Summary.
70 Summary I Database selection is critical to building efficient metasearchers that interact with potentially large number of databases. Database selection is critical to building efficient metasearchers that interact with potentially large number of databases. The metasearchers uses the database content summary to select the most relevant databases for a given query. The metasearchers uses the database content summary to select the most relevant databases for a given query.
71 Summary II The papers present methods to improve the database content summary: The papers present methods to improve the database content summary: Creating Content summary with estimation of actual document frequency. Creating Content summary with estimation of actual document frequency. Categorizing databases in a classification scheme. Categorizing databases in a classification scheme. A method to exploits content summaries of similarly classified databases and combines them using shrinkage. A method to exploits content summaries of similarly classified databases and combines them using shrinkage.
72 The End "The invisible portion of the Web will continue to grow exponentially before the tools to uncover the hidden Web are ready for general use" ( /deepweb.asp) QUESTIONS?
73 Appendix The metasearcher Turbo The metasearcher Turbo