Classifying and Searching "Hidden-Web" Text Databases

Slides:



Advertisements
Similar presentations
Modeling and Managing Content Changes in Text Databases Panos Ipeirotis New York University Alexandros Ntoulas UCLA Junghoo Cho UCLA Luis Gravano Columbia.
Advertisements

Pseudo-Relevance Feedback For Multimedia Retrieval By Rong Yan, Alexander G. and Rong Jin Mwangi S. Kariuki
Improvements and extras Paul Thomas CSIRO. Overview of the lectures 1.Introduction to information retrieval (IR) 2.Ranked retrieval 3.Probabilistic retrieval.
Effective Keyword Based Selection of Relational Databases Bei Yu, Guoliang Li, Karen Sollins, Anthony K.H Tung.
TI: An Efficient Indexing Mechanism for Real-Time Search on Tweets Chun Chen 1, Feng Li 2, Beng Chin Ooi 2, and Sai Wu 2 1 Zhejiang University, 2 National.
Case Study: BibFinder BibFinder: A popular CS bibliographic mediator –Integrating 8 online sources: DBLP, ACM DL, ACM Guide, IEEE Xplore, ScienceDirect,
Automatic Classification of Text Databases Through Query Probing Panagiotis G. Ipeirotis Luis Gravano Columbia University Mehran Sahami E.piphany Inc.
Search Engines and Information Retrieval
Distributed Search over the Hidden Web Hierarchical Database Sampling and Selection Panagiotis G. Ipeirotis Luis Gravano Computer Science Department Columbia.
Distributed Search over the Hidden Web Hierarchical Database Sampling and Selection Panagiotis G. Ipeirotis & Luis Gravano.
Semantic text features from small world graphs Jure Leskovec, IJS + CMU John Shawe-Taylor, Southampton.
INFO 624 Week 3 Retrieval System Evaluation
1 Ranked Queries over sources with Boolean Query Interfaces without Ranking Support Vagelis Hristidis, Florida International University Yuheng Hu, Arizona.
Distributed Search over the Hidden Web: Hierarchical Database Sampling and Selection.
1 An Empirical Study on Large-Scale Content-Based Image Retrieval Group Meeting Presented by Wyman
1 UCB Digital Library Project An Experiment in Using Lexical Disambiguation to Enhance Information Access Robert Wilensky, Isaac Cheng, Timotius Tjahjadi,
Federated Search of Text Search Engines in Uncooperative Environments Luo Si Language Technology Institute School of Computer Science Carnegie Mellon University.
Chapter 5: Information Retrieval and Web Search
Overview of Search Engines
1 Modeling Query-Based Access to Text Databases Eugene Agichtein Panagiotis Ipeirotis Luis Gravano Computer Science Department Columbia University.
Donghui Xu Spring 2011, COMS E6125 Prof. Gail Kaiser.
Evaluation David Kauchak cs458 Fall 2012 adapted from:
Evaluation David Kauchak cs160 Fall 2009 adapted from:
Search Engines and Information Retrieval Chapter 1.
©2008 Srikanth Kallurkar, Quantum Leap Innovations, Inc. All rights reserved. Apollo – Automated Content Management System Srikanth Kallurkar Quantum Leap.
Master Thesis Defense Jan Fiedler 04/17/98
A Simple Unsupervised Query Categorizer for Web Search Engines Prashant Ullegaddi and Vasudeva Varma Search and Information Extraction Lab Language Technologies.
A Survey of Patent Search Engine Software Jennifer Lewis April 24, 2007 CSE 8337.
Livnat Sharabani SDBI 2006 The Hidden Web. 2 Based on: “Distributed search over the hidden web: Hierarchical database sampling and selection” “Distributed.
윤언근 DataMining lab.  The Web has grown exponentially in size but this growth has not been isolated to good-quality pages.  spamming and.
Autumn Web Information retrieval (Web IR) Handout #0: Introduction Ali Mohammad Zareh Bidoki ECE Department, Yazd University
Math Information Retrieval Zhao Jin. Zhao Jin. Math Information Retrieval Examples: –Looking for formulas –Collect teaching resources –Keeping updated.
Chapter 6: Information Retrieval and Web Search
Search Engines. Search Strategies Define the search topic(s) and break it down into its component parts What terms, words or phrases do you use to describe.
Introduction to Digital Libraries hussein suleman uct cs honours 2003.
Autumn Web Information retrieval (Web IR) Handout #1:Web characteristics Ali Mohammad Zareh Bidoki ECE Department, Yazd University
WIRED Week 3 Syllabus Update (next week) Readings Overview - Quick Review of Last Week’s IR Models (if time) - Evaluating IR Systems - Understanding Queries.
21/11/20151Gianluca Demartini Ranking Clusters for Web Search Gianluca Demartini Paul–Alexandru Chirita Ingo Brunkhorst Wolfgang Nejdl L3S Info Lunch Hannover,
Algorithmic Detection of Semantic Similarity WWW 2005.
Hidden-Web Databases: Classification and Search Luis Gravano Columbia University Joint work with Panos Ipeirotis (Columbia)
Digital Libraries1 David Rashty. Digital Libraries2 “A library is an arsenal of liberty” Anonymous.
Chapter. 3: Retrieval Evaluation 1/2/2016Dr. Almetwally Mostafa 1.
Identifying “Best Bet” Web Search Results by Mining Past User Behavior Author: Eugene Agichtein, Zijian Zheng (Microsoft Research) Source: KDD2006 Reporter:
Instance Discovery and Schema Matching With Applications to Biological Deep Web Data Integration Tantan Liu, Fan Wang, Gagan Agrawal {liut, wangfa,
September 2003, 7 th EDG Conference, Heidelberg – Roberta Faggian, CERN/IT CERN – European Organization for Nuclear Research The GRACE Project GRid enabled.
Relevant Document Distribution Estimation Method for Resource Selection Luo Si and Jamie Callan School of Computer Science Carnegie Mellon University
Information Retrieval in Practice
Queensland University of Technology
Text Based Information Retrieval
Evaluation of IR Systems
Federated & Meta Search
Panagiotis G. Ipeirotis Tom Barry Luis Gravano
Search Techniques and Advanced tools for Researchers
Classifying and Searching "Hidden-Web" Text Databases
Classifying and Searching "Hidden-Web" Text Databases
Information Retrieval
Classifying and Searching "Hidden-Web" Text Databases
Classifying and Searching "Hidden-Web" Text Databases
Where did we stop? The Bayes decision rule guarantees an optimal classification… … But it requires the knowledge of P(ci|x) (or p(x|ci) and P(ci)) We.
Structure and Content Scoring for XML
SDLIP + STARTS = SDARTS A Protocol and Toolkit for Metasearching
Panos Ipeirotis Luis Gravano
Introduction to Information Retrieval
Panagiotis G. Ipeirotis Luis Gravano
Structure and Content Scoring for XML
Semantic Similarity Methods in WordNet and their Application to Information Retrieval on the Web Yizhe Ge.
Information Retrieval
PSoup: A System for streaming queries over streaming data
Presentation transcript:

Classifying and Searching "Hidden-Web" Text Databases Panos Ipeirotis Computer Science Department Columbia University

Motivation? “Surface” Web vs. “Hidden” Web Link structure Crawlable Documents indexed by search engines “Hidden” Web No link structure Documents “hidden” in databases Documents not indexed by search engines Need to query each collection individually 11/21/2018 Panos Ipeirotis - Columbia University

Hidden-Web Databases: Examples Search on U.S. Patent and Trademark Office (USPTO) database: [wireless network]  25,749 matches (USPTO database is at http://patft.uspto.gov/netahtml/search-bool.html) Search on Google restricted to USPTO database site: [wireless network site:patft.uspto.gov]  0 matches Database Query Database Matches Site-Restricted Google Matches USPTO wireless network 25,749 Library of Congress visa regulations >10,000 PubMed thrombopenia 26,460 172 as of Feb 10th, 2004 11/21/2018 Panos Ipeirotis - Columbia University

Interacting With Hidden-Web Databases Browsing: Yahoo!-like directories InvisibleWeb.com SearchEngineGuide.com Searching: Metasearchers Populated Manually 11/21/2018 Panos Ipeirotis - Columbia University

Panos Ipeirotis - Columbia University Outline of Talk Classification of Hidden-Web Databases Search over Hidden-Web Databases SDARTS 11/21/2018 Panos Ipeirotis - Columbia University

Hierarchically Classifying the ACM Digital Library ACM DL ?   ? 11/21/2018 Panos Ipeirotis - Columbia University

Text Database Classification: Definition For a text database D and a category C: Coverage(D,C) = number of docs in D about C Specificity(D,C) = fraction of docs in D about C Assign a text database to a category C if: Database coverage for C at least Tc Tc: coverage threshold (e.g., > 100 docs in C) Database specificity for C at least Ts Ts: specificity threshold (e.g., > 40% of docs in C) 11/21/2018 Panos Ipeirotis - Columbia University

Brute-Force Classification “Strategy” Extract all documents from database Classify documents on topic (use state-of-the-art classifiers: SVMs, C4.5, RIPPER,…) Classify database according to topic distribution Problem: No direct access to full contents of Hidden-Web databases 11/21/2018 Panos Ipeirotis - Columbia University

Classification: Goal & Challenges Discover database topic distribution Challenges: No direct access to full contents of Hidden-Web databases Only limited search interfaces available Should not overload databases For example, in a health-related database a query on “cancer” will generate a large number of matches, while a query on “michael jordan” will generate small or zero matches. So, we will see now how we generate such queries, and how we exploit the returned results Key observation: Only queries “about” database topic(s) generate large number of matches 11/21/2018 Panos Ipeirotis - Columbia University

Query-based Database Classification: Overview TRAIN CLASSIFIER Train document classifier Extract queries from classifier Adaptively issue queries to database Identify topic distribution based on adjusted number of query matches Classify database EXTRACT QUERIES Sports: +nba +knicks Health: +sars QUERY DATABASE +sars 1254 IDENTIFY TOPIC DISTRIBUTION CLASSIFY DATABASE   11/21/2018 Panos Ipeirotis - Columbia University

Training a Document Classifier Get training set (set of pre-classified documents) Select best features to characterize documents (Zipf’s law + information theoretic feature selection) [Koller and Sahami 1996] Train classifier (SVM, C4.5, RIPPER, …) EXTRACT QUERIES Sports: +nba +knicks Health +sars QUERY DATABASE Output: A “black-box” model for classifying documents If you have several points, steps, or key ideas use multiple slides. Determine if your audience is to understand a new idea, learn a process, or receive greater depth to a familiar concept. Back up each point with adequate explanation. As appropriate, supplement your presentation with technical support data in hard copy or on disc, e-mail, or the Internet. Develop each point adequately to communicate with your audience. IDENTIFY TOPIC DISTRIBUTION   CLASSIFY DATABASE Document     Classifier 11/21/2018 Panos Ipeirotis - Columbia University

Extracting Query Probes ACM TOIS 2003 Transform classifier model into queries Trivial for “rule-based” classifiers (RIPPER) TRAIN CLASSIFIER EXTRACT QUERIES Sports: C4.5rules Easy for decision-tree classifiers (C4.5) for which rule generators exist (C4.5rules) +nba +knicks Health: +sars QUERY DATABASE +sars 1254 Rule extraction Trickier for other classifiers: we devised rule-extraction methods for linear classifiers (linear-kernel SVMs, Naïve-Bayes, …) If you have several points, steps, or key ideas use multiple slides. Determine if your audience is to understand a new idea, learn a process, or receive greater depth to a familiar concept. Back up each point with adequate explanation. As appropriate, supplement your presentation with technical support data in hard copy or on disc, e-mail, or the Internet. Develop each point adequately to communicate with your audience. IDENTIFY TOPIC DISTRIBUTION CLASSIFY DATABASE   Example query for Sports: +nba +knicks 11/21/2018 Panos Ipeirotis - Columbia University

Querying Database with Extracted Queries TRAIN CLASSIFIER Issue each query to database to obtain number of matches without retrieving any documents Increase coverage of rule’s category accordingly (#Sports = #Sports + 706) EXTRACT QUERIES Sports: +nba +knicks Health: +sars QUERY DATABASE +sars 1254 If you have several points, steps, or key ideas use multiple slides. Determine if your audience is to understand a new idea, learn a process, or receive greater depth to a familiar concept. Back up each point with adequate explanation. As appropriate, supplement your presentation with technical support data in hard copy or on disc, e-mail, or the Internet. Develop each point adequately to communicate with your audience. IDENTIFY TOPIC DISTRIBUTION CLASSIFY DATABASE   SIGMOD 2001 ACM TOIS 2003 11/21/2018 Panos Ipeirotis - Columbia University

Identifying Topic Distribution from Query Results TRAIN CLASSIFIER Query-based estimates of topic distribution not perfect Document classifiers not perfect: Rules for one category match documents from other categories Querying not perfect: Queries for same category might overlap Queries do not match all documents in a category EXTRACT QUERIES Sports: +nba +knicks Health +sars QUERY DATABASE If you have several points, steps, or key ideas use multiple slides. Determine if your audience is to understand a new idea, learn a process, or receive greater depth to a familiar concept. Back up each point with adequate explanation. As appropriate, supplement your presentation with technical support data in hard copy or on disc, e-mail, or the Internet. Develop each point adequately to communicate with your audience. IDENTIFY TOPIC DISTRIBUTION Solution: Learn to adjust results of query probes CLASSIFY DATABASE   11/21/2018 Panos Ipeirotis - Columbia University

Confusion Matrix Adjustment of Query Probe Results correct class Correct (but unknown) topic distribution Incorrect topic distribution derived from query probing Real Coverage 1000 5000 50 comp sports health 0.80 0.10 0.00 0.08 0.85 0.04 0.02 0.15 0.96 Estimated Coverage 1300 4332 818 800+500+0 = X = 80+4250+2 = 20+750+48 = If you have several points, steps, or key ideas use multiple slides. Determine if your audience is to understand a new idea, learn a process, or receive greater depth to a familiar concept. Back up each point with adequate explanation. As appropriate, supplement your presentation with technical support data in hard copy or on disc, e-mail, or the Internet. Develop each point adequately to communicate with your audience. assigned class This “multiplication” can be inverted to get a better estimate of the real topic distribution from the probe results 10% of “sport” documents match queries for “computers” 11/21/2018 Panos Ipeirotis - Columbia University

Confusion Matrix Adjustment of Query Probe Results TRAIN CLASSIFIER Coverage(D) ~ M-1 . ECoverage(D) EXTRACT QUERIES Sports: +nba +knicks Adjusted estimate of topic distribution Health Probing results +sars QUERY DATABASE M usually diagonally dominant for “reasonable” document classifiers, hence invertible Compensates for errors in query-based estimates of topic distribution IDENTIFY TOPIC DISTRIBUTION CLASSIFY DATABASE   11/21/2018 Panos Ipeirotis - Columbia University

Classification Algorithm (Again) TRAIN CLASSIFIER Train document classifier Extract queries from classifier Adaptively issue queries to database Identify topic distribution based on adjusted number of query matches Classify database One-time process EXTRACT QUERIES Sports: +nba +knicks Health +sars QUERY DATABASE +sars 1254 May be put the hierarchical classification algorithm here? IDENTIFY TOPIC DISTRIBUTION For every database CLASSIFY DATABASE   11/21/2018 Panos Ipeirotis - Columbia University

Panos Ipeirotis - Columbia University Experimental Setup 72-node 4-level topic hierarchy from InvisibleWeb/Yahoo! (54 leaf nodes) 500,000 Usenet articles (April-May 2000): Newsgroups assigned by hand to hierarchy nodes RIPPER trained with 54,000 articles (1,000 articles per leaf), 27,000 articles to construct confusion matrix 500 “Controlled” databases built using 419,000 newsgroup articles (to run detailed experiments) 130 real Web databases picked from InvisibleWeb (first 5 under each topic) If you have several points, steps, or key ideas use multiple slides. Determine if your audience is to understand a new idea, learn a process, or receive greater depth to a familiar concept. Back up each point with adequate explanation. As appropriate, supplement your presentation with technical support data in hard copy or on disc, e-mail, or the Internet. Develop each point adequately to communicate with your audience. comp.hardware rec.music.classical rec.photo.* 11/21/2018 Panos Ipeirotis - Columbia University

Experimental Results: Controlled Databases Accuracy (using F-measure): Above 80% for most <Tc, Ts> threshold combinations tried Degrades gracefully with hierarchy depth Confusion-matrix adjustment helps Efficiency: Relatively small number of queries (<500) needed for most threshold <Tc, Ts> combinations tried 11/21/2018 Panos Ipeirotis - Columbia University

Experimental Results: Web Databases Accuracy (using F-measure): ~70% for best <Tc, Ts> combination Learned thresholds that reproduce human classification Tested threshold choice using 3-fold cross validation Efficiency: 120 queries per database on average needed for choice of thresholds, no documents retrieved Only small part of hierarchy “explored” Queries are short: 1.5 words on average; 4 words maximum (easily handled by most Web databases) 11/21/2018 Panos Ipeirotis - Columbia University

Panos Ipeirotis - Columbia University Other Experiments Effect of choice of document classifiers: RIPPER C4.5 Naïve Bayes SVM Benefits of feature selection Effect of search-interface heterogeneity: Boolean vs. vector-space retrieval models Effect of query-overlap elimination step Over crawlable databases: query-based classification orders of magnitude faster than “brute-force” crawling-based classification ACM TOIS 2003 IEEE Data Engineering Bulletin 2002 11/21/2018 Panos Ipeirotis - Columbia University

Hidden-Web Database Classification: Summary Handles autonomous Hidden-Web databases accurately and efficiently: ~70% F-measure Only 120 queries issued on average, with no documents retrieved Handles large family of document classifiers (and can hence exploit future advances in machine learning) 11/21/2018 Panos Ipeirotis - Columbia University

Panos Ipeirotis - Columbia University Outline of Talk Classification of Hidden-Web Databases Search over Hidden-Web Databases SDARTS 11/21/2018 Panos Ipeirotis - Columbia University

Interacting With Hidden-Web Databases Browsing: Yahoo!-like directories Searching: Metasearchers Content not accessible through Google } NYTimes Archives In your opening, establish the relevancy of the topic to the audience. Give a brief preview of the presentation and establish value for the listeners. Take into account your audience’s interest and expertise in the topic when choosing your vocabulary, examples, and illustrations. Focus on the importance of the topic to your audience, and you will have more attentive listeners. … … PubMed … Query Metasearcher USPTO Library of Congress … 11/21/2018 Panos Ipeirotis - Columbia University

Metasearchers Provide Access to Distributed Databases Database selection relies on simple content summaries: vocabulary, word frequencies thrombopenia Metasearcher PubMed (11,868,552 documents) … aids 121,491 cancer 1,562,477 heart 691,360 hepatitis 121,129 thrombopenia 24,826   ? PubMed NYTimes Archives USPTO ... thrombopenia 24,826 ... thrombopenia 0 ... thrombopenia 18 11/21/2018 Panos Ipeirotis - Columbia University

Extracting Content Summaries from Autonomous Hidden-Web Databases [Callan&Connell 2001] Send random queries to databases Retrieve top matching documents If retrieved 300 documents then stop; else go to Step 1 Content summary contains words in sample and document frequency of each word If you have several points, steps, or key ideas use multiple slides. Determine if your audience is to understand a new idea, learn a process, or receive greater depth to a familiar concept. Back up each point with adequate explanation. As appropriate, supplement your presentation with technical support data in hard copy or on disc, e-mail, or the Internet. Develop each point adequately to communicate with your audience. Problems: Random sampling retrieves non-representative documents Frequencies in summary “compressed” to sample size range Summaries from small samples are highly incomplete 11/21/2018 Panos Ipeirotis - Columbia University

Extracting Representative Document Sample Problem 1: Random sampling retrieves non-representative documents Train a document classifier Create queries from classifier Adaptively issue queries to databases Retrieve top-k matching documents for each query Save #matches for each one-word query Identify topic distribution based on adjusted number of query matches Categorize the database Generate content summary from document sample If you have several points, steps, or key ideas use multiple slides. Determine if your audience is to understand a new idea, learn a process, or receive greater depth to a familiar concept. Back up each point with adequate explanation. As appropriate, supplement your presentation with technical support data in hard copy or on disc, e-mail, or the Internet. Develop each point adequately to communicate with your audience. Sampling retrieves documents only from “topically dense” areas from database 11/21/2018 Panos Ipeirotis - Columbia University

Sample Frequencies vs. Actual Frequencies Problem 2: Frequencies in summary “compressed” to sample size range PubMed (11,868,552 docs) … cancer 1,562,477 heart 691,360 … PubMed Sample (300 documents) … cancer 45 heart 16 … Sampling Key Observation: Query matches reveal frequency information 11/21/2018 Panos Ipeirotis - Columbia University

Adjusting Document Frequencies Zipf’s law empirically connects word frequency f and rank r f = A (r + B) c frequency rank VLDB 2002 11/21/2018 Panos Ipeirotis - Columbia University

Adjusting Document Frequencies Zipf’s law empirically connects word frequency f and rank r We know document frequency and rank r of the words in sample f = A (r + B) c frequency Frequency in sample 100 rank 1 12 78 …. VLDB 2002 Rank in sample 11/21/2018 Panos Ipeirotis - Columbia University

Adjusting Document Frequencies Zipf’s law empirically connects word frequency f and rank r We know document frequency and rank r of the words in sample We know real document frequency f of some words from one-word queries frequency f = A (r + B) c Frequency in database rank 1 12 78 …. VLDB 2002 Rank in sample 11/21/2018 Panos Ipeirotis - Columbia University

Adjusting Document Frequencies Zipf’s law empirically connects word frequency f and rank r We know document frequency and rank r of the words in sample We know real document frequency f of some words from one-word queries We use curve-fitting to estimate the absolute frequency of all words in sample f = A (r + B) c frequency Estimated frequency in database rank 1 12 78 …. VLDB 2002 11/21/2018 Panos Ipeirotis - Columbia University

Actual PubMed Content Summary Number of Documents: 8,691,360 (Actual: 11,868,552) Category: Health, Diseases … cancer 1,562,477 heart 581,506 (Actual: 691,360) aids 121,491 hepatitis 73,481 (Actual: 121,129) basketball 907 (Actual: 1,063) cpu 598 Extracted automatically ~ 27,500 words in extracted content summary Fewer than 200 queries sent At most 4 documents retrieved per query (heart, hepatitis, basketball not in 1-word probes) 11/21/2018 Panos Ipeirotis - Columbia University

Sampling and Incomplete Content Summaries Problem 3: Summaries from small samples are highly incomplete Log(Frequency) Sample=300 107 106 Frequency of 10% most frequent words in PubMed database 9,000 . . aphasia ~9,000 docs / ~0.1% 103 102 2·104 4·104 105 Rank Many words appear in “relatively few” documents (Zipf’s law) Low-frequency words are often important Small document samples miss many low-frequency words 11/21/2018 Panos Ipeirotis - Columbia University

Sample-based Content Summaries Challenge: Improve content summary quality without increasing sample size Main Idea: Database Classification Helps Similar topics ↔ Similar content summaries Extracted content summaries complement each other 11/21/2018 Panos Ipeirotis - Columbia University

Databases with Similar Topics CANCERLIT` contains “metastasis”, not found during sampling CancerBACUP contains “metastasis” Databases under same category have similar vocabularies, and can complement each other 11/21/2018 Panos Ipeirotis - Columbia University

Content Summaries for Categories Databases under same category share similar vocabulary Higher level category content summaries provide additional useful estimates All estimates in category path are potentially useful 11/21/2018 Panos Ipeirotis - Columbia University

Enhancing Summaries Using “Shrinkage” Estimates from database content summaries can be unreliable Category content summaries are more reliable (based on larger samples) but less specific to database By combining estimates from category and database content summaries we get better estimates SIGMOD 2004 11/21/2018 Panos Ipeirotis - Columbia University

Shrinkage-based Estimations Adjust estimate for metastasis in D: λ1 * 0.002 + λ2 * 0.05 + λ3 * 0.092 + λ4 * 0.000 Select λi weights to maximize the probability that the summary of D is from a database under all its parent categories  Avoids “sparse data” problem and decreases estimation risk 11/21/2018 Panos Ipeirotis - Columbia University

Adaptive Application of Shrinkage Database selection algorithms assign scores to databases for each query When frequency estimates are uncertain, assigned score is uncertain… …but sometimes confidence about assigned score is high When confident about score, shrinkage unnecessary Unreliable Score Estimate: Use shrinkage Probability 1 Database Score for a Query Reliable Score Estimate: Shrinkage might hurt Probability 1 Database Score for a Query 11/21/2018 Panos Ipeirotis - Columbia University

Extracting Content Summaries: Problems Solved Problem 1: Random sampling may retrieve non-representative documents Solution: Focus querying on “topically dense” areas of the database Problem 2: Frequencies are “compressed” to the sample size range Solution: Exploit number of matches for query and adjust estimates using curve fitting Problem 3: Summaries based on small samples are highly incomplete Solution: Exploit database classification and augment summaries using samples from topically similar databases 11/21/2018 Panos Ipeirotis - Columbia University

Panos Ipeirotis - Columbia University Searching Algorithm Classify databases and extract document samples Adjust frequencies in samples One-time process For each query: For each database D: Assign score to database D (using extracted content summary) Examine uncertainty of score If uncertainty high, apply shrinkage and give new score; else keep existing score Query only top-K scoring databases For every query 11/21/2018 Panos Ipeirotis - Columbia University

Panos Ipeirotis - Columbia University Experimental Setup Two standard testbeds from TREC (“Text Retrieval Conference”): 200 databases 100 queries with associated human-assigned document relevance judgments Two sets of experiments: Content summary quality Metrics: precision, recall, Spearman correlation coefficient, KL-divergence Database selection accuracy Metric: fraction of relevant documents for queries in top-scored databases SIGMOD 2004 11/21/2018 Panos Ipeirotis - Columbia University

Panos Ipeirotis - Columbia University Experimental Results Content summary quality: Shrinkage improves quality of content summaries without increasing sample size Frequency estimation gives accurate (within ±20%) estimates of actual frequencies Database selection accuracy: Frequency estimation: Improves performance by 20%-30% Focused sampling: Improves performance by 40%-50% Adaptive application of shrinkage: Improves performance up to 100% Shrinkage is robust: Improved performance consistently across many different configurations 11/21/2018 Panos Ipeirotis - Columbia University

Panos Ipeirotis - Columbia University Other Experiments Additional data set: 315 real Web databases Choice of database selection algorithm (CORI, bGlOSS, Language Modeling) Effect of stemming Effect of stop-word elimination SIGMOD 2004 11/21/2018 Panos Ipeirotis - Columbia University

Classification & Search: Overall Contributions Support for browsing and searching Hidden-Web databases No need for cooperation: Work with autonomous Hidden-Web databases Scalable and work with large number of databases Not restricted to “Hidden”-Web databases: Work with any searchable text database Classification and content summary extraction implemented and available for download at: http://sdarts.cs.columbia.edu 11/21/2018 Panos Ipeirotis - Columbia University

Panos Ipeirotis - Columbia University Outline of Talk Classification of Hidden-Web Databases Search over Hidden-Web Databases SDARTS: Protocol and Toolkit for Metasearching 11/21/2018 Panos Ipeirotis - Columbia University

SDARTS: Protocol and Toolkit for Metasearching Query Harrison’s Online SDARTS British Medical Journal PubMed Unstructured text documents DLI2 Corpus XML documents Local Web 11/21/2018 Panos Ipeirotis - Columbia University

SDARTS: Protocol and Toolkit for Metasearching Accomplishments: Combines the strength of existing Digital Library protocols (SDLIP, STARTS) Enables indexing and wrapping of “local” collections of text and XML documents Enables “declarative” wrapping of Hidden-Web databases, with no programming Extracts content summary, topical focus, and technical level of each database Interfaces with Open Archives Initiative, an emerging Digital Library interoperability protocol Critical building block for search component of Columbia’s PERSIVAL project (5-year, $5M NSF Digital Libraries – Phase 2 project) Open source, available at: http://sdarts.cs.columbia.edu ~1,000 downloads since Jan 2003 Supervised and coordinated eight students during development ACM+IEEE JCDL Conference 2001, 2002 11/21/2018 Panos Ipeirotis - Columbia University

Current Work: Updating Content Summaries Databases are not static. Their content changes. When should we refresh the content summary? Examined 150 real Web databases over 52 weeks Modeled changes using “survival analysis” techniques (Cox proportional hazards model) Currently developing updating algorithms: Contact database only when necessary Improve quality of summaries by exploiting history Joint work with Junghoo Cho and Alex Ntoulas (UCLA) 11/21/2018 Panos Ipeirotis - Columbia University

Other Work: Approximate Text Matching VLDB’01 WWW’03 Matching similar strings within relational DBMS important: data resides there Service A Jenny Stamatopoulou John Paul McDougal Aldridge Rodriguez Panos Ipeirotis John Smith Service B Panos Ipirotis Jonh Smith Stamatopulou, Jenny John P. McDougal Al Dridge Rodriguez Exact joins not enough: Typing mistakes, abbreviations, different conventions Introduced algorithms for mapping approximate text joins into SQL: No need for import/export of data Provides crucial building block for data cleaning applications Identifies many interesting matches Joint work with Divesh Srivastava, Nick Koudas (AT&T Labs-Research) and others 11/21/2018 Panos Ipeirotis - Columbia University

Future Work: Integrated Access to Hidden-Web Databases Query: [good indie movies playing in Vancouver now] Current top Google result: as of March 3rd, 2004 Rock Group Performance on January 16th, 2004 11/21/2018 Panos Ipeirotis - Columbia University

Future Work: Integrated Access to Hidden-Web Databases Query: [good indie movies playing in Vancouver now] query review databases query movie databases query ticket databases All information already available on the web Review databases: Rotten Tomatoes, NY Times, … Movie databases: All Movie Guide, IMDB,… Tickets: Fandango, Moviefone… Privacy: Should the database know that it was selected for one of my queries? Authorization: If a query is sent to a database that I do not have access, I know that it contains something relevant 11/21/2018 Panos Ipeirotis - Columbia University

Future Work: Integrated Access to Hidden-Web Databases Query: [good indie movies playing in Vancouver now] query review databases query movie databases query ticket databases Challenges: Short term: Learn to interface with different databases Adapt database selection algorithms Long term: Understand semantics of query Extract “query plans” and optimize for distributed execution Personalization Security and privacy Privacy: Should the database know that it was selected for one of my queries? Authorization: If a query is sent to a database that I do not have access, I know that it contains something relevant 11/21/2018 Panos Ipeirotis - Columbia University

Panos Ipeirotis http://www.cs.columbia.edu/~pirot Classification and Search of Hidden-Web Databases P. Ipeirotis, L. Gravano, When one Sample is not Enough: Improving Text Database Selection using Shrinkage [SIGMOD 2004] L. Gravano, P. Ipeirotis, M. Sahami QProber: A System for Automatic Classification of Hidden-Web Databases [ACM TOIS 2003] E. Agichtein, P. Ipeirotis, L. Gravano Modelling Query-Based Access to Text Databases [WebDB 2003] P. Ipeirotis, L. Gravano Distributed Search over the Hidden-Web: Hierarchical Database Sampling and Selection [VLDB 2002] L. Gravano, P. Ipeirotis, M. Sahami Query- vs. Crawling-based Classification of Searchable Web Databases [DEB 2002] P. Ipeirotis, L. Gravano, M. Sahami Probe, Count, and Classify: Categorizing Hidden-Web Databases [SIGMOD 2001] Approximate Text Matching L. Gravano, P. Ipeirotis, N. Koudas, D. Srivastava Text Joins in an RDBMS for Web Data Integration [WWW2003] L. Gravano, P. Ipeirotis, H.V. Jagadish, N. Koudas, S. Muthukrishnan, D. Srivastava Approximate String Joins in a Database (Almost) for Free [VLDB 2001] L. Gravano, P. Ipeirotis, H.V. Jagadish, N. Koudas, S. Muthukrishnan, D. Srivastava, L. Pietarinen Using q-grams in a DBMS for Approximate String Processing [DEB 2001] SDARTS: Protocol & Toolkit for Metasearching N. Green, P. Ipeirotis, L. Gravano SDLIP + STARTS = SDARTS. A Protocol and Toolkit for Metasearching [JCDL 2001] P. Ipeirotis, T. Barry, L. Gravano Extending SDARTS: Extracting Metadata from Web Databases and Interfacing with the Open Archives Initiative [JCDL 2002] 11/21/2018 Panos Ipeirotis - Columbia University

Panos Ipeirotis - Columbia University Thank you! 11/21/2018 Panos Ipeirotis - Columbia University

No Good Category for Database General issue with supervised learning Example: English vs. Chinese databases Devised technique to analyze if can work with given database: Find candidate textfields Send top-level queries Examine results & construct similarity matrix If “matrix rank” small  Many “similar” pages returned Web form is not a search interface Textfield is not a “keyword” field Database is of different language Database is of an “unknown” topic 11/21/2018 Panos Ipeirotis - Columbia University

Database not Category Focused Extract one content summary per topic: Focused queries retrieve documents about known topic Each database is represented multiple times in hierarchy 11/21/2018 Panos Ipeirotis - Columbia University

Near Future Work: Definition and analysis of query-based algorithms Currently query-based algorithms are evaluated only empirically Possible to model querying process using random graph theory and: Analyze thoroughly properties of the algorithms Understand better why, when, and how the algorithms work Interested in exploring similar directions: Adapt hyperlink-based ranking algorithms Use results in graph theory to design sampling algorithms WebDB 2003 11/21/2018 Panos Ipeirotis - Columbia University

Database Selection (CORI, TREC6) More results in … Stemming/No Stemming, CORI/LM/bGlOSS, QBS/FPS/RS/CMPL, Stopwords 11/21/2018 Panos Ipeirotis - Columbia University

3-Fold Cross-Validation These charts are not included in the paper and I am not quite sure whether they can be useful or not. Actually they are the F measure values for the three disjoint sets of the Web set. Their behavior is exactly the same for Varying thresholds, confirming strongly the fact that we are not overfitting the data 11/21/2018 Panos Ipeirotis - Columbia University

Crawling- vs. Query-based Classification for CNN Sports Efficiency Statistics: Crawling-based Query-based Time Files Size Queries 1325min 270,202 8Gb 2min (-99.8%) 112 357Kb (-99.9%) IEEE DEB – March 2002 Accuracy Statistics: Crawling-based classification is classified correctly only after downloading 70% of the documents in CNN-Sports 11/21/2018 Panos Ipeirotis - Columbia University

Experiments: Precision of Database Selection Algorithms Content Summary Generation Technique CORI Hierarchical Flat FP-SVM-Documents 0.270 0.170 FP-SVM-Snippets 0.200 0.183 Random Sampling 0.177 QPilot (backlinks + front page) 0.050 VLDB 2002 (extended version) 11/21/2018 Panos Ipeirotis - Columbia University

F-measure vs. Hierarchy Depth ACM TOIS 2003 11/21/2018 Panos Ipeirotis - Columbia University

Real Confusion Matrix for Top Node of Hierarchy Health Sports Science Computers Arts 0.753 0.018 0.124 0.021 0.017 0.006 0.171 0.016 0.064 0.024 0.255 0.047 0.004 0.042 0.080 0.610 0.031 0.027 0.298 11/21/2018 Panos Ipeirotis - Columbia University

Panos Ipeirotis - Columbia University Overlap Elimination 11/21/2018 Panos Ipeirotis - Columbia University

No Support for Conjunctive Queries (Boolean vs. Vector-space) 11/21/2018 Panos Ipeirotis - Columbia University

Future Work: Integrated Access to Hidden-Web Databases Query: [good indie movies playing in New York now] Current top Google result: as of March 3rd, 2004 Story at “Seattle Times” about 9-year old drummer Rachel Trachtenburg 11/21/2018 Panos Ipeirotis - Columbia University

Future Work: Integrated Access to Hidden-Web Databases Query: [good indie movies playing in New York now] query review databases query movie databases query ticket databases All information already available on the web Review databases: Rotten Tomatoes, NY Times, TONY,… Movie databases: All Movie Guide, IMDB Tickets: Moviefone, Fandango,… Privacy: Should the database know that it was selected for one of my queries? Authorization: If a query is sent to a database that I do not have access, I know that it contains something relevant 11/21/2018 Panos Ipeirotis - Columbia University

Future Work: Integrated Access to Hidden-Web Databases Query: [good indie movies playing in New York now] query review databases query movie databases query ticket databases Challenges: Short term: Learn to interface with different databases Adapt database selection algorithms Long term: Understand semantics of query Extract “query plans” and optimize for distributed execution Personalization Security and privacy Privacy: Should the database know that it was selected for one of my queries? Authorization: If a query is sent to a database that I do not have access, I know that it contains something relevant 11/21/2018 Panos Ipeirotis - Columbia University