Presentation is loading. Please wait.

Presentation is loading. Please wait.

Classifying and Searching "Hidden-Web" Text Databases

Similar presentations


Presentation on theme: "Classifying and Searching "Hidden-Web" Text Databases"— Presentation transcript:

1 Classifying and Searching "Hidden-Web" Text Databases
Panos Ipeirotis Department of Information Systems New York University

2 Motivation? “Surface” Web vs. “Hidden” Web
Link structure Crawlable Documents indexed by search engines “Hidden” Web No link structure Documents “hidden” in databases Documents not indexed by search engines Need to query each collection individually 11/24/2018 Panos Ipeirotis - New York University - IOMS Department

3 Hidden-Web Databases: Examples
Search on U.S. Patent and Trademark Office (USPTO) database: [wireless network]  39,270 matches (USPTO database is at Search on Google restricted to USPTO database site: [wireless network site:patft.uspto.gov]  1 match Database Query Database Matches Site-Restricted Google Matches USPTO wireless network 39,270 1 Library of Congress visa regulations >10,000 PubMed thrombopenia 29,022 2 as of Oct 3rd, 2005 11/24/2018 Panos Ipeirotis - New York University - IOMS Department

4 Interacting With Hidden-Web Databases
Browsing: Yahoo!-like directories InvisibleWeb.com SearchEngineGuide.com Searching: Metasearchers Populated Manually 11/24/2018 Panos Ipeirotis - New York University - IOMS Department

5 Panos Ipeirotis - New York University - IOMS Department
Outline of Talk Classification of Hidden-Web Databases Search over Hidden-Web Databases Managing Changes in Hidden-Web Databases 11/24/2018 Panos Ipeirotis - New York University - IOMS Department

6 Hierarchically Classifying the ACM Digital Library
ACM DL ? ? 11/24/2018 Panos Ipeirotis - New York University - IOMS Department

7 Text Database Classification: Definition
For a text database D and a category C: Coverage(D,C) = number of docs in D about C Specificity(D,C) = fraction of docs in D about C Assign a text database to a category C if: Database coverage for C at least Tc Tc: coverage threshold (e.g., > 100 docs in C) Database specificity for C at least Ts Ts: specificity threshold (e.g., > 40% of docs in C) 11/24/2018 Panos Ipeirotis - New York University - IOMS Department

8 Brute-Force Classification “Strategy”
Extract all documents from database Classify documents on topic (use state-of-the-art classifiers: SVMs, C4.5, RIPPER,…) Classify database according to topic distribution Problem: No direct access to full contents of Hidden-Web databases 11/24/2018 Panos Ipeirotis - New York University - IOMS Department

9 Classification: Goal & Challenges
Discover database topic distribution Challenges: No direct access to full contents of Hidden-Web databases Only limited search interfaces available Should not overload databases For example, in a health-related database a query on “cancer” will generate a large number of matches, while a query on “michael jordan” will generate small or zero matches. So, we will see now how we generate such queries, and how we exploit the returned results Key observation: Only queries “about” database topic(s) generate large number of matches 11/24/2018 Panos Ipeirotis - New York University - IOMS Department

10 Query-based Database Classification: Overview
TRAIN CLASSIFIER Train document classifier Extract queries from classifier Adaptively issue queries to database Identify topic distribution based on adjusted number of query matches Classify database EXTRACT QUERIES Sports: +nba +knicks Health: +sars QUERY DATABASE +sars 1254 IDENTIFY TOPIC DISTRIBUTION CLASSIFY DATABASE 11/24/2018 Panos Ipeirotis - New York University - IOMS Department

11 Training a Document Classifier
Get training set (set of pre-classified documents) Select best features to characterize documents (Zipf’s law + information theoretic feature selection) [Koller and Sahami 1996] Train classifier (SVM, C4.5, RIPPER, …) EXTRACT QUERIES Sports: +nba +knicks Health +sars QUERY DATABASE Output: A “black-box” model for classifying documents If you have several points, steps, or key ideas use multiple slides. Determine if your audience is to understand a new idea, learn a process, or receive greater depth to a familiar concept. Back up each point with adequate explanation. As appropriate, supplement your presentation with technical support data in hard copy or on disc, , or the Internet. Develop each point adequately to communicate with your audience. IDENTIFY TOPIC DISTRIBUTION CLASSIFY DATABASE Document Classifier 11/24/2018 Panos Ipeirotis - New York University - IOMS Department

12 Training a Document Classifier
Get training set (set of pre-classified documents) Select best features to characterize documents (Zipf’s law + information theoretic feature selection) [Koller and Sahami 1996] Train classifier (SVM, C4.5, RIPPER, …) EXTRACT QUERIES Sports: +nba +knicks Health +sars QUERY DATABASE Output: A “black-box” model for classifying documents If you have several points, steps, or key ideas use multiple slides. Determine if your audience is to understand a new idea, learn a process, or receive greater depth to a familiar concept. Back up each point with adequate explanation. As appropriate, supplement your presentation with technical support data in hard copy or on disc, , or the Internet. Develop each point adequately to communicate with your audience. IDENTIFY TOPIC DISTRIBUTION CLASSIFY DATABASE Document Classifier 11/24/2018 Panos Ipeirotis - New York University - IOMS Department

13 Extracting Query Probes
ACM TOIS 2003 Transform classifier model into queries Trivial for “rule-based” classifiers (RIPPER) TRAIN CLASSIFIER EXTRACT QUERIES Sports: C4.5rules Easy for decision-tree classifiers (C4.5) for which rule generators exist (C4.5rules) +nba +knicks Health: +sars QUERY DATABASE +sars 1254 Rule extraction Trickier for other classifiers: we devised rule-extraction methods for linear classifiers (linear-kernel SVMs, Naïve-Bayes, …) If you have several points, steps, or key ideas use multiple slides. Determine if your audience is to understand a new idea, learn a process, or receive greater depth to a familiar concept. Back up each point with adequate explanation. As appropriate, supplement your presentation with technical support data in hard copy or on disc, , or the Internet. Develop each point adequately to communicate with your audience. IDENTIFY TOPIC DISTRIBUTION CLASSIFY DATABASE Example query for Sports: +nba +knicks 11/24/2018 Panos Ipeirotis - New York University - IOMS Department

14 Querying Database with Extracted Queries
TRAIN CLASSIFIER Issue each query to database to obtain number of matches without retrieving any documents Increase coverage of rule’s category accordingly (#Sports = #Sports + 706) EXTRACT QUERIES Sports: +nba +knicks Health: +sars QUERY DATABASE +sars 1254 If you have several points, steps, or key ideas use multiple slides. Determine if your audience is to understand a new idea, learn a process, or receive greater depth to a familiar concept. Back up each point with adequate explanation. As appropriate, supplement your presentation with technical support data in hard copy or on disc, , or the Internet. Develop each point adequately to communicate with your audience. IDENTIFY TOPIC DISTRIBUTION CLASSIFY DATABASE SIGMOD 2001 ACM TOIS 2003 11/24/2018 Panos Ipeirotis - New York University - IOMS Department

15 Identifying Topic Distribution from Query Results
TRAIN CLASSIFIER Query-based estimates of topic distribution not perfect Document classifiers not perfect: Rules for one category match documents from other categories Querying not perfect: Queries for same category might overlap Queries do not match all documents in a category EXTRACT QUERIES Sports: +nba +knicks Health +sars QUERY DATABASE If you have several points, steps, or key ideas use multiple slides. Determine if your audience is to understand a new idea, learn a process, or receive greater depth to a familiar concept. Back up each point with adequate explanation. As appropriate, supplement your presentation with technical support data in hard copy or on disc, , or the Internet. Develop each point adequately to communicate with your audience. IDENTIFY TOPIC DISTRIBUTION Solution: Learn to adjust results of query probes CLASSIFY DATABASE 11/24/2018 Panos Ipeirotis - New York University - IOMS Department

16 Confusion Matrix Adjustment of Query Probe Results
correct class Correct (but unknown) topic distribution Incorrect topic distribution derived from query probing Real Coverage 1000 5000 50 comp sports health 0.80 0.10 0.00 0.08 0.85 0.04 0.02 0.15 0.96 Estimated Coverage 1300 4332 818 = X = = = If you have several points, steps, or key ideas use multiple slides. Determine if your audience is to understand a new idea, learn a process, or receive greater depth to a familiar concept. Back up each point with adequate explanation. As appropriate, supplement your presentation with technical support data in hard copy or on disc, , or the Internet. Develop each point adequately to communicate with your audience. assigned class This “multiplication” can be inverted to get a better estimate of the real topic distribution from the probe results 10% of “sport” documents match queries for “computers” 11/24/2018 Panos Ipeirotis - New York University - IOMS Department

17 Confusion Matrix Adjustment of Query Probe Results
TRAIN CLASSIFIER Coverage(D) ~ M-1 . ECoverage(D) EXTRACT QUERIES Sports: +nba +knicks Adjusted estimate of topic distribution Health Probing results +sars QUERY DATABASE M usually diagonally dominant for “reasonable” document classifiers, hence invertible Compensates for errors in query-based estimates of topic distribution IDENTIFY TOPIC DISTRIBUTION CLASSIFY DATABASE 11/24/2018 Panos Ipeirotis - New York University - IOMS Department

18 Classification Algorithm (Again)
TRAIN CLASSIFIER Train document classifier Extract queries from classifier Adaptively issue queries to database Identify topic distribution based on adjusted number of query matches Classify database One-time process EXTRACT QUERIES Sports: +nba +knicks Health +sars QUERY DATABASE +sars 1254 May be put the hierarchical classification algorithm here? IDENTIFY TOPIC DISTRIBUTION For every database CLASSIFY DATABASE 11/24/2018 Panos Ipeirotis - New York University - IOMS Department

19 Panos Ipeirotis - New York University - IOMS Department
Experimental Setup 72-node 4-level topic hierarchy from InvisibleWeb/Yahoo! (54 leaf nodes) 500,000 Usenet articles: Newsgroups assigned by hand to hierarchy nodes RIPPER trained with 54,000 articles (1,000 articles per leaf), 27,000 articles to construct confusion matrix 500 “Controlled” databases built using 419,000 newsgroup articles (to run detailed experiments) 130 real Web databases picked from InvisibleWeb (first 5 under each topic) If you have several points, steps, or key ideas use multiple slides. Determine if your audience is to understand a new idea, learn a process, or receive greater depth to a familiar concept. Back up each point with adequate explanation. As appropriate, supplement your presentation with technical support data in hard copy or on disc, , or the Internet. Develop each point adequately to communicate with your audience. comp.hardware rec.music.classical rec.photo.* 11/24/2018 Panos Ipeirotis - New York University - IOMS Department

20 Experimental Results: Controlled Databases
Accuracy (using F-measure): Above 80% for most <Tc, Ts> threshold combinations tried Degrades gracefully with hierarchy depth Confusion-matrix adjustment helps Efficiency: Relatively small number of queries (<500) needed for most threshold <Tc, Ts> combinations tried 11/24/2018 Panos Ipeirotis - New York University - IOMS Department

21 Experimental Results: Web Databases
Accuracy (using F-measure): ~70% for best <Tc, Ts> combination Learned thresholds that reproduce human classification Tested threshold choice using 3-fold cross validation Efficiency: 120 queries per database on average needed for choice of thresholds, no documents retrieved Only small part of hierarchy “explored” Queries are short: 1.5 words on average; 4 words maximum (easily handled by most Web databases) 11/24/2018 Panos Ipeirotis - New York University - IOMS Department

22 Hidden-Web Database Classification: Summary
Handles autonomous Hidden-Web databases accurately and efficiently: ~70% F-measure Only 120 queries issued on average, with no documents retrieved Handles large family of document classifiers (and can hence exploit future advances in machine learning) 11/24/2018 Panos Ipeirotis - New York University - IOMS Department

23 Panos Ipeirotis - New York University - IOMS Department
Outline of Talk Classification of Hidden-Web Databases Search over Hidden-Web Databases Managing Changes in Hidden-Web Databases 11/24/2018 Panos Ipeirotis - New York University - IOMS Department

24 Interacting With Hidden-Web Databases
Browsing: Yahoo!-like directories Searching: Metasearchers Content not accessible through Google } NYTimes Archives In your opening, establish the relevancy of the topic to the audience. Give a brief preview of the presentation and establish value for the listeners. Take into account your audience’s interest and expertise in the topic when choosing your vocabulary, examples, and illustrations. Focus on the importance of the topic to your audience, and you will have more attentive listeners. PubMed Query Metasearcher USPTO Library of Congress 11/24/2018 Panos Ipeirotis - New York University - IOMS Department

25 Metasearchers Provide Access to Distributed Databases
Database selection relies on simple content summaries: vocabulary, word frequencies thrombopenia PubMed (11,868,552 documents) aids ,491 cancer 1,562,477 heart ,360 hepatitis ,129 thrombopenia ,826 Metasearcher ? PubMed ... thrombopenia 24,826 thrombopenia 0 thrombopenia 18 NYTimes Archives USPTO Databases typically do not export such summaries! 11/24/2018 Panos Ipeirotis - New York University - IOMS Department

26 Extracting Representative Document Sample
Focused Sampling Train a document classifier Create queries from classifier Adaptively issue queries to databases Retrieve top-k matching documents for each query Save #matches for each one-word query Identify topic distribution based on adjusted number of query matches Categorize the database Generate content summary from document sample If you have several points, steps, or key ideas use multiple slides. Determine if your audience is to understand a new idea, learn a process, or receive greater depth to a familiar concept. Back up each point with adequate explanation. As appropriate, supplement your presentation with technical support data in hard copy or on disc, , or the Internet. Develop each point adequately to communicate with your audience. Focused sampling retrieves documents only from “topically dense” areas from database 11/24/2018 Panos Ipeirotis - New York University - IOMS Department

27 Sampling and Incomplete Content Summaries
Problem: Summaries from small samples are highly incomplete Log(Frequency) 107 106 Frequency & rank of 10% most frequent words in PubMed database (95% of the words appear in less than 0.1% of db) 10,000 . . 103 102 2·104 4·104 105 Rank Many words appear in “relatively few” documents (Zipf’s law) 11/24/2018 Panos Ipeirotis - New York University - IOMS Department

28 Sampling and Incomplete Content Summaries
Problem: Summaries from small samples are highly incomplete Log(Frequency) 107 106 Frequency & rank of 10% most frequent words in PubMed database (95% of the words appear in < 0.1% of db) 10,000 . . endocarditis ~10,000 docs / ~0.1% 103 102 2·104 4·104 105 Rank Many words appear in “relatively few” documents (Zipf’s law) Low-frequency words are often important 11/24/2018 Panos Ipeirotis - New York University - IOMS Department

29 Sampling and Incomplete Content Summaries
Problem: Summaries from small samples are highly incomplete Log(Frequency) Sample=300 107 106 Frequency & rank of 10% most frequent words in PubMed database (95% of the words appear in < 0.1% of db) 9,000 . . endocarditis ~9,000 docs / ~0.1% 103 102 2·104 4·104 105 Rank Many words appear in “relatively few” documents (Zipf’s law) Low-frequency words are often important Small document samples miss many low-frequency words 11/24/2018 Panos Ipeirotis - New York University - IOMS Department

30 Sample-based Content Summaries
Challenge: Improve content summary quality without increasing sample size Main Idea: Database Classification Helps Similar topics ↔ Similar content summaries Extracted content summaries complement each other 11/24/2018 Panos Ipeirotis - New York University - IOMS Department

31 Databases with Similar Topics
CANCERLIT contains “metastasis”, not found during sampling CancerBACUP contains “metastasis” Databases under same category have similar vocabularies, and can complement each other 11/24/2018 Panos Ipeirotis - New York University - IOMS Department

32 Content Summaries for Categories
Databases under same category share similar vocabulary Higher level category content summaries provide additional useful estimates All estimates in category path are potentially useful 11/24/2018 Panos Ipeirotis - New York University - IOMS Department

33 Enhancing Summaries Using “Shrinkage”
Estimates from database content summaries can be unreliable Category content summaries are more reliable (based on larger samples) but less specific to database By combining estimates from category and database content summaries we get better estimates SIGMOD 2004 11/24/2018 Panos Ipeirotis - New York University - IOMS Department

34 Shrinkage-based Estimations
Adjust estimate for metastasis in D: λ1 * λ2 * λ3 * 0.092 + λ4 * 0.000 Select λi weights to maximize the probability that the summary of D is from a database under all its parent categories Avoids “sparse data” problem and decreases estimation risk 11/24/2018 Panos Ipeirotis - New York University - IOMS Department

35 Computing Shrinkage-based Summaries
Root Health Cancer D Pr [metastasis | D] = λ1 * λ2 * λ3 * 0.092 + λ4 * 0.000 Pr [treatment | D] = λ1 * λ2 * λ3 * 0.179 + λ4 * 0.184 Automatic computation of λi weights using an EM algorithm Computation performed offline  No query overhead Avoids “sparse data” problem and decreases estimation risk 11/24/2018 Panos Ipeirotis - New York University - IOMS Department

36 Shrinkage Weights and Summary
new estimates old estimates CANCERLIT Shrinkage-based λroot=0.02 λhealth=0.13 λcancer=0.20 λcancerlit=0.65 metastasis 2.5% 0.2% 5% 9.2% 0% aids 14.3% 0.8% 7% 2% 20% football 0.17% 1% Shrinkage: Increases estimations for underestimates (e.g., metastasis) Decreases word-probability estimates for overestimates (e.g., aids) …it also introduces (with small probabilities) spurious words (e.g., football) 11/24/2018 Panos Ipeirotis - New York University - IOMS Department

37 Is Shrinkage Always Necessary?
Shrinkage used to reduce uncertainty (variance) of estimations Small samples of large databases  high variance In sample: 10 out of 100 documents contain metastasis In database: ? out of 10,000,000 documents? Small samples of small databases  small variance In database: ? out of 200 documents? Shrinkage less useful (or even harmful) when uncertainty is low 11/24/2018 Panos Ipeirotis - New York University - IOMS Department

38 Adaptive Application of Shrinkage
Database selection algorithms assign scores to databases for each query When word frequency estimates are uncertain, assigned score has high variance shrinkage improves score estimates When word frequency estimates are reliable, assigned score has small variance shrinkage unnecessary Unreliable Score Estimate: Use shrinkage Probability 1 Database Score for a Query Reliable Score Estimate: Shrinkage might hurt Probability Solution: Use shrinkage adaptively in a query- and database-specific manner 1 Database Score for a Query 11/24/2018 Panos Ipeirotis - New York University - IOMS Department

39 Panos Ipeirotis - New York University - IOMS Department
Searching Algorithm Classify databases and extract document samples Adjust frequencies in samples One-time process For each query: For each database D: Assign score to database D (using extracted content summary) Examine uncertainty of score If uncertainty high, apply shrinkage and give new score; else keep existing score Query only top-K scoring databases For every query 11/24/2018 Panos Ipeirotis - New York University - IOMS Department

40 Results: Database Selection
Metric: R(K) = Χ / Υ X = # of relevant documents in the selected K databases Y = # of relevant documents in the best K databases For CORI (a state-of-the-art database selection algorithm) with stemming over TREC6 testbed 11/24/2018 Panos Ipeirotis - New York University - IOMS Department

41 Panos Ipeirotis - New York University - IOMS Department
Outline of Talk Classification of Hidden-Web Databases Search over Hidden-Web Databases Managing Changes in Hidden-Web Databases 11/24/2018 Panos Ipeirotis - New York University - IOMS Department

42 Panos Ipeirotis - New York University - IOMS Department
Never-update Policy Naive practice: construct summary once, never update Extracted (old) summary may: Miss new words (from new documents) Contain obsolete words (from deleted document) Provide inaccurate frequency estimates NY Times (Oct 29, 2004) Word #Docs NY Times (Mar 29, 2005) Word #Docs tsunami (0) recount 2,302 grokster 2 tsunami 250 recount (0) grokster 78 11/24/2018 Panos Ipeirotis - New York University - IOMS Department

43 Updating Content Summaries: Questions
Do content summaries change over time? Which database properties affect the rate of change? How to schedule updates with constrained resources? ICDE 2005 11/24/2018 Panos Ipeirotis - New York University - IOMS Department

44 Data for our Study: 152 Web Databases
Study period: Oct 2002 – Oct 2003 52 weekly snapshots for each database 5 million pages in each snapshot (approx.) 65 Gb per snapshot (3.3 Tb total) Examined differences between snapshots: 1 week: 5% new words, 5% old words disappeared 20 weeks: 20% new words, 20% old words disappeared 11/24/2018 Panos Ipeirotis - New York University - IOMS Department

45 Panos Ipeirotis - New York University - IOMS Department
Survival Analysis Survival Analysis: A collection of statistical techniques for predicting “the time until an event occurs” Initially used to measure length of survival of patients under different treatments (hence the name) Used to measure effect of different parameters (e.g., weight, race) on survival time We want to predict “time until next update” and find database properties that affect this time 11/24/2018 Panos Ipeirotis - New York University - IOMS Department

46 Survival Analysis for Summary Updates
“Survival time of summary”: Time until current database summary is “sufficiently different” than the old one (i.e., an update is required) Old summary changes at time t if: KL divergence(current, old) > τ Survival analysis estimates probability that a database summary changes within time t change sensitivity threshold 11/24/2018 Panos Ipeirotis - New York University - IOMS Department

47 Survival Times and “Incomplete” Data
“Survival times” for a database “Censored” cases Week 52, end of study X week Many observations are “incomplete” (aka “censored”) Censored data give partial information (database did not change) 11/24/2018 Panos Ipeirotis - New York University - IOMS Department

48 Panos Ipeirotis - New York University - IOMS Department
Using “Censored” Data S(t), best fit, ignoring censored data S(t), best fit, using censored data S(t), best fit, using censored data “as-is” X X By ignoring censored cases we get (under) estimates  perform more update operations than needed By using censored cases “as-is” we get (again) underestimates Survival analysis “extends” the lifetime of “censored” cases 11/24/2018 Panos Ipeirotis - New York University - IOMS Department

49 Database Properties and Survival Times
For our analysis, we use Cox Proportional Hazards Regression Uses effectively “censored” data (i.e., database did not change within time T) Derives effect of database properties on rate of change E.g., “if you double the size of a database, it changes twice as fast” No assumptions about the form of the survival function 11/24/2018 Panos Ipeirotis - New York University - IOMS Department

50 Baseline Survival Functions by Domain
Effect of domain: GOV changes slower than any other domain EDU changes fast in the short term, but slower in the long term COM and other commercial sites change faster than the rest 11/24/2018 Panos Ipeirotis - New York University - IOMS Department

51 Results of Cox PH Analysis
Cox PH analysis gives a formula for predicting the time between updates for any database Rate of change depends on: domain database size history of change threshold τ By knowing time between updates we can schedule update operations better! 11/24/2018 Panos Ipeirotis - New York University - IOMS Department

52 Panos Ipeirotis - New York University - IOMS Department
Scheduling Updates Database Rate of change λ average time between updates 10 weeks 40 weeks Tom’s Hardware 0.088 5 weeks 46 weeks USPS 0.023 12 weeks 34 weeks With plentiful resources, we update sites according to their rate of change When resources are constrained, we update less often sites that change “too frequently” 11/24/2018 Panos Ipeirotis - New York University - IOMS Department

53 Classification & Search: Overall Contributions
Support for browsing and searching Hidden-Web databases No need for cooperation: Work with autonomous Hidden-Web databases Scalable and work with large number of databases Not restricted to “Hidden”-Web databases: Work with any searchable text database Classification and content summary extraction implemented and available for download at: 11/24/2018 Panos Ipeirotis - New York University - IOMS Department

54 Current Work: Integrated Access to Hidden-Web Databases
Query: [good drama movies playing in newark tomorrow] Current top Google result: as of Oct 2nd, 2005 11/24/2018 Panos Ipeirotis - New York University - IOMS Department

55 Future Work: Integrated Access to Hidden-Web Databases
Query: [ drama movies playing in newark tomorrow] good query review databases query movie databases query ticket databases All information already available on the web Review databases: Rotten Tomatoes, NY Times, TONY,… Movie databases: All Movie Guide, IMDB Tickets: Moviefone, Fandango,… Privacy: Should the database know that it was selected for one of my queries? Authorization: If a query is sent to a database that I do not have access, I know that it contains something relevant 11/24/2018 Panos Ipeirotis - New York University - IOMS Department

56 Future Work: Integrated Access to Hidden-Web Databases
Query: [ drama movies playing in newark tomorrow] good query review databases query movie databases query ticket databases Challenges: Short term: Learn to interface with different databases Adapt database selection algorithms Long term: Understand semantics of query Extract “query plans” and optimize for distributed execution Personalization Security and privacy Privacy: Should the database know that it was selected for one of my queries? Authorization: If a query is sent to a database that I do not have access, I know that it contains something relevant 11/24/2018 Panos Ipeirotis - New York University - IOMS Department

57 Panos Ipeirotis - New York University - IOMS Department
Thank you! 11/24/2018 Panos Ipeirotis - New York University - IOMS Department

58 Other Work: Approximate Text Matching
VLDB’01 WWW’03 Matching similar strings within relational DBMS important: data resides there Service A Jenny Stamatopoulou John Paul McDougal Aldridge Rodriguez Panos Ipeirotis John Smith Service B Panos Ipirotis Jonh Smith Stamatopulou, Jenny John P. McDougal Al Dridge Rodriguez Exact joins not enough: Typing mistakes, abbreviations, different conventions Introduced algorithms for mapping approximate text joins into SQL: No need for import/export of data Provides crucial building block for data cleaning applications Identifies many interesting matches Joint work with Divesh Srivastava, Nick Koudas (AT&T Labs-Research) and others 11/24/2018 Panos Ipeirotis - New York University - IOMS Department

59 No Good Category for Database
General “problem” with supervised learning Example: English vs. Chinese databases Devised technique to analyze if can work with given database: Find candidate textfields Send top-level queries Examine results & construct similarity matrix If “matrix rank” small  Many “similar” pages returned Web form is not a search interface Textfield is not a “keyword” field Database is of different language Database is of an “unknown” topic 11/24/2018 Panos Ipeirotis - New York University - IOMS Department

60 Database not Category Focused
Extract one content summary per topic: Focused queries retrieve documents about known topic Each database is represented multiple times in hierarchy 11/24/2018 Panos Ipeirotis - New York University - IOMS Department

61 Near Future Work: Definition and analysis of query-based algorithms
Currently query-based algorithms are evaluated only empirically Possible to model querying process using random graph theory and: Analyze thoroughly properties of the algorithms Understand better why, when, and how the algorithms work Interested in exploring similar directions: Adapt hyperlink-based ranking algorithms Use results in graph theory to design sampling algorithms WebDB 2003 11/24/2018 Panos Ipeirotis - New York University - IOMS Department

62 Crawling- vs. Query-based Classification for CNN Sports
Efficiency Statistics: Crawling-based Query-based Time Files Size Queries 1325min 270,202 8Gb 2min (-99.8%) 112 357Kb (-99.9%) IEEE DEB – March 2002 Accuracy Statistics: Crawling-based classification is classified correctly only after downloading 70% of the documents in CNN-Sports 11/24/2018 Panos Ipeirotis - New York University - IOMS Department

63 Real Confusion Matrix for Top Node of Hierarchy
Health Sports Science Computers Arts 0.753 0.018 0.124 0.021 0.017 0.006 0.171 0.016 0.064 0.024 0.255 0.047 0.004 0.042 0.080 0.610 0.031 0.027 0.298 11/24/2018 Panos Ipeirotis - New York University - IOMS Department

64 Adjusting Document Frequencies
Zipf’s law empirically connects word frequency f and rank r We know document frequency and rank r of the words in sample f = A (r + B) c frequency Frequency in sample 100 rank …. VLDB 2002 Rank in sample 11/24/2018 Panos Ipeirotis - New York University - IOMS Department

65 Adjusting Document Frequencies
Zipf’s law empirically connects word frequency f and rank r We know document frequency and rank r of the words in sample We know real document frequency f of some words from one-word queries frequency f = A (r + B) c Frequency in database rank …. VLDB 2002 Rank in sample 11/24/2018 Panos Ipeirotis - New York University - IOMS Department

66 Adjusting Document Frequencies
Zipf’s law empirically connects word frequency f and rank r We know document frequency and rank r of the words in sample We know real document frequency f of some words from one-word queries We use curve-fitting to estimate the absolute frequency of all words in sample f = A (r + B) c frequency Estimated frequency in database rank …. VLDB 2002 11/24/2018 Panos Ipeirotis - New York University - IOMS Department

67 Measuring Changes over Time
Recall: How many words in current summary also in old (extracted) summary? Shows how well old summaries cover the current (unknown) vocabulary Higher values are better Precision: How many words in old (extracted) summary still in current summary? Shows how many obsolete words exist in the old summaries Higher values are better 11/24/2018 Results for complete summaries (similar for approximate) Panos Ipeirotis - New York University - IOMS Department

68 Panos Ipeirotis - New York University - IOMS Department
Modeling Goals Goal: Estimate database-specific survival time distribution Exponential distribution S(t) = exp(-λt) common for survival times λ captures rate of change Need to estimate λ for each database Preferably, infer λ from database properties (with no “training”) Intuitive (and wrong) approach: data + multiple regression Study contains a large number of “incomplete” observations Target variable S(t) typically not Gaussian 11/24/2018 Panos Ipeirotis - New York University - IOMS Department

69 Panos Ipeirotis - New York University - IOMS Department
Other Experiments Effect of choice of document classifiers: RIPPER C4.5 Naïve Bayes SVM Benefits of feature selection Effect of search-interface heterogeneity: Boolean vs. vector-space retrieval models Effect of query-overlap elimination step Over crawlable databases: query-based classification orders of magnitude faster than “brute-force” crawling-based classification ACM TOIS 2003 IEEE Data Engineering Bulletin 2002 11/24/2018 Panos Ipeirotis - New York University - IOMS Department


Download ppt "Classifying and Searching "Hidden-Web" Text Databases"

Similar presentations


Ads by Google