Download presentation
Presentation is loading. Please wait.
Published byGladys Dickerson Modified over 9 years ago
1
Automatic Classification of Text Databases Through Query Probing Panagiotis G. Ipeirotis Luis Gravano Columbia University Mehran Sahami E.piphany Inc.
2
Search-only Text Databases Sources of valuable information Hidden behind search interfaces Non-crawlable Example: Microsoft Support KB
3
Interacting With Searchable Text Databases 1. Searching: Metasearchers 2. Browsing: Use Yahoo-like directories 3. Browse & search: “Category-enabled” metasearchers
4
Searching Text Databases: Metasearchers Select the good databases for a query Evaluate the query at these databases Combine the query results from the databases Examples: MetaCrawler, SavvySearch, Profusion
5
Browsing Through Text Databases Yahoo-like web directories: InvisibleWeb.com SearchEngineGuide.com TheBigHub.com Example from InvisibleWeb.com Computers Computers > Publications > ACM DL Publications Category-enabled metasearchers User-defined category (e.g. Recipes)
6
Problem With Current Classification Approach Classification of databases is done manually This requires a lot of human effort!
7
How to Classify Text Databases Automatically: Outline Definition of classification Strategies for classifying searchable databases through query probing Initial experiments
8
Database Classification: Two Definitions Coverage-based classification: The database contains many documents about the category (e.g. Basketball) Coverage: #docs about this category Specificity-based classification: The database contains mainly documents about this category Specificity: #docs/|DB|
9
Database Classification: An Example Category: Basketball Coverage-based classification ESPN.com, NBA.com Specificity-based classification NBA.com, but not ESPN.com
10
Categorizing a Text Database: Two Problems Find the category of a given document Find the category of all the documents inside the database
11
Categorizing Documents Several text classifiers available RIPPER (AT&T Research, William Cohen 1995) Input: A set of pre-classified, labeled documents Output: A set of classification rules
12
Categorizing Documents: RIPPER Training set: Preclassified documents “Linux as a web server”: Computers “Linux vs. Windows: …”: Computers “Jordan was the leader of Chicago Bulls”: Sports “Smoking causes lung cancer”: Health Output: Rule-based classifier IF linux THEN Computers IF jordan AND bulls THEN Sports IF lung AND cancer THEN Health
13
Precision and Recall of Document Classifier During the training phase: 100 documents about computers “Computer” rules matched 50 docs From these 50 docs 40 were about computers Precision = 40/50 = 0.8 Recall = 40/100 = 0.4
14
From Document to Database Classification If we know the categories of all the documents, we are done! But databases do not export such data! How can we extract this information?
15
Our Approach: Query Probing Design a small set of queries to probe the databases Categorize the database based on the probing results
16
Designing and Implementing Query Probes The probes should extract information about the categories of the documents in the database Start with a document classifier (RIPPER) Transform each rule into a query IF lung AND cancer THEN health +lung +cancer IF linux THEN computers +linux Get number of matches for each query
17
ACM DL NBA.com PubMED lung AND cancer health jordan AND bulls sports linux computers ACMNBAPubM comp sports health 336016 066740 1810381164 336016 066740 1810381164 Three Categories and Three Databases
18
Using the Results for Classification COVACMNBAPubM comp336016 sports066740 health1810381164SPECACMNBAPubM comp0.9500 sports00.9850 health0.050.0151.0 We use the results to estimate coverage and specificity values
19
Adjusting Query Results Classifiers are not perfect! Queries do not “retrieve” all the documents that belong to a category Queries for one category “match” documents that do not belong to this category From the training phase of classifier we use precision and recall
20
Precision & Recall Adjustment Computer-category: Rule: “linux”, Precision = 0.7 Rule: “cpu”, Precision = 0.9 Recall (for all the rules) = 0.4 Probing with queries for “Computers”: Query: +linux X 1 matches 0.7X 1 correct matches Query: +cpu X 2 matches 0.9X 2 correct matches From X 1 +X 2 documents found: Expect 0.7 X 1 +0.9 X 2 to be correct Expect (0.7 X 1 +0.9 X 2 )/0.4 total computer docs
21
Initial Experiments Used a collection of 20,000 newsgroup articles Formed 5 categories: Computers (comp.*) Science (sci.*) Hobbies (rec.*) Society (soc.* + alt.atheism) Misc (misc.sale) RIPPER trained with 10,000 newsgroup articles Classifier: 29 rules, 32 words used IF windows AND pc THEN Computers (precision~0.75) IF satellite AND space THEN Science (precision~0.9)
22
Web-databases Probed Using the newsgroup classifier we probed four web databases: Cora (www.cora.jprc.com)www.cora.jprc.com CS Papers archive (Computers) American Scientist (www.amsci.org)www.amsci.org Science and technology magazine (Science) All Outdoors (www.alloutdoors.com)www.alloutdoors.com Articles about outdoor activities (Hobbies) Religion Today (www.religiontoday.com)www.religiontoday.com News and discussion about religions (Society)
23
Results Only 29 queries per web site No need for document retrieval!
24
Conclusions Easy classification using only a small number of queries No need for document retrieval Only need a result like: “X matches found” Not limited to search-only databases Every searchable database can be classified this way Not limited to topical classification
25
Current Issues Comprehensive classification scheme Representative training data
26
Future Work Use a hierarchical classification scheme Test different search interfaces Boolean model Vector-space model Different capabilities Compare with document sampling (Callan et al.’s work – SIGMOD99, adapted for the classification task) Study classification efficiency when documents are accessible
27
Related Work Gauch (JUCS 1996) Etzioni et al. (JIIS 1997) Hawking & Thistlewaite (TOIS 1999) Callan et al. (SIGMOD 1999) Meng et al. (CoopIS 1999)
Similar presentations
© 2025 SlidePlayer.com. Inc.
All rights reserved.