Automatic Classification of Text Databases Through Query Probing Panagiotis G. Ipeirotis Luis Gravano Columbia University Mehran Sahami E.piphany Inc.

Slides:

Advertisements

Similar presentations

Downloading Textual Hidden-Web Content Through Keyword Queries

Advertisements

Probe, Count, and Classify: Categorizing Hidden Web Databases

Bringing Order to the Web: Automatically Categorizing Search Results Hao Chen SIMS, UC Berkeley Susan Dumais Adaptive Systems & Interactions Microsoft.

1/1/ A Knowledge-based Approach to Citation Extraction Min-Yuh Day 1,2, Tzong-Han Tsai 1,3, Cheng-Lung Sung 1, Cheng-Wei Lee 1, Shih-Hung Wu 4, Chorng-Shyong.

Search Engines and Information Retrieval

Distributed Search over the Hidden Web Hierarchical Database Sampling and Selection Panagiotis G. Ipeirotis Luis Gravano Computer Science Department Columbia.

WebMiningResearch ASurvey Web Mining Research: A Survey Raymond Kosala and Hendrik Blockeel ACM SIGKDD, July 2000 Presented by Shan Huang, 4/24/2007.

Distributed Search over the Hidden Web Hierarchical Database Sampling and Selection Panagiotis G. Ipeirotis & Luis Gravano.

CS335 Principles of Multimedia Systems Content Based Media Retrieval Hao Jiang Computer Science Department Boston College Dec. 4, 2007.

Information Retrieval Concerned with the: Representation of Storage of Organization of, and Access to Information items.

(c) Maria Indrawan Distributed Information Retrieval.

Retrieval Evaluation. Brief Review Evaluation of implementations in computer science often is in terms of time and space complexity. With large document.

FACT: A Learning Based Web Query Processing System Hongjun Lu, Yanlei Diao Hong Kong U. of Science & Technology Songting Chen, Zengping Tian Fudan University.

Reference Collections: Task Characteristics. TREC Collection Text REtrieval Conference (TREC) –sponsored by NIST and DARPA (1992-?) Comparing approaches.

Distributed Search over the Hidden Web: Hierarchical Database Sampling and Selection.

WebMiningResearchASurvey Web Mining Research: A Survey Raymond Kosala and Hendrik Blockeel ACM SIGKDD, July 2000 Presented by Shan Huang, 4/24/2007 Revised.

Distributed Information Retrieval Jamie Callan Carnegie Mellon University

WHAT HAVE WE DONE SO FAR?  Weeks 1 – 8 : various components of an information retrieval system  Now – look at various examples of information retrieval.

1 Modeling Query-Based Access to Text Databases Eugene Agichtein Panagiotis Ipeirotis Luis Gravano Computer Science Department Columbia University.

Donghui Xu Spring 2011, COMS E6125 Prof. Gail Kaiser.

Search Engines and Information Retrieval Chapter 1.

©2008 Srikanth Kallurkar, Quantum Leap Innovations, Inc. All rights reserved. Apollo – Automated Content Management System Srikanth Kallurkar Quantum Leap.

Bringing Order to the Web: Automatically Categorizing Search Results Hao Chen, CS Division, UC Berkeley Susan Dumais, Microsoft Research ACM:CHI April.

1 Named Entity Recognition based on three different machine learning techniques Zornitsa Kozareva JRC Workshop September 27, 2005.

1 LiveClassifier: Creating Hierarchical Text Classifiers through Web Corpora Chien-Chung Huang Shui-Lung Chuang Lee-Feng Chien Presented by: Vu LONG.

Accessing the Deep Web Bin He IBM Almaden Research Center in San Jose, CA Mitesh Patel Microsoft Corporation Zhen Zhang computer science at the University.

A Simple Unsupervised Query Categorizer for Web Search Engines Prashant Ullegaddi and Vasudeva Varma Search and Information Extraction Lab Language Technologies.

1 Technologies for (semi-) automatic metadata creation Diana Maynard.

To Search or to Crawl? Towards a Query Optimizer for Text-Centric Tasks Panos Ipeirotis – New York University Eugene Agichtein – Microsoft Research Pranay.

Internet Information Retrieval Sun Wu. Course Goal To learn the basic concepts and techniques of internet search engines –How to use and evaluate search.

Search and Navigation Based on the paper, “Improved Search Engines and Navigation Preference in Personal Information Management” Ofer Bergman, Ruth Beyth-Marom,

1 Information Retrieval Acknowledgements: Dr Mounia Lalmas (QMW) Dr Joemon Jose (Glasgow)

Livnat Sharabani SDBI 2006 The Hidden Web. 2 Based on: “Distributed search over the hidden web: Hierarchical database sampling and selection” “Distributed.

Xiaoying Gao Computer Science Victoria University of Wellington Intelligent Agents COMP 423.

Querying Text Databases for Efficient Information Extraction Eugene Agichtein Luis Gravano Columbia University.

Web Services and Application of Multi-Agent Paradigm for DL Yueyu Fu & Javed Mostafa School of Library and Information Science Indiana University, Bloomington.

Math Information Retrieval Zhao Jin. Zhao Jin. Math Information Retrieval Examples: –Looking for formulas –Collect teaching resources –Keeping updated.

WIRED Week 3 Syllabus Update (next week) Readings Overview - Quick Review of Last Week’s IR Models (if time) - Evaluating IR Systems - Understanding Queries.

IR Homework #3 By J. H. Wang May 4, Programming Exercise #3: Text Classification Goal: to classify each document into predefined categories Input:

Probabilistic Latent Query Analysis for Combining Multiple Retrieval Sources Rong Yan Alexander G. Hauptmann School of Computer Science Carnegie Mellon.

Hidden-Web Databases: Classification and Search Luis Gravano Columbia University Joint work with Panos Ipeirotis (Columbia)

Digital Libraries1 David Rashty. Digital Libraries2 “A library is an arsenal of liberty” Anonymous.

Information Retrieval

Reference Collections: Collection Characteristics.

Text Document Categorization by Term Association Maria-luiza Antonie Osmar R. Zaiane University of Alberta, Canada 2002 IEEE International Conference on.

Comparing Document Segmentation for Passage Retrieval in Question Answering Jorg Tiedemann University of Groningen presented by: Moy’awiah Al-Shannaq

Chapter. 3: Retrieval Evaluation 1/2/2016Dr. Almetwally Mostafa 1.

Xiaoying Gao Computer Science Victoria University of Wellington COMP307 NLP 4 Information Retrieval.

The Effect of Database Size Distribution on Resource Selection Algorithms Luo Si and Jamie Callan School of Computer Science Carnegie Mellon University.

Toward Entity Retrieval over Structured and Text Data Mayssam Sayyadian, Azadeh Shakery, AnHai Doan, ChengXiang Zhai Department of Computer Science University.

Scalable Methods for Estimating Document Frequencies of Collocations in Databases Tan Yee Fan 2006 December 15 WING Group Meeting.

Instance Discovery and Schema Matching With Applications to Biological Deep Web Data Integration Tantan Liu, Fan Wang, Gagan Agrawal {liut, wangfa,

Bringing Order to the Web : Automatically Categorizing Search Results Advisor ： Dr. Hsu Graduate ： Keng-Wei Chang Author ： Hao Chen Susan Dumais.

Relevant Document Distribution Estimation Method for Resource Selection Luo Si and Jamie Callan School of Computer Science Carnegie Mellon University

SEMINAR ON INTERNET SEARCHING PRESENTED BY:- AVIPSA PUROHIT REGD NO GUIDED BY:- Lect. ANANYA MISHRA.

IR Homework #2 By J. H. Wang May 9, Programming Exercise #2: Text Classification Goal: to classify each document into predefined categories Input:

Web Services and Application of Multi-Agent Paradigm for DL

Panagiotis G. Ipeirotis Tom Barry Luis Gravano

Classifying and Searching "Hidden-Web" Text Databases

Classifying and Searching "Hidden-Web" Text Databases

Classifying and Searching "Hidden-Web" Text Databases

Classifying and Searching "Hidden-Web" Text Databases

Text Joins in an RDBMS for Web Data Integration

Text Categorization Rong Jin.

SDLIP + STARTS = SDARTS A Protocol and Toolkit for Metasearching

Panos Ipeirotis Luis Gravano

Panagiotis G. Ipeirotis Luis Gravano

Categorization: Information and Misinformation

Web Mining Research: A Survey

Presentation transcript:

Automatic Classification of Text Databases Through Query Probing Panagiotis G. Ipeirotis Luis Gravano Columbia University Mehran Sahami E.piphany Inc.

Search-only Text Databases Sources of valuable information Hidden behind search interfaces Non-crawlable Example: Microsoft Support KB

Interacting With Searchable Text Databases 1. Searching: Metasearchers 2. Browsing: Use Yahoo-like directories 3. Browse & search: “Category-enabled” metasearchers

Searching Text Databases: Metasearchers Select the good databases for a query Evaluate the query at these databases Combine the query results from the databases Examples: MetaCrawler, SavvySearch, Profusion

Browsing Through Text Databases Yahoo-like web directories: InvisibleWeb.com SearchEngineGuide.com TheBigHub.com Example from InvisibleWeb.com Computers Computers > Publications > ACM DL Publications Category-enabled metasearchers User-defined category (e.g. Recipes)

Problem With Current Classification Approach Classification of databases is done manually This requires a lot of human effort!

How to Classify Text Databases Automatically: Outline Definition of classification Strategies for classifying searchable databases through query probing Initial experiments

Database Classification: Two Definitions Coverage-based classification: The database contains many documents about the category (e.g. Basketball) Coverage: #docs about this category Specificity-based classification: The database contains mainly documents about this category Specificity: #docs/|DB|

Database Classification: An Example Category: Basketball Coverage-based classification ESPN.com, NBA.com Specificity-based classification NBA.com, but not ESPN.com

Categorizing a Text Database: Two Problems Find the category of a given document Find the category of all the documents inside the database

Categorizing Documents Several text classifiers available RIPPER (AT&T Research, William Cohen 1995) Input: A set of pre-classified, labeled documents Output: A set of classification rules

Categorizing Documents: RIPPER Training set: Preclassified documents “Linux as a web server”: Computers “Linux vs. Windows: …”: Computers “Jordan was the leader of Chicago Bulls”: Sports “Smoking causes lung cancer”: Health Output: Rule-based classifier IF linux THEN Computers IF jordan AND bulls THEN Sports IF lung AND cancer THEN Health

Precision and Recall of Document Classifier During the training phase: 100 documents about computers “Computer” rules matched 50 docs From these 50 docs 40 were about computers Precision = 40/50 = 0.8 Recall = 40/100 = 0.4

From Document to Database Classification If we know the categories of all the documents, we are done! But databases do not export such data! How can we extract this information?

Our Approach: Query Probing Design a small set of queries to probe the databases Categorize the database based on the probing results

Designing and Implementing Query Probes The probes should extract information about the categories of the documents in the database Start with a document classifier (RIPPER) Transform each rule into a query IF lung AND cancer THEN health  +lung +cancer IF linux THEN computers  +linux Get number of matches for each query

ACM DL NBA.com PubMED lung AND cancer  health jordan AND bulls  sports linux  computers ACMNBAPubM comp sports health Three Categories and Three Databases

Using the Results for Classification COVACMNBAPubM comp sports health SPECACMNBAPubM comp sports health We use the results to estimate coverage and specificity values

Adjusting Query Results Classifiers are not perfect! Queries do not “retrieve” all the documents that belong to a category Queries for one category “match” documents that do not belong to this category From the training phase of classifier we use precision and recall

Precision & Recall Adjustment Computer-category: Rule: “linux”, Precision = 0.7 Rule: “cpu”, Precision = 0.9 Recall (for all the rules) = 0.4 Probing with queries for “Computers”: Query: +linux  X 1 matches  0.7X 1 correct matches Query: +cpu  X 2 matches  0.9X 2 correct matches From X 1 +X 2 documents found: Expect 0.7 X X 2 to be correct Expect (0.7 X X 2 )/0.4 total computer docs

Initial Experiments Used a collection of 20,000 newsgroup articles Formed 5 categories: Computers (comp.*) Science (sci.*) Hobbies (rec.*) Society (soc.* + alt.atheism) Misc (misc.sale) RIPPER trained with 10,000 newsgroup articles Classifier: 29 rules, 32 words used IF windows AND pc THEN Computers (precision~0.75) IF satellite AND space THEN Science (precision~0.9)

Web-databases Probed Using the newsgroup classifier we probed four web databases: Cora ( CS Papers archive (Computers) American Scientist ( Science and technology magazine (Science) All Outdoors ( Articles about outdoor activities (Hobbies) Religion Today ( News and discussion about religions (Society)

Results Only 29 queries per web site No need for document retrieval!

Conclusions Easy classification using only a small number of queries No need for document retrieval Only need a result like: “X matches found” Not limited to search-only databases Every searchable database can be classified this way Not limited to topical classification

Current Issues Comprehensive classification scheme Representative training data

Future Work Use a hierarchical classification scheme Test different search interfaces Boolean model Vector-space model Different capabilities Compare with document sampling (Callan et al.’s work – SIGMOD99, adapted for the classification task) Study classification efficiency when documents are accessible

Related Work Gauch (JUCS 1996) Etzioni et al. (JIIS 1997) Hawking & Thistlewaite (TOIS 1999) Callan et al. (SIGMOD 1999) Meng et al. (CoopIS 1999)