Ray R. Larson : University of California, Berkeley Clustering and Classification Workshop 1998 Cheshire II and Automatic Categorization Ray R. Larson Associate.

Slides:



Advertisements
Similar presentations
Text Categorization.
Advertisements

Improvements and extras Paul Thomas CSIRO. Overview of the lectures 1.Introduction to information retrieval (IR) 2.Ranked retrieval 3.Probabilistic retrieval.
Chapter 5: Introduction to Information Retrieval
INFO624 - Week 2 Models of Information Retrieval Dr. Xia Lin Associate Professor College of Information Science and Technology Drexel University.
Modern Information Retrieval Chapter 1: Introduction
UCLA : GSE&IS : Department of Information StudiesJF : 276lec1.ppt : 5/2/2015 : 1 I N F S I N F O R M A T I O N R E T R I E V A L S Y S T E M S Week.
San Diego Supercomputer Center Analyzing the NSDL Collection Peter Shin, Charles Cowart Tony Fountain, Reagan Moore San Diego Supercomputer Center.
IS530 Lesson 12 Boolean vs. Statistical Retrieval Systems.
Information Retrieval in Practice
Search and Retrieval: More on Term Weighting and Document Ranking Prof. Marti Hearst SIMS 202, Lecture 22.
December 9, 2002 Cheshire II at INEX -- Ray R. Larson Cheshire II at INEX: Using A Hybrid Logistic Regression and Boolean Model for XML Retrieval Ray R.
Database Management Systems, R. Ramakrishnan1 Computing Relevance, Similarity: The Vector Space Model Chapter 27, Part B Based on Larson and Hearst’s slides.
Clustering… in General In vector space, clusters are vectors found within  of a cluster vector, with different techniques for determining the cluster.
SLIDE 1IS 240 – Spring 2009 Prof. Ray Larson University of California, Berkeley School of Information Principles of Information Retrieval.
SLIDE 1IS 240 – Spring 2007 Prof. Ray Larson University of California, Berkeley School of Information Tuesday and Thursday 10:30 am - 12:00.
SLIDE 1IS 240 – Spring 2011 Prof. Ray Larson University of California, Berkeley School of Information Principles of Information Retrieval.
SLIDE 1IS 240 – Spring 2010 Logistic Regression The logistic function: The logistic function is useful because it can take as an input any.
SLIDE 1IS 240 – Spring 2011 Prof. Ray Larson University of California, Berkeley School of Information Principles of Information Retrieval.
SLIDE 1IS 202 – FALL 2004 Lecture 13: Midterm Review Prof. Ray Larson & Prof. Marc Davis UC Berkeley SIMS Tuesday and Thursday 10:30 am -
1 CS 430 / INFO 430 Information Retrieval Lecture 12 Probabilistic Information Retrieval.
1 CS 430 / INFO 430 Information Retrieval Lecture 12 Probabilistic Information Retrieval.
Information Retrieval Concerned with the: Representation of Storage of Organization of, and Access to Information items.
SLIDE 1IS 202 – FALL 2002 Prof. Ray Larson & Prof. Marc Davis UC Berkeley SIMS Tuesday and Thursday 10:30 am - 12:00 pm Fall 2002
SLIDE 1IS 240 – Spring 2007 Lecture 8: Clustering University of California, Berkeley School of Information IS 245: Organization of Information.
8/28/97Information Organization and Retrieval IR Implementation Issues, Web Crawlers and Web Search Engines University of California, Berkeley School of.
Recall: Query Reformulation Approaches 1. Relevance feedback based vector model (Rocchio …) probabilistic model (Robertson & Sparck Jones, Croft…) 2. Cluster.
11/21/2000Information Organization and Retrieval Thesaurus Design and Development University of California, Berkeley School of Information Management and.
9/14/2000Information Organization and Retrieval Vector Representation, Term Weights and Clustering Ray Larson & Marti Hearst University of California,
SLIDE 1IS 202 – FALL 2002 Prof. Ray Larson & Prof. Marc Davis UC Berkeley SIMS Tuesday and Thursday 10:30 am - 12:00 pm Fall 2002
1 CS 430 / INFO 430 Information Retrieval Lecture 10 Probabilistic Information Retrieval.
Ranking by Odds Ratio A Probability Model Approach let be a Boolean random variable: document d is relevant to query q otherwise Consider document d as.
9/19/2000Information Organization and Retrieval Vector and Probabilistic Ranking Ray Larson & Marti Hearst University of California, Berkeley School of.
9/21/2000Information Organization and Retrieval Ranking and Relevance Feedback Ray Larson & Marti Hearst University of California, Berkeley School of Information.
Information Retrieval
Clustering. What is clustering? Grouping similar objects together and keeping dissimilar objects apart. In Information Retrieval, the cluster hypothesis.
Recuperação de Informação. IR: representation, storage, organization of, and access to information items Emphasis is on the retrieval of information (not.
Query Operations: Automatic Global Analysis. Motivation Methods of local analysis extract information from local set of documents retrieved to expand.
Chapter 5: Information Retrieval and Web Search
Overview of Search Engines
Modeling (Chap. 2) Modern Information Retrieval Spring 2000.
CHAMELEON : A Hierarchical Clustering Algorithm Using Dynamic Modeling
Chapter 2 Architecture of a Search Engine. Search Engine Architecture n A software architecture consists of software components, the interfaces provided.
Modern Information Retrieval: A Brief Overview By Amit Singhal Ranjan Dash.
1 Information Retrieval Acknowledgements: Dr Mounia Lalmas (QMW) Dr Joemon Jose (Glasgow)
Xiaoying Gao Computer Science Victoria University of Wellington Intelligent Agents COMP 423.
Interactive Probabilistic Search for GikiCLEF Ray R Larson School of Information University of California, Berkeley Ray R Larson School of Information.
1 Motivation Web query is usually two or three words long. –Prone to ambiguity –Example “keyboard” –Input device of computer –Musical instruments How can.
Chapter 6: Information Retrieval and Web Search
Search Engines. Search Strategies Define the search topic(s) and break it down into its component parts What terms, words or phrases do you use to describe.
1 Automatic Classification of Bookmarked Web Pages Chris Staff Second Talk February 2007.
Introduction to Digital Libraries hussein suleman uct cs honours 2003.
Collocations and Information Management Applications Gregor Erbach Saarland University Saarbrücken.
Chapter 8 Evaluating Search Engine. Evaluation n Evaluation is key to building effective and efficient search engines  Measurement usually carried out.
Clustering C.Watters CS6403.
Probabilistic Latent Query Analysis for Combining Multiple Retrieval Sources Rong Yan Alexander G. Hauptmann School of Computer Science Carnegie Mellon.
Vector Space Models.
Information Retrieval
Information Retrieval Transfer Cycle Dania Bilal IS 530 Fall 2007.
A Logistic Regression Approach to Distributed IR Ray R. Larson : School of Information Management & Systems, University of California, Berkeley --
Personalization Services in CADAL Zhang yin Zhuang Yuting Wu Jiangqin College of Computer Science, Zhejiang University November 19,2006.
Xiaoying Gao Computer Science Victoria University of Wellington COMP307 NLP 4 Information Retrieval.
1 CS 430: Information Discovery Lecture 8 Collection-Level Metadata Vector Methods.
1 Text Categorization  Assigning documents to a fixed set of categories  Applications:  Web pages  Recommending pages  Yahoo-like classification hierarchies.
Information Organization: Overview
Information Organization: Clustering
Introduction to Information Retrieval
Chapter 5: Information Retrieval and Web Search
Text Categorization Berlin Chen 2003 Reference:
CS 430: Information Discovery
Information Organization: Overview
Presentation transcript:

Ray R. Larson : University of California, Berkeley Clustering and Classification Workshop 1998 Cheshire II and Automatic Categorization Ray R. Larson Associate Professor School of Information Management and Systems

Ray R. Larson : University of California, Berkeley Clustering and Classification Workshop 1998 Overview l Introduction to Automatic Classification and Clustering l Classification of Classification Methods l Classification Clusters and Information Retrieval in Cheshire II l DARPA Unfamiliar Metadata Project

Ray R. Larson : University of California, Berkeley Clustering and Classification Workshop 1998 Classification l The grouping together of items (including documents or their representations) which are then treated as a unit. The groupings may be predefined or generated algorithmically. The process itself may be manual or automated. l In document classification the items are grouped together because they are likely to be wanted together »For example, items about the same topic.

Ray R. Larson : University of California, Berkeley Clustering and Classification Workshop 1998 Automatic Indexing and Classification l Automatic indexing is typically the simple deriving of keywords from a document and providing access to all of those words. l More complex Automatic Indexing Systems attempt to select controlled vocabulary terms based on terms in the document. l Automatic classification attempts to automatically group similar documents using either: »A fully automatic clustering method. »An established classification scheme and set of documents already indexed by that scheme.

Ray R. Larson : University of California, Berkeley Clustering and Classification Workshop 1998 Background and Origins l Early suggestion by Fairthorne »“The Mathematics of Classification” l Early experiments by Maron (1961) and Borko and Bernick(1963) l Work in Numerical Taxonomy and its application to Information retrieval Jardine, Sibson, van Rijsbergen, Salton (1970’s). l Early IR clustering work more concerned with efficiency issues than semantic issues.

Ray R. Larson : University of California, Berkeley Clustering and Classification Workshop 1998 Cluster Hypothesis l The basic notion behind the use of classification and clustering methods: l “Closely associated documents tend to be relevant to the same requests.” »C.J. van Rijsbergen

Ray R. Larson : University of California, Berkeley Clustering and Classification Workshop 1998 Classification of Classification Methods l Class Structure »Intellectually Formulated –Manual assignment (e.g. Library classification) –Automatic assignment (e.g. Cheshire Classification Mapping) »Automatically derived from collection of items –Hierarchic Clustering Methods (e.g. Single Link) –Agglomerative Clustering Methods (e.g. Dattola) –Hybrid Methods (e.g. Query Clustering)

Ray R. Larson : University of California, Berkeley Clustering and Classification Workshop 1998 Classification of Classification Methods l Relationship between properties and classes »monothetic »polythetic l Relation between objects and classes »exclusive »overlapping l Relation between classes and classes »ordered »unordered Adapted from Sparck Jones

Ray R. Larson : University of California, Berkeley Clustering and Classification Workshop 1998 Properties and Classes l Monothetic »Class defined by a set of properties that are both necessary and sufficient for membership in the class l Polythetic »Class defined by a set of properties such that to be a member of the class some individual must have some number (usually large) of those properties, and that a large number of individuals in the class possess some of those properties, and no individual possesses all of the properties.

Ray R. Larson : University of California, Berkeley Clustering and Classification Workshop 1998 A B C D E F G H Monothetic vs. Polythetic Polythetic Monothetic Adapted from van Rijsbergen, ‘79

Ray R. Larson : University of California, Berkeley Clustering and Classification Workshop 1998 Exclusive Vs. Overlapping l Item can either belong exclusively to a single class l Items can belong to many classes, sometimes with a “membership weight”

Ray R. Larson : University of California, Berkeley Clustering and Classification Workshop 1998 Ordered Vs. Unordered l Ordered classes have some sort of structure imposed on them »Hierarchies are typical of ordered classes l Unordered classes have no imposed precedence or structure and each class is considered on the same “level” »Typical in agglomerative methods

Ray R. Larson : University of California, Berkeley Clustering and Classification Workshop 1998 Clustering Methods l Hierarchical l Agglomerative l Hybrid l Automatic Class Assignment

Ray R. Larson : University of California, Berkeley Clustering and Classification Workshop 1998 Coefficients of Association l Simple l Dice’s coefficient l Jaccard’s coefficient l Cosine coefficient l Overlap coefficient

Ray R. Larson : University of California, Berkeley Clustering and Classification Workshop 1998 Hierarchical Methods Single Link Dissimilarity Matrix Hierarchical methods: Polythetic, Usually Exclusive, Ordered Clusters are order-independent

Ray R. Larson : University of California, Berkeley Clustering and Classification Workshop 1998 Threshold =.1 Single Link Dissimilarity Matrix

Ray R. Larson : University of California, Berkeley Clustering and Classification Workshop 1998 Threshold =

Ray R. Larson : University of California, Berkeley Clustering and Classification Workshop 1998 Threshold =

Ray R. Larson : University of California, Berkeley Clustering and Classification Workshop 1998 Clustering Agglomerative methods: Polythetic, Exclusive or Overlapping, Unordered clusters are order-dependent. Doc 1. Select initial centers (I.e. seed the space) 2. Assign docs to highest matching centers and compute centroids 3. Reassign all documents to centroid(s) Rocchio’s method (similar to current K-means methods

Ray R. Larson : University of California, Berkeley Clustering and Classification Workshop 1998 Automatic Class Assignment Doc Search Engine 1. Create pseudo-documents representing intellectually derived classes. 2. Search using document contents 3. Obtain ranked list 4. Assign document to N categories ranked over threshold. OR assign to top-ranked category Automatic Class Assignment: Polythetic, Exclusive or Overlapping, usually ordered clusters are order-independent, usually based on an intellectually derived scheme

Ray R. Larson : University of California, Berkeley Clustering and Classification Workshop 1998 Automatic Categorization in Cheshire II l The Cheshire II system is intended to provide a bridge between the purely bibliographic realm of previous generations of online catalogs and the rapidly expanding realm of full-text and multimedia information resources. It is currently used in the UC Berkeley Digital Library Project and for a number of other sites and projects.

Ray R. Larson : University of California, Berkeley Clustering and Classification Workshop 1998 Overview of Cheshire II l It supports SGML as the primary database type. l It is a client/server application. l Uses the Z39.50 Information Retrieval Protocol. l Supports Boolean searching of all servers. l Supports probabilistic ranked retrieval in the Cheshire search engine. l Supports ``nearest neighbor'' searches, relevance feedback and Two-Stage Search. l GUI interface on X window displays (Tcl/Tk). l HTTP/CGI interface for the Web (Tcl scripting).

Ray R. Larson : University of California, Berkeley Clustering and Classification Workshop 1998 Cheshire II - Cluster Generation l Define basis for clustering records. »Select field to form the basis of the cluster. »Evidence Fields to use as contents of the pseudo- documents. l During indexing cluster keys are generated with basis and evidence from each record. l Cluster keys are sorted and merged on basis and pseudo-documents created for each unique basis element containing all evidence fields. l Pseudo-Documents (Class clusters) are indexed on combined evidence fields.

Ray R. Larson : University of California, Berkeley Clustering and Classification Workshop 1998 Cheshire II - Two-Stage Retrieval l Using the LC Classification System »Pseudo-Document created for each LC class containing terms derived from “content-rich” portions of documents in that class (subject headings, titles, etc.) »Permits searching by any term in the class »Ranked Probabilistic retrieval techniques attempt to present the “Best Matches” to a query first. »User selects classes to feed back for the “second stage” search of documents. l Can be used with any classified/Indexed collection.

Ray R. Larson : University of California, Berkeley Clustering and Classification Workshop 1998 Probabilistic Retrieval: Logistic Regression l Estimates for relevance based on log-linear model with various statistical measures of document content as independent variables. Log odds of relevance is a linear function of attributes: Term contributions summed: Probability of Relevance is inverse of log odds:

Ray R. Larson : University of California, Berkeley Clustering and Classification Workshop 1998 Probabilistic Retrieval: Logistic Regression In Cheshire II probability of relevance is based on Logistic regression from a sample set of TREC documents to determine values of the coefficients. At retrieval the probability estimate is obtained by: For 6 attributes or “clues” about term usage in the documents and the query

Ray R. Larson : University of California, Berkeley Clustering and Classification Workshop 1998 Probabilistic Retrieval: Logistic Regression attributes Average Absolute Query Frequency Query Length Average Absolute Document Frequency Document Length Average Inverse Document Frequency Inverse Document Frequency Number of Terms in common between query and document (M) -- logged

Ray R. Larson : University of California, Berkeley Clustering and Classification Workshop 1998 Cheshire II Demo l Examples from the Unfamiliar Metadata Project l Basis for clusters is a normalized Library of Congress Class Number l Evidence is provided by terms from record titles (and subject headings for the “all languages” l Five different training sets (Russian, German, French, Spanish, and All Languages l Testing cross-language retrieval and classification

Ray R. Larson : University of California, Berkeley Clustering and Classification Workshop 1998 Resources l Cheshire II Home page » l Unfamiliar MetaData Project » datahttp://info.sims.berkeley.edu/research/meta data l Cross-Language Classification Clusters »

Ray R. Larson : University of California, Berkeley Clustering and Classification Workshop 1998 References l C.J. van Rijsbergen, Information Retrieval, 2nd edition. London : Butterworths, 1979 »Available as as l Ray R. Larson, “Experiments in Automatic Library of Congress Classification”. Journal of the American Society for Information Science, 43(2) , 1992 l M.E. Maron, “Automatic Indexing: An Experimental Inquiry”. Journal of the ACM, , July l Alan Griffiths, Claire Luckhurst & Peter Willett, “Using Interdocument Similarity Information in Document Retrieval Systems”. JASIS, 37(1) 3-11, 1986

Ray R. Larson : University of California, Berkeley Clustering and Classification Workshop 1998 References l Christian Plaunt & Barbara Norgard, “An Association Based Method for Automatic Indexing with a Controlled Vocabulary”. To appear in JASIS. »Preprint available: available: l Ray R. Larson, Jerome McDonough, Lucy Kuntz, Paul O’Leary & Ralph Moon, “Cheshire II: Designing a Next-Generation Online Catalog”. JASIS, 47(7) , 1996.