Presentation is loading. Please wait.

Presentation is loading. Please wait.

Ray R. Larson : University of California, Berkeley Clustering and Classification Workshop 1998 Cheshire II and Automatic Categorization Ray R. Larson Associate.

Similar presentations


Presentation on theme: "Ray R. Larson : University of California, Berkeley Clustering and Classification Workshop 1998 Cheshire II and Automatic Categorization Ray R. Larson Associate."— Presentation transcript:

1 Ray R. Larson : University of California, Berkeley Clustering and Classification Workshop 1998 Cheshire II and Automatic Categorization Ray R. Larson Associate Professor School of Information Management and Systems

2 Ray R. Larson : University of California, Berkeley Clustering and Classification Workshop 1998 Overview l Introduction to Automatic Classification and Clustering l Classification of Classification Methods l Classification Clusters and Information Retrieval in Cheshire II l DARPA Unfamiliar Metadata Project

3 Ray R. Larson : University of California, Berkeley Clustering and Classification Workshop 1998 Classification l The grouping together of items (including documents or their representations) which are then treated as a unit. The groupings may be predefined or generated algorithmically. The process itself may be manual or automated. l In document classification the items are grouped together because they are likely to be wanted together »For example, items about the same topic.

4 Ray R. Larson : University of California, Berkeley Clustering and Classification Workshop 1998 Automatic Indexing and Classification l Automatic indexing is typically the simple deriving of keywords from a document and providing access to all of those words. l More complex Automatic Indexing Systems attempt to select controlled vocabulary terms based on terms in the document. l Automatic classification attempts to automatically group similar documents using either: »A fully automatic clustering method. »An established classification scheme and set of documents already indexed by that scheme.

5 Ray R. Larson : University of California, Berkeley Clustering and Classification Workshop 1998 Background and Origins l Early suggestion by Fairthorne »“The Mathematics of Classification” l Early experiments by Maron (1961) and Borko and Bernick(1963) l Work in Numerical Taxonomy and its application to Information retrieval Jardine, Sibson, van Rijsbergen, Salton (1970’s). l Early IR clustering work more concerned with efficiency issues than semantic issues.

6 Ray R. Larson : University of California, Berkeley Clustering and Classification Workshop 1998 Cluster Hypothesis l The basic notion behind the use of classification and clustering methods: l “Closely associated documents tend to be relevant to the same requests.” »C.J. van Rijsbergen

7 Ray R. Larson : University of California, Berkeley Clustering and Classification Workshop 1998 Classification of Classification Methods l Class Structure »Intellectually Formulated –Manual assignment (e.g. Library classification) –Automatic assignment (e.g. Cheshire Classification Mapping) »Automatically derived from collection of items –Hierarchic Clustering Methods (e.g. Single Link) –Agglomerative Clustering Methods (e.g. Dattola) –Hybrid Methods (e.g. Query Clustering)

8 Ray R. Larson : University of California, Berkeley Clustering and Classification Workshop 1998 Classification of Classification Methods l Relationship between properties and classes »monothetic »polythetic l Relation between objects and classes »exclusive »overlapping l Relation between classes and classes »ordered »unordered Adapted from Sparck Jones

9 Ray R. Larson : University of California, Berkeley Clustering and Classification Workshop 1998 Properties and Classes l Monothetic »Class defined by a set of properties that are both necessary and sufficient for membership in the class l Polythetic »Class defined by a set of properties such that to be a member of the class some individual must have some number (usually large) of those properties, and that a large number of individuals in the class possess some of those properties, and no individual possesses all of the properties.

10 Ray R. Larson : University of California, Berkeley Clustering and Classification Workshop 1998 A B C D E F G H 1 + + + 2 + + + 3 + + + 4 + + + 5 + + + 6 + + + 7 + + + 8 + + + Monothetic vs. Polythetic Polythetic Monothetic Adapted from van Rijsbergen, ‘79

11 Ray R. Larson : University of California, Berkeley Clustering and Classification Workshop 1998 Exclusive Vs. Overlapping l Item can either belong exclusively to a single class l Items can belong to many classes, sometimes with a “membership weight”

12 Ray R. Larson : University of California, Berkeley Clustering and Classification Workshop 1998 Ordered Vs. Unordered l Ordered classes have some sort of structure imposed on them »Hierarchies are typical of ordered classes l Unordered classes have no imposed precedence or structure and each class is considered on the same “level” »Typical in agglomerative methods

13 Ray R. Larson : University of California, Berkeley Clustering and Classification Workshop 1998 Clustering Methods l Hierarchical l Agglomerative l Hybrid l Automatic Class Assignment

14 Ray R. Larson : University of California, Berkeley Clustering and Classification Workshop 1998 Coefficients of Association l Simple l Dice’s coefficient l Jaccard’s coefficient l Cosine coefficient l Overlap coefficient

15 Ray R. Larson : University of California, Berkeley Clustering and Classification Workshop 1998 Hierarchical Methods 2.4 3.4.2 4.3.3.3 5.1.4.4.1 1 2 3 4 Single Link Dissimilarity Matrix Hierarchical methods: Polythetic, Usually Exclusive, Ordered Clusters are order-independent

16 Ray R. Larson : University of California, Berkeley Clustering and Classification Workshop 1998 Threshold =.1 Single Link Dissimilarity Matrix 2.4 3.4.2 4.3.3.3 5.1.4.4.1 1 2 3 4 2 0 3 0 0 4 0 0 0 5 1 0 0 1 1 2 3 4 2 1 3 5 4

17 Ray R. Larson : University of California, Berkeley Clustering and Classification Workshop 1998 Threshold =.2 2.4 3.4.2 4.3.3.3 5.1.4.4.1 1 2 3 4 2 0 3 0 1 4 0 0 0 5 1 0 0 1 1 2 3 4 2 1 3 5 4

18 Ray R. Larson : University of California, Berkeley Clustering and Classification Workshop 1998 Threshold =.3 2.4 3.4.2 4.3.3.3 5.1.4.4.1 1 2 3 4 2 0 3 0 1 4 1 1 1 5 1 0 0 1 1 2 3 4 2 1 3 5 4

19 Ray R. Larson : University of California, Berkeley Clustering and Classification Workshop 1998 Clustering Agglomerative methods: Polythetic, Exclusive or Overlapping, Unordered clusters are order-dependent. Doc 1. Select initial centers (I.e. seed the space) 2. Assign docs to highest matching centers and compute centroids 3. Reassign all documents to centroid(s) Rocchio’s method (similar to current K-means methods

20 Ray R. Larson : University of California, Berkeley Clustering and Classification Workshop 1998 Automatic Class Assignment Doc Search Engine 1. Create pseudo-documents representing intellectually derived classes. 2. Search using document contents 3. Obtain ranked list 4. Assign document to N categories ranked over threshold. OR assign to top-ranked category Automatic Class Assignment: Polythetic, Exclusive or Overlapping, usually ordered clusters are order-independent, usually based on an intellectually derived scheme

21 Ray R. Larson : University of California, Berkeley Clustering and Classification Workshop 1998 Automatic Categorization in Cheshire II l The Cheshire II system is intended to provide a bridge between the purely bibliographic realm of previous generations of online catalogs and the rapidly expanding realm of full-text and multimedia information resources. It is currently used in the UC Berkeley Digital Library Project and for a number of other sites and projects.

22 Ray R. Larson : University of California, Berkeley Clustering and Classification Workshop 1998 Overview of Cheshire II l It supports SGML as the primary database type. l It is a client/server application. l Uses the Z39.50 Information Retrieval Protocol. l Supports Boolean searching of all servers. l Supports probabilistic ranked retrieval in the Cheshire search engine. l Supports ``nearest neighbor'' searches, relevance feedback and Two-Stage Search. l GUI interface on X window displays (Tcl/Tk). l HTTP/CGI interface for the Web (Tcl scripting).

23 Ray R. Larson : University of California, Berkeley Clustering and Classification Workshop 1998 Cheshire II - Cluster Generation l Define basis for clustering records. »Select field to form the basis of the cluster. »Evidence Fields to use as contents of the pseudo- documents. l During indexing cluster keys are generated with basis and evidence from each record. l Cluster keys are sorted and merged on basis and pseudo-documents created for each unique basis element containing all evidence fields. l Pseudo-Documents (Class clusters) are indexed on combined evidence fields.

24 Ray R. Larson : University of California, Berkeley Clustering and Classification Workshop 1998 Cheshire II - Two-Stage Retrieval l Using the LC Classification System »Pseudo-Document created for each LC class containing terms derived from “content-rich” portions of documents in that class (subject headings, titles, etc.) »Permits searching by any term in the class »Ranked Probabilistic retrieval techniques attempt to present the “Best Matches” to a query first. »User selects classes to feed back for the “second stage” search of documents. l Can be used with any classified/Indexed collection.

25 Ray R. Larson : University of California, Berkeley Clustering and Classification Workshop 1998 Probabilistic Retrieval: Logistic Regression l Estimates for relevance based on log-linear model with various statistical measures of document content as independent variables. Log odds of relevance is a linear function of attributes: Term contributions summed: Probability of Relevance is inverse of log odds:

26 Ray R. Larson : University of California, Berkeley Clustering and Classification Workshop 1998 Probabilistic Retrieval: Logistic Regression In Cheshire II probability of relevance is based on Logistic regression from a sample set of TREC documents to determine values of the coefficients. At retrieval the probability estimate is obtained by: For 6 attributes or “clues” about term usage in the documents and the query

27 Ray R. Larson : University of California, Berkeley Clustering and Classification Workshop 1998 Probabilistic Retrieval: Logistic Regression attributes Average Absolute Query Frequency Query Length Average Absolute Document Frequency Document Length Average Inverse Document Frequency Inverse Document Frequency Number of Terms in common between query and document (M) -- logged

28 Ray R. Larson : University of California, Berkeley Clustering and Classification Workshop 1998 Cheshire II Demo l Examples from the Unfamiliar Metadata Project l Basis for clusters is a normalized Library of Congress Class Number l Evidence is provided by terms from record titles (and subject headings for the “all languages” l Five different training sets (Russian, German, French, Spanish, and All Languages l Testing cross-language retrieval and classification

29 Ray R. Larson : University of California, Berkeley Clustering and Classification Workshop 1998 Resources l Cheshire II Home page »http://cheshire.lib.berkeley.edu/http://cheshire.lib.berkeley.edu/ l Unfamiliar MetaData Project »http://info.sims.berkeley.edu/research/meta datahttp://info.sims.berkeley.edu/research/meta data l Cross-Language Classification Clusters »http://sherlock.berkeley.edu/Languagehttp://sherlock.berkeley.edu/Language

30 Ray R. Larson : University of California, Berkeley Clustering and Classification Workshop 1998 References l C.J. van Rijsbergen, Information Retrieval, 2nd edition. London : Butterworths, 1979 »Available as http://sherlock.berkeley.edu/IS205/IR_CJVRAvailable as http://sherlock.berkeley.edu/IS205/IR_CJVR l Ray R. Larson, “Experiments in Automatic Library of Congress Classification”. Journal of the American Society for Information Science, 43(2) 130-148, 1992 l M.E. Maron, “Automatic Indexing: An Experimental Inquiry”. Journal of the ACM, 8 404-417, July 1961. l Alan Griffiths, Claire Luckhurst & Peter Willett, “Using Interdocument Similarity Information in Document Retrieval Systems”. JASIS, 37(1) 3-11, 1986

31 Ray R. Larson : University of California, Berkeley Clustering and Classification Workshop 1998 References l Christian Plaunt & Barbara Norgard, “An Association Based Method for Automatic Indexing with a Controlled Vocabulary”. To appear in JASIS. »Preprint available: http://bliss.berkeley.edu/papers/Preprint available: http://bliss.berkeley.edu/papers/ l Ray R. Larson, Jerome McDonough, Lucy Kuntz, Paul O’Leary & Ralph Moon, “Cheshire II: Designing a Next-Generation Online Catalog”. JASIS, 47(7) 555- 567, 1996.


Download ppt "Ray R. Larson : University of California, Berkeley Clustering and Classification Workshop 1998 Cheshire II and Automatic Categorization Ray R. Larson Associate."

Similar presentations


Ads by Google