10/21/98Organization of Information in Collections Subject Access to Collections: Introduction University of California, Berkeley School of Information.

Slides:



Advertisements
Similar presentations
Database Searching: How to Find Journal Articles? START.
Advertisements

Subject Analysis: An Introduction Based on BASIC SUBJECT CATALOGING USING LCSH edited by Lori Robare.
Chapter 5: Introduction to Information Retrieval
Leveraging Your Taxonomy to Increase User Productivity MAIQuery and TM Navtree.
Search Strategies Online Search Techniques. Universal Search Techniques Precision- getting results that are relevant, “on topic.” Recall- getting all.
SLIDE 1IS 257 – Fall 2007 Thesaurus Construction and Use University of California, Berkeley School of Information IS 245: Organization of.
11/15/2001Information Organization and Retrieval Information Structures and Metadata University of California, Berkeley School of Information Management.
9/18/2001Information Organization and Retrieval Vector Representation, Term Weights and Clustering (continued) Ray Larson & Warren Sack University of California,
I256 Applied Natural Language Processing Fall 2009
SLIDE 1IS 240 – Spring 2007 Prof. Ray Larson University of California, Berkeley School of Information Tuesday and Thursday 10:30 am - 12:00.
Final Exam Review SIMS 202 Profs. Hearst & Larson UC Berkeley SIMS Fall 2000.
8/28/97Information Organization and Retrieval Metadata and Data Structures University of California, Berkeley School of Information Management and Systems.
Facetted Classification and Thesauri Introduction
Thesaurus Design and Development
SLIDE 1IS 202 – FALL 2003 Prof. Ray Larson & Prof. Marc Davis UC Berkeley SIMS Tuesday and Thursday 10:30 am - 12:00 pm Fall 2003
SLIDE 1IS FALL 2004 Lecture 18: Metadata & Controlled Vocabulary Introduction Prof. Ray Larson & Prof. Marc Davis UC Berkeley SIMS Tuesday.
11/21/2000Information Organization and Retrieval Thesaurus Design and Development University of California, Berkeley School of Information Management and.
11/7/2000Information Organization and Retrieval Controlled Vocabularies: Name Authority Control University of California, Berkeley School of Information.
9/14/2000Information Organization and Retrieval Vector Representation, Term Weights and Clustering Ray Larson & Marti Hearst University of California,
8/28/97Information Organization and Retrieval Controlled Subject Vocabularies and Thesauri University of California, Berkeley School of Information Management.
SLIDE 1IS 257 – Fall 2009 Controlled Vocabularies University of California, Berkeley School of Information IS 245: Organization of Information.
SLIDE 1IS 202 – FALL 2002 Prof. Ray Larson & Prof. Marc Davis UC Berkeley SIMS Tuesday and Thursday 10:30 am - 12:00 pm Fall 2002
Psychology of Category Structure Facets vs. Hierarchies SIMS 202 Profs. Hearst & Larson UC Berkeley SIMS Fall 2000.
SLIDE 1IS 257 – Fall 2007 Subject Access to Collections: Introduction University of California, Berkeley School of Information IS 245: Organization.
CSE 730 Information Retrieval of Biomedical Data The use of medical lexicon in biomedical IR.
The Library Cataloging Tradition
11/13/2001Information Organization and Retrieval Controlled Vocabularies: Name Authority Control University of California, Berkeley School of Information.
IMT530- Organization of Information Resources1 Feedback Like exercises –But want more instructions and feedback on them –Wondering about grading on these.
1 Languages for aboutness n Indexing languages: –Terminological tools Thesauri (CV – controlled vocabulary) Subject headings lists (CV) Authority files.
8/28/97Information Organization and Retrieval Controlled Vocabularies: Name Authority Control University of California, Berkeley School of Information.
11/20/2001Information Organization and Retrieval Final Review University of California, Berkeley School of Information Management and Systems SIMS 202:
SLIDE 1IS FALL 2003 Lecture 07: Controlled Vocabularies Prof. Ray Larson & Prof. Marc Davis UC Berkeley SIMS Tuesday and Thursday 10:30.
SLIDE 1IS FALL 2002 Lecture 06: Controlled Vocabularies Introduction Prof. Ray Larson & Prof. Marc Davis UC Berkeley SIMS Tuesday and.
Chapter 5: Information Retrieval and Web Search
Vocabulary & languages in searching
1 MeSH & Principles of Classification April 13, 2005.
8/28/97Organization of Information in Collections Introduction to Description: Dublin Core and History University of California, Berkeley School of Information.
Improving Access to Audio- Visual Materials by Using Genre/Form Terms OLAC Conference 1-3 October 2004 Montreal, Quebec.
Internet Research Fourth Edition Unit C. Internet Research – Illustrated, Fourth Edition 2 Internet Research: Unit C Browsing Subject Guides.
Lecture Four: Steps 3 and 4 INST 250/4.  Does one look for facts, or opinions, or both when conducting a literature search?  What is the difference.
Types of Periodicals in Literature Professional Scholarly Literary.
The Library Cataloging Tradition Marty Kurth CS 431 February 9, 2005 [slides stolen from Diane Hillmann]
Thanks to Bill Arms, Marti Hearst Documents. Last time Size of information –Continues to grow IR an old field, goes back to the ‘40s IR iterative process.
IL Step 2: Searching for Information Information Literacy 1.
1 Motivation Web query is usually two or three words long. –Prone to ambiguity –Example “keyboard” –Input device of computer –Musical instruments How can.
Current Events and Issues Using Index Databases for Finding Answers.
XP New Perspectives on The Internet, Sixth Edition— Comprehensive Tutorial 3 1 Searching the Web Using Search Engines and Directories Effectively Tutorial.
Librarians vs. Automation Carolyn Weber Lucio Campanelli Will Hohyon Ryu.
Chapter 6: Information Retrieval and Web Search
Search Engines. Search Strategies Define the search topic(s) and break it down into its component parts What terms, words or phrases do you use to describe.
Introduction to Searching Databases and Records. What is a database? A database is a large, organized collection of information. Addresses Recipes Citations.
Interaction LBSC 734 Module 4 Doug Oard. Agenda Where interaction fits Query formulation Selection part 1: Snippets  Selection part 2: Result sets Examination.
Indexes and Abstracts: Dissecting the Resource By M. Leedy.
Ray R. Larson : University of California, Berkeley Clustering and Classification Workshop 1998 Cheshire II and Automatic Categorization Ray R. Larson Associate.
Information Retrieval
Information Retrieval Transfer Cycle Dania Bilal IS 530 Fall 2007.
PubMed …featuring more than 20 million citations for biomedical literature from MEDLINE, life science journals, and online books.
LIS 204: Introduction to Library and Information Science Week Nine Kevin Rioux, PhD.
Subject Access to Your Information Sandy Tucker Texas A&M University Libraries August 1, 2006 Second International Symposium on Transportation Technology.
Jean-Yves Le Meur - CERN Geneva Switzerland - GL'99 Conference 1.
Charlyn P. Salcedo Instructor Types of Indexing Languages.
Organization of Information LSIS Summer II (2005)
GUIDE. P UB M ED
Some basic concepts Week 1 Lecture notes INF 384C: Organizing Information Spring 2016 Karen Wickett UT School of Information.
SIMS 202, Marti Hearst Content Analysis Prof. Marti Hearst SIMS 202, Lecture 15.
Searching for and Accessing Information
Document Clustering Matt Hughes.
IL Step 2: Searching for Information
Introduction to Information Retrieval
PubMed.
Presentation transcript:

10/21/98Organization of Information in Collections Subject Access to Collections: Introduction University of California, Berkeley School of Information Management and Systems SIMS 245: Organization of Information In Collections

10/21/98Organization of Information in Collections Review Review of Description Goal of IR is to retrieve all and only the “relevant” documents in a collection for a particular user with a particular need for information

10/21/98Organization of Information in Collections Indexing Languages and Thesauri Origins and Uses of Controlled Vocabularies for Information Retrieval Types of Indexing Languages, Thesauri and Classification Systems

10/21/98Organization of Information in Collections Controlled Vocabularies Vocabulary control is the attempt to provide a standardized and consistent set of terms (such as subject headings, names, classifications, etc.) with the intent of aiding the searcher in finding information.

10/21/98Organization of Information in Collections What is a “Controlled Vocabulary” “The greatest problem of today is how to teach people to ignore the irrelevant, how to refuse to know things, before they are suffocated. For too many facts are as bad as none at all.” (W.H. Auden) Similarly, there are too many ways of expressing or explaining the topic of a document. Controlled vocabularies are sets of Rules for topic identification and indexing, and a THESAURUS, which consists of “lead-in vocabulary” and an limited and selective “Indexing Language” sometimes with special coding or structures.

10/21/98Organization of Information in Collections Uses of Controlled Vocabularies Library Subject Headings, Classification and Name Authority Files. Commercial Journal Indexing Services and databases Yahoo, and other Web classification schemes Online and Manual Systems within organizations –SunSolve –MacArthur

10/21/98Organization of Information in Collections Name Authority Files ID:NAFL ST:p EL:n STH:a MS:c UIP:a TD: KRC:a NMU:a CRC:c UPN:a SBU:a SBC:a DID:n DF: RFE:a CSC: SRU:b SRT:n SRN:n TSS: TGA:? ROM:? MOD: VST:d Other Versions: earlier 040 DLC$cDLC$dDLC$dOCoLC 053 PR6005.R Creasey, John Cooke, M. E Cooke, Margaret,$d Cooper, Henry St. John,$d Credo,$d Fecamps, Elise Gill, Patrick,$d Hope, Brian,$d Hughes, Colin,$d Marsden, James Matheson, Rodney Ranger, Ken St. John, Henry,$d Wilde, Jimmy $wnnnc$aAshe, Gordon,$d Different names for the same person

10/21/98Organization of Information in Collections Name Authority Files ID:NAFO ST:p EL:n STH:a MS:n UIP:a TD: KRC:a NMU:a CRC:c UPN:a SBU:a SBC:a DID:n DF: RFE:a CSC:c SRU:b SRT:n SRN:n TSS: TGA:? ROM:? MOD: VST:d OCoLC$cOCoLC Marric, J. J.,$d $wnnnc$aCreasey, John 663 Works by this author are entered under the name used in the item. For a listing of other names used by this author, search also under$bCrease y, John 670 OCLC : His Gideon's day, 1955$b(hdg.: Creasey, John; usage: J.J. Marric) 670 LC data base, 6/10/91$b(hdg.: Creasey, John; usage: J.J. Marric) 670 Pseuds. and nicknames dict., c1987$b(Creasey, John, ; Britis h author; pseud.: Marric, J. J.)

10/21/98Organization of Information in Collections Name authority files ID:NAFL ST:p EL:n STH:a MS:c UIP:a TD: KRC:a NMU:a CRC:c UPN:a SBU:a SBC:a DID:n DF: RFE:a CSC: SRU:b SRT:n SRN:n TSS: TGA:? ROM:? MOD: VST:d Other Versions: earlier 040 DLC$cDLC$dDLC$dOCoLC Butler, William Vivian,$d Butler, W. V.$q(William Vivian),$d Marric, J. J.,$d His The durable desperadoes, His The young detective's handbook, c1981:$bt.p. (W.V. Butler) 670 His Gideon's way, 1986:$bCIP t.p. (William Vivian Butler writing as J.J. Marric) Different people writing with the same name

10/21/98Organization of Information in Collections Indexing Languages An index is a systematic guide designed to indicate topics or features of documents in order to facilitate retrieval of documents or parts of documents. An Indexing language is the set of terms used in an index to represent topics or features of documents, and the rules for combining or using those terms.

10/21/98Organization of Information in Collections Types of Indexing Languages Uncontrolled Keyword Indexing Indexing Languages –Controlled, but not structured Thesauri –Controlled and Structured Classification Systems –Controlled, Structured, and Coded Faceted Classification Systems

10/21/98Organization of Information in Collections Indexing Languages Library of Congress Subject Headings Yellow Pages Topics Wilson Indexes (“Reader’s Guide”)

10/21/98Organization of Information in Collections Controlled Vocabulary Start with the text of the document Attempt to “control” or regularize: –The concepts expressed within mutually exclusive exhaustive –The language used to express those concepts limit the normal linguistic variations regulate word order and structure of phrases reduce the number of synonyms or near-synonyms Also, provide cross-references between concepts and their expression. See Bates, 1988

10/21/98Organization of Information in Collections Subject Headings vs. Descriptors Describe the contents of an entire document Designed to be looked up in an alphabetical index –Look up document under its heading Few (1-5) headings per document Describe one concept within a document Designed to be used in Boolean searching –Combine to describe the desired document Many (5-25) descriptors per document

10/21/98Organization of Information in Collections Subject Heading vs. Descriptor Example WILSONLINE –Athletes –Athletes--Heath&Hygiene –Athletes--Nutrition –Athletes--Physical Exams –… –Athletics –Athletics -- Administration –Athletics -- Equipment -- Catalogs –… –Sports -- Accidents and injuries –Sports -- Accidents and injuries -- prevention ERIC –Athletes –Athletic Coaches –Athletic Equipment –Athletic Fields –Athletics –… –Sports psychology –Sportsmanship

10/21/98Organization of Information in Collections Assigning Headings vs. Descriptors Subject headings -- assign one (or a few) complex heading(s) to the document Descriptors -- mix and match –How would we describe recipes using each technique?

10/21/98Organization of Information in Collections Thesauri A Thesaurus is a collection of selected vocabulary (preferred terms or descriptors) with links among Synonymous, Equivalent, Broader, Narrower and other Related Terms

10/21/98Organization of Information in Collections Thesauri (cont.) National and International Standards for Thesauri –ANSI/NISO z American National Standard Guidelines for the Construction, Format and Management of Monolingual Thesauri –ANSI/NISO Draft Standard Z x -- American National Standard Guidelines for Indexes in Information Retrieval –ISO Documentation -- Guidelines for the establishment and development of monolingual thesauri –ISO Documentation -- Guidelines for the establishment and development of multilingual thesauri

10/21/98Organization of Information in Collections Thesauri (cont.) Examples: –The ERIC Thesaurus of Descriptors –The Art and Architecture Thesaurus –The Medical Subject Headings (MESH) of the National Library of Medicine

10/21/98Organization of Information in Collections Development of a Thesaurus Term Selection. Merging and Development of Concept Classes. Definition of Broad Subject Fields and Subfields. Development of Classificatory structure Review, Testing, Application, Revision.

10/21/98Organization of Information in Collections Categorization Summary Processes of categorization underlie many of the issues having to do with information organization Categorization is messier than our computer systems would like Human categories have graded membership, consisting of family resemblances. Family resemblance is expressed in part by which subset of features are shared It is also determined by underlying understandings of the world that do not get represented in most systems

10/21/98Organization of Information in Collections Classification Systems A classification system is an indexing language often based on a broad ordering of topical areas. Thesauri and classification systems both use this broad ordering and maintain a structure of broader, narrower, and related topics. Classification schemes commonly use a coded notation for representing a topic and it’s place in relation to other terms.

10/21/98Organization of Information in Collections Classification Systems (cont.) Examples: –The Library of Congress Classification System –The Dewey Decimal Classification System –The ACM Computing Reviews Categories –The American Mathematical Society Classification System

10/21/98Organization of Information in Collections Classification Schemes Classify possible concepts. Goals: –Completely distinct conceptual categories (mutually exclusive) –Complete coverage of conceptual categories (exhaustive)

10/21/98Organization of Information in Collections Hierarchical Classification Traditional “family-tree” –Each category is successively broken down into smaller and smaller subdivisions –Each level divided out by a “character of division”. Also known as a feature. Example: distinguish Literature based on: –Language –Genre –Time Period

10/21/98Organization of Information in Collections Hierarchical Classification Literature SpanishFrenchEnglish DramaPoetryProse 18th17th16th DramaPoetryProse 19th18th17th16th19th...

10/21/98Organization of Information in Collections Labeled Categories for Hierarchical Classification LITERATURE –100 English Literature 110 English Prose –English Prose 16th Century –English Prose 17th Century –English Prose 18th Century – English Poetry –121 English Poetry 16th Century –122 English Poetry 17th Century – English Drama –130 English Drama 16th Century –… –200 French Literature

10/21/98Organization of Information in Collections Faceted Classification Create a separate, free-standing list for each characteristic of division (feature). Combine features to create a classification.

10/21/98Organization of Information in Collections Faceted Classification and Labeled Catgories A Language –a English –b French –c Spanish B Genre –a Prose –b Poetry –c Drama C Period –a 16th Century –b 17th Century –c 18th Century –d 19th Century Aa English Literature AaBa English Prose AaBaCa English Prose 16th Century AbBbCd French Poetry 19th Century BbCd Drama 19th Century

10/21/98Organization of Information in Collections How to use such classification structures? How to look through them? How to use them in search?

10/21/98Organization of Information in Collections Automatic Indexing and Classification Automatic indexing is typically the simple deriving of keywords from a document and providing access to all of those words. More complex Automatic Indexing Systems attempt to select controlled vocabulary terms based on terms in the document. Automatic classification attempts to automatically group similar documents using either: –A fully automatic clustering method. –An established classification scheme and set of documents already indexed by that scheme.

10/21/98Organization of Information in Collections Agglomerative Clustering ABCDEFGHIABCDEFGHI

10/21/98Organization of Information in Collections Agglomerative Clustering ABCDEFGHIABCDEFGHI

10/21/98Organization of Information in Collections Agglomerative Clustering ABCDEFGHIABCDEFGHI

10/21/98Organization of Information in Collections Hierarchical Methods Single Link Dissimilarity Matrix Hierarchical methods: Polythetic, Usually Exclusive, Ordered Clusters are order-independent

10/21/98Organization of Information in Collections Threshold =.1 Single Link Dissimilarity Matrix

10/21/98Organization of Information in Collections Threshold =

10/21/98Organization of Information in Collections Threshold =

10/21/98Organization of Information in Collections Clustering Agglomerative methods: Polythetic, Exclusive or Overlapping, Unordered clusters are order-dependent. Doc 1. Select initial centers (I.e. seed the space) 2. Assign docs to highest matching centers and compute centroids 3. Reassign all documents to centroid(s) Rocchio’s method

10/21/98Organization of Information in Collections Automatic Class Assignment Doc Search Engine 1. Create pseudo-documents representing intellectually derived classes. 2. Search using document contents 3. Obtain ranked list 4. Assign document to N categories ranked over threshold. OR assign to top-ranked category Automatic Class Assignment: Polythetic, Exclusive or Overlapping, usually ordered clusters are order-independent, usually based on an intellectually derived scheme

10/21/98Organization of Information in Collections K-Means Clustering 1 Create a pair-wise similarity measure 2 Find K centers using agglomerative clustering –take a small sample –group bottom up until K groups found 3 Assign each document to nearest center, forming new clusters 4 Repeat 3 as necessary

10/21/98Organization of Information in Collections Scatter/Gather Cutting, Pedersen, Tukey & Karger 92, 93 Hearst & Pedersen 95 Cluster sets of documents into general “themes”, like a table of contents Display the contents of the clusters by showing topical terms and typical titles User chooses subsets of the clusters and re-clusters the documents within Resulting new groups have different “themes”

10/21/98Organization of Information in Collections S/G Example: query on “star” Encyclopedia text 14 sports 8 symbols47 film, tv 68 film, tv (p) 7 music 97 astrophysics 67 astronomy(p)12 steller phenomena 10 flora/fauna 49 galaxies, stars 29 constellations 7 miscelleneous Clustering and re-clustering is entirely automated