Presentation is loading. Please wait.

Presentation is loading. Please wait.

Subject Analysis and Vocabulary Control Spring 2006, 6 March Bharat Mehra IS 520 (Organization and Representation of Information) School of Information.

Similar presentations


Presentation on theme: "Subject Analysis and Vocabulary Control Spring 2006, 6 March Bharat Mehra IS 520 (Organization and Representation of Information) School of Information."— Presentation transcript:

1 Subject Analysis and Vocabulary Control Spring 2006, 6 March Bharat Mehra IS 520 (Organization and Representation of Information) School of Information Sciences University of Tennessee

2 Subject and its Representation  Subject reveals what a work is about: the content of the work  Representing subjects of an information object in the most precise and concise linguistic format is necessary for computerized searching: word, phrase, sentence, etc.

3 Questions  Why can’t a computer do a good job in identifying the “aboutness” of a work?  How can you identify “aboutness” for nontextual materials?

4 Subject Analysis  Is part of creating metadata that deals with the conceptual analysis of an information object to determine what it is about and  Translating “aboutness” of an info object to create controlled vocabulary terms for subject headings and classification notations

5 Purpose of Subject Analysis  Provides meaningful subject access via retrieval tool  Provides collocation of objects of a like nature (Cutter)  Provides a logical location for similar objects  Saves user time

6 Conceptual Analysis  What is it? Philosophy, history  What is it for? For a farmer…  What is it about? D. W. Langridge, 1989

7 Methods in Conceptual Analysis  Purposive method: Figure out author’s purpose (statement of purpose)  Figure-ground method (what are the problems in this method?)  Objective method : Counting of references (what are the problems in this method?)  Appealing to unity or to rules of selection and rejection what has been said (selection) and not said (rejected) P. Wilson, 1968

8 Identification of Concepts  Topics  Names (person, corporate bodies, geographic areas, other named entities)  Time periods  Form

9 Subject Access Process  Textual and non-textual info objects  What will be helpful for identifying the “aboutness” of the info object?  What did the user queries of the NLM’s Prints and Photographs Collection reveal?

10 Dewey Decimal Classification  Main classes=>divisions=>sections  The system is made up of ten categories:  000 Computers, information and general reference  100 Philosophy and psychology  200 Religion  300 Social sciences  400 Language  500 Science and mathematics  600 Technology  700 Arts and recreation  800 Literature  900 History and geography 330 for economy + 94 for Europe = 330.94 European economy; 973 for United States + 005 form division for periodicals = 973.005, periodicals concerning the United States generally economy EuropeUnited Statesperiodicalseconomy EuropeUnited Statesperiodicals

11 Dewey Decimal Classification From the divine to the mundane (except 000) From the divine to the mundane (except 000) Choosing decimals for its categories, allows purely numerical and infinitely hierarchical Choosing decimals for its categories, allows purely numerical and infinitely hierarchicaldecimals Faceted classification: combines elements from different parts of the structure to construct a number representing the subject content Faceted classification: combines elements from different parts of the structure to construct a number representing the subject content Except for general works and fiction, works are classified principally by subject, with extensions for subject relationships, place, time or type of material, producing classification numbers of not less than three digits but otherwise of indeterminate length with a decimal point before the fourth digit, where present Except for general works and fiction, works are classified principally by subject, with extensions for subject relationships, place, time or type of material, producing classification numbers of not less than three digits but otherwise of indeterminate length with a decimal point before the fourth digit, where presentfiction Classmarks are to be read as numbers, in the order: 050, 220, 330.973, 331 etc. Classmarks are to be read as numbers, in the order: 050, 220, 330.973, 331 etc.

12 Subject Access--The Problems  diverse expressions  linguistic phenomena  cultural diversity  human cognitive factors  individual differences  differences in methods, lack of consistency  exhaustivity: summarization and depth indexing

13 Subject Access--Some Solutions  1.Vocabulary control in indexing  2.Classification systems arranging concepts in hierarchical structure  3.Citations: citing and being cited  4. Hyperlinks

14 Why is controlled vocabulary needed?

15 What can Vocabulary Control Do?  to promote the consistent representation of subject matter by indexer/cataloger and searchers;  to guide users on subject access by clarifying linguistic ambiguity and linking terms with related meanings;  to increase precision as well as recall.

16 Recall and Precision Basic measures used in evaluating search strategies Assumptions: There is a set of records in the DB which is relevant to the search topic Records are assumed to be either relevant or irrelevant (these measures do not allow for degrees of relevancy) The actual retrieval set may not perfectly match the set of relevant records.

17 Recall and Precision RECALL is the ratio of the number of relevant records retrieved to the total number of relevant records in the database. It is usually expressed as a percentage. PRECISION is the ratio of the number of relevant records retrieved to the total number of irrelevant and relevant records retrieved. It is usually expressed as a percentage.

18 PR Inverse Relationship Why is there an inverse relationship? Issue of Language If search goal is comprehensive retrieval, then searcher must include synonyms, related terms, broad or general terms, for each concept Precision suffers: Searcher may decide to combine terms using Boolean rather than proximity operator: secondary concepts may get omitted Because synonyms may not be exact synonyms the probability of retrieving irrelevant material increases Recall suffers Broader terms may result in the retrieval of material which does not discuss the narrower search topic Using Boolean operators rather than proximity operators may increase the probability that the terms won't be in context

19 Other Problems with P and R Records must be considered either relevant or irrelevant (what about records that are marginally relevant, somewhat irrelevant, very relevant, completely irrelevant) Individual perception: what is relevant to one person may not be relevant to another Measuring recall: difficult to know how many relevant records exist in DB Measures for estimating recall Usefulness of P and R

20 Challenges in Vocabulary Control  Specific vs. general  Synonymous concepts  Word form and one-word forms (e.g., online)  Sequence and form for multiword terms and phases; inverted order  Abbreviations and acronyms  Popular vs. technical names

21 What is a Controlled Vocabulary?  A limited set of terms for indexing (subject cataloging) and for searching  authorized terms (representing concepts)  scope notes  related concepts  lead-in terms (non-preferred synonym term, not for indexing or searching; a pointer to authorized ones)

22 Types of Control Terminology  Synonyms (more terms for one concept)  Homographs (more than one meaning): qualifiers or preferred term synonym  Homophones  Conceptual relationships  Hierarchical (narrower, broader)  Associative (related)  Cross References

23 Cross reference Structure of Controlled Vocabulary  Term-A  scope note: explains use of the term  UF lead-in term-B “used for”  BT term(s)  SA term(s) “see also”  NT term(s)  -- subdivision  Lead-in term-B  USE Term-A

24 Examples Subject Heading Lists  developed in library community  in favor of pre-coordination in card cataloging environment Thesauri  developed as part of IR systems  in favor of post-coordination and somewhat pre-coordination

25 Pre-Coordination  The combination of concepts at the time of cataloging or indexing, e.g.:  Library -- automation -- United States  The above example is one heading in a structured format: Topic -- subtopic -- geography  (LCSH is a highly pre-coordinated control vocabulary)  Indexer constructs subject strings with main terms followed by subdivisions

26 Post-Coordination  The combination of concepts at the time of searching for a compound concept, e.g.:  library  automation  United States  The above example indicates three descriptors assigned to a work; no structure exists between them  Examples: ERIC

27 Pre-Coordinated SH Document Number: 195 Title: France importing crops from US and exporting wine to US  SH: crop--export--US  SH: crop--import--France  SH: wine--export--France  SH: wine--import--US Document Number: 44 Title: US importing wine from France SH: wine--export—FranceSH: wine--export—France SH: wine--import--USSH: wine--import--US

28 Pre-Coordinated Indexes  crop--export--US195  crop--import--France195  wine--export--France44, 195  wine--import--US44, 195 These facet headings are clear about the direction of the trade between two countries. What happens if the concepts are not combined in the headings?

29 Post-Coordinated Indexes  crop195  export44, 195  import44, 195  France44, 195  US44, 195  wine44, 195 Let’s do a Boolean search: crop AND import AND US results: Document 195 -- irrelevant

30 Subject Cataloging--Process 1.Conceptual analysis of a document to identify what the document is about The methods:  purpose of the author (indicative statements)  figure-ground  objective analysis (statistics)

31 Subject Cataloging--Process (cont’d) 2. Translation of the conceptual analysis into a particular vocabulary The methods  look up subject headings  weighted headings  assign headings

32 Various Subjects in MARC  600 610 650 651 MARC tags  600 vs. 100 vs. 700  610 vs. 110 vs. 710  1XXfields (main entries)  4XXfields (series statements)  6XXfields (subject headings)  7XXfields (added entries other than subject or series)  8XXfields (series added entries)  X00Personal names  X10Corporate names  X11Meeting names  X30Uniform titles  X40Bibliographic titles  X50Topical terms  X51Geographic names For example, 610: subject heading that is a corporate name

33 Subject Cataloging Quality  Consistency: works on the same subjects are given the same headings  Exhaustivity: whether the headings cover all aspects of the work -- number of headings  Specificity: whether the heading assigned is at the same hierarchical level of the concept

34 Controlled Vocabularies 1. Subject heading lists: include phrases, precoordinated terms LCSH Sears List of SH MeSH, 2. Thesauri: single and bound terms (e.g., Type A Personality) representing single concepts (descriptors); strictly hierarchical; narrower in scope; can be multilingual Art & Architecture Thesaurus (cultural heritage info) Thesaurus of ERIC Descriptors (educational resources) INSPEC Thesaurus (physics and engineering communities)

35 Controlled Vocabularies  1. and 2. provide subject access to info objects by providing terminology that can be consistent (controlled vocabulary)  Choose preferred terms and make references from non-used terms  Provide hierarchies: BT, NT, RT 3. Ontologies: bring all variant ways of expressing a concept and showing relationships via BT, NT, RT; do not select preferred terms“systematic account of existence”

36 Solution to the “Subject Problem” for Images: Natural Language Analysis  Natural language that people use (linguistic constructs, grammar relationships, syntax, communication vocabulary) can be used for describing and searching in visual information retrieval systems  Content-based natural language processing is understood in terms of syntactic structure in the spoken natural language  Concept-based natural language processing attempts to capture the semantics of an image

37 Critical Reflection 7: The GAME  User-Based Natural Language Analysis for Creation and Evaluation of Visual information Retrieval Systems in Library and Museum Settings  Your response: On the Black Board space, respond to the questions provided on the handout

38 Exercise 3: Authority Control OBJECTIVES  to observe name authority control  to observe controlled vocabulary for subject access Part I. Name authority  Go to the authority record database in Library of Congress http://authorities.loc.gov/. Search for the popular author, Samuel Clemens. http://authorities.loc.gov/  How many authorized headings are established for him? Attach the most complete MARC Authority record for each authorized heading.  For the MARC Authority format, explain the semantics (meanings) of the fields: 1xx, 4xx and 5xx. Make sure that you mention how authorized and unauthorized headings are cross-referenced.  For each authorized heading, how many bibliographic records are found in LC collection using the heading? If an authorized heading is not used, why so?  Can the user just click on the authorized heading to retrieve bibliographic records by the author?

39 Exercise 3: Authority Control Part II. Authorized subject headings  Go to the authority record database in Library of Congress http://authorities.loc.gov/. http://authorities.loc.gov/  Search for an authorized subject heading for each of the topics: Teapot Dome scandal Watergate scandal  What are the broader heading (BT)?  What are the narrower headings (NT)?  What are the related headings (RT)?  Construct an alphabetical subject headings list of the headings (BT, NT, RT, the heading itself) and their related headings including both authorized headings and lead-in terms. Under each heading cross- reference the related terms: Used-for, Use, BT, NT, RT.

40 Exercise 3: Authority Control WHAT TO TURN IN?  The authority records for Samuel Clemens in MARC format and your answers to all the questions.  The authority records for the two subject headings in MARC format and the subject headings list.  A brief discussion on the roles of authority control in IR.


Download ppt "Subject Analysis and Vocabulary Control Spring 2006, 6 March Bharat Mehra IS 520 (Organization and Representation of Information) School of Information."

Similar presentations


Ads by Google