Module 7a: Creating Controlled Vocabularies IMT530: Organization of Information Resources Winter 2008 Michael Crandall.

Slides:



Advertisements
Similar presentations
Testing Relational Database
Advertisements

Ontology Assessment – Proposed Framework and Methodology.
Subject Analysis: An Introduction Based on BASIC SUBJECT CATALOGING USING LCSH edited by Lori Robare.
INFO624 - Week 2 Models of Information Retrieval Dr. Xia Lin Associate Professor College of Information Science and Technology Drexel University.
Gathering Information Information Collection: Garbage In – Garbage Out.
Module 5a: Authority Control and Encoding Schemes IMT530: Organization of Information Resources Winter 2007 Michael Crandall.
Guided Enquiry. OBJECTIVES databases  Understand what information is available from the databases  Locate and become familiar with the Student Research.
Search and Retrieval: More on Term Weighting and Document Ranking Prof. Marti Hearst SIMS 202, Lecture 22.
Parametric search and zone weighting Lecture 6. Recap of lecture 4 Query expansion Index construction.
Beginning the Research Design
Module 6a: Intro to Controlled Vocabularies, Taxonomies and Classification IMT530: Organization of Information Resources Winter 2007 Michael Crandall.
Module 10b: Wrapup IMT530: Organization of Information Resources Winter, 2007 Michael Crandall.
Using Metadata in CONTENTdm Diana Brooking and Allen Maberry Metadata Implementation Group, Univ. of Washington Crossing Organizational Boundaries Oct.
Module 7b: Extracting/Controlling Terms and Semantic Relationships IMT530: Organization of Information Resources Winter 2007 Michael Crandall.
What is a document? Information need: From where did the metaphor, doing X is like “herding cats”, arise? quotation? “Managing senior programmers is like.
Module 9a: Classification Schemes
IMT530- Organization of Information Resources1 Feedback Like exercises –But want more instructions and feedback on them –Wondering about grading on these.
Module 6b: Creating Controlled Vocabularies IMT530: Organization of Information Resources Winter 2007 Michael Crandall.
Module 2b: Modeling Information Objects and Relationships IMT530: Organization of Information Resources Winter, 2007 Michael Crandall.
12 -1 Lecture 12 User Modeling Topics –Basics –Example User Model –Construction of User Models –Updating of User Models –Applications.
+ 21 st Century Skills and Academic Standards Kimberly Hetrick Berry Creek Middle School Eagle County School District.
Chapter 4 Query Languages.... Introduction Cover different kinds of queries posed to text retrieval systems Keyword-based query languages  include simple.
Lesson Six Research Basics.
1.A file is organized logically as a sequence of records. 2. These records are mapped onto disk blocks. 3. Files are provided as a basic construct in operating.
Developing facets in UDC for online retrieval Claudio Gnoli (University of Pavia) Aida Slavic (UDC Consortium) 8th NKOS Workshop, Corfu, 1 Oct 2009.
1 MeSH & Principles of Classification April 13, 2005.
LIBRARY OF CONGRESS SUBJECT HEADING By Ms. Preeti Patel Lecturer School of Library And Information Science DAVV, Indore
Controlled Vocabulary & Thesaurus Design Planning & Maintenance.
Programme Specification, Benchmarks etc. Warren Houghton School of Engineering and Computer Science, University of Exeter.
1 Catalog Displays, Retrieval, and FAST May 31, 2005.
Creating Tutorials for the Web: a Designer’s Challenge Module 4: Checking for Effectiveness.
Lecture Four: Steps 3 and 4 INST 250/4.  Does one look for facts, or opinions, or both when conducting a literature search?  What is the difference.
MARKETING RESEARCH CHAPTERS
Selecting a Topic and Purpose
1 CSC 221: Introduction to Programming Fall 2012 Functions & Modules  standard modules: math, random  Python documentation, help  user-defined functions,
Detailed design – class design Domain Modeling SE-2030 Dr. Rob Hasker 1 Based on slides written by Dr. Mark L. Hornick Used with permission.
Are LCSH still effective? Why not use keyword searching instead? Presented by Carol Bradsher October 29, 2004.
Information Retrieval Evaluation and the Retrieval Process.
Databases. Databases Database Searching Database Searching Definition: A database is any organized collection of data that can be retrieved using organized.
Current Events and Issues Using Index Databases for Finding Answers.
Search Engines. Search Strategies Define the search topic(s) and break it down into its component parts What terms, words or phrases do you use to describe.
User Support Chapter 8. Overview Assumption/IDEALLY: If a system is properly design, it should be completely of ease to use, thus user will require little.
Information Retrieval Model Aj. Khuanlux MitsophonsiriCS.426 INFORMATION RETRIEVAL.
WISER: Citation searching Web of Knowledge is a powerful way to access the ISI's multidisciplinary citation indexes. It allows you to discover what research.
OOAD Unit – I OBJECT-ORIENTED ANALYSIS AND DESIGN With applications
1 Tutorial 14 Validating Documents with Schemas Exploring the XML Schema Vocabulary.
Tutorial 13 Validating Documents with Schemas
WEB 2.0 PATTERNS Carolina Marin. Content  Introduction  The Participation-Collaboration Pattern  The Collaborative Tagging Pattern.
Week 2 The lecture for this week is designed to provide students with a general overview of 1) quantitative/qualitative research strategies and 2) 21st.
Reference & Organization Instructor: Eric Riley. What we’re going to cover What makes a reference book Using LCC to locate books in the Library Using.
JISC/NSF PI Meeting, June Archon - A Digital Library that Federates Physics Collections with Varying Degrees of Metadata Richness Department of Computer.
IMT530- Organization of Information Resources1 Feedback Lectures –More practical examples –Like guest lecturers –Generally helpful in understanding concepts.
LIS 204: Introduction to Library and Information Science Week Nine Kevin Rioux, PhD.
Automatic vs manual indexing Focus on subject indexing Not a relevant question? –Wherever full text is available, automatic methods predominate Simple.
ICT training needs for Librarians Library Electronic Technical Tools Electronic DDC / Dewey for Window.
Charlyn P. Salcedo Instructor Types of Indexing Languages.
What is this? SE-2030 Dr. Mark L. Hornick 1. Same images with different levels of detail SE-2030 Dr. Mark L. Hornick 2.
1 Shelflisting and Filing Rules and Subject Authority Control May 11, 2005.
Part 3A-2: Document & Subject Analysis Documents Subjects Facets.
Step One: Task Definition 1.1 Define the Information Problem: What do you need to know? Make sure you understand the assignment and the teacher’s requirements.
Some basic concepts Week 1 Lecture notes INF 384C: Organizing Information Spring 2016 Karen Wickett UT School of Information.
Human Computer Interaction Lecture 21 User Support
Subject Analysis: An Introduction
Form/Genre Headings --DRAFT--
Search Techniques and Advanced tools for Researchers
INDEXING TECHNIQUES The process of constructing document surrogates or document representations is called as Subject Indexing. Indexing has to specify.
Effective Research and Integration Techniques
An Introduction to e-Assessment
Attributes and Values Describing Entities.
Analyzing and Organizing Information
Presentation transcript:

Module 7a: Creating Controlled Vocabularies IMT530: Organization of Information Resources Winter 2008 Michael Crandall

IMT530- Organization of Information Resources2 Steps in Constructing CVs Define your domain Gather concepts –From user interviews, search logs, content analysis, preexisting vocabularies Select your approach Extract terminology Control your terms Organize your terms Maintain, maintain, maintain

IMT530- Organization of Information Resources3 Elements of Building CVs Select your approach –Pre- or post-coordinated (sixteenth century lute music or sixteenth century and lutes and music) –Open or closed (indexers can add terms or not) –Enumeration vs. synthesis (facets) Extract terms –Warrant (from users or domain or both) Control terms –Specificity (cats or Siamese cats?) –Control of homographs (qualifications) –Term consistency and word form (plurals, etc.) –Multiword/phrase sequence and form (inverted, normal form?) –Term definitions (scope notes) –Syntax (citation order) –Semantic factoring Organize terms –Semantic relationships

Different Approaches

IMT530- Organization of Information Resources5 Pre- and Post-Coordination Pre-coordination involves creating terms that combine multiple concepts (not words) into a single term Post-coordination involves creating terms that contain single concepts only, not multiple ones Some authors refer to this as “combination”, and say “pre-combined” and “post-combined”

IMT530- Organization of Information Resources6 Single or Multiple Concept? Is “information retrieval systems” a single concept or a multiple concept? Multiple concepts are often joined with conjunction (and, or) or preposition (in, of) Multiple concepts are often indicated in subdivisions, which may be indicated by a dash (--) or a comma (,) Bottom line is, it’s hard to tell in some cases

IMT530- Organization of Information Resources7 Examples Post-Coordinated Terms Animal nutrition Effects Salt Pre-Coordinated Term Effects of salt on animal nutrition

IMT530- Organization of Information Resources8 More Examples More pre-coordinate terms –France – Textile industries – Skilled Personnel – Training (PRECIS) –Plants – Nutrition – Genetic aspects (LCSH) Pre-coordinate terms often have subdivisions (the words that appear after the hyphens above)

IMT530- Organization of Information Resources9 Advantages of Pre-Coordination All of the concepts that may apply to indexing a single document may appear in a single term Multiple concepts have the context and meaning embedded in syntactic order and constructions –they may make more sense –they are more precise –different syntax means that concepts with different meanings can be represented using the same simple concepts, e.g.: Art by childrenMusic industry Art for childrenMusic for industry Art about childrenIndustrial uses of music Children and artMusic about industry Children in art

IMT530- Organization of Information Resources10 Advantages of Pre-coordination More terms are available for indexers to use to express the subjects of documents The results of a multiple-concept search will result in a list of terms to select from (not a list of document representations with those words in them) Thus, a user is able to browse all the topics available to get an overview of what is available

IMT530- Organization of Information Resources11 Sample Display Accidents Related LC Subjects AccidentsAccidents 29 Related LC SubjectsAccidents Accidents Aeronautics Military United States Accidents Aeronautics Statistics Periodicals Accidents Aerosols Accidents Agricultural Laborers United States Accidents Agriculture Accidents Agriculture Abstracts Accidents Agriculture Bibliography Accidents Agriculture Research United States

IMT530- Organization of Information Resources12 Disadvantages of Pre-Coordination Must create more terms, more costly to create Often, complex rules for combination are needed to create pre-coordinated terms. Result  cost of the CV is increased, training for CV designers is longer and more difficult, and the possibility of error increased Makes for a long term list

IMT530- Organization of Information Resources13 Disadvantages of Pre-Coordination Long strings of terms may not be interpretable by users – –Ethnic groups – young people – ethnic identity – psychotherapy – cultural aspects In manual systems, access is limited to the first concept listed; only with online keyword access are the other embedded concepts accessible.

IMT530- Organization of Information Resources14 Advantages of Post-coordination Vocabulary is short, because each concept is only represented once Rules for creation of terms are often simpler Simple, thus easier to construct, thus less costly Terms are shorter and easier to read and understand In a manual system, individual concepts may be more accessible

IMT530- Organization of Information Resources15 Disadvantages of Post-coordination Does not allow for subtle distinctions in meaning –Art and children vs. children in art vs. art by children –Music in industry vs. industry in music vs. music industry May have to assign a lot of headings for a single document, thus relying on searching mechanisms to put them together

IMT530- Organization of Information Resources16 Disadvantages of Post-Coordination A multiple-concept search frequently results in a list of document representations with those words in them; these results are not grouped according to similarity, but are often listed in a random order. The results list of document representations does not give the user an overview of the subject area covered by the words entered in the search

IMT530- Organization of Information Resources17 Sample Display RESULTS OF BOOLEAN “AND” SEARCH ON natural AND disaster Mapping vulnerability : disasters, development, and people / edited by Greg Bankoff, Georg Frerks, D Mapping vulnerability : disasters, development, and people / edited by Greg Bankoff, Georg Frerks, D Understanding the economic and financial impacts of natural disasters / Charlotte Benson, Edward J. Understanding the economic and financial impacts of natural disasters / Charlotte Benson, Edward J. Cultures of disaster : society and natural hazards in the Philippines / Greg Bankoff Cultures of disaster : society and natural hazards in the Philippines / Greg Bankoff Malaria control during mass population movements and natural disasters / Peter B. Bloland and Holly Malaria control during mass population movements and natural disasters / Peter B. Bloland and Holly Hurricane! : coping with disaster : progress and challenges since Galveston, 1900 / Robert Simpson, Hurricane! : coping with disaster : progress and challenges since Galveston, 1900 / Robert Simpson, The use of earth observing satellites for hazard support [electronic resource] : assessments & scena The use of earth observing satellites for hazard support [electronic resource] : assessments & scena The vulnerability of cities : natural disasters and social resilience / Mark Pelling The vulnerability of cities : natural disasters and social resilience / Mark Pelling

IMT530- Organization of Information Resources18 Open and Closed Controlled Vocabularies An open vocabulary is one in which an indexer may add a term at any time if they need it- pretty rare in traditional indexing (but common in folksonomies) A closed vocabulary is one in which an indexer may not add a term at any time. Term additions are controlled by the creators of the CV, not by indexers

IMT530- Organization of Information Resources19 Synthesis and Enumeration The synthesis & enumeration attribute has to do with how a controlled vocabulary is set up to operate and with where and at what point term creation happens The creation of terms may be restricted to the CV designer in some cases; in other cases, indexers have some flexibility in creating new terms by using a technique called “synthesis”

IMT530- Organization of Information Resources20 Enumeration An enumerated vocabulary is simply a list of terms. Indexers look at the list, select a term, and use it for indexing If a term is not present to index a particular document, then the indexer has to either ask the CV designer to add a term, or they are stuck Many enumerated vocabularies are also closed vocabularies Enumerative vocabularies came first in history – it probably didn’t occur to anyone that there could be any other way!

IMT530- Organization of Information Resources21 Example of Enumeration Sample list of enumerated terms: –bowls –plastic bowls –wood bowls –wood chairs –steel chairs –wood bookshelves –steel bookshelves Note that if an indexer had a document on “steel bowls”, that term is not available. The indexer using this vocabulary would have to either assign “bowls” (not specific), or would have to ask the CV designer to add the term “steel bowls”

IMT530- Organization of Information Resources22 Advantages of Enumeration Enumerated vocabularies are often easy to use because there are fewer rules for indexers (just look up your term, write it down, and move on!) All possible terms appear in the vocabulary, so it is easy to search and display all possible terms

IMT530- Organization of Information Resources23 Disadvantages of Enumeration Some terms are not available for the indexer to use; some combinations simply are not there List of terms may become very long (the Library of Congress Classification, a highly enumerated classification scheme, has 46 volumes!) Terms may be repeated over and over Wood bowls Wood chairs Wood bookshelves Wood cabinets Wood structures

IMT530- Organization of Information Resources24 Synthesis Synthesis is a technique developed in the 20 th century as a means of saving space and time in CV creation, and of extending flexibility to the indexer In a synthetic system, tables containing single terms are created by the CV designer and indexers follow rules to combine the terms from different tables to create a new term We’ll look at this in more detail in a couple weeks when we discuss faceted classification

IMT530- Organization of Information Resources25 Synthesis and Enumeration vs. Pre- & Post-coordination The relationship between enumeration & synthesis and pre- & post- coordination is not one-to-one! Some enumerated vocabularies are pre- coordinate; others are post-coordinate Most synthetic vocabularies are pre- coordinate, but it is possible for a synthetic vocabulary to be post-coordinate, particularly where it is exposed to end users –Where indexers assign terms from facets, the user has no control over coordination, but where a user can select and combine facets, it’s post-coordinate

IMT530- Organization of Information Resources26 Synthesis and Enumeration vs. Open and Closed Vocabularies All synthetic systems are open to a limited extent because indexers may combine simple terms to create new longer terms - but are closed if indexers may not add new terms to tables Synthetic systems are completely open if an indexer is allowed to add terms to the tables, add new tables, and add new rules for term synthesis

Extracting Terminology

IMT530- Organization of Information Resources28 Sources and Origins of Terminology Where do you get terms for a controlled vocabulary? Sources and origins of terminology may come from explicit statements of warrant Making a conscious decision about warrant demonstrates that as a CV designer you are aware of the different possibilities and have made considered choices

IMT530- Organization of Information Resources29 Warrant Warrant is “the authority that is used to justify decisions about what is included in a system,” (Clare Beghtol) Types of warrant: –Literary warrant –User warrant –Scholarly warrant –Cultural warrant (Beghtol, 2002)

IMT530- Organization of Information Resources30 Literary & User warrant Literary Warrant –terms or organization reflect or are taken directly from resources themselves; this includes dictionaries, encyclopedias, etc. on a topic User (aka Use, Enquiry) Warrant –terms or organization reflect use; user terminology may (or may not) be taken directly from logs of system use or from personal interactions with users

IMT530- Organization of Information Resources31 Scholarly & Cultural Warrant Scholarly Warrant –terms or organization reflect the opinions of a panel of human experts Cultural Warrant –terms or organization derived from cultural practice or understanding; for example, Dewey and LCSH reflect American/Western cultural bias; Colon Classification reflects Indian/Eastern cultural bias (this also can be partly a function of literary warrant…)

IMT530- Organization of Information Resources32 Questions? If not, take a break!!!

IMT530- Organization of Information Resources33 Exercise 7a Today we are starting a multiple-part exercise in building controlled vocabularies The first step is extracting concepts from a defined domain Form your groups and work through the two parts of exercise 7a Keep a copy of your concept lists to use in the next exercise