Facetted Classification and Thesauri Introduction University of California, Berkeley School of Information IS 245: Organization of Information In Collections IS 257 – Fall 2009
Lecture Overview Facetted Classification Traditional vs. Facetted Classification Designing Facetted Classifications Thesaurus Design intro IS 257 – Fall 2009
Agenda Facetted Classification Traditional vs. Facetted Classification Designing Facetted Classifications Thesaurus Design IS 257 – Fall 2009
Controlled Vocabularies Vocabulary control is the attempt to provide a standardized and consistent set of terms (such as subject headings, names, classifications, etc.) with the intent of aiding the searcher in finding information That is, it is an attempt to provide a consistent set of descriptions for use in (or as) metadata IS 257 – Fall 2009
Hierarchical Classification Each category is successively broken down into smaller and smaller subdivisions No item occurs in more than one subdivision Each level divided out by a “character of division” (also known as a feature) Example: Distinguish “Literature” based on: Language Genre Time Period Slide author: Marti Hearst IS 257 – Fall 2009
Hierarchical Classification Literature Spanish French English Drama Poetry Prose 18th 17th 16th 19th ... Slide author: Marti Hearst IS 257 – Fall 2009
Labeled Categories for Hierarchical Classification LITERATURE 100 English Literature 110 English Prose English Prose 16th Century English Prose 17th Century English Prose 18th Century ... 111 English Poetry 121 English Poetry 16th Century 122 English Poetry 17th Century 112 English Drama 130 English Drama 16th Century … 200 French Literature Slide author: Marti Hearst IS 257 – Fall 2009
Facetted Categories Mutually exclusive Relational Composable Non-overlapping, distinct categories Relational Relations between facets, subfacets, and foci (elements) are not restricted to hierarchical generalization-specialization relations Composable Combined using grammars of order and relation to form compound descriptions IS 257 – Fall 2009
Facetted Classification Along With Labeled Categories A Language a English b French c Spanish B Genre a Prose b Poetry c Drama C Period a 16th Century b 17th Century c 18th Century d 19th Century Aa English Literature AaBa English Prose AaBaCa English Prose 16th Century AbBbCd French Poetry 19th Century BbCd Drama 19th Century Slide author: Marti Hearst IS 257 – Fall 2009
Ranganathan PMEST Facets P(ersonality) M(atter) E(nergy) S(pace) WHO: The most important types or names of things for the particular discipline M(atter) WHAT: Constituent materials E(nergy) HOW: Action or activity terms S(pace) WHERE: Where things occur T(ime) WHEN: When things occur IS 257 – Fall 2009
“Classical” CRG/BC2 Facet Analysis Entity Kind Part Property Material Process Operation Patient Product By-Product Agent Space Time IS 257 – Fall 2009
“Classical” Facet Analysis What is being done? Entity Kind Product By-Product What are its parts? Part What are its properties? Property Material How is this achieved? Process By what means? Operation By whom? Agent Patient Where? Space When? Time IS 257 – Fall 2009
“Classical” Facet Analysis Nouns Entity Kind Part Patient Product By-Product Agent Adjectives Property Material Intransitive Verb Process Transitive Verb Operation Adverb Space Time IS 257 – Fall 2009
Semantic and Syntactic Relationships Semantic relationships Is-A (thing/kind, genus/species) Mammals Primates Humans Has-Parts Human Head Eyes Syntactic relationships Compounds Wheat + harvesting = “wheat harvesting” Object + operation = operation on object IS 257 – Fall 2009
Facetted Classification Clearly distinguishes between semantic relationships and syntactic relationships Semantic relationships Within a facet Containment relations Syntactic relationships Across facets Combinatoric relations Have a “syntax” for syntactic combination of semantic terms IS 257 – Fall 2009
Power of Facet Combinations The syntactic relations of facetted classifications enable a small controlled vocabulary to produce Many, many structured descriptions Complex, but formally structured descriptions using nested compound descriptions Descriptions for things we do not have words for IS 257 – Fall 2009
Example: Objects Red Plastic Glass Blue Paper Straw IS 257 – Fall 2009
IS202 Project Team Facetted Classifications (2004) 007 Personality Straw Glass Operation Drinking Slurping Sipping Material Plastic Paper Color Blue Red ARTery Color Size Material Weight Shape Radius/Circumference Density Volume/Capacity Function/Use Hardness/Softness Yin/Yang IS 257 – Fall 2009
IS202 Project Team Facetted Classifications (2004) Culture Feed Color Red Blue Material Plastic Paper Use Drink from Drink with Dimensions Circumference Height Diameter Picture Portal Color Red Blue Material Paper Plastic Use Containment Transport Shape Torus Planar # Holes 1 IS 257 – Fall 2009
IS202 Project Team Facetted Classifications (2004) F.U.N. Shape Color Material Rigidity Function Container Conduit Locale Weight Size MNM Functionality What it does What you can do with it Physical Properties Color Shape Material IS 257 – Fall 2009
IS202 Project Team Facetted Classifications (2004) pillBox Function Container Conduit Form Shape Cylinder Composition Paper Plastic Color Blue Red Size Tall and skinny Short and fat Team iTour Color Red Blue State Solid Non-porous Flexible Material Plastic Paper Geometry Cylindrical Hollow Function Container Drinking Sucking Blowing IS 257 – Fall 2009
Two Yellow Plastic Straws Example: Objects Gray Metal Glass Two Yellow Plastic Straws IS 257 – Fall 2009
Example: Objects Function Form Function: Drinking Form Shape: Cylinder Material Color Number Function: Drinking Form Shape: Cylinder Material: Plastic Color: Red Number: 1 IS 257 – Fall 2009
Agenda Facetted Classification Traditional vs. Facetted Classification Designing Facetted Classifications Thesaurus Design IS 257 – Fall 2009
Facetted Classification Design Collect examples that need to be classified Identify candidates for facets and subfacets Test classification scheme on examples for facet orthogonality Order foci within facets Explicate grammar for ordering and combining facets and subfacets Test classification scheme on examples for combinatoric power Extend foci for comprehensiveness where applicable Create new facets and subfacets where needed Test classification scheme on new examples, especially boundary cases Iterate and refine throughout IS 257 – Fall 2009
Facet Guidelines Terms on the same level in the ontology should be of the same level and type Facets, subfacets, and foci should have a discernible order Use of capitalization and singular/plural forms should be uniform Sports Team Sports Baseball Football Basketball Solo Sports Marathon Running Sports Team Sports Baseball Football Basketball Solo Sports Marathon Running IS 257 – Fall 2009
Ordering Foci (“Array”) Simple to complex (Locomotions: walk, run, jump, skip, hurdle, cartwheel) Common/popular to uncommon/unpopular (Vegetarian Pizza Toppings: mushroom, onion, olive, artichoke, pineapple, pine nuts) Spatial, geographical, or geometric (Southwestern States: California, Nevada, Arizona, New Mexico ) Chronological, historical, or evolutionary (Dinosaur Eras: Triassic, Jurassic, Cretaceous) Canonical (pre-established order) (Playground Counting: Eenie, Meenie, Mynee, Mo) Alphabetical (Boy’s Names: Al, Bob, Chuck, David, Ed, Frank, George, Harry) Size (T-Shirts: Small, Medium, Large, XL, XXL) IS 257 – Fall 2009
Agenda Facetted Classification Traditional vs. Facetted Classification Designing Facetted Classifications Thesaurus Design (intro) IS 257 – Fall 2009
Types of Indexing Languages Uncontrolled keyword indexing Indexing languages Controlled, but not structured Thesauri Controlled and structured Classification systems Controlled, structured, and coded Facetted classification systems IS 257 – Fall 2009
Thesauri A Thesaurus is a collection of selected vocabulary (preferred terms or descriptors) with links among synonymous, equivalent, broader, narrower and other related terms IS 257 – Fall 2009
Thesaurus Standards National and International Standards for Thesauri ANSI/NISO z39.19-1994 — American National Standard Guidelines for the Construction, Format and Management of Monolingual Thesauri ANSI/NISO Draft Standard Z39.4-199x — American National Standard Guidelines for Indexes in Information Retrieval ISO 2788 — Documentation — Guidelines for the establishment and development of monolingual thesauri ISO 5964 — Documentation — Guidelines for the establishment and development of multilingual thesauri IS 257 – Fall 2009
Thesaurus Examples Examples Non-Facetted Semi-Facetted Facetted The ERIC Thesaurus of Descriptors Semi-Facetted The Medical Subject Headings (MESH) of the National Library of Medicine Facetted The Art and Architecture Thesaurus IS 257 – Fall 2009
ERIC Thesaurus – Entry IS 257 – Fall 2009
ERIC Thesaurus – Alphabetic IS 257 – Fall 2009
ERIC Thesaurus – KWIC Index IS 257 – Fall 2009
ERIC Thesaurus – Hierarchies IS 257 – Fall 2009
ERIC Thesaurus – Groups IS 257 – Fall 2009
ERIC Thesaurus – Online http://www.ericfacility.net/extra/pub/thessearch.cfm IS 257 – Fall 2009
MESH – Entry IS 257 – Fall 2009
MESH – Alphabetic IS 257 – Fall 2009
MESH – Tree Structures IS 257 – Fall 2009
MESH – KWOC Index IS 257 – Fall 2009
MESH - Online http://www.nlm.nih.gov/mesh/meshhome.html IS 257 – Fall 2009
AAT – Facets IS 257 – Fall 2009
AAT – Hierarchies (print) IS 257 – Fall 2009
AAT – Hierarchies (online) http://www.getty.edu/research/tools/vocabulary/aat/ IS 257 – Fall 2009
AAT – Entry (online) IS 257 – Fall 2009
Lecture Overview Thesaurus Design and Development Controlled Vocabularies for topical description Thesaurus Design Steps In Thesaurus Development (intro) IS 257 – Fall 2009
Why Develop a Thesaurus? To provide a conceptual structure or “space” for a body of information To make it possible to adequately describe the topical content of information resources at an appropriate level of generality or specificity To provide enhanced search capabilities and to improve the effectiveness of searching (i.e., to retrieve most of the relevant material without too much irrelevant material) IS 257 – Fall 2009
Why Develop a Thesaurus? To provide vocabulary (or terminological) control When there are several possible terms designating a single concept, the thesaurus should lead the indexer or searcher to the appropriate concept, regardless of the terms they start with IS 257 – Fall 2009
Preliminary Considerations What is used now? Continue using an existing thesaurus? Ad hoc modification of existing thesaurus? Develop a new well-structured thesaurus? What is the scope and complexity of the subject field? What kind of retrieval objects or data will be dealt with? How exhaustive and specific is the desired description of objects? IS 257 – Fall 2009
Preliminary Considerations The scope and complexity of the field will provide some indication of the scope and complexity of the thesaurus It is better to plan for a larger and more comprehensive system than a smaller system that rapidly will become inadequate as the database grows Development of a good thesaurus requires a major intellectual effort as well as clerical operations like data entry and production of sorted lists IS 257 – Fall 2009
Development of a Thesaurus Term selection Merging and development of concept classes Definition of broad subject fields and subfields Development of classificatory structure Review, testing, application, revision IS 257 – Fall 2009
Flow of Work in Thesaurus Construction Select Sources Assign codes Select Terms Record Selected Terms Sort Terms Merge identical Terms Define Broad Subject Fields Merge Terms in Same Concept class Sort Terms into Broad Subject Fields Define Subfields within one Subject Field Work out detailed structure of the Subject Field Select Preferred Terms All Subfields of Broad Subject finished? All Broad Subjects finished? Improve Class Structure Yes No Print Classified Index and review Discuss with Experts and Users Select descriptors and checklist items Produce Full Thesaurus and Check references Assign Notation Review and Test Many Modifications? Based on Soergel, pp 327-333 Revise as needed IS 257 – Fall 2009