1 Vocabulary & languages in indexing & searching Connection: indexing searching

Slides:



Advertisements
Similar presentations
EDUCATION DATABASES: OVERVIEW. Primary Journal Databases Available for Education Education specific: ProQuest Education Journals Professional Development.
Advertisements

Subject Analysis: An Introduction Based on BASIC SUBJECT CATALOGING USING LCSH edited by Lori Robare.
R2 Library Features and Functionality Overview. The R2 Library  The R2 Library is an electronic database that enables access to digital book content.
UCLA : GSE&IS : Department of Information StudiesJF : 276lec1.ppt : 5/2/2015 : 1 I N F S I N F O R M A T I O N R E T R I E V A L S Y S T E M S Week.
Advanced Searching Engineering Village.
Information & Library Services Australian Education Index, British Education Index and ERIC Sally Giffen August 2006.
Leveraging Your Taxonomy to Increase User Productivity MAIQuery and TM Navtree.
Text Operations: Preprocessing. Introduction Document preprocessing –to improve the precision of documents retrieved –lexical analysis, stopwords elimination,
Engineering Village ™ ® Basic Searching On Compendex ®
© Tefko Saracevic, Rutgers University1 Search strategy & tactics Governed by effectiveness & feedback.
Search Strategies Online Search Techniques. Universal Search Techniques Precision- getting results that are relevant, “on topic.” Recall- getting all.
1 Advanced searching a variety tricks of the trade Tefko Saracevic
© Tefko Saracevic, Rutgers University1 1.Discussion 2.Information retrieval (IR) model (the traditional models). 3. The review of the readings. Announcement.
© Tefko Saracevic, Rutgers University1 Interaction in information retrieval There is MUCH more to searching than knowing computers, networks & commands,
WMES3103 : INFORMATION RETRIEVAL
Learn how to search for information the smart way Choose your own adventure!
Thesaurus Design and Development
© Tefko Saracevic1 Search strategy & tactics Governed by effectiveness&feedback.
17:610:551:01 Where Should the Person Stop and the Information Search Interface Start? Marcia Bates Presented by Albena Stoyanova-Tzankova March 2004.
© Tefko Saracevic1 Types & structures of information resources What is out there for searching and what’s under the hood?
© Tefko Saracevic, Rutgers University1 PRINCIPLES OF SEARCHING 17:610:530 (01) Tefko Saracevic SCILS, Rm. 306 (732) /Ext. 8222
CSE 730 Information Retrieval of Biomedical Data The use of medical lexicon in biomedical IR.
1 Languages for aboutness n Indexing languages: –Terminological tools Thesauri (CV – controlled vocabulary) Subject headings lists (CV) Authority files.
Using the ERIC Database This tutorial will show you how to access ERIC which contains citations, abstracts and some full-text materials from journals and.
© Tefko Saracevic 1 Information retrieval (IR): traditional model 1.Why? Rationale for the module. Definition of IR 2.System & user components 3.Exact.
Vocabulary & languages in searching
PLUG-INs Information Fujariah Colleges
MS 640: Introduction to Biomedical Information Medical Professionalism Finding Information Using Alumni Medical Library Resources.
WISER : OvidSP OvidSP is the new interface for searching many of the science and medicine databases available via OxLIP Catherine Dockerty
Searching Databases. What is in the Library? The Online Library has thousands of journal articles and electronic books available for your use. Also available.
BIS 3320 Nature of Intellectual Inquiry Hillary Campbell September 22, 2003.
H. Lundbeck A/S3-Oct-151 Assessing the effectiveness of your current search and retrieval function Anna G. Eslau, Information Specialist, H. Lundbeck A/S.
1 Information Retrieval Acknowledgements: Dr Mounia Lalmas (QMW) Dr Joemon Jose (Glasgow)
IL Step 2: Searching for Information Information Literacy 1.
Medline on OvidSP. Medline Facts Extensive MeSH thesaurus structure with many synonyms used in mapping and multidatabase searching with Embase Thesaurus.
Searching CAB Abstracts, Medline & Zoological Record Cab Abstracts –Agriculture, Animal and crop husbandry –Animal and plant breeding –Veterinary medicine.
Information in the Digital Environment Information Seeking Models Dr. Dania Bilal IS 530 Spring 2006.
IL Step 3: Using Bibliographic Databases Information Literacy 1.
Current Events and Issues Using Index Databases for Finding Answers.
Librarians vs. Automation Carolyn Weber Lucio Campanelli Will Hohyon Ryu.
The UNESCO Thesaurus Meeting for Managers of UNESCO Documentation Networks Meron Ewketu UNESCO Library June
Information in the Digital Environment Information Seeking Models Dr. Dania Bilal IS 530 Spring 2005.
Thesauri usage in information retrieval systems: example of LISTA and ERIC database thesaurus Kristina Feldvari Departmant of Information Sciences, Faculty.
ERIC Educational Resources Information Center Searching.
CAB Abstracts, Medline & Zoological Record. Searching CAB Abstracts, Medline & Zoological Record Cab Abstracts –Agriculture, Animal and crop husbandry.
Librarians vs. Automation Carolyn Weber Lucio Campanelli Will Hohyon Ryu.
Information Retrieval
June 2003INIS Training Seminar1 INIS Training Seminar 2-6 June 2003 Subject Analysis Thesaurus and Indexing Alexander Nevyjel Subject Control Unit INIS.
Controlled Vocabulary & Thesaurus Design Associative Relationships & Thesauri.
PubMed …featuring more than 20 million citations for biomedical literature from MEDLINE, life science journals, and online books.
Oxlip+. What is Oxlip+? A tool for finding & linking to databases – Online collections of (scholarly) materials – Includes full text / indexes / range.
ORGANIZATION OF ELEMENTS OF INFORMATION The Thesaurus.
Jean-Yves Le Meur - CERN Geneva Switzerland - GL'99 Conference 1.
Charlyn P. Salcedo Instructor Types of Indexing Languages.
EBSCO SEARCH USING BOOLEAN OPERATORS, AND LIMITERS BY: YEAR, AGE, GENDER COMPANY AND COUNTRY DATABASES: Academic Search Premier Business Source Elite CINAHL.
12 Basic Skills for IQ: Keyword vs. Controlled Vocabulary Searching.
1 How do we describe something? n What something is about? –What the content of an object is “about”? n Different methods (Wilson, 1968) –counting terms.
OVIDSP Searches Library Informatics 2011/2012 Edit Csajbók Semmelweis University Central Library.
GUIDE. P UB M ED
Introduction to Human Services
1. Expand 2. Business search 3. Dialindex search
Database & Record Structure
CAB Abstracts, Medline & Zoological Record
IL Step 3: Using Bibliographic Databases
IL Step 2: Searching for Information
Chapter Two: Review of the Literature
Introduction to Information Retrieval
PubMed.
ProQuest Databases.
Presentation transcript:

1 Vocabulary & languages in indexing & searching Connection: indexing searching © Tefko Saracevic

2 Central idea Indexing and searching: inexorably connected – you cannot search that that was not first indexed in some manner or other to be searched everything is and must be indexed somehow even if it is not called “indexed” – indexing of documents or objects is done in order to be searchable there are great many ways to do indexing – to index one needs an indexing language there are great many indexing languages – even taking every word in a document is an indexing language Knowing searching is knowing indexing Tefko Saracevic

ToC 1.Definitions 2.Controlled & uncontrolled vocabularies 3.Inverted indexes 4.Thesaurus © Tefko Saracevic 3

A few concepts from general to specific 1. Definitions © Tefko Saracevic 4

Defined concepts valid for application in indexing & searching General – language – vocabulary Specific – index terms – indexing vocabulary – indexing language – descriptors – keywords – search terms – search vocabulary – query language Tefko Saracevic 5

General definitions [Encarta Dictionary]Encarta Dictionary Language 1. communication with words: the human use of spoken or written words as a communication system 2. system of communication: a system of communication with its own set of conventions or special words Vocabulary 1. words of language: all the words used in a language as a whole 2. words of subject area: the set of words associated with a subject or area of activity, or used by an individual person Tefko Saracevic 6

7 Specific definitions Starting from the most basic concept: Index term: A word or phrase that denotes (describes) a concept & connotes (implies) a class index term “table” describes a and implies many kinds of tables: for which, if desired, we may have more specific index terms Tefko Saracevic

8 More definitions... Indexing vocabulary a set of index terms used in a domain or for a set of documents or objects it could be even a single document or object e.g. a book Indexing language an indexing vocabulary together with rules – syntax, grammar – for their application and use Tefko Saracevic

Variation on Index term Descriptor Word or phrase used to identify a topic or idea. Part of a controlled vocabulary, normally listed in a thesaurus (defined later). May be used as a search term. Keyword A significant word from a text of a record which can be used as a search term in a free-text search to retrieve all the records containing it – Could be assigned manually, but now done mostly automatically – key entry in automatic indexing Tefko Saracevic 9

Searching definitions Question request by a user related to user’s information need, task, problem at hand Question analysis breakdown & elaboration of concepts in a question to be translated into search terms Query question or part thereof as stated for searching according to rules of a given system © Tefko Saracevic 10

11 more... Search term a counterpart to index term, also denoting a concept and connoting a class for a search Search vocabulary a set of search terms in a domain or available in a systems Query language a search vocabulary together with rules for their use in searching Tefko Saracevic

elaboration … Example: Question: – What are some major historical developments in the area of information retrieval? Transformed into query – history information retrieval (in Google) – history AND information(w)retrieval (in Dialog) (plus you have to select which file(s) to search Tefko Saracevic 12 Question is what user asks and what you may then have elaborated Query is what is asked of computer to match – what is put in for searching Question is transformed into query

13 more … “An index language is the language used to describe documents and requests. The elements of the index language are index terms, which may be derived from the text of the document to be described, or may be arrived at independently. The vocabulary of an index language may be controlled or uncontrolled.” (van Rijsbergen, 1979)van Rijsbergen, 1979 Tefko Saracevic

Approaches, tensions 2. Controlled & uncontrolled vocabularies Tefko Saracevic 14

15 Controlled vocabulary Predetermined – indicating what terms to be used in indexing – may show definition of and relations between terms examples: thesaurus, subject heading list, classification Also indicates terms that may be selected for searching An indexing AND a searching tool Human constructed – and costly to construct and use Tefko Saracevic

Example of controlled vocabularies Medical Subject Headings Medical Subject Headings (MeSH) of the National Library of MedicineNational Library of Medicine One of the largest & most comprehensive – used in indexing & searching More than 22,000 descriptors, with more than 106,000 cross- references More than 139,000 Supplementary Concept Records Approximately 50 publication types (Journal Article, News, Editorial, Review, Randomized Controlled Trial, etc) Done by indexers But also experimenting with semi-automatic indexing © Tefko Saracevic 16

17 Uncontrolled vocabulary Derived from texts – natural language - in documents – nowadays automatically using various ways or algorithms – constantly tested: which algorithm is better? Used to construct inverted indexes In turn, inverted indexes are used for free text searching Tefko Saracevic

Comparison of vocabularies Controlled The idea of a controlled vocabulary is to reduce the variability of expressions used to characterize documents being indexed & searched for Manual, costly, time consuming, also semi- automatic in some systems Dynamic – needs constant changing, updating Uncontrolled or free The idea is to follow natural language expressions as they occur in documents Could be automatic – great advantage – algorithms constantly changing & improving e.g. parsing phrases, connections Prevailing in many applications © Tefko Saracevic 18

19 Controlled vs. free text searching Endless source of debate & controversy But, each has its place for given circumstance & retrieval goal Each has strengths & weaknesses can you list or find a list comparing them? – this is a good search assignment Users mostly use free text searching Professional searchers use both as warranted – have to know when Professional credo: KNOW THY CONTROLLED VOCABULARY so you can apply it in searching as/or when needed Tefko Saracevic

Use in searching 3. Inverted indexes Tefko Saracevic 20

21 Inverted indexes & searching Useful to know how they function to understand search & retrieval. Steps: 1.Each document is indexed – every word in a document is taken as index term with exception of stop words, if any – position in text is noted, even for stop words 2.Indexes for all documents are merged index terms are arranged alphabetically in the bowel of the system, so they can be searched under each index term are document numbers in which it appears & position in text for that document Tefko Saracevic

22 So, when you search for digital AND libraries: 1.computer takes all documents under digital 2.and all documents under libraries 3.compares to “see” which documents have both terms and then 4.provides you the list of those documents that have in the document both terms, no matter where This is also called “coordinate indexing” – coordination is done at time of searching Tefko Saracevic

23 Variation: when you search for digital (WITH) libraries or “digital libraries” i.e as a phrase 1.computer goes through the same steps as before but then also 2.“looks” for documents where digital is positioned right before libraries remember: computer “knows” position of each term in each document, each sentence So searching for a phrase is a form of searching of terms connected with AND but in a given sequence Tefko Saracevic

24 Example of searches in inverted file Doc # Text 1Slow brown truck arrived 2Shipment of brownies damaged in a fire 3Delivery of brownies arrived in a slow truck 4Shipment of brownies arrived in a truck TermPosition in doc number arrived(1:4), (3:4), (4:4) brown(1:2) brownies(2:3), (3:3), (4:3) damaged(2:4) delivery(3:1) fire(2:7) shipment(2:1), (4:1) slow(1:1), (3:7) truck(1:3), (3:8), (4:7) For simplicity documents have one sentence. Stop words: “a” “of” “in” – but their position counted Inverted index Search for slow AND truck gets as results documents 1 and 3 since both contain slow and truck Search for slow (w) truck retrieves only document 3 in which slow is 7th and truck is 8th, they are right next to each other. Doc 1 has both words, but not next to each other thus not retrieved Tefko Saracevic

Everything is inverted - consequences for searching All words in all fields are inverted, no matter if – in title, full text, descriptor, author … Thus all are searchable In some systems (but not all) phrases are parsed & thus searchable – but in most phrases are searched as AwB, or “AB” But beware: – search for libraries as descriptor e.g. libraries/DE in Dialog – will retrieve ALL other descriptors where libraries appear in addition to descriptor libraries itself e.g. academic libraries, public libraries, special libraries, research libraries … – but there are search tricks to avoid that Tefko Saracevic 25

A major tool for controlled vocabularies in information retrieval (IR) 4. Thesaurus Tefko Saracevic 26

27 What is a thesaurus? “For writers, it is a tool like Roget’s ­ one with words grouped and classified to help select the best word to convey a specific nuance of meaning. For indexers and searchers, it is an information storage and retrieval tool: a listing of words and phrases authorized for use in an indexing system, together with relationships, variants and synonyms, and aids to navigation through the thesaurus.” (Milstead, 2000)Milstead, 2000 Tefko Saracevic

28 more… “A thesaurus to an information scientist is a controlled set of the terms used to index information in a database, and therefore also to search for information in that database so the same concepts are represented by the same term.” (Batty, 1998)Batty, 1998 Tefko Saracevic

29 Thesaurus Good old Peter Mark Roget had a most useful idea in 1890s & did a great jobPeter Mark Roget Following this idea thesaurus became THE major tool for controlled vocabulary in IR – starting in 1950’s & to this day great many IR thesauri have been developed for all kinds of subjects including, for instance, in information science – all have a similar structure & functionstructure & function – but they are difficult & costly to construct & maintain Tefko Saracevic

Standards, software Subject to international standards: – “ Guidelines for the Construction, Format, and Management of Monolingual Controlled Vocabularies” ANSI/NISO Standard Z39.19 ANSI/NISO Standard Z39.19 – followed by “Construction of Controlled Vocabularies. A Primer”“Construction of Controlled Vocabularies. A Primer” A number of software products are available for thesaurus construction and maintenance – e.g. as listed by American Society for IndexingAmerican Society for Indexing © Tefko Saracevic 30

31 Examples of thesauri Thesauri have been constructed for great many domains, from A to Z – here are some lists international & multilingual thesauri online thesauri among them ERIC Thesaurus (we use it for example)ERIC Thesaurus – BUT: different thesauri may and do treat the same descriptor (index term) differently having different, more or fewer narrower, broader, related terms thus it is dangerous to use them interchangeably Tefko Saracevic

32 Basic thesaurus components For each entry thesaurus has a classification grid: – Descriptor (DE) – an index term that has Scope note (SN) – context in which used Broader terms (BT) – higher in a hierarchy Narrower terms (NT) – lower in a hierarchy Related terms (RT) – other connected descriptors Used for (UF) – synonyms that are not descriptors – Note: not all of these may be present for every descriptor A searcher or indexer can use these as a guide for selection/rejection & for browsing to get ideas Tefko Saracevic

33 Standard structure With variations on the theme, thesauri have similar conceptual structure to guide searcher or indexer: Note: Every descriptor doesn't have to have all of these Descriptor - DE Broader terms - BT Narrower terms - NT Related terms - RT Used for - UF Synonyms Scope note - SN Tefko Saracevic

34 Same thesaurus but … Examples of ERIC (Educational Resources Information Center) thesaurus as used differently in different systems: 1.ERIC own systemERIC 2.ERIC file on Dialog (begin 1)Dialog 3.ERIC file on OVID (accessible through RUL)accessible through RUL Notice how each uses the same ERIC thesaurus displays & search in its own way, but principles still the same Oh well… Tefko Saracevic

35 ERIC online thesaurus on ERIC Allows for – searching for words that are included in descriptors by category or all categories – browsing alphabetically – browsing in one of about 40 categories Search for libraries in all categories found 50 descriptors that have “library” included Out of these selected libraries Tefko Saracevic

ERIC online thesaurus on ERIC descriptor libraries © Tefko Saracevic 36 Other descriptors – one could browse Descriptor

37 ERIC thesaurus on Dialog In a convoluted way ERIC thesaurus (and other ones) can be displayed on Dialog (and other vendors, such as OVID) How? – begin in file 1 – ERIC – then expand a desired term – here we used term library – you will see under R that certain terms have related terms – meaning that these are thesaurus entries – then expand on one of those to see related terms – then you can browse & choose which ones to use in search And here are printed screens of the process Tefko Saracevic

Note on command expand (E) in Dialog Dialog (and some other systems) has a neat way to display all entries in any inverted index alphabetically – command is Expand or e – it could be done in any of the indexes – basic and additional For instance: e library will provide alpha list of term library in basic index & then after expanding again you can see related terms (see next) e Au=Saracevic will provide alpha list of all entries in the author additional index around that name © Tefko Saracevic 38

going Tefko Saracevic 39 Expand library

going … Tefko Saracevic 40 RT indicates related terms items have library This one has 14 related terms

going … Tefko Saracevic 41 We now chose descriptor LIBRARY ADMINISTRATION and expand on that one Neat trick: You can expand on expand & get related terms out of Eric thesaurus

going … Tefko Saracevic related terms for this one are listed These are now R terms of various type Can expand on this one to see other RT You can also select any of these to search

going … Tefko Saracevic 43 We have now selected r15 – library services to search for documents

going … Tefko Saracevic 44 And this is the no. of items we got Now we can view some items in a chosen format or we can further modify this search - add refine, …

gone Tefko Saracevic 45 This is one of the items we got Descriptors used for this item Additional index terms

Start ERIC search on OVID (accessed through RUL)OVID © Tefko Saracevic 46 Start with

Automatically gets you to thesaurus Tefko Saracevic 47 This one of selected to enlarge

Allows you to select thesaurus (or not) Tefko Saracevic 48 This one of selected to enlarge

Then go to ERIC thesaurus on OVID (accessed through RUL) © Tefko Saracevic 49 Scroll Descriptor

gone Next go and select additional terms Or search for libraries only See no. of results Select fields and formats by making a check and happy going … suggestion: repeat this exercise Tefko Saracevic 50 Point being that the same thesaurus is handled differently by different databases

51 Relevance feedback - an important search tactic Method for using information in items judged relevant to further refine or change the search – first you find a relevant document (or documents) – in relevant document(s) you browse titles, descriptors, identifiers, abstracts … to get leads (e.g. keywords) for further search terms & tactics – then you search for those in some advanced systems this may be done automatically Tefko Saracevic

52 Query expansion – another important search tactic Method for adding, modifying, changing search terms in a query – to broaden, narrow, focus, change … terms Many sources can be used – relevance feedback, thesauri, dictionaries, textbooks, documents, catalogs, & people: users, colleagues, your own mind & experience Some systems suggest terms for query expansion Tefko Saracevic

Query expansion tactics You can use the same structure for expanding query terms as in a thesaurus – think of what may be broader, narrower, related terms or synonyms to use as search terms Tefko Saracevic 53 Query term Broader terms - BT Narrower terms - NT Related terms - RTSynonyms

54 Conclusion At the base of all searching are – terms – vocabularies – languages – but a variety exists In reality in searching there is no completely controlled or uncontrolled vocabulary – matter of degree – & most importantly, matter of mastery Tefko Saracevic

55 symbolically; controlled & free vocabulary Tefko Saracevic

56 thank you! Tefko Saracevic