Vocabulary & languages in searching

Slides:



Advertisements
Similar presentations
Taxonomy as Content Outline, Site Map and Search Aid SLA NWR Vancouver October 6, 2006 Marjorie M.K. Hlava President
Advertisements

EDUCATION DATABASES: OVERVIEW. Primary Journal Databases Available for Education Education specific: ProQuest Education Journals Professional Development.
UCLA : GSE&IS : Department of Information StudiesJF : 276lec1.ppt : 5/2/2015 : 1 I N F S I N F O R M A T I O N R E T R I E V A L S Y S T E M S Week.
Advanced Searching Engineering Village.
Information & Library Services Australian Education Index, British Education Index and ERIC Sally Giffen August 2006.
Leveraging Your Taxonomy to Increase User Productivity MAIQuery and TM Navtree.
Text Operations: Preprocessing. Introduction Document preprocessing –to improve the precision of documents retrieved –lexical analysis, stopwords elimination,
© Tefko Saracevic, Rutgers University1 Search strategy & tactics Governed by effectiveness & feedback.
Search Strategies Online Search Techniques. Universal Search Techniques Precision- getting results that are relevant, “on topic.” Recall- getting all.
1 Advanced searching a variety tricks of the trade Tefko Saracevic
© Tefko Saracevic, Rutgers University1 1.Discussion 2.Information retrieval (IR) model (the traditional models). 3. The review of the readings. Announcement.
WMES3103 : INFORMATION RETRIEVAL
Information Retrieval February 24, 2004
Module 6a: Intro to Controlled Vocabularies, Taxonomies and Classification IMT530: Organization of Information Resources Winter 2007 Michael Crandall.
Learn how to search for information the smart way Choose your own adventure!
© Tefko Saracevic, Rutgers University 1 EVALUATION in searching IR systems Digital libraries Reference sources Web sources.
Thesaurus Design and Development
© Tefko Saracevic1 Search strategy & tactics Governed by effectiveness&feedback.
Introduction to Databases CIS 5.2. Where would you find info about yourself stored in a computer? College Physician’s office Library Grocery Store Dentist’s.
17:610:551:01 Where Should the Person Stop and the Information Search Interface Start? Marcia Bates Presented by Albena Stoyanova-Tzankova March 2004.
© Tefko Saracevic1 Types & structures of information resources What is out there for searching and what’s under the hood?
1 Vocabulary & languages in indexing & searching Connection: indexing searching
© Tefko Saracevic, Rutgers University1 PRINCIPLES OF SEARCHING 17:610:530 (01) Tefko Saracevic SCILS, Rm. 306 (732) /Ext. 8222
CSE 730 Information Retrieval of Biomedical Data The use of medical lexicon in biomedical IR.
“A successful man is usually a classifier and a chartmaker. This applies as much to modern business as to science or libraries… A large business or work.
IMT530- Organization of Information Resources1 Feedback Like exercises –But want more instructions and feedback on them –Wondering about grading on these.
1 Languages for aboutness n Indexing languages: –Terminological tools Thesauri (CV – controlled vocabulary) Subject headings lists (CV) Authority files.
WISER: Newspapers online : an introduction to the scope and range of recent and current newspapers available on Oxlip, including hints on effective search.
© Tefko Saracevic 1 Information retrieval (IR): traditional model 1.Why? Rationale for the module. Definition of IR 2.System & user components 3.Exact.
Languages are bridges … not barriers Chiara Carlucci – CEDEFOP Library ReferNet Technical Meeting September 2009.
WISER : OvidSP OvidSP is the new interface for searching many of the science and medicine databases available via OxLIP Catherine Dockerty
Searching Databases. What is in the Library? The Online Library has thousands of journal articles and electronic books available for your use. Also available.
LIS 506 (Fall 2006) LIS 506 Information Technology Week 11: Digital Libraries & Institutional Repositories.
Databases. Databases Database Searching Database Searching Definition: A database is any organized collection of data that can be retrieved using organized.
1 Information Retrieval Acknowledgements: Dr Mounia Lalmas (QMW) Dr Joemon Jose (Glasgow)
Medline on OvidSP. Medline Facts Extensive MeSH thesaurus structure with many synonyms used in mapping and multidatabase searching with Embase Thesaurus.
Searching CAB Abstracts, Medline & Zoological Record Cab Abstracts –Agriculture, Animal and crop husbandry –Animal and plant breeding –Veterinary medicine.
Information in the Digital Environment Information Seeking Models Dr. Dania Bilal IS 530 Spring 2006.
Current Events and Issues Using Index Databases for Finding Answers.
Librarians vs. Automation Carolyn Weber Lucio Campanelli Will Hohyon Ryu.
The UNESCO Thesaurus Meeting for Managers of UNESCO Documentation Networks Meron Ewketu UNESCO Library June
Welcome to the Business Source Premier tutorial By the end of this tutorial you should be able to: Do a basic search to find references Use search techniques.
Information in the Digital Environment Information Seeking Models Dr. Dania Bilal IS 530 Spring 2005.
Thesauri usage in information retrieval systems: example of LISTA and ERIC database thesaurus Kristina Feldvari Departmant of Information Sciences, Faculty.
Indexes and Abstracts: Dissecting the Resource By M. Leedy.
ERIC Educational Resources Information Center Searching.
CAB Abstracts, Medline & Zoological Record. Searching CAB Abstracts, Medline & Zoological Record Cab Abstracts –Agriculture, Animal and crop husbandry.
Librarians vs. Automation Carolyn Weber Lucio Campanelli Will Hohyon Ryu.
Information Retrieval
Subject Headings for Reference Everything You Need to Know About Subject Headings in One Easy Lesson By Dr. Nancy J. Becker Presented by Dr. Kevin Rioux.
Information Retrieval Transfer Cycle Dania Bilal IS 530 Fall 2007.
June 2003INIS Training Seminar1 INIS Training Seminar 2-6 June 2003 Subject Analysis Thesaurus and Indexing Alexander Nevyjel Subject Control Unit INIS.
Levels of Linguistic Analysis
Controlled Vocabulary & Thesaurus Design Associative Relationships & Thesauri.
ORGANIZATION OF ELEMENTS OF INFORMATION The Thesaurus.
Subject Access to Your Information Sandy Tucker Texas A&M University Libraries August 1, 2006 Second International Symposium on Transportation Technology.
Jean-Yves Le Meur - CERN Geneva Switzerland - GL'99 Conference 1.
Charlyn P. Salcedo Instructor Types of Indexing Languages.
SEPTEMBER 2015 Databases. Database (review) A database is a collection of data arranged for ease and speed of search and retrieval (The American Heritage.
12 Basic Skills for IQ: Keyword vs. Controlled Vocabulary Searching.
1 How do we describe something? n What something is about? –What the content of an object is “about”? n Different methods (Wilson, 1968) –counting terms.
Some basic concepts Week 1 Lecture notes INF 384C: Organizing Information Spring 2016 Karen Wickett UT School of Information.
UNIFIED MEDICAL LANGUAGE SYSTEMS (UMLS)
Subject Access: Indexing and Abstracting
1. Expand 2. Business search 3. Dialindex search
Multimedia Information Retrieval
IL Step 2: Searching for Information
Levels of Linguistic Analysis
THESAURUS CONSTRUCTION: GROUND WATER
Presentation transcript:

Vocabulary & languages in searching © Tefko Saracevic. Rutgres University Vocabulary & languages in searching Connection: indexing searching © Tefko Saracevic

Basic assertion Indexing and searching: inexorably connected you cannot search that that was not first indexed in some manner or other indexing of documents or objects is done in order to be searchable there are many ways to do indexing to index one needs an indexing language there are many indexing languages even taking every word in a document is an indexing language Knowing searching is knowing indexing © Tefko Saracevic

General definitions Vocabulary [Encarta Dictionary] Language “1. words known LANGUAGE - all the words used by or known to a particular person or group, or contained in a language as a whole” Language “1. speech of group the speech of a country, region, or group of people, including its diction, syntax, and grammar 2. system of communication a system of communication with its own set of conventions or special words” © Tefko Saracevic

From general to specific These general definitions are valid for application in indexing & searching to define index terms indexing vocabulary indexing language search terms search vocabulary query (request, search) language © Tefko Saracevic

Specific Index term a word or phrase that denotes (describes) a concept & connotes (implies) a class index term “table” describes a and implies many kinds of tables: for which, if desired, we may have more specific index terms © Tefko Saracevic

Specific ... Indexing vocabulary Indexing language a set of index terms used in a domain or for a set of documents or objects it could be even a single document or object e.g. a book Indexing language an indexing vocabulary together with rules – syntax, grammar – for their application and use © Tefko Saracevic

Specific ... Search terms Search vocabulary Query language a counterpart to index terms, also denoting a concept and connoting a class for a search Search vocabulary a set of search terms in a domain or available in a systems Query language a search vocabulary together with rules for their use in searching © Tefko Saracevic

More “An index language is the language used to describe documents and requests. The elements of the index language are index terms, which may be derived from the text of the document to be described, or may be arrived at independently. The vocabulary of an index language may be controlled or uncontrolled.” (van Rijsbergen, 1979) © Tefko Saracevic

Controlled vocabulary Predetermined – indicating what terms to be used in indexing may show definition of and relations between terms examples: thesaurus, subject heading list, classification Also indicates terms that may be selected for searching An indexing AND a searching tool Human constructed and costly to construct and use © Tefko Saracevic

Uncontrolled vocabulary Derived from documents nowadays automatically using various ways or algorithms constant issue: which way is “better” Used to construct inverted indexes a concordance, such as of the Bible, indicating place and position of each word mentioned in the text is an inverted index monks used to do it in 12th century, computers do it today Inverted indexes are used for free text searching © Tefko Saracevic

Controlled vs. free text searching Endless source of debate & controversy But, each has its place for given circumstance & retrieval goal Each has strengths & weaknesses can you list or find a list comparing them? Users mostly use free text searching Professional searchers use both as warranted As option: KNOW THY CONTROLLED VOCABULARY © Tefko Saracevic

Inverted indexes Useful to know how they function to understand search & retrieval. Steps: Each document is indexed every word in a document is taken as index term with exception of stop words position in text is noted Indexes for all documents are merged index terms are arranged alphabetically in the bowel of the system under each index term are document numbers in which it appears & position in text for that document © Tefko Saracevic

So, when you search for digital AND libraries: computer takes all documents under digital and all documents under libraries compares to “see” which documents have both terms and then provides you the list of those documents in a default format or you may choose a format This is also called “coordinate indexing” coordination is done at time of searching © Tefko Saracevic

Variation: when you search for digital (WITH) libraries or “digital libraries” i.e as a phrase computer goes through the same steps as before but then also “looks” for documents where digital is positioned right before libraries remember: computer “knows” position of each term in each document, each sentence So searching for a phrase is a form of searching of terms connected with AND but in a given sequence © Tefko Saracevic

Example of inverted file Doc # Text 1 Slow brown truck arrived 2 Shipment of brownies damaged in a fire 3 Delivery of brownies arrived in a slow truck 4 Shipment of brownies arrived in a truck For simplicity documents have one sentence. Stop words: “a,” “of,” “in.” Inverted index Term Position in doc number arrived (1:4), (3:4), (4:4) brown (1:2) brownies (2:3), (3:3), (4:3) damaged (2:4) delivery (3:1) fire (2:7) shipment (2:1), (4:1) slow (1:1), (3:7) truck (1:3), (3:8), (4:7) Search for slow AND truck gets as results documents 1 and 3 since both contain slow and truck Search for slow (w) truck retrieves only document 3 in which slow is 7th and truck is 8th, they are right next to each other. Doc 1 has both words, but not next to each other thus not retrieved © Tefko Saracevic

Thesaurus Good old Peter Mark Roget had a most useful idea & did a great job Following this idea thesaurus became THE major tool for controlled vocabulary in information retrieval (IR) starting in 1950’s & to this day many IR thesauri have been developed all have a similar structure & function but they are difficult & costly to construct © Tefko Saracevic

What is a thesaurus? “For writers, it is a tool like Roget’s ­ one with words grouped and classified to help select the best word to convey a specific nuance of meaning. For indexers and searchers, it is an information storage and retrieval tool: a listing of words and phrases authorized for use in an indexing system, together with relationships, variants and synonyms, and aids to navigation through the thesaurus.” (Milstead, 2000) © Tefko Saracevic

more… “A thesaurus to an information scientist is a controlled set of the terms used to index information in a database, and therefore also to search for information in that database so the same concepts are represented by the same term.” (Batty, 1998) © Tefko Saracevic

Basic thesaurus components For each entry thesaurus has a classification grid: Descriptor (DE) – an index term that has Scope note (SN) – context in which used Broader terms (BT) – higher in a hierarchy Narrower terms (NT) – lower in a hierarchy Related terms (RT) – other connected descriptors Used for (UF) – synonyms that are not descriptors Note: not all of these may be present for every descriptor A searcher or indexer can use these as a guide for selection/rejection & for browsing to get ideas © Tefko Saracevic

Examples of thesauri Thesauri have been constructed for great many domains, from A to Z here are some lists international & multilingual thesauri online thesauri among them ERIC Thesaurus (we use it for example) BUT: different thesauri may and do treat the same descriptor (index term) differently having different, more or fewer narrower, broader, related terms thus it is dangerous to use them interchangeably © Tefko Saracevic

Standard structure With variations on the theme, thesauri have similar conceptual structure to guide searcher or indexer: Descriptor - DE Broader terms - BT Narrower terms - NT Related terms - RT Used for - UF Synonyms Scope note - SN Note: Every descriptor doesn't have to have all of these © Tefko Saracevic

Same thesaurus but … Examples of ERIC (Educational Resources Information Center) thesaurus as used differently in different systems: ERIC own system ERIC file on DIALOG (begin 1) ERIC file on OVID (accessible through RUL) Notice how each uses thesaurus displays & search in its own way, but principles still the same Oh well… © Tefko Saracevic

ERIC online thesaurus on ERIC Allows for searching for words that are included in descriptors by category or all categories browsing alphabetically browsing in one of about 40 categories Search for library in all categories found 76 descriptors that have “library” included Out of these selected library education © Tefko Saracevic

ERIC online thesaurus on ERIC descriptor library education © Tefko Saracevic

ERIC thesaurus on DIALOG In a convoluted way ERIC thesaurus (and other ones) can be displayed on DIALOG (and other vendors, such as OVID) How? begin in file 1 – ERIC then expand a desired term – here we used term library you will see under R that certain terms have related terms – meaning that these are thesaurus entries then expand on one of those to see related terms then you can browse & choose which ones to use in search And here are Print Screens of the process © Tefko Saracevic

going … Expand library © Tefko Saracevic

going … RT indicates related terms 45237 items have library This one has 14 related terms going … © Tefko Saracevic

You can expand on expand going … We now chose descriptor LIBRARY ADMINISTRATION and expand on that one Neat trick: You can expand on expand & get related terms © Tefko Saracevic

going … These are now R terms of various type 14 related terms for this one are listed Can expand on this one to see other RT You can also select any of these to search © Tefko Saracevic

going … We have now selected r10 – library expenditures © Tefko Saracevic

going … And this is what we got Now we can view some items in a chosen format or we can further modify this search - add refine, … © Tefko Saracevic

gone This is one of the items we got Descriptors with * are major Additional index terms Descriptors used for this item Descriptors with * are major © Tefko Saracevic

ERIC thesaurus on OVID (accessed through RUL) For library ask to map as thesaurus term © Tefko Saracevic

There are more down there but we choose this one to expand going … There are more down there but we choose this one to expand © Tefko Saracevic

going … Entries for descriptor Electronic Libraries Continue to search for AND © Tefko Saracevic

Retrieved & ready to display going … Retrieved & ready to display © Tefko Saracevic

Choose format you want for this item gone Choose format you want for this item © Tefko Saracevic

Relevance feedback Method for using information in items judged relevant to further refine or change the search e.g. in relevant items we can browse titles, descriptors, identifiers, abstracts … to get leads for further search terms & tactics in some advanced systems this may be done automatically © Tefko Saracevic

Query expansion Method for adding, modifying, changing search terms in query to broaden, narrow, focus, change … terms Many sources can be used relevance feedback, thesauri, dictionaries, textbooks, documents, catalogs, & people: users, colleagues, your own mind & experience Some systems suggest terms for query expansion © Tefko Saracevic

Conclusion At the base of all searching are terms vocabularies languages but a variety exists In reality in searching there is no completely controlled or uncontrolled vocabulary matter of degree & most importantly, matter of mastery © Tefko Saracevic

symbolically; controlled & free vocabulary © Tefko Saracevic

thank you! © Tefko Saracevic