LIS618 lecture 2 Thomas Krichel 2002-09-22. Structure of talk General round trip on theoretical matters, part –Information retrieval models vector model.

Slides:



Advertisements
Similar presentations
Using CAB Abstracts to Search for Articles. Objectives Learn what CAB Abstracts is Know the main features of CAB Abstracts Learn how to conduct searches.
Advertisements

LIS618 lecture 2 Thomas Krichel Structure Theory: information retrieval performance Practice: more advanced dialog.
LIS618 lecture 3 Thomas Krichel Structure Theory: discussion of the Boolean model Theory: the vector model Practice: Nexis.
LIS618 lecture 3 Thomas Krichel Structure Revision of what was done last week. Theory: discussion of the Boolean model Theory: the vector.
LIS618 lecture 6 Thomas Krichel structure DIALOG –basic vs additional index –initial database file selection (files) Lexis/Nexis.
LIS618 lecture 1 Thomas Krichel Structure of talk Recap on Boolean Before online searching Working with DIALOG –Overview –Search command –Bluesheets.
Chapter 5: Introduction to Information Retrieval
Query Languages. Information Retrieval Concerned with the: Representation of Storage of Organization of, and Access to Information items.
LIS618 lecture 6 Thomas Krichel Structure Probabilistic model News from the front line –Open WorldCat Pilot –Amazon Search Inside the book.
Intelligent Information Retrieval CS 336 –Lecture 3: Text Operations Xiaoyan Li Spring 2006.
Introduction to Information Retrieval (Part 2) By Evren Ermis.
Leveraging Your Taxonomy to Increase User Productivity MAIQuery and TM Navtree.
Text Operations: Preprocessing. Introduction Document preprocessing –to improve the precision of documents retrieved –lexical analysis, stopwords elimination,
Engineering Village ™ ® Basic Searching On Compendex ®
IR Models: Overview, Boolean, and Vector
Information Retrieval in Practice
T.Sharon - A.Frank 1 Internet Resources Discovery (IRD) Classic Information Retrieval (IR)
Search Strategies Online Search Techniques. Universal Search Techniques Precision- getting results that are relevant, “on topic.” Recall- getting all.
1 CS 430 / INFO 430 Information Retrieval Lecture 8 Query Refinement: Relevance Feedback Information Filtering.
Modern Information Retrieval
WMES3103 : INFORMATION RETRIEVAL
Query Languages: Patterns & Structures. Pattern Matching Pattern –a set of syntactic features that must occur in a text segment Types of patterns –Words:
Chapter 4 : Query Languages Baeza-Yates, 1999 Modern Information Retrieval.
Basics Computer Internet Search Strategy. Computer Basics IP address: Internet Protocol Address An identifier for a computer or device on a network The.
WMES3103 : INFORMATION RETRIEVAL INDEXING AND SEARCHING.
Recuperação de Informação. IR: representation, storage, organization of, and access to information items Emphasis is on the retrieval of information (not.
LIS618 lecture 11 i/r performance evaluation Thomas Krichel
Chapter 5: Information Retrieval and Web Search
Overview of Search Engines
Chapter 4 Query Languages.... Introduction Cover different kinds of queries posed to text retrieval systems Keyword-based query languages  include simple.
Modeling (Chap. 2) Modern Information Retrieval Spring 2000.
LIS618 lecture 4 before searching + introduction to dialog Thomas Krichel
LIS510 lecture 3 Thomas Krichel information storage & retrieval this area is now more know as information retrieval when I dealt with it I.
LIS618 lecture 1 Thomas Krichel economic rational for traditional model In olden days the cost of telecommunication was high. database use.
Bio-Medical Information Retrieval from Net By Sukhdev Singh.
A Simple Unsupervised Query Categorizer for Web Search Engines Prashant Ullegaddi and Vasudeva Varma Search and Information Extraction Lab Language Technologies.
Chapter 2 Architecture of a Search Engine. Search Engine Architecture n A software architecture consists of software components, the interfaces provided.
Query Routing in Peer-to-Peer Web Search Engine Speaker: Pavel Serdyukov Supervisors: Gerhard Weikum Christian Zimmer Matthias Bender International Max.
1 University of Palestine Topics In CIS ITBS 3202 Ms. Eman Alajrami 2 nd Semester
1 Information Retrieval Acknowledgements: Dr Mounia Lalmas (QMW) Dr Joemon Jose (Glasgow)
IL Step 2: Searching for Information Information Literacy 1.
Planning a search strategy.  A search strategy may be broadly defined as a conscious approach to decision making to solve a problem or achieve an objective.
Medline on OvidSP. Medline Facts Extensive MeSH thesaurus structure with many synonyms used in mapping and multidatabase searching with Embase Thesaurus.
LIS618 lecture 4 Thomas Krichel Structure of talk Before online searching Introduction to online searching Introduction to DIALOG –Overview.
LIR 10 Week 7 Boolean Searching and Online Periodical Databases.
IL Step 3: Using Bibliographic Databases Information Literacy 1.
XP New Perspectives on The Internet, Sixth Edition— Comprehensive Tutorial 3 1 Searching the Web Using Search Engines and Directories Effectively Tutorial.
Chapter 6: Information Retrieval and Web Search
Search Engines. Search Strategies Define the search topic(s) and break it down into its component parts What terms, words or phrases do you use to describe.
Introduction to Digital Libraries hussein suleman uct cs honours 2003.
LIS618 lecture 3 Thomas Krichel Structure of talk Document Preprocessing Basic ingredients of query languages Retrieval performance evaluation.
4 1 SEARCHING THE WEB Using Search Engines and Directories Effectively New Perspectives on THE INTERNET.
WIRED Week 3 Syllabus Update (next week) Readings Overview - Quick Review of Last Week’s IR Models (if time) - Evaluating IR Systems - Understanding Queries.
LOGO A comparison of two web-based document management systems ShaoxinYu Columbia University March 31, 2009.
LIS618 lecture 3 Thomas Krichel Structure of talk The blue sheet Working with Dialog Nexis.com.
Web- and Multimedia-based Information Systems Lecture 2.
Information Retrieval
1 CS 430: Information Discovery Lecture 5 Ranking.
1 CS 430 / INFO 430 Information Retrieval Lecture 12 Query Refinement and Relevance Feedback.
Information Retrieval in Practice
Designing Cross-Language Information Retrieval System using various Techniques of Query Expansion and Indexing for Improved Performance  Hello everyone,
LECTURE 3: DATABASE SEARCHING PRINCIPLES
Text Based Information Retrieval
CS 430: Information Discovery
Data Mining Chapter 6 Search Engines
IL Step 3: Using Bibliographic Databases
IL Step 2: Searching for Information
Introduction to Information Retrieval
Chapter 5: Information Retrieval and Web Search
Information Retrieval and Web Design
Presentation transcript:

LIS618 lecture 2 Thomas Krichel

Structure of talk General round trip on theoretical matters, part –Information retrieval models vector model Probabilistic model –Retrieval performance evaluation –Query languages Introduction to online searching Introduction to DIALOG –Overview –bluesheets

vector model associates weights with each index term appearing in the query and in each database document. relevance can be calculated as the cosine between the two vector, i.e. their cross product divided be the square roots of the squares of each vector. This measure varies between 0 and 1.

tf/idf weighting technique Let n_i be the number of documents where the term d_i appears. Let F_i_j be the number of times term i appears in the document j. The normalized frequency is f_i_j, given by f_i_j=(F_i_j/max_l(F_l_j) that is the raw frequency divided by the maximum raw frequency achieved by any term in the document j.

tf/idf weighting technique Let N be the number of documents. Then the most frequently used weighting scheme is w_i_j=f_i_j * log(N/n_i) There are other methods, but these are variations on this one.

advantages of vector model term weighting improves performance sorting is possible easy to compute, therefore fast results are difficult to improve without –query expansion –user feedback circle

probabilistic model (outline only) starts with the assumption that there is a subset of documents that form the ideal answer set query process specifies properties of the answer set query terms can be used to form a probability that a document is part of the answer then we start an iterative process with the user to gain more characteristics about the answer set

recursive method If we assume that the probability that the documents that are relevant among a set of initially retrieved documents is proportional to the appearance of index terms that are part of the query, the probability can further be refined.

probabilistic model For any user requirement, we assume that there is an answer set and that the probability that the user finds a document interesting only depends on the document and the query Then the similarity of the document to the query can be expressed as s=(probability that the document is part of the answer set / probability that it is not part of the answer set). There are ways to calculate this (with some more assumptions).

retrieval performance evaluation There are two classic measures. Both assume that there is an answer set. Recall is the fraction of the relevant documents that the query result has captured. Precision is the fraction of the retrieved documents that is relevant.

recall and precision curves assume that all the retrieved documents arrive at once and are being examined. during that process, the user discover more and more relevant documents. Recall increases. during the same process, at least eventually, there will be less and less useful document. Precision declines (usually).

Example let the answer set be {4,7,5,3,6,1,0,8,9} and non-relevant documents represented by letters. A query reveals the following result: 7,a,3,b,c,9,n,j,l,5,r,o,s,e,4. for the first document, (recall, precison) is (10%,100%), for the third, (20%,60%), then follow (30%,50%), (40%,40%), (50%,27%)

recall/precision curves Such curves can be formed for each query. An average curve, for each recall level, can be calculated for several queries. Recall and precision levels can also be used to calculate two single-valued summaries.

average precision at seen document sum all the precision level for each new relevant document and divide by the total number of relevant documents is the query. In our example, it is 0.57 This measure favors retrieval methods that get the relevant documents to the top.

R-precision a more ad-hoc measure. Let R be the size of the answer set. Take the first R results of the query. Find the number of relevant documents Divide by R. In our example, the R-precision is.4. An average can be calculated for a number of queries.

critique of recall & precision recall has to be estimated by an expert recall is very difficult to estimate in a large collection measures most appropriate to a situation where queries are run in batch mode, they are difficult to reconcile with the idea of interactive use. there are some other measures.

simple queries single-word queries –one word only –Hopefully some word combinations are understood as one word, e.g. on-line Context queries –phrase queries (be aware of stop words) –proximity queries, generalize phrase queries Boolean queries

simple pattern queries prefix queries (e.g. "anal" for analogy) suffix queries (e.g. "oral" for choral) substring (e.g. "al" for talk) ranges (e.g. form "held" to "hero") within a distance, usually Levenshtein distance (i.e. the minimum number of insertions, deletions, and replacements) of query term

regular expressions come from UNIX computing build form strings where certain characters are metacharacters. example: "pro(blem)|(tein)s?" matches problem, problem, protein and proteins. example: New.*y matches "New Jersey" and "New York City", and "New Delhy". great variety of dialects, usually very powerful. Extremely important in digital libraries.

structured queries make use of document structures simplest example is when the documents are database records, we can search for terms is a certain field only. if there is sufficient structure to field contents, the field can be interpreted as meaning something different than the word it contains. example: dates

query protocols There are some standard languages –Z39.50 queries –CCL, "common command language" is a development of Z39.50 –CD-RDx "compact disk read only data exchange" is supported by US government agencies such as CIA and NASA –SFQL "structure full text query language" built on SQL

document preprocessing operations done on the documents before indexing –lexical analysis –elimination of stop words –stemming of words –selection of index term –construction of term categorization structures receives a decline in attention

lexical analysis divides a stream of characters into a stream of words seems easy enough but…. should we keep numbers? hyphens. compare "state-of-the-art" with "b-52" removal of punctuation, but "333B.C." casing. compare "bank" and "Bank"

elimination of stop words some words carry no meaning and should be eliminated in fact any word that appears in 80% of all documents is pretty much useless, but consider a searcher for "to be or not to be".

stemming in general, users search for the occurrence of a term irrespective of grammar plural, gerund forms, past tense can be subject to stemming important algorithm by Porter evidence on effect on retrieval is mixed

index term selection some engines try to capture nouns only some nouns that appear heavily together can be considered to be one index term, such as "computer science" Most web engines, however, index all words, why?

thesauri a list of words and for each word, a list of related words –synonyms –broader terms –narrower terms used –to provide a consistent vocabulary for indexing and searching –to assist users with locating terms for query formulation –allow users to broaden or narrow query

use of thesauri most users want to get a quick response often the selection of terms is erroneous frequently the relationship between terms in the query is badly served by the relationships in the query. Thus thesaurus expansion of an initial query (if performed automatically) can lead to bad results.

Online database searching

before a searchI what is purpose –brief overview –comprehensive search What perspective on the topic –scholarly –technical –business –popular

before searchII What type of information –Fulltext –Bibliographic –Directory –Numeric Are there any known sources? –Authors –Journals –Papers –Conferences

before searchIII What are the language restrictions? What, if any, are the cost restrictions? How current need the data to be? How much of each record is required?

DIALOG

Literature pdf_sem/dlg1.pdf pdf_sem/dlg2.pdf pdf_sem/dlg3.pdf pdf_sem/dlg4.pdf

databank over 500 different databases – references and abstracts for published literature, – business information and financial data; –complete text of articles and news stories; –statistical tables –Directories Two interfaces to all this stuff. –Guided search (for neophytes) –Command search (for Masters)  for us!!

Four steps in a search Use the Databases Selection Tool to select databases Identify search terms Use Dialog basic commands to conduct a search View records online

B E S T strategyI begin –b 630,636 –b papersmj, not 630 expand –e co=long island university –e au=krichel, t

B E S T II select –s (mate?(N)drink?) or (lex(N)para?) –s s1 and s2 type –type s1/3/1,6

Command search The first thing to be done is to select a database. 8 categories –Government – Medicine & Pharmaceuticals –News – Science & Technology –Business– Intellectual property –Reference– Social Sciences and Humanities there we go to command search

databases menus databases are ordered in hierarchical fashion at each level a Boolean search can be executed –on all of Dialog –on the databases in the current hierarchical level

searching result may be a just a blank screen otherwise, a table with the file number, the database name and the number of hits appears wait until the display is complete…. sorting of database is possible by the number of hits for the current query

blue sheet each database name is linked to a blueish pop-up window called the blue sheet for the database Contents of bluesheet is covered later at this stage we choose a database and hit "begin". We see that there is a command selected: "be numbers" where numbers are the ones for the databases selected, separated by comma.

finding a database file 411 contains the database of databases 'sf category' selects files belonging to a category category categories are listed at 'rank files' will rank the results 'b ref,ref' will select databases using rank references.

closer look at the bluesheet file description subject coverage (free vocabulary) format options, lists all formats –by number (internal) –by dialog web format (external, i.e. cross- database) search options –basic index, i.e. subject contents –additional index, i.e. non-subject

search options: basic index select without qualifiers searches in all fields in the basic index bluesheet lists field indicators available for a database also note if field is indexed by word or phrase. proximity searching only works with word indices.

other search options additional indices lists those terms that can lead a query. Often, these are phrase indexed. special features will list other features that the database has and that can be used in queries

Thank you for your attention!