Information Retrieval: Indexing Acknowledgements: Dr Mounia Lalmas (QMW) Dr Joemon Jose (Glasgow)

Slides:



Advertisements
Similar presentations
Chapter 5: Introduction to Information Retrieval
Advertisements

Multimedia Database Systems
Properties of Text CS336 Lecture 3:. 2 Generating Document Representations Want to automatically generate with little human intervention Use significant.
Stemming Algorithms 資訊擷取與推薦技術:期中報告 指導教授:黃三益 老師 學生: 黃哲修 張家豪.
Intelligent Information Retrieval CS 336 –Lecture 3: Text Operations Xiaoyan Li Spring 2006.
CSE3201/CSE4500 Information Retrieval Systems Introduction to Information Retrieval.
Text Operations: Preprocessing. Introduction Document preprocessing –to improve the precision of documents retrieved –lexical analysis, stopwords elimination,
IR Models: Overview, Boolean, and Vector
Information Retrieval in Practice
Information Retrieval Ling573 NLP Systems and Applications April 26, 2011.
Information Retrieval Review
Query Operations: Automatic Local Analysis. Introduction Difficulty of formulating user queries –Insufficient knowledge of the collection –Insufficient.
WMES3103 : INFORMATION RETRIEVAL
T.Sharon - A.Frank 1 Internet Resources Discovery (IRD) IR Queries.
Information Retrieval Concerned with the: Representation of Storage of Organization of, and Access to Information items.
Modern Information Retrieval Chapter 2 Modeling. Can keywords be used to represent a document or a query? keywords as query and matching as query processing.
Advance Information Retrieval Topics Hassan Bashiri.
What is a document? Information need: From where did the metaphor, doing X is like “herding cats”, arise? quotation? “Managing senior programmers is like.
Information retrieval Finding relevant data using irrelevant keys Example: database of photographic images sorted by number, date. DBMS: Well structured.
Text Operations: Preprocessing and Compression. Introduction Document preprocessing –to improve the precision of documents retrieved –lexical analysis,
WMES3103 : INFORMATION RETRIEVAL INDEXING AND SEARCHING.
Recuperação de Informação. IR: representation, storage, organization of, and access to information items Emphasis is on the retrieval of information (not.
Chapter 5: Information Retrieval and Web Search
Overview of Search Engines
Indexing and Complexity. Agenda Inverted indexes Computational complexity.
1 Text Categorization  Assigning documents to a fixed set of categories  Applications:  Web pages  Recommending pages  Yahoo-like classification hierarchies.
Modeling (Chap. 2) Modern Information Retrieval Spring 2000.
Query Relevance Feedback and Ontologies How to Make Queries Better.
Modern Information Retrieval Chapter 7: Text Processing.
Indexing LBSC 708A/CMSC 838L Session 7, October 23, 2001 Philip Resnik.
Information Retrieval and Web Search Text properties (Note: some of the slides in this set have been adapted from the course taught by Prof. James Allan.
Chapter 2 Architecture of a Search Engine. Search Engine Architecture n A software architecture consists of software components, the interfaces provided.
1 Information Retrieval Acknowledgements: Dr Mounia Lalmas (QMW) Dr Joemon Jose (Glasgow)
1 University of Palestine Topics In CIS ITBS 3202 Ms. Eman Alajrami 2 nd Semester
Weighting and Matching against Indices. Zipf’s Law In any corpus, such as the AIT, we can count how often each word occurs in the corpus as a whole =
Chapter 6: Information Retrieval and Web Search
Search Engines. Search Strategies Define the search topic(s) and break it down into its component parts What terms, words or phrases do you use to describe.
Introduction to Digital Libraries hussein suleman uct cs honours 2003.
Information Retrieval Model Aj. Khuanlux MitsophonsiriCS.426 INFORMATION RETRIEVAL.
LIS618 lecture 3 Thomas Krichel Structure of talk Document Preprocessing Basic ingredients of query languages Retrieval performance evaluation.
Basic Implementation and Evaluations Aj. Khuanlux MitsophonsiriCS.426 INFORMATION RETRIEVAL.
Query Expansion By: Sean McGettrick. What is Query Expansion? Query Expansion is the term given when a search engine adding search terms to a user’s weighted.
Web- and Multimedia-based Information Systems Lecture 2.
Vector Space Models.
1 Automatic indexing Salton: When the assignment of content identifiers is carried out with the aid of modern computing equipment the operation becomes.
Information Retrieval
Evidence from Content INST 734 Module 2 Doug Oard.
Evidence from Content INST 734 Module 2 Doug Oard.
1 CS 430: Information Discovery Lecture 8 Automatic Term Extraction and Weighting.
Indexing LBSC 796/CMSC 828o Session 9, March 29, 2004 Doug Oard.
Xiaoying Gao Computer Science Victoria University of Wellington COMP307 NLP 4 Information Retrieval.
Feature Assignment LBSC 878 February 22, 1999 Douglas W. Oard and Dagobert Soergel.
Selecting Relevant Documents Assume: –we already have a corpus of documents defined. –goal is to return a subset of those documents. –Individual documents.
CS791 - Technologies of Google Spring A Web­based Kernel Function for Measuring the Similarity of Short Text Snippets By Mehran Sahami, Timothy.
Knowledge and Information Retrieval Dr Nicholas Gibbins 32/4037.
1 Chapter 7 Text Operations. 2 Logical View of a Document document structure recognition text+ structure accents, spacing, etc. stopwords noun groups.
Information Retrieval in Practice
Why indexing? For efficient searching of a document
An Efficient Algorithm for Incremental Update of Concept space
Search Engine Architecture
Text Based Information Retrieval
CS 430: Information Discovery
Multimedia Information Retrieval
Text Categorization Assigning documents to a fixed set of categories
CSE 635 Multimedia Information Retrieval
Inf 722 Information Organisation
Chapter 5: Information Retrieval and Web Search
Semantic Similarity Methods in WordNet and their Application to Information Retrieval on the Web Yizhe Ge.
Information Retrieval and Web Design
Information Retrieval and Web Design
Presentation transcript:

Information Retrieval: Indexing Acknowledgements: Dr Mounia Lalmas (QMW) Dr Joemon Jose (Glasgow)

What is a document? Representing the content of documents – Luhn's analysis – Generation of document representatives – Weighting Inverted files Roadmap

Indexing Language Language used to describe documents and queries index terms – selected subset of words – derived from the text or arrived at independently Keyword searching – Statistical analysis of document based of word occurrence frequency – Automated, efficient and potentially inaccurate Searching using controlled vocabularies – More accurate results but time consuming if documents manually indexed

Luhn's analysis Resolving power of significant words: – ability of words to discriminate document content – peak at rank order position half way between the two cut-offs

Generating document representatives

Input text: full text, abstract, title Document representative: list of (weighted) class names, each name representing a class of concepts (words) occurring in input text Document indexed by a class name if one of its significant words occurs as a member of that class Phases: – identify words - Lexical Analysis (Tokenising) – removal of high frequency words – suffix stripping (stemming) – detecting equivalent stems – thesauri – others (noun-phrase, noun group, logical formula, structure) – Index structure creation

Process View Document Lexical Analysis Stopwords removal stemming Indexing features

Lexical Analysis The process of converting a stream of characters (the text of the documents) into a stream of words (the candidate words to be adopted as index terms) – treating digits, hyphens, punctuation marks, and the case of the letters.

Stopword Removal Removal of high frequency words list of stop words (implement Luhn's upper cut- off) filtering out words with very low discrimination values for retrieval purposes example: “been", “a", “about", “otherwise" compare input text with stop list reduction: between 30 and 50 per cent

Conflation Conflation reduces word variants into a single form – similar words generally have similar meaning – retrieval effectiveness increased if the query is expanded with those which are similar in meaning to those originally contained within it. Stemming algorithm is a conflation procedure – reduces all words with same root into a single root

Different forms - stemming Stemming – Matching the query term “forests” to “forest” and “forested” – “choke", “choking", “choked" Suffix removal – removal of suffixes - worker – Porter algorithm: remove longest suffix Porter algorithm – error: “equal" -> “eq": heuristic rules – more effective than ordinary word forms Detecting equivalent stems – example: ABSORB- and ABSORPT- Stemmers remove affixes – prefixes? - megavolt

Plural stemmer Plurals in English – If word ends in “ies” but not “eies”, “aies” “ies” -> “y” – if word ends in “es” but not “aes, “ees”, “oes” “es” -> “e” – if word ends in “s” but not “us” or “ss” “s” -> “” – First applicable rule is the one used

Processing “The destruction of the amazon rain forests” Case normalisation Stop word removal. – From fixed list – “destruction amazon rain forests” Suffix removal (stemming). – “destruct amazon rain forest”

Thesauri A collection of terms along with some structure or relationships between them. Scope notes etc.. 1. provide standard vocabulary for indexing & searching 2. assist user locating terms for proper query formulation 3. provide classification hierarchy for broadening and narrowing current query according to user need – Equivalence: synonyms, preferred terms – Hierarchical: broader/narrower terms (BT/NT) – Association: related terms across the hierarchy (RT)

Thesauri Examples: WordNet

Faceted Classification

Thesauri Examples: AAT Art and Architecture Thesaurus

Hierarchical Classifications Alphanumeric coding schemes Subject classifications A taxonomy that represents a classification or kind-of hierarchy. Examples: Dewey Decimal, AAT, SHIC, ICONCLASS 41A324 Metalwork of a Door 41A322 Closing the Door 41A323 Monumental Door 41A32 Door 41A3241 Door-Knocker 41A327 Door-keeper, houseguard 41A325 Threshold Action associated with a door Something attached to a door Kind of a door

Terminology/Controlled vocabulary The descriptors from a thesauri form a controlled vocabulary Normalise indexing concepts Identification of indexing concepts with clear semantics Retrieval based on concepts rather than terms Good for specific domains (e.g., medical) Problematic for general domains (large, new, dynamic)

No One Classification

Generating document representatives - Outcome Class –words with the same stem Class name –stem Document representative: –list of class names (index terms or keywords) Same process applied to query

Precision and Recall Precision – Ratio of the number of relevant documents retrieved to the total number of documents retrieved. – The number of hits that are relevant Recall – Ratio of number of relevant documents retrieved to the total number of relevant documents – The number of relevant documents that are hits

Precision and Recall Retrieved Documents Relevant Documents Document Space Low Precision Low Recall Low Precision High Recall High Precision Low Recall High Precision High Recall

Precision and Recall The user isn’t usually given the answer set RA at once The documents in A are sorted to a degree of relevance (ranking) which the user examines. Recall and precision vary as the user proceeds with their examination of the answer set A Retrieved Documents |A| Relevant Documents |R| Information Space |RA| Precision = |RA| |A| Recall = |RA| |R|

Precision and Recall Trade Off Increase number of documents retrieved Likely to retrieve more of the relevant documents and thus increase the recall But typically retrieve more inappropriate documents and thus decrease precision Recall Precision 100%

Index term weighting Effectiveness of an indexing language: Exhaustivity – number of different topics indexed – high exhaustivity: high recall and low precision Specificity – ability of the indexing language to describe topics precisely – high specificity: high precision and low recall

Index term weighting Exhaustivity – related to the number of index terms assigned to a given document Specificity – number of documents to which a term is assigned in a collection – related to the distribution of index terms in collection Index term weighting – index term frequency: occurrence frequency of a term in document – document frequency: number of documents in which a term occurs

IR as Clustering A query is a vague spec of a set of objects, A IR is reduced to the problem of determining which documents are in set A and which ones are not Intra clustering similarity: – What are the features that better describe the objects in A Inter clustering dissimilarity: – What are the features that better distinguish the objects A from the remaining objects in C A: Retrieved Documents C: Document Collection x x x x x x

Index term weighting Weight(t,d) = tf(t,d) x idf(t) NNumber of documents in collection n(t)Number of documents in which term t occurs idf(t)Inverse document frequency occ(t,d)Occurrence of term t in document d t max Term in document d with highest occurrence tf(t,d)Term frequency of t in document d

Index term weighting Intra-clustering similarity – The raw frequency of a term t inside a document d. – A measure of how well the document term describes the document contents Inter-cluster dissimilarity – Inverse document frequency – Inverse of the frequency of a term t among the documents in the collection. – Terms which appear in many documents are not useful for distinguishing a relevant document from a non-relevant one. Normalised frequency of term t in document d Inverse document frequency n(t) N logidf(t) = occ(t max, d) occ(t,d) tf(t,d) = Weight(t,d) = tf(t,d) x idf(t)

Term weighting schemes Best known Variation for query term weights n(t) N log occ(t max, d) occ(t,d) weight(t,d) = x occ(t max, q) 0.5occ(t,q) n(t) N logx Term frequency Inverse document frequency

Example Nuclear 7 Computer 9 Poverty 5 Unemployment 1 Luddites 3 Machines 19 People 25 And 49 Weight(machine) = 19/25 x log(100/50) = 0.76 x = Weight(luddite) = 3/25 x log(100/2) = 0.12 x = Weight(poverty) = 5/25 x log(100/2) = 0.2 x =

Inverted Files Word-oriented mechanism for indexing test collections to speed up searching Searching: – vocabulary search (query terms) – retrieval of occurrence – manipulation of occurrence

Original Document view Cosmonaut astronaut moon car truck D D D

Inverted view D1D2D3 Cosmonaut 100 astronaut 01 0 moon 1 10 Car 1 01 truck 1 01

Inverted index cosmonaut astronaut moon car truck D1 D2 D1 D2 D3

Inverted File The speed of retrieval is maximised by considering only those terms that have been specified in the query This speed is achieved only at the cost of very substantial storage and processing overheads

Components of an inverted file term Field type frequency pointer Document number frequency Postings file Header Information

Producing an Inverted file quick brown fox over lazy dog back now time all good men come jump aid their party Term Doc 1Doc Doc 3 Doc Doc 5Doc Doc 7Doc 8 A B C F D G J L M N O P Q T AI AL BA BR TH TI 4, 8 2, 4, 6 1, 3, 7 1, 3, 5, 7 2, 4, 6, 8 3, 5 3, 5, 7 2, 4, 6, 8 3 1, 3, 5, 7 2, 4, 8 2, 6, 8 1, 3, 5, 7, 8 6, 8 1, 3 1, 5, 7 2, 4, 6 Postings Inverted File

An Inverted file quick brown fox over lazy dog back now time all good men come jump aid their party Term A B C F D G J L M N O P Q T AI AL BA BR TH TI 4, 8 2, 4, 6 1, 3, 7 1, 3, 5, 7 2, 4, 6, 8 3, 5 3, 5, 7 2, 4, 6, 8 3 1, 3, 5, 7 2, 4, 8 2, 6, 8 1, 3, 5, 7, 8 6, 8 1, 3 1, 5, 7 2, 4, 6 PostingsInverted File

Searching Algorithm For each document D, Score(D) =0; For each query term – Search the vocabulary list – Pull out the postings list – for each document J in the list, Score(J) +=Score(J) +1

What Goes in a Postings File? Boolean retrieval – Just the document number Ranked Retrieval – Document number and term weight (TF*IDF,...) Proximity operators – Word offsets for each occurrence of the term Example: Doc 3 (t17, t36), Doc 13 (t3, t45)

How Big Is the Postings File? Very compact for Boolean retrieval – About 10% of the size of the documents If an aggressive stopword list is used Not much larger for ranked retrieval – Perhaps 20% Enormous for proximity operators – Sometimes larger than the documents But access is fast - you know where to look

Tokenize Stop word Stemming Documents Query Tokenize Stop word Stemming Query features Indexing features Matching Term 1 Term 2 Term 3 didjdk Doc Score dj di dk s1 s2 s3 s1>s2>s3>... Storage: inverted index indexing

Similarity Matching The process in which we compute the relevance of a document for a query A similarity measure comprises – term weighting scheme which allocates numerical values to each of the index terms in a query or document reflecting their relative importance – similarity coefficient - uses the term weights to compute the overall degree of similarity between a query and a document 