Download presentation
Presentation is loading. Please wait.
1
Inf 722 Information Organisation
Class notes: Information Retrieval Jagdish S. Gangolly 1/17/2019 Inf 722 Information Organisation (Fall 2007) (Gangolly)
2
Inf 722 Information Organisation (Fall 2007) (Gangolly)
FOA Process FOA Process Asking a question (Query formulation) Constructing an answer (retrieval algorithms) Assessing the answer (feedback on relevance) 1/17/2019 Inf 722 Information Organisation (Fall 2007) (Gangolly)
3
Inf 722 Information Organisation (Fall 2007) (Gangolly)
FOA Process Query language Natural or artificial Vocabulary Syntax: operators, arguments Query expansion, specialization, disambiguation, relevance feedback 1/17/2019 Inf 722 Information Organisation (Fall 2007) (Gangolly)
4
FOA Process Constructing the answer
Information need accurately translated in the query? How to provide answer in a form suitable to the user? Provide background to the user so (s)he can verbalise the information need better? How to represent the query as well as the corpus efficiently and effectively 1/17/2019 Inf 722 Information Organisation (Fall 2007) (Gangolly)
5
Inf 722 Information Organisation (Fall 2007) (Gangolly)
FOA Process Constructing the answer (Contd) Generate a set of index terms which render the documents in the collection as different as possible Conflation algorithms Removal of function/fluff/stop words (usually from closed class words) Stripping suffixes (lemmatization) Detection of equivalent/associated words 1/17/2019 Inf 722 Information Organisation (Fall 2007) (Gangolly)
6
Inf 722 Information Organisation (Fall 2007) (Gangolly)
FOA Process Facets of documents: Structure (dtd) Format (css, xsl) Content (xsd) Unit of interest Tagging of corpora content tagging, grammatical tagging 1/17/2019 Inf 722 Information Organisation (Fall 2007) (Gangolly)
7
Inf 722 Information Organisation (Fall 2007) (Gangolly)
FOA Process Step 1: Selection of corpora to build Population from which documents to be included are selected (domain, genre,..) Step 2: Selection of Tagging, if necessary grammatical or other tagging schemes 1/17/2019 Inf 722 Information Organisation (Fall 2007) (Gangolly)
8
Inf 722 Information Organisation (Fall 2007) (Gangolly)
FOA Process Step 3: Indexing Index: doci {kwj} Index-1: {kwj} doci Extracting lexical features: Step a: Selection of tokens, separators Step b: Stemming decisions on number, gender (for some languages), hyphenation, phrases, idioms, morphological features,… Step c: Removal of stop words using a list 1/17/2019 Inf 722 Information Organisation (Fall 2007) (Gangolly)
9
Inf 722 Information Organisation (Fall 2007) (Gangolly)
FOA Process Use of Zipf’s Law in indexing 1/17/2019 Inf 722 Information Organisation (Fall 2007) (Gangolly)
10
Inf 722 Information Organisation (Fall 2007) (Gangolly)
FOA Process Zipf’s Law 1/17/2019 Inf 722 Information Organisation (Fall 2007) (Gangolly)
11
Inf 722 Information Organisation (Fall 2007) (Gangolly)
FOA Process Explanations of Zipf’s Law Zipf: Principle of Least Effort Mandelbrot: A more general version of Zipf law, and the similarity with cantor dust (fractals) 1/17/2019 Inf 722 Information Organisation (Fall 2007) (Gangolly)
12
Inf 722 Information Organisation (Fall 2007) (Gangolly)
FOA Process Word occurrences as Poisson process and the detection of stop words 1/17/2019 Inf 722 Information Organisation (Fall 2007) (Gangolly)
13
Inf 722 Information Organisation (Fall 2007) (Gangolly)
FOA Process Resolving power of words in discrimination between documents relationship between word frequencies and word significance (non function words), I.e., words are more frequently used to signify their importance To be index terms, words must help discriminate between documents 1/17/2019 Inf 722 Information Organisation (Fall 2007) (Gangolly)
14
Inf 722 Information Organisation (Fall 2007) (Gangolly)
FAO Process Precision v. Recall 1/17/2019 Inf 722 Information Organisation (Fall 2007) (Gangolly)
15
Inf 722 Information Organisation (Fall 2007) (Gangolly)
FOA Process Specificity v. Exhaustivity An index is specific if it reflects the information needs of the users An index is exhaustive if it reflects all topics covered by the documents There is tension between the two 1/17/2019 Inf 722 Information Organisation (Fall 2007) (Gangolly)
16
Inf 722 Information Organisation (Fall 2007) (Gangolly)
FOA Process word frequency: the number of times that a word is used in a document inverse document frequency: the number of documents in the corpus in which a word is used. Robertson - Sparck-Jones weighting 1/17/2019 Inf 722 Information Organisation (Fall 2007) (Gangolly)
17
Inf 722 Information Organisation (Fall 2007) (Gangolly)
Vector Space Model Vector Space model: 1/17/2019 Inf 722 Information Organisation (Fall 2007) (Gangolly)
Similar presentations
© 2024 SlidePlayer.com. Inc.
All rights reserved.