Chapter 2 Information Retrieval Ms. Malak Bagais [textbook]: Chapter 2.

Slides:

Advertisements

Similar presentations

Boolean and Vector Space Retrieval Models

Advertisements

CSE3201/4500 Information Retrieval Systems

Chapter 5: Introduction to Information Retrieval

INFO624 - Week 2 Models of Information Retrieval Dr. Xia Lin Associate Professor College of Information Science and Technology Drexel University.

Modern information retrieval Modelling. Introduction IR systems usually adopt index terms to process queries IR systems usually adopt index terms to process.

Multimedia Database Systems

INSTRUCTOR: DR.NICK EVANGELOPOULOS PRESENTED BY: QIUXIA WU CHAPTER 2 Information retrieval DSCI 5240.

Effective Keyword Based Selection of Relational Databases Bei Yu, Guoliang Li, Karen Sollins, Anthony K.H Tung.

Modern Information Retrieval Chapter 1: Introduction

Crawling, Ranking and Indexing. Organizing the Web The Web is big. Really big. –Over 3 billion pages, just in the indexable Web The Web is dynamic Problems:

| 1 › Gertjan van Noord2014 Zoekmachines Lecture 4.

Modern Information Retrieval by R. Baeza-Yates and B. Ribeiro-Neto

Introduction to Information Retrieval (Manning, Raghavan, Schutze) Chapter 6 Scoring term weighting and the vector space model.

CS 430 / INFO 430 Information Retrieval

CS 430 / INFO 430 Information Retrieval

Data - Information - Knowledge

IR Models: Overview, Boolean, and Vector

Information Retrieval Ling573 NLP Systems and Applications April 26, 2011.

Search and Retrieval: More on Term Weighting and Document Ranking Prof. Marti Hearst SIMS 202, Lecture 22.

Building an Intelligent Web: Theory and Practice Pawan Lingras Saint Mary’s University Rajendra Akerkar American University of Armenia and SIBER, India.

WMES3103 : INFORMATION RETRIEVAL

Chapter 5: Query Operations Baeza-Yates, 1999 Modern Information Retrieval.

1 CS 430 / INFO 430 Information Retrieval Lecture 12 Probabilistic Information Retrieval.

T.Sharon - A.Frank 1 Internet Resources Discovery (IRD) IR Queries.

1 CS 430 / INFO 430 Information Retrieval Lecture 12 Probabilistic Information Retrieval.

Ch 4: Information Retrieval and Text Mining

Modern Information Retrieval Chapter 2 Modeling. Can keywords be used to represent a document or a query? keywords as query and matching as query processing.

Chapter 2Modeling 資工 4B 陳建勳. Introduction.  Traditional information retrieval systems usually adopt index terms to index and retrieve documents.

Modeling Modern Information Retrieval

1 CS 430 / INFO 430 Information Retrieval Lecture 3 Vector Methods 1.

Modern Information Retrieval Chapter 2 Modeling. Can keywords be used to represent a document or a query? keywords as query and matching as query processing.

WMES3103 : INFORMATION RETRIEVAL INDEXING AND SEARCHING.

Chapter 5: Information Retrieval and Web Search

Overview of Search Engines

Modeling (Chap. 2) Modern Information Retrieval Spring 2000.

1 University of Palestine Topics In CIS ITBS 3202 Ms. Eman Alajrami 2 nd Semester

Xiaoying Gao Computer Science Victoria University of Wellington Intelligent Agents COMP 423.

Weighting and Matching against Indices. Zipf’s Law In any corpus, such as the AIT, we can count how often each word occurs in the corpus as a whole =

Chapter 6: Information Retrieval and Web Search

Introduction to Digital Libraries hussein suleman uct cs honours 2003.

1 University of Palestine Topics In CIS ITBS 3202 Ms. Eman Alajrami 2 nd Semester

LANGUAGE MODELS FOR RELEVANCE FEEDBACK Lee Won Hee.

Chapter 23: Probabilistic Language Models April 13, 2004.

Web- and Multimedia-based Information Systems Lecture 2.

Vector Space Models.

1 Information Retrieval LECTURE 1 : Introduction.

Information Retrieval Chapter 2 by Rajendra Akerkar, Pawan Lingras Presented by: Xxxxxx.

1 CS 430: Information Discovery Lecture 5 Ranking.

Introduction n IR systems usually adopt index terms to process queries n Index term: u a keyword or group of selected words u any word (more general) n.

Xiaoying Gao Computer Science Victoria University of Wellington COMP307 NLP 4 Information Retrieval.

Information Retrieval and Web Search IR models: Vector Space Model Term Weighting Approaches Instructor: Rada Mihalcea.

Information Retrieval in Practice

Why indexing? For efficient searching of a document

Plan for Today’s Lecture(s)

Lecture 1: Introduction and the Boolean Model Information Retrieval

Why the interest in Queries?

CS 430: Information Discovery

Information Retrieval on the World Wide Web

Multimedia Information Retrieval

Representation of documents and queries

INFORMATION RETRIEVAL TECHNIQUES BY DR. ADNAN ABID

Chapter 5: Information Retrieval and Web Search

Boolean and Vector Space Retrieval Models

CS 430: Information Discovery

Recuperação de Informação B

Information Retrieval and Web Design

Recuperação de Informação B

Information Retrieval and Web Design

Advanced information retrieval

Bug Localization with Combination of Deep Learning and Information Retrieval A. N. Lam et al. International Conference on Program Comprehension 2017.

Presentation transcript:

Chapter 2 Information Retrieval Ms. Malak Bagais [textbook]: Chapter 2

Objectives  By the end of this lecture, student will be able to:  Lists information retrieval components  Describe document representation  Apply Porter’s Algorithm  Compare and apply different retrieval models  Evaluate the performance of retrieving

Information Retrieval summarization searching indexing

Document representationQuery representationRank the documentsEvaluation of the quality of retrieval Information Retrieval Components

Document Representation Transforming a text document to a weighted list of keywords

Stopwords

Data Mining has emerged as one of the most exciting and dynamic fields in computing science. The driving force for data mining is the presence of petabyte-scale online archives that potentially contain valuable bits of information hidden in them. Commercial enterprises have been quick to recognize the value of this concept; consequently, within the span of a few years, the software market itself for data mining is expected to be in excess of $10 billion. Data mining refers to a family of techniques used to detect interesting nuggets of relationships knowledge in data. While the theoretical underpinnings of the field have been around for quite some time (in the form of pattern recognition, statistics, data analysis and machine learning), the practice and use of these techniques have been largely ad hoc. With the availability of large databases to store, manage and assimilate data, the new thrust of data mining lies at the intersection of database systems, artificial intelligence and algorithms that efficiently analyze data. The distributed nature of several databases, their size and the high complexity of many techniques present interesting computational challenges. Sample Document

ad algorithms analysis analyze archives artificial assimilate availability billion bits challenges commercial complexity computational computing concept data database databases detect distributed driving dynamic efficiently emerged enterprises excess exciting expected family field fields force form hidden high hoc information intelligence interesting intersection large largely learning lies machine manage market mining nature nuggets online pattern petabyte-scale potentially practice presence present quick recognition recognize refers relationships science size software span statistics store systems techniques theoretical thrust time underpinnings valuable years Delete stopwords

Stemming A given word may occur in a variety of syntactic forms plurals past tense gerund forms

Stemming A given word may occur in a variety of syntactic forms plurals past tense gerund forms connector connected preconnection connection connecting postconnection connections connects

Stemming A stem is what is left after its affixes (prefixes and suffixes) are removed Stem connect Suffixes connector connection connections connected connecting connects Prefixes preconnection postconnection

 Letters A, E, I, O, and U are vowels  A consonant in a word is a letter other than A, E, I, O, or U, with the exception of Y  The letter Y is a vowel if it is preceded by a consonant, otherwise it is a consonant  For example, Y in synopsis is a vowel, while in toy, it is a consonant  A consonant in the algorithm description is denoted by c, and a vowel by v Porter’s Algorithm

m is the measure of vc repetition *S – the stem ends with S (Similarly for other letters) *v* - the stem contains a vowel *d – the stem ends with a double consonant (e.g., -TT) *o – the stem ends cvc, where the seconds c is not W, X, or Y (e.g. -WIL) OATS m=1

What is the value of m in the following words? Porter’s Algorithm BY PRIVATE OATEN ORRERY IVY TROUBLES TREES TROUBLE OATS Y Y TREE EE TR

What is the value of m in the following words? Porter’s Algorithm BY PRIVATE OATEN ORRERY IVY TROUBLES TREES TROUBLE OATS Y Y TREE EE TR

Porter’s algorithm Step 1 Step 1: plurals and past participles

Steps 2–4: straightforward stripping of suffixes Porter’s algorithm - Step 2

Steps 2–4: straightforward stripping of suffixes Porter’s algorithm Step 3

Steps 2–4: straightforward stripping of suffixes Porter’s algorithm Step 4

Example  generalizations  Step1: GENERALIZATION  Step2: GENERALIZE  Step3: GENERAL  Step4: GENER  OSCILLATORS  Step1: OSCILLATOR  Step2: OSCILLATE  Step4: OSCILL  Step5: OSCIL

Number of words reduced in step1:3597 “2:766 “3:327 “4:2424 “5:1373 Number of words not reduce:3650 In an experiment reported on Porter’s site, suffix stripping of a vocabulary of 10,000 words Porter’s Algorithm

 Term-document matrix (TDM) is a two-dimensional representation of a document collection.  Rows of the matrix represent various documents  Columns correspond to various index terms  Values in the matrix can be either the frequency or weight of the index term (identified by the column) in the document (identified by the row). Term-Document Matrix

Term-Document matrix

Sparse Matrixes- triples

Sparse Matrixes- Pairs

Raw frequency values are not useful for a retrieval model Prefer normalized weights, usually between 0 and 1, for each term in a document Dividing all the keyword frequencies by the largest frequency in the document is a simple method of normalization Normalization

Normalized Term-Document Matrix

Vector Representation of the sample document showing the terms, their frequencies and normalized frequencies Vector Representation ad algorithm analysi analyz archiv artifici assimil avail billion bit challeng commerci complex comput concept data databas detect distribut drive dynam effici emerg enterpris excess excit expect famili field forc form hidden high hoc inform intellig interest intersect knowledg larg learn li machin manag market mine natur nugget onlin pattern petabyte potenti practic presenc present quick recogn recognit refer Relationship scienc size softwar span statist store system techniqu theoret thrust time underpin valuabl year

Retrieval models match query with documents to:  separate documents into relevant and non-relevant class  rank the documents according to the relevance Retrieval models Retrieval Models Boolean model Vector space model (VSM) Probabilistic models

 One of the simplest and most efficient retrieval mechanisms  Based on set theory and Boolean algebra  Conventional numeric representations of false as 0 and true as 1  Boolean model is interested only in the presence or absence of a term in a document  In the term-document matrix replace all the nonzero values with 1 Boolean Retrieval Model

Boolean Term-document Matrix

 Document set  DocSet(K0) = {D1,D3,D5}  DocSet(K4) = {D2,D3,D4,D6}  Query  K0 and K4 DocSet(K0) ∩ DocSet(K4) = {D3}  K0 or K4 DocSet(K0) ∪ DocSet(K4) = {D1,D2,D3,D4,D5,D6} Examples

 User Boolean queries are usually simple Boolean expressions  A Boolean query can be represented in a “disjunctive normal form” (DNF)  disjunction corresponds to or  conjunction refers to and  DNF consists of a disjunction of conjunctive Boolean expressions Boolean Query

 K0 or (not K3 and K5) is in DNF  DNF query processing can be very efficient  If any one of the conjunctive expressions is true, the entire DNF will be true  Short-circuit the expression evaluation  Stop matching the expression with a document as soon as a conjunctive expression matches the document; label the document as relevant to the query DNF form

 Simplicity and efficiency of implementation  Binary values can be stored using bits  reduced storage requirements  retrieval using bitwise operations is efficient  Boolean retrieval was adopted by many commercial bibliographic systems  Boolean queries are akin to database queries Boolean Model Advantages

 A document is either relevant or non-relevant to the query  It is not possible to assign a degree of relevance  Complicated Boolean queries are difficult for users  Boolean queries retrieve too few or too many documents  K0 and K4 retrieved only 1 out of 6 documents  K0 or K4 retrieved 5 out of a possible 6 documents Boolean Model Disadvantages

 Treats both the documents and queries as vectors  A weight based on the frequency in the document: Vector Space Model

Graphical representation of the VSM Model

Computing the similarity

Relevance Values and Ranking Similarity between the documents and the query Ranking based on the similarity D0 (0.7774) D6 (0.4953) D2 (0.3123) D1 (0.2590) D5 (0.2122) D4 (0.1727) D3 (0.1084)

 Variations of the normalized frequency  Inverse document frequency (idf)  The idf for the j th term:  N = no. of documents  n j = no. of documents containing j th term  Modified weights : Variations of VSM

Inverse Document Frequencies

TDM using idf

Similarity and ranking using idf Ranking based on the similarity D0 (0.7867) D6 (0.4953) D2 (0.3361) D1 (0.2590) D5 (0.2215) D4 (0.1208) D3 (0.0969) Similarity between the documents and the query

 Queries are easier to express: allow users to attach relative weights to terms  A descriptive query can be transformed to a query vector similar to documents  Matching between a query and a document is not precise: document is allocated a degree of similarity  Documents are ranked based on their similarity scores instead of relevant/non-relevant classes  Users can go through the ranked list until their information needs are met. VSM vs. Boolean

Evaluation should include:  Functionality  Response time  Storage requirement  Accuracy Evaluation of Retrieval Performance

 Early days:  Batch testing  Document collection such as cacm.all  Query collection such as query.text  Present day: interactive tests are used  Difficult to conduct and time consuming  Batch testing still important Accuracy Testing

Precision and Recall PrecisionHow many from the retrieved are relevant? RecallHow many from the relevant are retrieved? PrecisionHow many from the retrieved are relevant? RecallHow many from the relevant are retrieved?

Example

F-measure

 Three retrieved document was arbitrary Average Precision

Relationship between precision and recall