Information Retrieval Introduction/Overview Material for these slides obtained from: Modern Information Retrieval by Ricardo Baeza-Yates and Berthier Ribeiro-Neto.

Slides:



Advertisements
Similar presentations
Chapter 5: Introduction to Information Retrieval
Advertisements

Modern information retrieval Modelling. Introduction IR systems usually adopt index terms to process queries IR systems usually adopt index terms to process.
Multimedia Database Systems
Modern Information Retrieval Chapter 1: Introduction
Query Languages. Information Retrieval Concerned with the: Representation of Storage of Organization of, and Access to Information items.
Web- and Multimedia-based Information Systems. Assessment Presentation Programming Assignment.
Motivation and Outline
IR Models: Overview, Boolean, and Vector
T.Sharon - A.Frank 1 Internet Resources Discovery (IRD) Classic Information Retrieval (IR)
ISP 433/533 Week 2 IR Models.
Query Operations: Automatic Local Analysis. Introduction Difficulty of formulating user queries –Insufficient knowledge of the collection –Insufficient.
Modern Information Retrieval Chapter 1: Introduction
1 Query Languages. 2 Boolean Queries Keywords combined with Boolean operators: –OR: (e 1 OR e 2 ) –AND: (e 1 AND e 2 ) –BUT: (e 1 BUT e 2 ) Satisfy e.
Information Retrieval Modeling CS 652 Information Extraction and Integration.
T.Sharon - A.Frank 1 Internet Resources Discovery (IRD) IR Queries.
Information Retrieval Concerned with the: Representation of Storage of Organization of, and Access to Information items.
Modern Information Retrieval Chapter 2 Modeling. Can keywords be used to represent a document or a query? keywords as query and matching as query processing.
Chapter 2Modeling 資工 4B 陳建勳. Introduction.  Traditional information retrieval systems usually adopt index terms to index and retrieve documents.
Modeling Modern Information Retrieval
What is a document? Information need: From where did the metaphor, doing X is like “herding cats”, arise? quotation? “Managing senior programmers is like.
Vector Space Model CS 652 Information Extraction and Integration.
Recuperação de Informação. IR: representation, storage, organization of, and access to information items Emphasis is on the retrieval of information (not.
1 Query Languages. 2 Boolean Queries Keywords combined with Boolean operators: –OR: (e 1 OR e 2 ) –AND: (e 1 AND e 2 ) –BUT: (e 1 BUT e 2 ) Satisfy e.
Chapter 5: Information Retrieval and Web Search
Information Retrieval: Foundation to Web Search Zachary G. Ives University of Pennsylvania CIS 455 / 555 – Internet and Web Systems August 13, 2015 Some.
 IR: representation, storage, organization of, and access to information items  Focus is on the user information need  User information need:  Find.
Chapter 4 Query Languages.... Introduction Cover different kinds of queries posed to text retrieval systems Keyword-based query languages  include simple.
Modeling (Chap. 2) Modern Information Retrieval Spring 2000.
CS344: Introduction to Artificial Intelligence Pushpak Bhattacharyya CSE Dept., IIT Bombay Lecture 32-33: Information Retrieval: Basic concepts and Model.
Modern Information Retrieval Computer engineering department Fall 2005.
1 University of Palestine Topics In CIS ITBS 3202 Ms. Eman Alajrami 2 nd Semester
Information Retrieval Chapter 2: Modeling 2.1, 2.2, 2.3, 2.4, 2.5.1, 2.5.2, Slides provided by the author, modified by L N Cassel September 2003.
Information Retrieval Models - 1 Boolean. Introduction IR systems usually adopt index terms to process queries Index terms:  A keyword or group of selected.
1 Information Retrieval Acknowledgements: Dr Mounia Lalmas (QMW) Dr Joemon Jose (Glasgow)
Information Retrieval CSE 8337 Spring 2007 Query Languages & Matching Material for these slides obtained from: Modern Information Retrieval by Ricardo.
Chapter 6: Information Retrieval and Web Search
Information retrieval 1 Boolean retrieval. Information retrieval (IR) is finding material (usually documents) of an unstructured nature (usually text)
Introduction to Digital Libraries hussein suleman uct cs honours 2003.
Information Retrieval Model Aj. Khuanlux MitsophonsiriCS.426 INFORMATION RETRIEVAL.
CSCE 5300 Information Retrieval and Web Search Introduction to IR models and methods Instructor: Rada Mihalcea Class web page:
LIS618 lecture 3 Thomas Krichel Structure of talk Document Preprocessing Basic ingredients of query languages Retrieval performance evaluation.
Information Retrieval CSE 8337 (Part A) Spring 2009 Some Material for these slides obtained from: Modern Information Retrieval by Ricardo Baeza-Yates and.
1 University of Palestine Topics In CIS ITBS 3202 Ms. Eman Alajrami 2 nd Semester
Query Languages Aj. Khuanlux MitsophonsiriCS.426 INFORMATION RETRIEVAL.
Introduction to Information Retrieval Aj. Khuanlux MitsophonsiriCS.426 INFORMATION RETRIEVAL.
1 Patrick Lambrix Department of Computer and Information Science Linköpings universitet Information Retrieval.
Vector Space Models.
1 Information Retrieval LECTURE 1 : Introduction.
Information Retrieval CSE 8337 Spring 2007 Introduction/Overview Some Material for these slides obtained from: Modern Information Retrieval by Ricardo.
Recuperação de Informação Cap. 01: Introdução 21 de Fevereiro de 1999 Berthier Ribeiro-Neto.
Information Retrieval
Set Theoretic Models 1. IR Models Non-Overlapping Lists Proximal Nodes Structured Models Retrieval: Adhoc Filtering Browsing U s e r T a s k Classic Models.
Information Retrieval and Web Search Introduction to IR models and methods Rada Mihalcea (Some of the slides in this slide set come from IR courses taught.
Information Retrieval CSE 8337 Spring 2005 Modeling (Part II) Material for these slides obtained from: Modern Information Retrieval by Ricardo Baeza-Yates.
Introduction n IR systems usually adopt index terms to process queries n Index term: u a keyword or group of selected words u any word (more general) n.
Information Retrieval Models School of Informatics Dept. of Library and Information Studies Dr. Miguel E. Ruiz.
Lecture 1: Introduction and the Boolean Model Information Retrieval
Modern Information Retrieval
CS 430: Information Discovery
Query Languages.
Information Retrieval
INFORMATION RETRIEVAL TECHNIQUES BY DR. ADNAN ABID
CSE 635 Multimedia Information Retrieval
Introduction to Information Retrieval
Ying Dai Faculty of software and information science,
Recuperação de Informação B
Information Retrieval and Web Design
Recuperação de Informação B
Recuperação de Informação
Advanced information retrieval
Presentation transcript:

Information Retrieval Introduction/Overview Material for these slides obtained from: Modern Information Retrieval by Ricardo Baeza-Yates and Berthier Ribeiro-Neto Data Mining Introductory and Advanced Topics by Margaret H. Dunham

CSE 5331/7331 F07 2 Information Retrieval Information Retrieval (IR): retrieving desired information from textual data. Library Science Digital Libraries Web Search Engines Traditionally keyword based Sample query: Find all documents about “data mining”.

CSE 5331/7331 F07 3 DB vs IR Records (tuples) vs. documents Well defined results vs. fuzzy results DB grew out of files and traditional business systesm IR grew out of library science and need to categorize/group/access books/articles

CSE 5331/7331 F07 4 DB vs IR (cont’d)  Data retrieval  which docs contain a set of keywords?  Well defined semantics  a single erroneous object implies failure!  Information retrieval  information about a subject or topic  semantics is frequently loose  small errors are tolerated  IR system:  interpret contents of information items  generate a ranking which reflects relevance  notion of relevance is most important

CSE 5331/7331 F07 5 Motivation  IR in the last 20 years:  classification and categorization  systems and languages  user interfaces and visualization  Still, area was seen as of narrow interest  Advent of the Web changed this perception once and for all  universal repository of knowledge  free (low cost) universal access  no central editorial board  many problems though: IR seen as key to finding the solutions!

CSE 5331/7331 F07 6 Basic Concepts Logical view of the documents Document representation viewed as a continuum: logical view of docs might shift structure Accents spacing stopwords Noun groups stemming Manual indexing Docs structureFull textIndex terms

CSE 5331/7331 F07 7 User Interface Text Operations Query Operations Indexing Searching Ranking Index Text query user need user feedback ranked docs retrieved docs logical view inverted file DB Manager Module Text Database Text The Retrieval Process

CSE 5331/7331 F07 8 IR is Fuzzy SimpleFuzzy Accept Reject

CSE 5331/7331 F07 9 Information Retrieval Similarity: measure of how close a query is to a document. Documents which are “close enough” are retrieved. Metrics: Precision = |Relevant and Retrieved| |Retrieved| Recall = |Relevant and Retrieved| |Relevant|

CSE 5331/7331 F07 10 Indexing IR systems usually adopt index terms to process queries Index term: a keyword or group of selected words any word (more general) Stemming might be used: connect: connecting, connection, connections An inverted file is built for the chosen index terms

CSE 5331/7331 F07 11 Indexing Docs Information Need Index Terms doc query Ranking match

CSE 5331/7331 F07 12 Inverted Files There are two main elements: vocabulary – set of unique terms Occurrences – where those terms appear The occurrences can be recorded as terms or byte offsets Using term offset is good to retrieve concepts such as proximity, whereas byte offsets allow direct access VocabularyOccurrences (byte offset) ……

CSE 5331/7331 F07 13 Inverted Files The number of indexed terms is often several orders of magnitude smaller when compared to the documents size (Mbs vs Gbs) The space consumed by the occurrence list is not trivial. Each time the term appears it must be added to a list in the inverted file That may lead to a quite considerable index overhead

CSE 5331/7331 F07 14 Example Text: Inverted file That house has a garden. The garden has many flowers. The flowers are beautiful beautiful flowers garden house 70 45, 58 18, 29 6 VocabularyOccurrences

CSE 5331/7331 F07 15 Ranking A ranking is an ordering of the documents retrieved that (hopefully) reflects the relevance of the documents to the query A ranking is based on fundamental premisses regarding the notion of relevance, such as: common sets of index terms sharing of weighted terms likelihood of relevance Each set of premisses leads to a distinct IR model

CSE 5331/7331 F07 16 Classic IR Models - Basic Concepts Each document represented by a set of representative keywords or index terms An index term is a document word useful for remembering the document main themes Usually, index terms are nouns because nouns have meaning by themselves However, search engines assume that all words are index terms (full text representation)

CSE 5331/7331 F07 17 Classic IR Models - Basic Concepts The importance of the index terms is represented by weights associated to them k i - an index term d j - a document w ij - a weight associated with (k i,d j ) The weight w ij quantifies the importance of the index term for describing the document contents

CSE 5331/7331 F07 18 Classic IR Models - Basic Concepts t is the total number of index terms K = {k 1, k 2, …, k t } is the set of all index terms w ij >= 0 is a weight associated with (k i,d j ) w ij = 0 indicates that term does not belong to doc d j = (w 1j, w 2j, …, w tj ) is a weighted vector associated with the document d j g i (d j ) = w ij is a function which returns the weight associated with pair (k i,d j )

CSE 5331/7331 F07 19 The Boolean Model Simple model based on set theory Queries specified as boolean expressions precise semantics and neat formalism Terms are either present or absent. Thus, w ij  {0,1} Consider q = k a  (k b   k c ) q dnf = (1,1,1)  (1,1,0)  (1,0,0) q cc = (1,1,0) is a conjunctive component

CSE 5331/7331 F07 20 The Vector Model Use of binary weights is too limiting Non-binary weights provide consideration for partial matches These term weights are used to compute a degree of similarity between a query and each document Ranked set of documents provides for better matching

CSE 5331/7331 F07 21 The Vector Model w ij > 0 whenever k i appears in d j w iq >= 0 associated with the pair (k i,q) d j = (w 1j, w 2j,..., w tj ) q = (w 1q, w 2q,..., w tq ) To each term k i is associated a unitary vector i The unitary vectors i and j are assumed to be orthonormal (i.e., index terms are assumed to occur independently within the documents) The t unitary vectors i form an orthonormal basis for a t-dimensional space where queries and documents are represented as weighted vectors

CSE 5331/7331 F07 22 Query Languages Keyword Based Boolean Weighted Boolean Context Based (Phrasal & Proximity) Pattern Matching Structural Queries

CSE 5331/7331 F07 23 Keyword Based Queries Basic Queries Single word Multiple words Context Queries Phrase Proximity

CSE 5331/7331 F07 24 Boolean Queries Keywords combined with Boolean operators: OR: (e 1 OR e 2 ) AND: (e 1 AND e 2 ) BUT: (e 1 BUT e 2 ) Satisfy e 1 but not e 2 Negation only allowed using BUT to allow efficient use of inverted index by filtering another efficiently retrievable set. Naïve users have trouble with Boolean logic.

CSE 5331/7331 F07 25 Boolean Retrieval with Inverted Indices Primitive keyword: Retrieve containing documents using the inverted index. OR: Recursively retrieve e 1 and e 2 and take union of results. AND: Recursively retrieve e 1 and e 2 and take intersection of results. BUT: Recursively retrieve e 1 and e 2 and take set difference of results.

CSE 5331/7331 F07 26 Phrasal Queries Retrieve documents with a specific phrase (ordered list of contiguous words) “information theory” May allow intervening stop words and/or stemming. “buy camera” matches: “buy a camera” “buying the cameras” etc.

CSE 5331/7331 F07 27 Phrasal Retrieval with Inverted Indices Must have an inverted index that also stores positions of each keyword in a document. Retrieve documents and positions for each individual word, intersect documents, and then finally check for ordered contiguity of keyword positions. Best to start contiguity check with the least common word in the phrase.

CSE 5331/7331 F07 28 Proximity Queries List of words with specific maximal distance constraints between terms. Example: “dogs” and “race” within 4 words match “…dogs will begin the race…” May also perform stemming and/or not count stop words.

CSE 5331/7331 F07 29 Pattern Matching Allow queries that match strings rather than word tokens. Requires more sophisticated data structures and algorithms than inverted indices to retrieve efficiently.

CSE 5331/7331 F07 30 Simple Patterns Prefixes: Pattern that matches start of word. “anti” matches “antiquity”, “antibody”, etc. Suffixes: Pattern that matches end of word: “ix” matches “fix”, “matrix”, etc. Substrings: Pattern that matches arbitrary subsequence of characters. “rapt” matches “enrapture”, “velociraptor” etc. Ranges: Pair of strings that matches any word lexicographically (alphabetically) between them. “tin” to “tix” matches “tip”, “tire”, “title”, etc.