ISP 433/533 Week 2 IR Models.

ISP 433/533 Week 2 IR Models

Outline IR defined IR tasks IR processes Boolean model Break
Vector space model Probabilistic model

User Information Needs
Goal of IR Hard Problem People have different and highly varied needs for information People often do not know what they want, or may not be able to express it in a usable form

Some Definitions of IR Salton (1989): “Information-retrieval systems process files of records and requests for information, and identify and retrieve from the files certain records in response to the information requests. The retrieval of particular records depends on the similarity between the records and the queries, which in turn is measured by comparing the values of certain attributes to records and information requests.” Kowalski (1997): “An Information Retrieval System is a system that is capable of storage, retrieval, and maintenance of information. Information in this context can be composed of text (including numeric and date data), images, audio, video, and other multi-media objects).”

Examples of IR Conventional (library catalog). Search by keyword, title, author, etc. Text-based (Lexis-Nexis, Google, FAST). Search by keywords. Limited search using queries in natural language. Multimedia (QBIC, WebSeek, SaFe) Search by visual appearance (shapes, colors,… ). Question answering systems (AskJeeves, NSIR, Answerbus) Search in (restricted) natural language

Key Terms Used in IR QUERY: a representation of what the user is looking for - can be a list of words or a phrase. DOCUMENT: an information entity that the user wants to retrieve COLLECTION: a set of documents INDEX: a representation of information that makes querying easier TERM: word or concept that appears in a document or a query RANKING: an ordering of the documents retrieved that (hopefully) reflects the relevance of the documents to the user query

Basic IR Process Docs Index Terms doc match Ranking Information Need
query Ranking match

IR Task – ad hoc Q1 Q2 Q3 Collection -relatively stable Q4 Q5

IR Task - filtering Docs Filtered User 2 for User 2 User 1 Docs for
Documents Stream User 1 Profile User 2 Docs Filtered for User 2 Docs for User 1

Process of IR User Interface Text operations Query operations DB Man.
indexing Searching index Text Db Ranking

Document Process Steps

Classic IR models Each document represented by a set of representative keywords or index terms Not all terms are equally useful for representing the document contents: less frequent terms allow identifying a narrower set of documents Let ki be an index term dj be a document wij is a weight associated with (ki,dj) The weight wij quantifies the importance of the index term for describing the document contents

Boolean Model Simple model based on set theory
Queries specified as boolean expressions precise semantics neat formalism using boolean logic Eg. Queryx = ka  (kb  kc) Terms are either present or absent. Thus, wij  {0,1}

Boolean Logic Named after logician/mathematician George Boole
Logical Connectives: AND, OR, NOT WARNING! INSPIRED BY, BUT NOT THE SAME AS, USUAL ENGLISH USAGE AND: “Each thing must satisfy ALL conditions” OR : “Each thing must satisfy at least one condition” NOT: “Each thing must NOT satisfy the given condition”

is the set of things in common, i.e., in both sets A and B
Logical AND () (Set Intersection) A  B is the set of things in common, i.e., in both sets A and B A  B (Aged, Blind People) A B Aged Blind

is the set of: things in either A, B or both.
Logical OR () (Set Union) A  B is the set of: things in either A, B or both. A  B (people that are either Aged or Blind or both) A B Aged Blind

is the set of things outside the set B
Logical NOT () (Set Complement)  B is the set of things outside the set B  B (people who aren’t blind) B Blind

Example Combination A B A  ( B) A  ( B) Blind Aged
(old people who aren’t blind) A B Aged Blind

Exercise D1 = “computer information retrieval”
D2 = “computer retrieval” D3 = “information” D4 = “computer information” Q1 = “information  retrieval” Q2 = “information   computer”

Drawbacks of the Boolean Model
Retrieval based on binary decision criteria with no notion of partial matching No ranking of the documents is provided (absence of a grading scale) Information need has to be translated into a Boolean expression which most users find awkward The Boolean queries formulated by the users are most often too simplistic As a consequence, the Boolean model frequently returns either too few or too many documents in response to a user query BREAK

Vector Model Non-binary weights provide consideration for partial matches These term weights are used to compute a degree of similarity between a query and each document Ranked set of documents provides for better matching

Vector Space Assume each term is independent from each other and each term defines a dimension T-dimensional space, where T is the number of terms In this space, queries and documents are represented as weighted vectors Weight wiq >= 0 associated with the pair (ki,q) vec(dj) = (w1j, w2j, ..., wtj) vec(q) = (w1q, w2q, ..., wtq)

Example Vector Space using term frequency
D1 = “computer information retrieval” D2 = “computer retrieval” Q1 = “information, retrieval” information Q1=(0, 1, 1) D1=(1, 1, 1) computer D2=(1, 0, 1) retrieval

Similarity Measure j dj  q i Sim(q,dj) = cos() = [vec(dj)  vec(q)] / ( |dj| * |q|) = [ wij * wiq] / (|dj| * |q|) Since wij > 0 and wiq > 0, 0 <= sim(q,dj) <=1

Exercise D1 = “computer information retrieval”
D2 = “computer retrieval” Q1 = “information, retrieval” Given the above query, rank the relevance of the above two documents using vector model

Pro and Con of Vector model
Advantages: term-weighting improves quality of the answer set partial matching allows retrieval of docs that approximate the query conditions cosine ranking formula sorts documents according to degree of similarity to the query Disadvantages: assumes independence of index terms (??); not clear that this is bad though

Probabilistic Model Given a user query, there is an ideal answer set
Querying as specification of the properties of this ideal answer set (clustering) But, what are these properties? Guess at the beginning what they could be (i.e., guess initial description of ideal answer set) Improve by iteration

Probabilistic Ranking Principle
Given a user query q and a document dj, the probabilistic model tries to estimate the probability that the user will find the document dj relevant sim(q, dj ) = P(dj relevant-to q) / P(dj non-relevant-to q)

Performance of Probabilistic Model
Salton and Buckley did a series of experiments that indicate that, in general, the vector model outperforms the probabilistic model with general collections This seems also to be the view of the research community

ISP 433/533 Week 2 IR Models.

Similar presentations

Presentation on theme: "ISP 433/533 Week 2 IR Models."— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

ISP 433/533 Week 2 IR Models.

Similar presentations

Presentation on theme: "ISP 433/533 Week 2 IR Models."— Presentation transcript:

Similar presentations

About project

Feedback