INFO624 - Week 2 Models of Information Retrieval Dr. Xia Lin Associate Professor College of Information Science and Technology Drexel University.

INFO624 - Week 2 Models of Information Retrieval Dr. Xia Lin Associate Professor College of Information Science and Technology Drexel University

Reviews of Last Week Challenges of Information Retrieval Challenges of Information Retrieval  Translate user’s information needs to queries.  Match queries to stored information.  Evaluate if the query results match the user’s information needs Differences between Differences between  Data, information, and knowledge  Data retrieval and information retrieval

Assignment 1 Some of my favorite Search Software Packages Some of my favorite Search Software Packages  IBM’ Content Management (high-cost)  AOL PLS Search Engine (free)  GreenStone Digital Library Software (open- source)  SWISH (open source)  mnoGoSearch (free)  Apache Lucene (open source components)

Documents Documents are logical units of text Documents are logical units of text  Units of records (text & other components)  Units that can be stored, retrieved, and displayed as an unique entity  Units of semantic entity  units of text grouped together for a purpose  Units of unformatted text  Text as written by authors of documents.

Document Models Documents need to be processed and represented in a concise and identifiable formats/structures. Documents need to be processed and represented in a concise and identifiable formats/structures.  Documents are full of text.  Not every words of the text are meaningful for searching/retrieval.  Documents themselves do not have identifiable attributes such as authors and titles.

Figure 1.2: Logical view of a document : from full text to a set of index terms.

Document Representation Documents should be represented to help users identify and receive information from the system. Documents should be represented to help users identify and receive information from the system.  to identify authors and titles  to identify subjects  to provide summaries/abstracts  to classify subject categories

Document Surrogates Each document should have one or more short and descriptive labels/attributes Each document should have one or more short and descriptive labels/attributes  Level 1:  Title:  Author:  Keywords:  Level 2:  Level 1 +Abstract:  Level 3:  Level 2 + full text

A Formal IR Models An information retrieval model is a quadruple (D, Q, F, R(qi, dj)) where An information retrieval model is a quadruple (D, Q, F, R(qi, dj)) where  D is a set composed of logical views (or representations) for the documents in the collection.  Q is a set composed of logical views (or representations) for the information needs. Such representations are called queries.  F is a framework for modeling document representations, queries, and their relationships  R(qi, dj) is a ranking function which associated a real number with a queryqi and a document representation dj. Scuh ranking defines an ordering among the documents with regard to the query qi.

Computerized Indexing Title indexing: Title indexing:  Sort all the titles alphabetically  Not consider the beginning “a” or “the”  Convert all letters to uppercases.  Matching always starts from the beginning of the title (not individual words).  Most early IR systems (such as library catalogs) used title indexing

Word indexing Parsing every individual words from documents Parsing every individual words from documents  First decision: What is a word?  Are digits words? How about the letter and digit combination: B6, B12How about the letter and digit combination: B6, B12 Is F-16 one word or two words?Is F-16 one word or two words?  Hyphens Online, on-line, on line ?Online, on-line, on line ? F-16F-16  Singular or plural ? List all the words alphabetically with points back to documents – inverted indexing. List all the words alphabetically with points back to documents – inverted indexing.

Inverted Indexing Inverted indexing consists of an ordered list of indexing terms, each indexing term is associated with some document identification numbers. Inverted indexing consists of an ordered list of indexing terms, each indexing term is associated with some document identification numbers. Retrieval is done by first searching in the ordered list to find the indexing term, then using the document identification numbers to locate documents Retrieval is done by first searching in the ordered list to find the indexing term, then using the document identification numbers to locate documents

Example: Create an inverted indexing for the following: Document Number Terms 001 T3, T4, T6, T12, T15 002 T1, T3, T4, T7, T9, T13 003 T5, T12, T15, 004 T11, T12, T15, T15 005 T2, T3, T5, T7, T8, T12 006 T1, T4, T5 007 T3, T5, T6, T7 008 T1, T2, T7, T9, T12

Boolean Logic Logical operators defined on sets Logical operators defined on sets  True and false:  A set is a collection of items with certain common characteristics.  Any item either belongs to the set (true) or not belong to the set (false)  AND  combine two sets, A and B, to create a smaller (or at least not larger) set C.  any items in C must be in BOTH set A and set B.  OR  Union of two sets, A and B, to create a larger set C.  any item in C must be either in set A or in set B.  Not  to exclude items in a set.

Example: Given: Given: A={1, 3, 7, 12, 14, 25,36,} B={1, 2, 3,4,5,7,8,12,13, 14, 15, 25, 26} C={2,4,6,8,10,11,12,13,14} Derive: Derive:  A AND B  A OR B  A AND B AND C  (A AND B) NOT C (A AND B) OR C (A AND B) OR C (A OR B) AND C (A OR B) AND C A AND (B OR C) A AND (B OR C)

Boolean Logic Venn Diagram Venn Diagram  graphical representation of Boolean logic  A and (B or C)  A and B or (C and D)

Boolean Query Terms connected by Boolean operators Terms connected by Boolean operators The system retrieves a set of documents based on the Boolean logic of the query. The system retrieves a set of documents based on the Boolean logic of the query. Examples: Examples:  (network or networks or structured or system or systems) and (information or retrieval)

Advantages of Boolean Search Simple and specific Simple and specific Effective Effective  AND reduces the number of hits very quickly  OR expands search scope Strong logic-based Strong logic-based  proved mathematical foundations

Problems of Boolean Search: Boolean search is an exact search Boolean search is an exact search  either retrieving or not retrieving a document.  Requesting “computer” would not find “computing” unless more programming is done No weighting can be done on terms No weighting can be done on terms  in query, A and B, you can’t specify A is more important than B.

No Ranking No Ranking  Retrieved sets can not be ordered based on the Boolean logic.  Every retrieved document are treated equally. Possible order confusion Possible order confusion  A AND B OR C

Vectors A numerical representation for a point in a multi-dimensional space. A numerical representation for a point in a multi-dimensional space.  (x 1, x 2, … … x n )  Dimensions of the space need to be defined  A measure of the space needs to be defined.

Vector Representation of Document Space   Each indexing term is a dimension   Each document is a vector   D i = (t i1, t i2, t i3, t i4,... t in )   D j = (t j1, t j2, d j3, t j4,..., t jn )   Document similarity is defined as

Example: A document Space is defined by three terms: A document Space is defined by three terms:  hardware, software, user A set of documents are defined as: A set of documents are defined as:  A1=(1, 0, 0),A2=(0, 1, 0), A3=(0, 0, 1)  A4=(1, 1, 0),A5=(1, 0, 1), A6=(0, 1, 1)  A7=(1, 1, 1)A8=(1, 0, 1).A9=(0, 1, 1) If the Query is “hardware and software” If the Query is “hardware and software” what documents should be retrieved? what documents should be retrieved?

In Boolean query matching: In Boolean query matching:  document A4, A7 will be retrieved by “ANDing” the two query terms  retrieved:A1, A2, A4, A5, A6, A7, A8, A9 if two query terms are “ORed” together. In Vector query matching: In Vector query matching:  q=(1, 1, 0)  S(q, A1)=0.71, S(q, A2)=0.71,S(q, A3)=0  S(q, A4)=1,S(q, A5)=0.5, S(q, A6)=0.5  S(q, A7)=0.82, S(q, A8)=0.5, S(q, A9)=0.5  Document retrieved set (with order)=  {A4, A7, A1, A2, A5, A6, A8, A9}

Weights in the Vector Space A main advantage of Vector representation is that items in vectors don’t have to be just 0 or 1 (true or false). A main advantage of Vector representation is that items in vectors don’t have to be just 0 or 1 (true or false).  A1=(0.7, 0.5, 0.3)  A2=(0.5, 0.2, 0.7)  A3=(0.3, 0.6, 0.9)  A4=(0.7, 0.9, 1.0) Queries may also be weighted: Queries may also be weighted:  Q=(0.7, 0.3, 0)

TF and IDF TF – term frequency TF – term frequency  number of times a term occurs in a document DF –Document frequency DF –Document frequency  Number of documents that contain the term. IDF – inversed document frequency IDF – inversed document frequency  =log(N/n i )  N –the total number of documents  n i – number of documents that contains term i.

Salton’s Vector Space A document is represented as a vector: A document is represented as a vector:  (W 1, W 2, … …, W n )  Binary:  W i = 1 if the corresponding term is in the document  W i = 0 if the term is not in the document  TF: (Term Frequency)  W i = tf i where tf i is the number of times the term occurred in the document  TF*IDF: (Inverse Document Frequency)  W i =tf i *idf i =tf i *(1+log(N/df i )) where df i is the number of documents contains the term i, and N the total number of documents in the collection.

In vector space, documents and queries are treated the same. In vector space, documents and queries are treated the same.  It is easier to do similarity search:  “find documents like this one”  It is easier to do document clusters:  “group documents into categories and subcategories”  It’s easier to display search results graphically  “Giving meaning to place or location in the multi-dimensional space”

Web Indexing Most web indexing is Vector-based indexing, with variances: Most web indexing is Vector-based indexing, with variances:  robot indexing software keeps traverse the web to collect more pages and terms  Servers establish a huge inverted indexing and vector indexing database  Search engines conduct different types of vector query matching  only a few search engines implement truly Boolean query matching

The real differences among different search engines are The real differences among different search engines are  their indexing weight schemes  their query process methods  their ranking algorithms  None of these are published by any of the search engines firms.

Alternative IR Models Probabilistic Model Probabilistic Model  Given a document d, how likely would the user consider it relevant?  How likely would the user consider it no relevant?  If these two are known, Similarity of document d and query q can be defined as:  S(d, q) = probability of d is relevant to q probability of d is not relevant to q probability of d is not relevant to q

Examples: If a document is 80% likely to be relevant to query q, what is its (probabilistic) similarity? If a document is 80% likely to be relevant to query q, what is its (probabilistic) similarity? If a document is only 30% likely to be relevant, what is the similarity? If a document is only 30% likely to be relevant, what is the similarity?

If there are 100 documents, 10 are relevant to a query, If there are 100 documents, 10 are relevant to a query,  what is the probability of relevance for a randomly select document?  What is the similarity of this document to the query?  Any retrieve systems must do must better than that.  In general, retrieval systems should retrieve those S>1

Advantages of the Probabilistic model Advantages of the Probabilistic model  Documents can be ranked by its relevance probability.  Relevance probability can be improved through the interaction process.  Good mathematic model Disadvantages: Disadvantages:  Involved many assumptions  Not very practical

Fuzzy Set Model Fuzzy Set Theory Fuzzy Set Theory  Extension of Boolean set theory  Instead of a binary membership definition, fuzzy set Membership is continuously defined between 0 and 1.  Example: { Male students in our class}{ Male students in our class} {tall students in our class}{tall students in our class} One is Boolean set and one is fuzzy set.One is Boolean set and one is fuzzy set.

The set of retrieved documents should be considered as a fuzzy set. The set of retrieved documents should be considered as a fuzzy set.  Documents are not just relevant or not- relevant.  Documents can be somehow relevant.  Documents can be 80% likely to be relevant. Good Mathematical Models but not widely implemented and tested. Good Mathematical Models but not widely implemented and tested.

Latent Semantic Indexing Model Map documents from a high-dimensional space to a lower dimensional space, while maintaining document relationships. Map documents from a high-dimensional space to a lower dimensional space, while maintaining document relationships.  For clustering  For visualization It’s a popular advanced retrieval technique. It’s a popular advanced retrieval technique. It’s computationally expensive. It’s computationally expensive.

Neural Network Model Organize the document collection as a semantic network through learning Organize the document collection as a semantic network through learning  Use known queries/relevant documents to to train the network, and later allow the network to predict relevance for new queries. (supervised learning)  Use document-document relationships to “self- organize” the network and move relevant documents close to each other. (un-supervised learning).

The Fusion Model Retrieve documents based on text indexing (Boolean model or Vector Space Model, etc.) Retrieve documents based on text indexing (Boolean model or Vector Space Model, etc.) Retrieve documents based on link models (Citations, Google’s PageLink, etc.)\ Retrieve documents based on link models (Citations, Google’s PageLink, etc.)\ Retrieve documents based on classification models (The classification schemes, thesauri, Yahoo categories, etc). Retrieve documents based on classification models (The classification schemes, thesauri, Yahoo categories, etc). “Fusion” results together before response to the user “Fusion” results together before response to the user

Models for Browsing Flat Model Flat Model  No particular organizations of materials Hierarchical model Hierarchical model  Assign documents into a hierarchical structure. Hypertext Model Hypertext Model  Define appropriate links among related documents.

INFO624 - Week 2 Models of Information Retrieval Dr. Xia Lin Associate Professor College of Information Science and Technology Drexel University.

Similar presentations

Presentation on theme: "INFO624 - Week 2 Models of Information Retrieval Dr. Xia Lin Associate Professor College of Information Science and Technology Drexel University."— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

INFO624 - Week 2 Models of Information Retrieval Dr. Xia Lin Associate Professor College of Information Science and Technology Drexel University.

Similar presentations

Presentation on theme: "INFO624 - Week 2 Models of Information Retrieval Dr. Xia Lin Associate Professor College of Information Science and Technology Drexel University."— Presentation transcript:

Similar presentations

About project

Feedback