Chapter 5: Introduction to Information Retrieval

Slides:

Advertisements

Similar presentations

Text Categorization.

Advertisements

WEB MINING. Why IR ？ Research & Fun

INFO624 - Week 2 Models of Information Retrieval Dr. Xia Lin Associate Professor College of Information Science and Technology Drexel University.

Indexing. Efficient Retrieval Documents x terms matrix t 1 t 2... t j... t m nf d 1 w 11 w w 1j... w 1m 1/|d 1 | d 2 w 21 w w 2j... w 2m 1/|d.

Introduction to Information Retrieval

Lecture 11 Search, Corpora Characteristics, & Lucene Introduction.

Introduction to Information Retrieval (Manning, Raghavan, Schutze) Chapter 6 Scoring term weighting and the vector space model.

IR Models: Overview, Boolean, and Vector

Information Retrieval in Practice

Search Engines and Information Retrieval

Information Retrieval Ling573 NLP Systems and Applications April 26, 2011.

Information Retrieval Review

Chapter 7: Text mining UIC - CS 594 Bing Liu 1 1.

Database Management Systems, R. Ramakrishnan1 Computing Relevance, Similarity: The Vector Space Model Chapter 27, Part B Based on Larson and Hearst’s slides.

T.Sharon - A.Frank 1 Internet Resources Discovery (IRD) IR Queries.

Information Retrieval Concerned with the: Representation of Storage of Organization of, and Access to Information items.

Ch 4: Information Retrieval and Text Mining

Modern Information Retrieval Chapter 2 Modeling. Can keywords be used to represent a document or a query? keywords as query and matching as query processing.

Computer comunication B Information retrieval. Information retrieval: introduction 1 This topic addresses the question on how it is possible to find relevant.

Evaluating the Performance of IR Sytems

Zdravko Markov and Daniel T. Larose, Data Mining the Web: Uncovering Patterns in Web Content, Structure, and Usage, Wiley, Slides for Chapter 1:

1 CS 430 / INFO 430 Information Retrieval Lecture 3 Vector Methods 1.

Information retrieval Finding relevant data using irrelevant keys Example: database of photographic images sorted by number, date. DBMS: Well structured.

Information retrieval: overview. Information Retrieval and Text Processing Huge literature dating back to the 1950’s! SIGIR/TREC - home for much of this.

Information Retrieval

Recuperação de Informação. IR: representation, storage, organization of, and access to information items Emphasis is on the retrieval of information (not.

Chapter 5: Information Retrieval and Web Search

Overview of Search Engines

1 Text Categorization  Assigning documents to a fixed set of categories  Applications:  Web pages  Recommending pages  Yahoo-like classification hierarchies.

Modeling (Chap. 2) Modern Information Retrieval Spring 2000.

Search Engines and Information Retrieval Chapter 1.

Chapter 7 Web Content Mining Xxxxxx. Introduction Web-content mining techniques are used to discover useful information from content on the web – textual.

Chapter 2 Architecture of a Search Engine. Search Engine Architecture n A software architecture consists of software components, the interfaces provided.

CSE 6331 © Leonidas Fegaras Information Retrieval 1 Information Retrieval and Web Search Engines Leonidas Fegaras.

1 CS 430: Information Discovery Lecture 9 Term Weighting and Ranking.

Query Operations J. H. Wang Mar. 26, The Retrieval Process User Interface Text Operations Query Operations Indexing Searching Ranking Index Text.

Google & Document Retrieval Qing Li School of Computing and Informatics Arizona State University.

Xiaoying Gao Computer Science Victoria University of Wellington Intelligent Agents COMP 423.

Term Frequency. Term frequency Two factors: – A term that appears just once in a document is probably not as significant as a term that appears a number.

Chapter 6: Information Retrieval and Web Search

Search Engines. Search Strategies Define the search topic(s) and break it down into its component parts What terms, words or phrases do you use to describe.

Introduction to Digital Libraries hussein suleman uct cs honours 2003.

Web Search. Crawling Start from some root site e.g., Yahoo directories. Traverse the HREF links. Search(initialLink) fringe.Insert( initialLink ); loop.

GUIDED BY DR. A. J. AGRAWAL Search Engine By Chetan R. Rathod.

Information Retrieval and Web Search

Evaluation of Agent Building Tools and Implementation of a Prototype for Information Gathering Leif M. Koch University of Waterloo August 2001.

Web- and Multimedia-based Information Systems Lecture 2.

Web Mining ( 網路探勘 ) WM06 TLMXM1A Wed 8,9 (15:10-17:00) U705 Information Retrieval and Web Search ( 資訊檢索與網路搜尋 ) Min-Yuh Day 戴敏育 Assistant Professor.

Information Retrieval Transfer Cycle Dania Bilal IS 530 Fall 2007.

1 Data Mining: Text Mining. 2 Information Retrieval Techniques Index Terms (Attribute) Selection: Stop list Word stem Index terms weighting methods Terms.

CIS 530 Lecture 2 From frequency to meaning: vector space models of semantics.

Xiaoying Gao Computer Science Victoria University of Wellington COMP307 NLP 4 Information Retrieval.

Information Retrieval and Web Search IR models: Vector Space Model Term Weighting Approaches Instructor: Rada Mihalcea.

Data e Web Mining. - S. Orlando 1 Information Retrieval and Web Search Salvatore Orlando Bing Liu. “Web Data Mining: Exploring Hyperlinks, Contents”, and.

1 Text Categorization  Assigning documents to a fixed set of categories  Applications:  Web pages  Recommending pages  Yahoo-like classification hierarchies.

3: Search & retrieval: Structures. The dog stopped attacking the cat, that lived in U.S.A. collection corpus database web d1…..d n docs processed term-doc.

Information Retrieval in Practice

Plan for Today’s Lecture(s)

Search Engine Architecture

Text Based Information Retrieval

Information Retrieval and Web Search

IST 516 Fall 2011 Dongwon Lee, Ph.D.

Multimedia Information Retrieval

Representation of documents and queries

Text Categorization Assigning documents to a fixed set of categories

Data Mining Chapter 6 Search Engines

From frequency to meaning: vector space models of semantics

Chapter 5: Information Retrieval and Web Search

Information Retrieval and Web Design

Information Retrieval and Web Design

Presentation transcript:

Chapter 5: Introduction to Information Retrieval

Text mining It refers to data mining using text documents as data. There are many special techniques for pre-processing text documents to make them suitable for mining. Most of these techniques are from the field of “Information Retrieval”. CS583, Bing Liu, UIC

Information Retrieval (IR) Conceptually, information retrieval (IR) is the study of finding needed information. I.e., IR helps users find information that matches their information needs. Historically, information retrieval is about document retrieval, emphasizing document as the basic unit. Technically, IR studies the acquisition, organization, storage, retrieval, and distribution of information. IR has become a center of focus in the Web era. CS583, Bing Liu, UIC

Information Retrieval User Information Search/select Info. Needs Queries Stored Information Translating info. needs to queries Matching queries To stored information Query result evaluation Does information found match user’s information needs? CS583, Bing Liu, UIC

Text Processing Word (token) extraction Stop words Stemming Frequency counts CS583, Bing Liu, UIC 4

Stop words Reduce indexing (or data) file size Many of the most frequently used words in English are worthless in IR and text mining – these words are called stop words. the, of, and, to, …. Typically about 400 to 500 such words For an application, an additional domain specific stop words list may be constructed Why do we need to remove stop words? Reduce indexing (or data) file size stopwords accounts 20-30% of total word counts. Improve efficiency stop words are not useful for searching or text mining stop words always have a large number of hits CS583, Bing Liu, UIC

Stemming Techniques used to find out the root/stem of a word: E.g., user engineering users engineered used engineer using stem: use engineer Usefulness improving effectiveness of IR and text mining matching similar words reducing indexing size combing words with same roots may reduce indexing size as much as 40-50%. CS583, Bing Liu, UIC

Basic stemming methods remove ending if a word ends with a consonant other than s, followed by an s, then delete s. if a word ends in es, drop the s. if a word ends in ing, delete the ing unless the remaining word consists only of one letter or of th. If a word ends with ed, preceded by a consonant, delete the ed unless this leaves only a single letter. …... transform words if a word ends with “ies” but not “eies” or “aies” then “ies --> y.” CS583, Bing Liu, UIC

Frequency counts Counts the number of times a word occurred in a document. Counts the number of documents in a collection that contains a word. Using occurrence frequencies to indicate relative importance of a word in a document. if a word appears often in a document, the document likely “deals with” subjects related to the word. CS583, Bing Liu, UIC

Vector Space Representation A document is represented as a vector: (W1, W2, … … , Wn) Binary: Wi= 1 if the corresponding term i (often a word) is in the document Wi= 0 if the term i is not in the document TF: (Term Frequency) Wi= tfi where tfi is the number of times the term occurred in the document TF*IDF: (Inverse Document Frequency) Wi =tfi*idfi=tfi*log(N/dfi)) where dfi is the number of documents contains term i, and N the total number of documents in the collection. CS583, Bing Liu, UIC 17

Vector Space and Document Similarity Each indexing term is a dimension. A indexing term is normally a word. Each document is a vector Di = (ti1, ti2, ti3, ti4, ... tin) Dj = (tj1, tj2, tj3, tj4, ..., tjn) Document similarity is defined as (cosine similarity) CS583, Bing Liu, UIC

Query formats Query is a representation of the user’s information needs Normally a list of words. Query as a simple question in natural language The system translates the question into executable queries Query as a document “Find similar documents like this one” The system defines what the similarity is CS583, Bing Liu, UIC 4

An Example A document Space is defined by three terms: hardware, software, users A set of documents are defined as: A1=(1, 0, 0), A2=(0, 1, 0), A3=(0, 0, 1) A4=(1, 1, 0), A5=(1, 0, 1), A6=(0, 1, 1) A7=(1, 1, 1) A8=(1, 0, 1). A9=(0, 1, 1) If the Query is “hardware and software” what documents should be retrieved? CS583, Bing Liu, UIC

An Example (cont.) In Boolean query matching: document A4, A7 will be retrieved (“AND”) retrieved:A1, A2, A4, A5, A6, A7, A8, A9 (“OR”) In similarity matching (cosine): q=(1, 1, 0) S(q, A1)=0.71, S(q, A2)=0.71, S(q, A3)=0 S(q, A4)=1, S(q, A5)=0.5, S(q, A6)=0.5 S(q, A7)=0.82, S(q, A8)=0.5, S(q, A9)=0.5 Document retrieved set (with ranking)= {A4, A7, A1, A2, A5, A6, A8, A9} CS583, Bing Liu, UIC

Relevance feedback To improve retrieval effectiveness, we allow the use to give some feedback. Given some retrieved results, the user will tell the system that some documents are relevant and some are not. This give us a text classification problem! Rocchio was an early system for relevance feedback, and text classification. Given training documents compute a prototype vector for each class. Given test doc, assign to topic whose prototype (centroid) is nearest using cosine similarity. CS583, Bing Liu, UIC

Vector Space Representation Each doc j is a vector, one component for each term (= word). Have a vector space terms are attributes n docs live in this space even with stop word removal and stemming, we may have 10000+ dimensions, or even 1,000,000+ CS583, Bing Liu, UIC

Rocchio Algorithm Constructing document vectors into a prototype vector for each class cj.  and  are parameters that adjust the relative impact of relevant and irrelevant training examples. Normally,  = 16 and  = 4. CS583, Bing Liu, UIC

Relevance judgment for IR A measurement of the outcome of a search or retrieval The judgment on what should or should not be retrieved. There is no simple answer to what is relevant and what is not relevant: need human users. difficult to define subjective depending on knowledge, needs, time,, etc. The central concept of information retrieval CS583, Bing Liu, UIC

Precision and Recall Given a query: Measures for system performance: Are all retrieved documents relevant? Have all the relevant documents been retrieved ? Measures for system performance: The first question is about the precision of the search The second is about the completeness (recall) of the search. CS583, Bing Liu, UIC

Web Search as a huge IR system A Web crawler (robot) crawls the Web to collect all the pages. Servers establish a huge inverted indexing database and other indexing databases At query (search) time, search engines conduct different types of vector query matching CS583, Bing Liu, UIC

Different search engines The real differences among different search engines are their indexing weight schemes their query process methods their ranking algorithms Few of these are published by any of the search engine company. CS583, Bing Liu, UIC