Lecture 111 Information Retrieval(IR) Information retrieval generally refers to the process of extracting relevant documents according to the information.

Slides:

Advertisements

Similar presentations

Traditional IR models Jian-Yun Nie.

Advertisements

Chapter 5: Introduction to Information Retrieval

UCLA : GSE&IS : Department of Information StudiesJF : 276lec1.ppt : 5/2/2015 : 1 I N F S I N F O R M A T I O N R E T R I E V A L S Y S T E M S Week.

Introduction to Information Retrieval (Manning, Raghavan, Schutze) Chapter 6 Scoring term weighting and the vector space model.

IR Models: Overview, Boolean, and Vector

Information Retrieval Ling573 NLP Systems and Applications April 26, 2011.

K nearest neighbor and Rocchio algorithm

ISP 433/533 Week 2 IR Models.

Query Operations: Automatic Local Analysis. Introduction Difficulty of formulating user queries –Insufficient knowledge of the collection –Insufficient.

1 CS 430 / INFO 430 Information Retrieval Lecture 12 Probabilistic Information Retrieval.

T.Sharon - A.Frank 1 Internet Resources Discovery (IRD) IR Queries.

1 CS 430 / INFO 430 Information Retrieval Lecture 12 Probabilistic Information Retrieval.

Ch 4: Information Retrieval and Text Mining

Modern Information Retrieval Chapter 2 Modeling. Can keywords be used to represent a document or a query? keywords as query and matching as query processing.

Chapter 2Modeling 資工 4B 陳建勳. Introduction.  Traditional information retrieval systems usually adopt index terms to index and retrieve documents.

Modeling Modern Information Retrieval

Evaluating the Performance of IR Sytems

1 CS 430 / INFO 430 Information Retrieval Lecture 3 Vector Methods 1.

Information retrieval Finding relevant data using irrelevant keys Example: database of photographic images sorted by number, date. DBMS: Well structured.

Retrieval Models II Vector Space, Probabilistic.  Allan, Ballesteros, Croft, and/or Turtle Properties of Inner Product The inner product is unbounded.

Modern Information Retrieval Chapter 2 Modeling. Can keywords be used to represent a document or a query? keywords as query and matching as query processing.

Automatic Indexing (Term Selection) Automatic Text Processing by G. Salton, Chap 9, Addison-Wesley, 1989.

Recuperação de Informação. IR: representation, storage, organization of, and access to information items Emphasis is on the retrieval of information (not.

Chapter 5: Information Retrieval and Web Search

Modeling (Chap. 2) Modern Information Retrieval Spring 2000.

RuleML-2007, Orlando, Florida1 Towards Knowledge Extraction from Weblogs and Rule-based Semantic Querying Xi Bai, Jigui Sun, Haiyan Che, Jin.

Chapter 7 Web Content Mining Xxxxxx. Introduction Web-content mining techniques are used to discover useful information from content on the web – textual.

IR Systems and Web Search By Sri Harsha Tumuluri (UNI: st2653)

Concept Unification of Terms in Different Languages for IR Qing Li, Sung-Hyon Myaeng (1), Yun Jin (2),Bo-yeong Kang (3) (1) Information & Communications.

A Simple Unsupervised Query Categorizer for Web Search Engines Prashant Ullegaddi and Vasudeva Varma Search and Information Extraction Lab Language Technologies.

CSE 6331 © Leonidas Fegaras Information Retrieval 1 Information Retrieval and Web Search Engines Leonidas Fegaras.

Thanks to Bill Arms, Marti Hearst Documents. Last time Size of information –Continues to grow IR an old field, goes back to the ‘40s IR iterative process.

Information Retrieval and Web Search Cross Language Information Retrieval Instructor: Rada Mihalcea Class web page:

1 University of Palestine Topics In CIS ITBS 3202 Ms. Eman Alajrami 2 nd Semester

Xiaoying Gao Computer Science Victoria University of Wellington Intelligent Agents COMP 423.

By Chung-Hong Lee ( 李俊宏 ) Assistant Professor Dept. of Information Management Chang Jung Christian University 資料庫與資訊檢索系統的整合 - 一個文件資料庫系統的開發研究.

Weighting and Matching against Indices. Zipf’s Law In any corpus, such as the AIT, we can count how often each word occurs in the corpus as a whole =

Term Frequency. Term frequency Two factors: – A term that appears just once in a document is probably not as significant as a term that appears a number.

1 CS 430: Information Discovery Lecture 3 Inverted Files.

Chapter 6: Information Retrieval and Web Search

Introduction to Digital Libraries hussein suleman uct cs honours 2003.

University of Malta CSA3080: Lecture 6 © Chris Staff 1 of 20 CSA3080: Adaptive Hypertext Systems I Dr. Christopher Staff Department.

Web- and Multimedia-based Information Systems Lecture 2.

Vector Space Models.

1 Information Retrieval LECTURE 1 : Introduction.

CIS 530 Lecture 2 From frequency to meaning: vector space models of semantics.

Information Retrieval Techniques MS(CS) Lecture 7 AIR UNIVERSITY MULTAN CAMPUS Most of the slides adapted from IIR book.

1 CS 430: Information Discovery Lecture 8 Automatic Term Extraction and Weighting.

Multilingual Information Retrieval using GHSOM Hsin-Chang Yang Associate Professor Department of Information Management National University of Kaohsiung.

Natural Language Processing Topics in Information Retrieval August, 2002.

The Development of a search engine & Comparison according to algorithms Sung-soo Kim The final report.

Xiaoying Gao Computer Science Victoria University of Wellington COMP307 NLP 4 Information Retrieval.

A Multilingual Hierarchy Mapping Method Based on GHSOM Hsin-Chang Yang Associate Professor Department of Information Management National University of.

1 CS 430: Information Discovery Lecture 8 Collection-Level Metadata Vector Methods.

Information Retrieval and Web Search IR models: Vector Space Model Term Weighting Approaches Instructor: Rada Mihalcea.

September 2003, 7 th EDG Conference, Heidelberg – Roberta Faggian, CERN/IT CERN – European Organization for Nuclear Research The GRACE Project GRid enabled.

Knowledge and Information Retrieval Dr Nicholas Gibbins 32/4037.

1 CS 430 / INFO 430: Information Retrieval Lecture 20 Web Search 2.

Automated Information Retrieval

Plan for Today’s Lecture(s)

Designing Cross-Language Information Retrieval System using various Techniques of Query Expansion and Indexing for Improved Performance Hello everyone,

Text Based Information Retrieval

Representation of documents and queries

From frequency to meaning: vector space models of semantics

CS 430: Information Discovery

Boolean and Vector Space Retrieval Models

CS 430: Information Discovery

WSExpress: A QoS-Aware Search Engine for Web Services

CS 430: Information Discovery

VECTOR SPACE MODEL Its Applications and implementations

Presentation transcript:

lecture 111 Information Retrieval(IR) Information retrieval generally refers to the process of extracting relevant documents according to the information specified in the query. IRS vs. DBMS DBMS structured query formulation deterministic all relevant data one-off query IRS unstructured casual query style non-deterministic most relevant data relevance feedback

lecture 112 Basic Components of IRS IRS core Users Linguistics Information Text editor/ file system/ internet files KnowledgeBase Input documents Query Relevant Documents

lecture 113 Information Retrieval The IR technology: –Knowledge base: Dictionary and rules –Basic Information representation model –Indexing of documents for retrieval –Relevance calculation Oriental languages Vs English in IR Main difference is in what is considered useful information in each language –Different NLP knowledge and variants of common methods need to be used

lecture 114 Vector Space Model for document representation Document D: articles in text form Terms T: Basic language units, such as words, phrases D(T 1 ; T 2 ; … T i ; … ;T n ), T i and T j may be referring to the same word appearing in different places and also the order in which it appear is also relevant. Term Weight: T i has an associated weight value W i to indicate the importance of T i to D D=D(T 1 W 1 ; T 2 W 2 ; …; T n W n ) For a given D, if we do not consider word repetition and order, also terms are against a known set T= (t 1 ; t 2 ; …; t K ) where K is the number of words in the vocabulary, thus D(T 1 W 1 ; T 2 W 2 ; …; T n W n ) can be represented by the Vector Space Model: D=D(W 1 ; W 2 ; …; W K )

lecture 115 Vector Space Model for Document Representation: (W 1 ; W 2 ; …; W k ) can be considered as a vector (t 1 ; t 2 ; …; t k ) defines the k dimension coordinate system, where K is a fixed number. Each coordinate indicates the weight of term t i. Different documents are then considered as different vectors in the VSM. 

lecture 116 Similarity: The degree of relevance of two documents The degree of relevance of D 1 and D 2 can be described by a so called similarity function, Sim(D 1, D 2 ), which describe their distance. Many different definitions of Sim(D 1, D 2 ) –One simple definition(inner product): Sim (D 1, D 2 ) =  n k=1 w 1k w 2k –Example: D 1 =(1,0,0, 1, 1, 1), D 2 =(0,1,1,0,1,0) –Sim(D 1, D 2 )=1x0+0x1+0x1+1x0+1x1+1x0 = 1

lecture 117 –Another definition( Cosine): Sim (D 1, D 2 ) = cos  =(  n k=1 w 1k w 2k )/sqrt((  n k=1 w 1k 2 )(  n k=1 w 2k 2 )) For Information retrieval, D 2 can be a query Q. Suppose there are I documents: D i,where i =1 to I Rank Sim(D i, Q), the higher the value, the more relevant D i is to Q

lecture 118 Terms Selection for Indexing T can be considered as all the terms that can be indexed: –Approach 1: Simply use a word dictionary –Approach 2: terms in a dictionary + terms segmented dynamically => T is not necessarily static Every document D i needs to be segmented –Vocabulary for indexing is normally much smaller than vocabulary of documents. Not every word T k in D i which is in T will be indexed T k in D i which is in T but not indexed is considered to have weight w ik = 0 –In other words, all indexed terms for D i are considered to have weight greater than zero

lecture 119 The process to select terms in a D i for indexing is called Terms selection Word frequency in documents is related to the information the articles intend to convey. Thus word frequency is often used in earlier term selection and weight assignment algorithms The Zipf Law in information theory: For a given document set, and rank the terms according to its frequency => Zipf Law Freq(t i ) * rank(t i )  constant

lecture 1110 The P.H. Lunh Method(terms selection) Suppose N documents form a document set D set ={ D i, 1 <=i<N} (1) Freq ik : the frequency of T k in D i. TotFreq k : the frequency of T k in D set. (2) Then, TotFreq k =  N i=1 Freq ik (3) Sort TotFreq k in descending order. Select an upper bound and a lower bound, C u-b, C l-b, respectively. Index only the terms between C u-b and C l-b Absolute frequency, choice of C u-b and C l-b

lecture 1111 P.H. Lunh’s method is a very rough association of terms frequency with information to be conveyed in a document. –Some low frequency terms may be very important to a particular article(document), and it may be exactly the reason it doesn’t appear so often in other articles Keys in terms selection for indexing: completeness and accuracy –Related to the article so that it can be indexed for retrieval(completeness) –Distinguish one article from other articles(accuracy and representative) Example: The term “ 電腦 ” in the document set “computer” is not an important term, however, it is probably important in a “hardware devices” set Relative frequency:

lecture 1112 Weight Assignment Algorithm Assuming: Freq ik  in D i, importance of t k in D i  Freq ik TotFreq k (Frequency of t k ) in D set , importance of t k in D i  => log 2 (N/TotFreq k ) The weight should be assigned based on these assumptions W ik = Freq ik + Freq ik (log 2 N - log 2 TotFreq k )

lecture 1113 More Consideration in Relevance Long articles are more likely to be retrieved  discount the length factor Document feature terms most frequently used terms tend to appear in more articles, it does not serve to distinguish one article from others. Expressiveness of terms: high-frequency, low- frequency, normal-frequency. Terms with normal frequency and low-frequency terms convey more about the article features/theme. Word class: nouns convey more information( 實詞 vs. 虛詞 ) –Use of PoS tags and also stoplist(Slist)

lecture 1114 Syntactical word classification information: Nouns are more related to concepts 人口的自然增長是由出生和死亡之間的差額所形成的 Class: 名助副名動介名連名名助名助動助 Slist: elimination of words cannot be identified by class such as “ 是 ” Only those terms not on the stop list will be used in frequency calculation Semantic word classification: extracting concepts 人口的自然增長是由出生和死亡之間的差額形成的。人必然增多誕生死期間數量興起 Thesaurus: co-occurrence of related terms

lecture 1115 Indexing of phrases(grammatical analysis): –Example: Artificial intelligence( 人工智能 ) it is more relevant to index “artificial intelligence( 人工智能 )”( as one unit) rather than two independent terms.

lecture 1116 Document sources Indexed Document repository Indexing engine Customer inquiries Relevance Calculation Indexed query expression Retrieved documents Evaluation and return by ranking Query optimization relevance feedback Return result Information Retrieval System Architecture

lecture 1117 Bilingual (Multi-lingual) IR Retargetable IR approach monolingual but IR system can be configured for different languages Need Bi- or Multi-lingual IR? –Bi- and Multi-lingual community that reads text in more than one language: find text in more than language! –Retrieval of law information in Chinese and English. – Retrieval of a person reported in different newspapers written in different languages

lecture 1118 Dictionary approach –normalize indexing and searching into one language (save storage) –determination of translation equivalence(multiple translations) during indexing and during extraction of term vector of query (increase time of indexing) not easy to obtain good translation equivalence many proper nouns not in translation dictionary need to be found and map to their corresponding target translation –inflexible: cannot use user specified translation equivalence

lecture 1119 Multi-lingual indexing approach –indexing for all different languages higher storage cost different indexing techniques for different languages (e.g. English and Chinese) –flexible (can use system supplied or user supplied translation equivalence) –support exact match in different languages