lecture 111 Information Retrieval(IR) Information retrieval generally refers to the process of extracting relevant documents according to the information specified in the query. IRS vs. DBMS DBMS structured query formulation deterministic all relevant data one-off query IRS unstructured casual query style non-deterministic most relevant data relevance feedback
lecture 112 Basic Components of IRS IRS core Users Linguistics Information Text editor/ file system/ internet files KnowledgeBase Input documents Query Relevant Documents
lecture 113 Information Retrieval The IR technology: –Knowledge base: Dictionary and rules –Basic Information representation model –Indexing of documents for retrieval –Relevance calculation Oriental languages Vs English in IR Main difference is in what is considered useful information in each language –Different NLP knowledge and variants of common methods need to be used
lecture 114 Vector Space Model for document representation Document D: articles in text form Terms T: Basic language units, such as words, phrases D(T 1 ; T 2 ; … T i ; … ;T n ), T i and T j may be referring to the same word appearing in different places and also the order in which it appear is also relevant. Term Weight: T i has an associated weight value W i to indicate the importance of T i to D D=D(T 1 W 1 ; T 2 W 2 ; …; T n W n ) For a given D, if we do not consider word repetition and order, also terms are against a known set T= (t 1 ; t 2 ; …; t K ) where K is the number of words in the vocabulary, thus D(T 1 W 1 ; T 2 W 2 ; …; T n W n ) can be represented by the Vector Space Model: D=D(W 1 ; W 2 ; …; W K )
lecture 115 Vector Space Model for Document Representation: (W 1 ; W 2 ; …; W k ) can be considered as a vector (t 1 ; t 2 ; …; t k ) defines the k dimension coordinate system, where K is a fixed number. Each coordinate indicates the weight of term t i. Different documents are then considered as different vectors in the VSM.
lecture 116 Similarity: The degree of relevance of two documents The degree of relevance of D 1 and D 2 can be described by a so called similarity function, Sim(D 1, D 2 ), which describe their distance. Many different definitions of Sim(D 1, D 2 ) –One simple definition(inner product): Sim (D 1, D 2 ) = n k=1 w 1k w 2k –Example: D 1 =(1,0,0, 1, 1, 1), D 2 =(0,1,1,0,1,0) –Sim(D 1, D 2 )=1x0+0x1+0x1+1x0+1x1+1x0 = 1
lecture 117 –Another definition( Cosine): Sim (D 1, D 2 ) = cos =( n k=1 w 1k w 2k )/sqrt(( n k=1 w 1k 2 )( n k=1 w 2k 2 )) For Information retrieval, D 2 can be a query Q. Suppose there are I documents: D i,where i =1 to I Rank Sim(D i, Q), the higher the value, the more relevant D i is to Q
lecture 118 Terms Selection for Indexing T can be considered as all the terms that can be indexed: –Approach 1: Simply use a word dictionary –Approach 2: terms in a dictionary + terms segmented dynamically => T is not necessarily static Every document D i needs to be segmented –Vocabulary for indexing is normally much smaller than vocabulary of documents. Not every word T k in D i which is in T will be indexed T k in D i which is in T but not indexed is considered to have weight w ik = 0 –In other words, all indexed terms for D i are considered to have weight greater than zero
lecture 119 The process to select terms in a D i for indexing is called Terms selection Word frequency in documents is related to the information the articles intend to convey. Thus word frequency is often used in earlier term selection and weight assignment algorithms The Zipf Law in information theory: For a given document set, and rank the terms according to its frequency => Zipf Law Freq(t i ) * rank(t i ) constant
lecture 1110 The P.H. Lunh Method(terms selection) Suppose N documents form a document set D set ={ D i, 1 <=i<N} (1) Freq ik : the frequency of T k in D i. TotFreq k : the frequency of T k in D set. (2) Then, TotFreq k = N i=1 Freq ik (3) Sort TotFreq k in descending order. Select an upper bound and a lower bound, C u-b, C l-b, respectively. Index only the terms between C u-b and C l-b Absolute frequency, choice of C u-b and C l-b
lecture 1111 P.H. Lunh’s method is a very rough association of terms frequency with information to be conveyed in a document. –Some low frequency terms may be very important to a particular article(document), and it may be exactly the reason it doesn’t appear so often in other articles Keys in terms selection for indexing: completeness and accuracy –Related to the article so that it can be indexed for retrieval(completeness) –Distinguish one article from other articles(accuracy and representative) Example: The term “ 電腦 ” in the document set “computer” is not an important term, however, it is probably important in a “hardware devices” set Relative frequency:
lecture 1112 Weight Assignment Algorithm Assuming: Freq ik in D i, importance of t k in D i Freq ik TotFreq k (Frequency of t k ) in D set , importance of t k in D i => log 2 (N/TotFreq k ) The weight should be assigned based on these assumptions W ik = Freq ik + Freq ik (log 2 N - log 2 TotFreq k )
lecture 1113 More Consideration in Relevance Long articles are more likely to be retrieved discount the length factor Document feature terms most frequently used terms tend to appear in more articles, it does not serve to distinguish one article from others. Expressiveness of terms: high-frequency, low- frequency, normal-frequency. Terms with normal frequency and low-frequency terms convey more about the article features/theme. Word class: nouns convey more information( 實 詞 vs. 虛詞 ) –Use of PoS tags and also stoplist(Slist)
lecture 1114 Syntactical word classification information: Nouns are more related to concepts 人口 的 自然 增長 是 由 出生 和 死亡 之間 的 差額 所 形成 的 Class: 名 助 副 名 動 介 名 連 名 名 助 名 助 動 助 Slist: elimination of words cannot be identified by class such as “ 是 ” Only those terms not on the stop list will be used in frequency calculation Semantic word classification: extracting concepts 人口的自然 增長是由出生和死亡 之間的差額 形成的。 人 必然 增多 誕生 死 期間 數量 興起 Thesaurus: co-occurrence of related terms
lecture 1115 Indexing of phrases(grammatical analysis): –Example: Artificial intelligence( 人工 智能 ) it is more relevant to index “artificial intelligence( 人工智能 )”( as one unit) rather than two independent terms.
lecture 1116 Document sources Indexed Document repository Indexing engine Customer inquiries Relevance Calculation Indexed query expression Retrieved documents Evaluation and return by ranking Query optimization relevance feedback Return result Information Retrieval System Architecture
lecture 1117 Bilingual (Multi-lingual) IR Retargetable IR approach monolingual but IR system can be configured for different languages Need Bi- or Multi-lingual IR? –Bi- and Multi-lingual community that reads text in more than one language: find text in more than language! –Retrieval of law information in Chinese and English. – Retrieval of a person reported in different newspapers written in different languages
lecture 1118 Dictionary approach –normalize indexing and searching into one language (save storage) –determination of translation equivalence(multiple translations) during indexing and during extraction of term vector of query (increase time of indexing) not easy to obtain good translation equivalence many proper nouns not in translation dictionary need to be found and map to their corresponding target translation –inflexible: cannot use user specified translation equivalence
lecture 1119 Multi-lingual indexing approach –indexing for all different languages higher storage cost different indexing techniques for different languages (e.g. English and Chinese) –flexible (can use system supplied or user supplied translation equivalence) –support exact match in different languages