Information Retrieval and Vector Space Model Presented by Jun Miao York University 1.

1 Information Retrieval and Vector Space Model Presented by Jun Miao York University 1

3 What is Information Retrieval? = IR ? IR: Retrieve information which is relative to your need  Search Engine  Question Answering  Information Extraction  Information Filtering  Information Recommendation 3

4 In old days… The term "information retrieval" may have been coined by Calvin MooersCalvin Mooers Early IR applications are used in libraries Set-based retrieval the system partitions the corpus into two subsets of documents: those it considers relevant to the search query, and those it does not. 4

5 In nowadays Ranked Retrieval the system responds to a search query by ranking all documents in the corpus based on its estimate of their relevance to the query. ◦ free-form query expresses user’s information need ◦ rank documents by decreasing likelihood of relevance ◦ many studies prove it is superior 5

6 An Information Retrieval Process (Borrow from Prof. Nie’s slides) 6 Document collection Info. need Query Answer list IR system Retrieval

7 Inside a IR system Indices Indexing Documents Query Analysis Rank List 7

8 Indexing Document Break documents into words Stop listStemming Construct Index 8

9 Lexical Analysis What counts as a word or token in the indexing scheme? A big topic 9

10 Stop List function words do not bear useful information for IR of, not, to, or, in, about, with, I, be, … Stop list: contain stop words, not to be used as index ◦ Prepositions ◦ Articles ◦ Pronouns ◦ Some adverbs and adjectives ◦ Some frequent words (e.g. document) The removal of stop words usually improves IR effectiveness A few “standard” stop lists are commonly used. 10

11 Stemming 11 Reason: ◦ Different word forms may bear similar meaning (e.g. search, searching): create a “standard” representation for them Stemming: ◦ Removing some endings of word dancer dancers dance danced dancing dance

12 Stemming(Cont’d) Two main methods : Linguistic/dictionary-based stemming  high stemming accuracy  high implementation and processing costs and higher coverage Porter-style stemming  lower stemming accuracy  lower implementation and processing costs and lower coverage  Usually sufficient for IR 12

13 Flat file indexing Each document is represented by a set of weighted keywords (terms): D 1  {(t 1, w 1 ), (t 2,w 2 ), …} e.g.D 1  {(comput, 0.2), (architect, 0.3), …} D 2  {(comput, 0.1), (network, 0.5), …} 13

14 Inverted Index 14

15 Query Analysis Parse Query Clean Stopwords Stemming Get terms Adjacent operations  connect related terms together 15

16 Models Matching score model ◦ Document D = a set of weighted keywords ◦ Query Q = a set of non-weighted keywords ◦ R(D, Q) =  i w(t i, D) where t i is in Q. 16

17 Models(Cont’d) Boolean Model Vector Space Model Probability Model Language Model Neural Network Model Fuzzy Set Model …… 17

18 tf*idf weighting schema tf = term frequency ◦ frequency of a term/keyword in a document The higher the tf, the higher the importance (weight) for the doc. df = document frequency ◦ no. of documents containing the term ◦ distribution of the term idf = inverse document frequency ◦ the unevenness of term distribution in the corpus ◦ the specificity of term to a document ◦ Idf = log(d/df) d= total number of documents The more the term is distributed evenly, the less it is specific to a document weight(t,D) = tf(t,D) * idf(t) 18

19 Evaluation A result list according to a query What is its performance? retrieved relevant Relevant Retrieved 19

20 Metrics often used (together):  Precision = retrieved relevant docs / retrieved docs  Recall = retrieved relevant docs / relevant docs 20

21 Precision-Recall Trade-off Usually, more precision, less recall; More recall, less precision Return all documents: recall rate = 1 precision is very low 21

22 For Ranked List Consider two result lists of two IR systems S1 and S2 according to one query: 1. 2. Which one is better??? relevant documents 22

23 Average Precision AP = sum(R(x i )/P(x i )) / n X i ∈ Set of retrieved relative documents P(x i ) : Rank of x i in retrieved list R(x i ) : Rank of x i in retrieved relative document list n : Number of retrieved relative documents List 1: AP1 = ((1/1)+(2/3)+(3/6)+(4/9)+(5/10))/5 = 0.622 relevant documents 23

24 Average Precision (Cont’d) List 2 AP2 = ( (1/1)+(2/2)+(3/3)+(4/5)+(5/6) ) / 5 = 0.927 S2 is better than S1 relevant documents 24

25 Evaluating over multiple queries Mean Average Precision: Arithmetic mean of average precisions over all queries 5 Queries (Topics) and 2 IR systems S1 is better than S2 AP1AP2AP3AP4AP5MAP S10. S20.9 25

26 Other Measurements Precision@N R-Precision F-measurement E-measurement …… 26

27 Problem Sometimes, documents in the collections are numerous. It is hard to calculate recall rate. 27

28 Pooling Step 1. Get top N documents from the results of IR systems to make a document pool. Step 2. Experts check the pool, and tag these documents by relevant or non- relevant according to different topics 28

29 Difficulties in text IR Vocabularies mismatching ◦ Synonymy: e.g. car v.s. automobile ◦ Polysemy: table Queries are ambiguous, they are partial specification of user’s need Content representation may be inadequate and incomplete The user is the ultimate judge, but we don’t know how the judge judges… ◦ The notion of relevance is imprecise, context- and user- dependent 29

30 Difficulties in web IR No stable document collection (spider, crawler) Invalid document, duplication, etc. Huge number of documents (partial collection) Multimedia documents Great variation of document quality Multilingual problem … 30

31 NLP in IR Simple methods: stop word, stemming Higher-level processing: chunking, parsing, word sense disambiguation Research about using NLP in IR needs more attention 31

32 Popular systems SMART Terrier Okapi Lemur etc… 32

33 Conference and Journal Conference SIGIR TREC CLEF WWW ECIR … Journal ACM Transactions on Information Systems(TOIS) ACM Transactions on Asian Language Information Processing(TALIP) Information Processing & Management(IP&M) Information Retrieval 33

35 Idea Convert documents and queries into vectors, and use Similarity Coefficient(SC) to measure the similarity Presented by Gerard Salton et al. in 1975, implemented in SMART IR system Premise: all terms are independent 35

36 Construct Vector Each dimension corresponds to a separate term. W i,j = weight of term j in document or query i Query Document 1 Document 2 36

37 Doc-Term Matrix N documents and M terms D1D1 D2D2 D3D3 …DnDn T1T1 W 1,1 W 2,1 W 3,1 …W n,1 T2T2 W 1,2 W 2,2 W 3,2 …W n,2 T3T3 W 1,3 W 2,3 W 3,3 …W n,3 … …………… TmTm W 1,m W 2,m W 3,m …W n,m 37

38 Three Key problems 1.Term selection 2.Term weighting 3.Similarity Coefficient Calculation 38

39 Term Selection Terms represent the content of documents Term purification  Stemming  Stoplist  Only choose Nouns 39

40 Term Weight Boolean weight: 1: appear 0: not appear Term Frequency:  tf  1+log(tf)  1+(1+log(tf)) Inverse Document Frequency  tf*idf 40

41 Term Weight (Cont’d) Document Length Two opinions:  Longer documents contain more terms  Longer documents have more information Punish long documents and compensate to short documents Pivoted Normalization : 1-b+b*doclen/avgdoclen b in (0,1) 41

42 Similarity Coefficient Calculation 42 Dot product Cosine Dice Jaccard t1 t2 D Q

43 Example Q: “gold silver truck” D1: “Shipment of gold delivered in a fire” D2: “Delivery of silver arrived in a silver truck” D3: “Shipment of gold arrived in a truck” Document Frequency of the j th term (df j ) Inverse Document Frequency (idf) = log10(n / df j ) Tf*idf is used as term weight here 43

45 Example(Cont’d) Tf*idf is used here SC(Q, D1 ) = (0)(0) + (0)(0) + (0)(0.477) + (0)(0) + (0)(0.477)+ (0.176)(0.176) + (0)(0) + (0)(0) = 0.031 SC(Q, D2 ) = 0.486 SC(Q,D3) = 0.062 The ranking would be D2,D3,D1. This SC uses the dot product. 45

46 Advantages of VSM Fairly cheap to compute Yields decent effectiveness Very popular -- SMART is one of the most commonly used academic prototype 46

47 Disadvantages of VSM No theoretical foundation Weights in the vectors are very arbitrary Assumes term independence Sparse Matrix 47

