Download presentation
Presentation is loading. Please wait.
Published byShanna Gray Modified over 9 years ago
1
1 Vector Space Model Rong Jin
2
2 Basic Issues in A Retrieval Model How to represent text objects What similarity function should be used? How to refine query according to users’ feedbacks?
3
3 Basic Issues in IR How to represent queries? How to represent documents? How to compute the similarity between documents and queries? How to utilize the users’ feedbacks to enhance the retrieval performance?
4
4 IR: Formal Formulation Vocabulary V={w 1, w 2, …, w n } of language Query q = q 1,…,q m, where q i V Collection C= {d 1, …, d k } Document d i = (d i1,…,d im i ), where d ij V Set of relevant documents R(q) C Generally unknown and user-dependent Query is a “hint” on which doc is in R(q) Task = compute R’(q), an “approximate R(q)”
5
5 Computing R(q) Strategy 1: Document selection Classification function f(d,q) {0,1} Outputs 1 for relevance, 0 for irrelevance R(q) is determined as a set {d C|f(d,q)=1} System must decide if a doc is relevant or not (“absolute relevance”) Example: Boolean retrieval
6
Document Selection Approach 6 + + + + - - - - - - - - - - - - - - + - - True R(q) Classifier C(q)
7
7 Computing R(q) Strategy 2: Document ranking Similarity function f(d,q) Outputs a similarity between document d and query q Cut off The minimum similarity for document and query to be relevant R(q) is determined as the set {d C|f(d,q)> } System must decide if one doc is more likely to be relevant than another (“relative relevance”)
8
8 Document Selection vs. Ranking + + + + - - - - - - - - - - - - - - + - - Doc Ranking f(d,q)=? 0.98 d 1 + 0.95 d 2 + 0.83 d 3 - 0.80 d 4 + 0.76 d 5 - 0.56 d 6 - 0.34 d 7 - 0.21 d 8 + 0.21 d 9 - R’(q) True R(q)
9
9 Document Selection vs. Ranking + + + + - - - - - - - - - - - - - - + - - Doc Ranking f(d,q)=? 0.98 d 1 + 0.95 d 2 + 0.83 d 3 - 0.80 d 4 + 0.76 d 5 - 0.56 d 6 - 0.34 d 7 - 0.21 d 8 + 0.21 d 9 - R’(q) - - - 1 Doc Selection f(d,q)=? + + + + - - + - + - - - - - 0 R’(q) True R(q)
10
10 Ranking is often preferred Similarity function is more general than classification function The classifier is unlikely to be accurate Ambiguous information needs, short queries Relevance is a subjective concept Absolute relevance vs. relative relevance
11
11 Probability Ranking Principle As stated by Cooper Ranking documents in probability maximizes the utility of IR systems “If a reference retrieval system’s response to each request is a ranking of the documents in the collections in order of decreasing probability of usefulness to the user who submitted the request, where the probabilities are estimated as accurately as possible on the basis of whatever data made available to the system for this purpose, then the overall effectiveness of the system to its users will be the best that is obtainable on the basis of that data.”
12
12 Vector Space Model Any text object can be represented by a term vector Examples: Documents, queries, sentences, …. A query is viewed as a short document Similarity is determined by relationship between two vectors e.g., the cosine of the angle between the vectors, or the distance between vectors The SMART system: Developed at Cornell University, 1960-1999 Still used widely
13
13 Vector Space Model: illustration JavaStarbuckMicrosoft D1D1 110 D2D2 011 D3D3 101 D4D4 111 Query10.11
14
14 Vector Space Model: illustration Java Microsoft Starbucks D2D2 ? D1D1 ? ?? ? D3D3 Query D4D4 ?
15
15 Vector Space Model: Similarity Represent both documents and queries by word histogram vectors n: the number of unique words A query q = (q 1, q 2,…, q n ) q i : occurrence of the i-th word in query A document d k = (d k,1, d k,2,…, d k,n ) d k,i : occurrence of the the i-th word in document Similarity of a query q to a document d k q dkdk
16
Some Background in Linear Algebra Dot product (scalar product) Example: Measure the similarity by dot product 16
17
Some Background in Linear Algebra Length of a vector Angle between two vectors 17 q dkdk
18
Some Background in Linear Algebra Example: Measure similarity by the angle between vectors 18 q dkdk
19
19 Vector Space Model: Similarity Given A query q = (q 1, q 2,…, q n ) q i : occurrence of the i-th word in query A document d k = (d k,1, d k,2,…, d k,n ) d k,i : occurrence of the the i-th word in document Similarity of a query q to a document d k q dkdk
20
Vector Space Model: Similarity 20 q dkdk
21
Vector Space Model: Similarity 21 q dkdk
22
22 Term Weighting w k,i : the importance of the i-th word for document d k Why weighting ? Some query terms carry more information TF.IDF weighting TF (Term Frequency) = Within-doc-frequency IDF (Inverse Document Frequency) TF normalization: avoid the bias of long documents
23
23 TF Weighting A term is important if it occurs frequently in document Formulas: f(t,d): term occurrence of word ‘ t ’ in document d Maximum frequency normalization: Term frequency normalization
24
24 TF Weighting A term is important if it occurs frequently in document Formulas: f(t,d): term occurrence of word ‘ t ’ in document d “ Okapi/BM25 TF ” : Term frequency normalization doclen(d): the length of document d avg_doclen: average document length k,b: predefined constants
25
25 TF Normalization Why? Document length variation “Repeated occurrences” are less informative than the “first occurrence” Two views of document length A doc is long because it uses more words A doc is long because it has more contents Generally penalize long doc, but avoid over- penalizing (pivoted normalization)
26
26 TF Normalization Norm. TF Raw TF “Pivoted normalization”
27
27 IDF Weighting A term is discriminative if it occurs only in a few documents Formula: IDF(t) = 1+ log(n/m) n – total number of docs m -- # docs with term t (doc freq) Can be interpreted as mutual information
28
28 TF-IDF Weighting TF-IDF weighting : The importance of a term t to a document d weight(t,d)=TF(t,d)*IDF(t) Freq in doc high tf high weight Rare in collection high idf high weight
29
29 TF-IDF Weighting TF-IDF weighting : The importance of a term t to a document d weight(t,d)=TF(t,d)*IDF(t) Freq in doc high tf high weight Rare in collection high idf high weight Both q i and d k,i arebinary values, i.e. presence and absence of a word in query and document.
30
30 Problems with Vector Space Model Still limited to word based matching A document will never be retrieved if it does not contain any query word How to modify the vector space model ?
31
31 Choice of Bases Java Microsoft Starbucks D1D1 Q D
32
32 Choice of Bases Java Microsoft Starbucks D1D1 Q D
33
33 Choice of Bases Java Microsoft Starbucks D1D1 Q D D’
34
34 Choice of Bases Java Microsoft Starbucks D1D1 Q D D’ Q’
35
35 Choice of Bases Java Microsoft Starbucks D1D1 D’ Q’
36
36 Choosing Bases for VSM Modify the bases of the vector space Each basis is a concept: a group of words Every document is a vector in the concept space A1 A2 c1c2c3c4c5m1m2m3m4 A1111110000 A2000001111
37
37 Choosing Bases for VSM Modify the bases of the vector space Each basis is a concept: a group of words Every document is a mixture of concepts A1 A2 c1c2c3c4c5m1m2m3m4 A1111110000 A2000001111
38
38 Choosing Bases for VSM Modify the bases of the vector space Each basis is a concept: a group of words Every document is a mixture of concepts How to define/select ‘basic concept’? In VS model, each term is viewed as an independent concept
39
39 Basic: Matrix Multiplication
40
40 Basic: Matrix Multiplication
41
41 Linear Algebra Basic: Eigen Analysis Eigenvectors (for a square m m matrix S) Example eigenvalue(right) eigenvector
42
42 Linear Algebra Basic: Eigen Analysis
43
43 Linear Algebra Basic: Eigen Decomposition S = U * * U T
44
44 Linear Algebra Basic: Eigen Decomposition S = U * * U T
45
45 Linear Algebra Basic: Eigen Decomposition This is generally true for symmetric square matrix Columns of U are eigenvectors of S Diagonal elements of are eigenvalues of S S = U * * U T
46
46 Singular Value Decomposition mmmmmnmnV is n n For an m n matrix A of rank r there exists a factorization (Singular Value Decomposition = SVD) as follows: The columns of U are left singular vectors. The columns of V are right singular vectors is a diagonal matrix with singular values
47
47 Singular Value Decomposition Illustration of SVD dimensions and sparseness
48
48 Singular Value Decomposition Illustration of SVD dimensions and sparseness
49
49 Singular Value Decomposition Illustration of SVD dimensions and sparseness
50
50 Low Rank Approximation Approximate matrix with the largest singular values and singular vectors
51
51 Low Rank Approximation Approximate matrix with the largest singular values and singular vectors
52
52 Low Rank Approximation Approximate matrix with the largest singular values and singular vectors
53
53 Latent Semantic Indexing (LSI) Computation: using single value decomposition (SVD) with the first m largest singular values and singular vectors, where m is the number of concepts Rep. of Concepts in term space Concept Rep. of concepts in document space
54
54 Finding “Good Concepts”
55
55 XX SVD: Example: m=2
56
56 XX SVD: Example: m=2
57
57 XX SVD: Example: m=2
58
58 XX SVD: Example: m=2
59
59 SVD: Orthogonality XX u 1 u 2 · = 0 v1v1 v2v2 v 1 · v 2 = 0
60
60 XX SVD: Properties rank(S): the maximum number of either row or column vectors within matrix S that are linearly independent. SVD produces the best low rank approximation X’: rank(X’) = 2 X: rank(X) = 9
61
61 SVD: Visualization X=
62
62 SVD: Visualization SVD tries to preserve the Euclidean distance of document vectors
Similar presentations
© 2025 SlidePlayer.com. Inc.
All rights reserved.