Chapter 4 Matching Process. Matching Process t query 의 한 단어가 문서에 나타났다고 해서 반드시 검색되어야 하는 것은 아니다  query 에 여러 단어가 있을 수 있다  그 단어가 문맥상 중요하지 않을 수 있다 u e.g.

Chapter 4 Matching Process

Matching Process t query 의 한 단어가 문서에 나타났다고 해서 반드시 검색되어야 하는 것은 아니다  query 에 여러 단어가 있을 수 있다  그 단어가 문맥상 중요하지 않을 수 있다 u e.g. “This document is not about...”  이 장에서는 문서가 query 에 불확실하게 match 되는 것으로 가정 u 관련도 (relevance) 가 얼마나 강한지에 촛점  Topicality of document 에 촛점 u 문서의 topic 과 query 의 topic 의이 일치하는 정도 u 사용자의 지식과 배경 및 선호도 : 6 장

4.1 Relevance and Similarity Measure t document space: organized set of document t document space doesn’t contain queries  mapping from the document space into the query space (Boolean systems)  characteristic function having the value on documents relevance to the query: [0, 1] t document space contains queries  query is a point in the document space  relevant documents: a cluster near the query point  evaluation function: define a contour t measure  basis for evaluation of each document  some computable function

measure t whether document is relevant to a query t as relevance is ultimately in the mind of the user, it is difficult to measure directly t IR systems rely primarily on measurements from document and query representation t most systems equate relevance with lexical similarity- matching of words

4.2 Boolean-Based Matching t whether containing a given term t query is a logical function of given words, document is not.  구조적 유사성이 없슴 : characteristic function t no basis for the development of significant similarity judgments.-satisfy query or not.  수정 사례 : ‘A OR B OR C’ 의 결과에 grade t Since Boolean systems operate on the basis of the presence or absence of terms, many such systems do not include the term frequency data.

4.3 Vector-Based Matching: Metrics t metrics: distance measure & angular measure  distance measure u 벡터 공간에서 가까우면 유사하다는 가정  angular measure u 벡터 공간에서 비슷한 방향에 있으면 유사하다는 가정 t distance of a document from itself is 0.  not similarity measure, but dissimilarity measure  변환이 필요 t linear conversion from a metric to a similarity measure is generally not desirable.  metric  에 대한 변환  를  = k -  로 할 경우 적절한 k 값 의 선정이 어렵다

4.3 Vector-Based Matching: Metrics t inversion transform( 역변환 ) that maps the distance into fixed positve range of numbers t, b>1, P(  ) 는 단조증가

4.4 Vector-Based Matching: Cosine Measure t this is not a distance measure, but an angular measure. where t k is the value of term k in the document and q k is its value in the query this is inner product of the document and query vectors, normalized by their lengths.

Measure comparison t distance measures  Similarity depends only on how far a given document is from the point t Angular mesures  not consider the distance of each document from the origin, but only the direction  two documents that lie along the same vector from the origin will be judged identically, despite the fact that they may be far apart in the document space.

Measure comparison t ex) D1=, D2=, D3=  consine measure u  (D1, D2) = 1.0,  (D1, D3) = 0.6  euclidean distance u  (D1, D2) = 314.96,  (D1, D3) = 2.83  consine measure 는 D1 과 D2 가 더 유사한 것으로 보고 distance measure 는 D1 과 D3 이 더 유사한 것으로 본다 t In practice, distance and angular measures seem to give results of similar quality  sufficiently far from the origin

4.5 Missing Terms and Term Relationship t one problem - missing term  0 은 2 가지 의미 : no occurrence, no information of occurrence (e.g., )  it may be that a term is missing from a document description because an indexer did not think it significant, rather than because it does not occur in the document. - also missing from a query by user.

4.5 Missing Terms and Term Relationship t Another problem - term relationship  vector 연산 – 각 원소가 서로 독립임을 가정  잘못된 결과 발생 가능성 : e.g. “digital computer” t Final problem – symmetricity  distance and angular measure 는 모두 query 와 document 에 대해 대칭적인 관점을 유지  사용자는 query 에 맞는 document 를 원하지만 document 에 맞는 query 를 원하지는 않는다 u e.g. 백과사전 : 사용자 query 에 해당하는 항목에는 query 에 나타나지 않는 단어가 매우 많이 존재

4.6 Probabilistic Matching t focus attention on models that include uncertainties more directly t to calculate the probability that the document is relevant to the query t assumption  at any given time a sigle query is being used  the number of documents within the database that are relevant to the query is known

4.6 Probabilistic Matching t 무작위 (random) 로 문서를 선택할 때의 확률  P(rel) = n/N  P( ㄱ rel) = 1- P(rel) = (N-n)/N  실제로는 query 와 document 의 단어를 비교하여 선택 t P(computer|digital) > P(computer|?) t 사례 1  선택된 어떤 문서 집합 S 의 모든 문서에 대해, P(rel|selected) > P( ㄱ rel|selected) 이면 relevant  Discriminant function dis(selected)=  어떤 집합의 모든 문서에 대해 dis(selected)>1 이면 그 집합 을 검색

t Bayes’s theorem t applying this to the discriminant function, t assume that a document is represented by terms and these terems are statistically independent. t P(selected|rel)=P(t 1 |rel)P(t 2 |rel)....P(t n |rel) 4.6 Probabilistic Matching

t If estimates for the probability of occurrence of various terms in relevant documents and in nonrelevant documents can be obtained, then the probabiliy that a document will be etreived can be estimated. 4.6 Probabilistic Matching

t Example  전체 문서 중 관련 문서의 비율 = 0.1  1 보다 작으므로 검색되지 않음 4.6 Probabilistic Matching

4.7 Fuzzy Matching t probabilistic matching involves much calculation and many assumption. t In fuzzy matching the calculation is based on defined membership grades for terms. t this computation is simpler than that for probabilistic retrieval, since it involves simple functions of the membership grades for each document: fuzzy arithmetic 에 기반  e.g. Avg(max(  D1 (t1),  D2 (t1),...), max(  D1 (t2),  D1 (t2),...)) t how such terms translate into the membership functions associated with fuzzy retrieval.

4.8 Proximity Matching t a much older and more widely used matching method involves the proximity of terms in a text. t Frequently proximity measures are used as additional criteria to further refine the set of documents identified by one of the other matching methods. t Modifications of proximity crireria can increase their effectiveness.  e.g. ordered proximity u “junior college” vs. “college junior”

4.9 Effects of Weighting t Not all terms are equally important in a query. t Weighting of terms modifies the calculations upon which relevance judgments are made. t Weighting can also be applied at a broder level than individual terms.  (beef and broccoli):5; (beef but not broccoli):2, noodles:1; snow peas:1 t Filtering without weighting: more complex calculations will be confined to a relatively small set of documents.

4.10 Effects of Scaling t impact of the size of the document collection can be major.  whether it will be feasible to apply it to real document collections t false drops become more likely  documents that appear to match the query but are not appropriate  컴퓨터 문서 집합에서는 “object-oriented programming” 의 허위 드롭 가능성이 작지만, 일반 문서 집합에서는 크다 (TV 도 object 로 취급 ) t Information filtering  produce a relatively small set containing a high proportion of relevant document.  간단한 기법으로 작은 후보 문서 집합을 추출한 후 복잡 한 기법으로 추출된 집합을 처리 : 금의 가공 과정과 유사

4.11 Data Fusion t no single retrieval technique will work equally well in all situations has led to data fusion  the study of techniques for merging the results of multiple search techniques on multiple databases to produce the best possible response to a query t to develop a retrieval technique that can adapt  DB 의 표준화가 문제 t to determine a method to fairly combine  서로 다른 성격의 measure 들을 결합

4.12 A User-Centered View t Each user has an individual vocabulary t retrieval systems commonly miss some documents that might have been informative to the user and retrieve others that the user does not find helpful

Chapter 4 Matching Process. Matching Process t query 의 한 단어가 문서에 나타났다고 해서 반드시 검색되어야 하는 것은 아니다  query 에 여러 단어가 있을 수 있다  그 단어가 문맥상 중요하지 않을 수 있다 u e.g.

Similar presentations

Similar presentations

About project

Feedback

Log in

Auth with social network:

Chapter 4 Matching Process. Matching Process t query 의 한 단어가 문서에 나타났다고 해서 반드시 검색되어야 하는 것은 아니다  query 에 여러 단어가 있을 수 있다  그 단어가 문맥상 중요하지 않을 수 있다 u e.g.

Similar presentations

Similar presentations

About project

Feedback