Presentation is loading. Please wait.

Presentation is loading. Please wait.

Chapter 4 Matching Process. Matching Process t query 의 한 단어가 문서에 나타났다고 해서 반드시 검색되어야 하는 것은 아니다  query 에 여러 단어가 있을 수 있다  그 단어가 문맥상 중요하지 않을 수 있다 u e.g.

Similar presentations


Presentation on theme: "Chapter 4 Matching Process. Matching Process t query 의 한 단어가 문서에 나타났다고 해서 반드시 검색되어야 하는 것은 아니다  query 에 여러 단어가 있을 수 있다  그 단어가 문맥상 중요하지 않을 수 있다 u e.g."— Presentation transcript:

1 Chapter 4 Matching Process

2 Matching Process t query 의 한 단어가 문서에 나타났다고 해서 반드시 검색되어야 하는 것은 아니다  query 에 여러 단어가 있을 수 있다  그 단어가 문맥상 중요하지 않을 수 있다 u e.g. “This document is not about...”  이 장에서는 문서가 query 에 불확실하게 match 되는 것으로 가정 u 관련도 (relevance) 가 얼마나 강한지에 촛점  Topicality of document 에 촛점 u 문서의 topic 과 query 의 topic 의이 일치하는 정도 u 사용자의 지식과 배경 및 선호도 : 6 장

3 4.1 Relevance and Similarity Measure t document space: organized set of document t document space doesn’t contain queries  mapping from the document space into the query space (Boolean systems)  characteristic function having the value on documents relevance to the query: [0, 1] t document space contains queries  query is a point in the document space  relevant documents: a cluster near the query point  evaluation function: define a contour t measure  basis for evaluation of each document  some computable function

4 measure t whether document is relevant to a query t as relevance is ultimately in the mind of the user, it is difficult to measure directly t IR systems rely primarily on measurements from document and query representation t most systems equate relevance with lexical similarity- matching of words

5 4.2 Boolean-Based Matching t whether containing a given term t query is a logical function of given words, document is not.  구조적 유사성이 없슴 : characteristic function t no basis for the development of significant similarity judgments.-satisfy query or not.  수정 사례 : ‘A OR B OR C’ 의 결과에 grade t Since Boolean systems operate on the basis of the presence or absence of terms, many such systems do not include the term frequency data.

6 4.3 Vector-Based Matching: Metrics t metrics: distance measure & angular measure  distance measure u 벡터 공간에서 가까우면 유사하다는 가정  angular measure u 벡터 공간에서 비슷한 방향에 있으면 유사하다는 가정 t distance of a document from itself is 0.  not similarity measure, but dissimilarity measure  변환이 필요 t linear conversion from a metric to a similarity measure is generally not desirable.  metric  에 대한 변환  를  = k -  로 할 경우 적절한 k 값 의 선정이 어렵다

7 4.3 Vector-Based Matching: Metrics t inversion transform( 역변환 ) that maps the distance into fixed positve range of numbers t, b>1, P(  ) 는 단조증가

8 4.4 Vector-Based Matching: Cosine Measure t this is not a distance measure, but an angular measure. where t k is the value of term k in the document and q k is its value in the query this is inner product of the document and query vectors, normalized by their lengths.

9 Measure comparison t distance measures  Similarity depends only on how far a given document is from the point t Angular mesures  not consider the distance of each document from the origin, but only the direction  two documents that lie along the same vector from the origin will be judged identically, despite the fact that they may be far apart in the document space.

10 Measure comparison t ex) D1=, D2=, D3=  consine measure u  (D1, D2) = 1.0,  (D1, D3) = 0.6  euclidean distance u  (D1, D2) = 314.96,  (D1, D3) = 2.83  consine measure 는 D1 과 D2 가 더 유사한 것으로 보고 distance measure 는 D1 과 D3 이 더 유사한 것으로 본다 t In practice, distance and angular measures seem to give results of similar quality  sufficiently far from the origin

11 4.5 Missing Terms and Term Relationship t one problem - missing term  0 은 2 가지 의미 : no occurrence, no information of occurrence (e.g., )  it may be that a term is missing from a document description because an indexer did not think it significant, rather than because it does not occur in the document. - also missing from a query by user.

12 4.5 Missing Terms and Term Relationship t Another problem - term relationship  vector 연산 – 각 원소가 서로 독립임을 가정  잘못된 결과 발생 가능성 : e.g. “digital computer” t Final problem – symmetricity  distance and angular measure 는 모두 query 와 document 에 대해 대칭적인 관점을 유지  사용자는 query 에 맞는 document 를 원하지만 document 에 맞는 query 를 원하지는 않는다 u e.g. 백과사전 : 사용자 query 에 해당하는 항목에는 query 에 나타나지 않는 단어가 매우 많이 존재

13 4.6 Probabilistic Matching t focus attention on models that include uncertainties more directly t to calculate the probability that the document is relevant to the query t assumption  at any given time a sigle query is being used  the number of documents within the database that are relevant to the query is known

14 4.6 Probabilistic Matching t 무작위 (random) 로 문서를 선택할 때의 확률  P(rel) = n/N  P( ㄱ rel) = 1- P(rel) = (N-n)/N  실제로는 query 와 document 의 단어를 비교하여 선택 t P(computer|digital) > P(computer|?) t 사례 1  선택된 어떤 문서 집합 S 의 모든 문서에 대해, P(rel|selected) > P( ㄱ rel|selected) 이면 relevant  Discriminant function dis(selected)=  어떤 집합의 모든 문서에 대해 dis(selected)>1 이면 그 집합 을 검색

15 4.6 Probabilistic Matching t 사례 2  조건 : 관련 확률이 무관련 확률의 3 배 초과 u P(rel|selected) > 3 P( ㄱ rel|selected) u P(rel|selected) > 3 (1 - P(rel|selected)) u P(rel|selected) + 3 P(rel|selected) > 3 u P(rel|selected) > 0.75  discrimination function criterion is then, u dis(selected) > 3 t 하나의 문서에 대한 관련성 판단을 위해서 는 위의 공식을 ‘ 단어 ’ 단위로 적용

16 t Bayes’s theorem t applying this to the discriminant function, t assume that a document is represented by terms and these terems are statistically independent. t P(selected|rel)=P(t 1 |rel)P(t 2 |rel)....P(t n |rel) 4.6 Probabilistic Matching

17 t If estimates for the probability of occurrence of various terms in relevant documents and in nonrelevant documents can be obtained, then the probabiliy that a document will be etreived can be estimated. 4.6 Probabilistic Matching

18 t Example  전체 문서 중 관련 문서의 비율 = 0.1  1 보다 작으므로 검색되지 않음 4.6 Probabilistic Matching

19 4.7 Fuzzy Matching t probabilistic matching involves much calculation and many assumption. t In fuzzy matching the calculation is based on defined membership grades for terms. t this computation is simpler than that for probabilistic retrieval, since it involves simple functions of the membership grades for each document: fuzzy arithmetic 에 기반  e.g. Avg(max(  D1 (t1),  D2 (t1),...), max(  D1 (t2),  D1 (t2),...)) t how such terms translate into the membership functions associated with fuzzy retrieval.

20 4.8 Proximity Matching t a much older and more widely used matching method involves the proximity of terms in a text. t Frequently proximity measures are used as additional criteria to further refine the set of documents identified by one of the other matching methods. t Modifications of proximity crireria can increase their effectiveness.  e.g. ordered proximity u “junior college” vs. “college junior”

21 4.9 Effects of Weighting t Not all terms are equally important in a query. t Weighting of terms modifies the calculations upon which relevance judgments are made. t Weighting can also be applied at a broder level than individual terms.  (beef and broccoli):5; (beef but not broccoli):2, noodles:1; snow peas:1 t Filtering without weighting: more complex calculations will be confined to a relatively small set of documents.

22 4.10 Effects of Scaling t impact of the size of the document collection can be major.  whether it will be feasible to apply it to real document collections t false drops become more likely  documents that appear to match the query but are not appropriate  컴퓨터 문서 집합에서는 “object-oriented programming” 의 허위 드롭 가능성이 작지만, 일반 문서 집합에서는 크다 (TV 도 object 로 취급 ) t Information filtering  produce a relatively small set containing a high proportion of relevant document.  간단한 기법으로 작은 후보 문서 집합을 추출한 후 복잡 한 기법으로 추출된 집합을 처리 : 금의 가공 과정과 유사

23 4.11 Data Fusion t no single retrieval technique will work equally well in all situations has led to data fusion  the study of techniques for merging the results of multiple search techniques on multiple databases to produce the best possible response to a query t to develop a retrieval technique that can adapt  DB 의 표준화가 문제 t to determine a method to fairly combine  서로 다른 성격의 measure 들을 결합

24 4.12 A User-Centered View t Each user has an individual vocabulary t retrieval systems commonly miss some documents that might have been informative to the user and retrieve others that the user does not find helpful


Download ppt "Chapter 4 Matching Process. Matching Process t query 의 한 단어가 문서에 나타났다고 해서 반드시 검색되어야 하는 것은 아니다  query 에 여러 단어가 있을 수 있다  그 단어가 문맥상 중요하지 않을 수 있다 u e.g."

Similar presentations


Ads by Google