3. Weighting and Matching against Indices 2007.1.20. 인공지능 연구실 송승미 Text : Finding out about Page:60-104.

3. Weighting and Matching against Indices 2007.1.20. 인공지능 연구실 송승미 Text : Finding out about Page:60-104

2 Microsopic Semantics and the Statistics of communication Table 3.1 English Letter Frequency Character frequencies good for simple ciphers, crosswords.. UZQSOVUOHXMOPVGPOZPEVSGZW SZOPF PESXUDBMETSXAIZVUEPHZHMDZSH ZOWS FPAPPDTSVPQUZWTMXUZUHSXEPTE POPD ZSZUFPOMBZWPFUPZHMDJUDTMOH MQ 빈도수 조사 결과 ] P : 16, Z : 14 E, T 에 해당될 가능성 높다. UZQSOVUOHXMOPVGPOZPEVSGZWSZ OPF PESXUDBMETSXAIZVUEPHZHMDZSHZO WS FPAPPDTSVPQUZWTMXUZUHSXEPTEP OPD ZSZUFPOMBZWPFUPZHMDJUDTMOHM Q

3 In this Chapter … What are we counting? What does the distribution of frequency occurrences across this level of features tell us about the pattern of their use? What can we tell about the meaning of these features, based on such statistics? How can we find meaning in text? How are such attempts to be distinguished?

4 Remember Zipf 언어학자 George Kingsley Zipf 영어로 된 책에 나오는 단어들을 모두 세어 빈 도수 조사 미국 사람들이 가장 많이 사용하는 단어 the(1000) → of(500) → and(250) → to(125) 자주 사용하는 단어는 소수에 불과, 다른 대부 분의 단어들은 비슷하게 적은 횟수로 사용

5 F(w) : the number of times word w occurs anywhere in the corpus Sorted the vocabulary according to frequency Ex) r = 1 → the most frequently occuring word r = 2 → the next most frequently used word

6 Zipf ’ s law –Empirical observation –F(r) : frequency of rank r word –F(r) = C / r α, α ≈ 1, C ≈ 0.1 ( Mathematical derivation of Zipf ’ s law – chapter 5 )

7 Zipfian Distribution of AIT Words Word frequency as function of its frequency rank Log/log plot Nearly linear Negative slope

8 Principle of Least Effort Words as tools Unification –Authors would like to use a single word, always Diversification –Readers would like a unique word for each purpose Vocabulary balance –Uses existing words, and avoid coining new ones

9 WWW surfing behavior A recent example of Zipf-like distributions

10 A statistical Basis for Keyword Meaning Nonnoise word Noise words occurs very frequently External keywords Internal keywords

11 Word occurrence as a Poisson Process Function words : of, the, but –Occur randomly throughout arbitrary text Content words

12 Resolving Power(1/2) Repetition as an indication of emphasis Resolving power = Ability of words to discriminate content Maximal at middle rank Thresholds to filter others –High frequency noise words –Low frequency, rare words

13 Resolving Power(2/2) 문서에 너무 많이 등장하기 때 문에 문서들을 구분하고 대표 하는데 별 의미 없음 쓰이는 횟수가 매우 드문 희귀 한 단어들. 일반적인 문서 구분 에는 도움이 되지 않는다.

14 Language Distribution Exhaustivity : Number of topics indexed Specificity : Ability to describe FOA information need precisely Index : A balance between user and corpus Not too exhaustive, not too specific

15 Exhaustivity ≈ N(Terms) assigned to Document Exhaustive ▷ high recall, low precision Document-oriented “ representation ” bias Specificity ≈ -1 N(documents) assigned same term Specific ▷ low recall, high precision Query-oriented “ discrimination ” bias

16 Specificity/Exhaustivity Trade- Offs

17 Indexing Graph

18 Weighting the Index Relation Weight – strength of association with a single real number The strength of the relationship between keyword and document.

19 Informative Signals vs. Noise words The least informative word (Noise words) –occurs uniformly across the corpus. Ex) the Informative Signals –Measure to weight of the keyword document

20 Hypothetical Word Distributions uniform distribution rarely happens

21 Inverse Document Frequency Up to this point. Really like to know is the number of documents containing a keyword ▷ IDF IDF ( Inverse Document Frequency) – 전체 문서 중에서 키워드 k 가 출현한 문서의 역수 –Comparison in terms of documents, not just word occurances –IDF ↑keyword k 를 포함하는 문서가 작다. IDF ↓ 〃 많다.

22 Vector Space Vector 를 이용하여 어느 기준 위치로부 터 얼마만큼 어느 방향으로 떨어져 있는 지 측정가능 어떻게 문서들간의 similarity 를 계산할 것이냐를 보다 더 수학적으로 접근해 보 자는 의미에서 등장함.

23 정보 검색 질의어 문서 2 문서 1 단순하게 두개의 색인어를 기초로 한 2 차원 평면 고려. - ( 정보, 검색 ) 좌표계 - 문서 1 : D1(0.8,0.3 ) - 문서 2 : D2(0.2,0.7 ) - 질의어 : 정보검색 Q(0.4, 0.8) - 문서 1, 문서 2 누가 더 가까울까 ?

24 Keyword 3 개 Query 와 가장 가까운 문서는 D1

25 Calculating TF-IDF Weighting TF – Term frequency IDF – Inverse document frequency idf k = log ( Ndoc / D k ) W kd = F kd * idf k F kd : the frequency with which keyword k occurs in docu. d Ndoc : the total number of document in the corpus D k : the number of documents containing keyword k

26 SMART Weighting Specification

27 inverse squared probabilistic frequency

3. Weighting and Matching against Indices 2007.1.20. 인공지능 연구실 송승미 Text : Finding out about Page:60-104.

Similar presentations

Presentation on theme: "3. Weighting and Matching against Indices 2007.1.20. 인공지능 연구실 송승미 Text : Finding out about Page:60-104."— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

3. Weighting and Matching against Indices 2007.1.20. 인공지능 연구실 송승미 Text : Finding out about Page:60-104.

Similar presentations

Presentation on theme: "3. Weighting and Matching against Indices 2007.1.20. 인공지능 연구실 송승미 Text : Finding out about Page:60-104."— Presentation transcript:

Similar presentations

About project

Feedback