1 Signal-to-Noise Ratio t Information theory 에 기반  1948, Claude Shannon t information (Shannon 의 정의 )  unexpectedness of a message ( 의미와는 무관 ) t information.

Slides:



Advertisements
Similar presentations
Text Categorization.
Advertisements

Chapter 5: Introduction to Information Retrieval
Text Databases Text Types
Properties of Text CS336 Lecture 3:. 2 Generating Document Representations Want to automatically generate with little human intervention Use significant.
Lecture 11 Search, Corpora Characteristics, & Lucene Introduction.
Intelligent Information Retrieval CS 336 –Lecture 3: Text Operations Xiaoyan Li Spring 2006.
CSE3201/CSE4500 Information Retrieval Systems Introduction to Information Retrieval.
Introduction to Information Retrieval (Manning, Raghavan, Schutze) Chapter 6 Scoring term weighting and the vector space model.
Comparison of information retrieval techniques: Latent semantic indexing (LSI) and Concept indexing (CI) Jasminka Dobša Faculty of organization and informatics,
Text Operations: Preprocessing. Introduction Document preprocessing –to improve the precision of documents retrieved –lexical analysis, stopwords elimination,
IR Models: Overview, Boolean, and Vector
Information Retrieval Ling573 NLP Systems and Applications April 26, 2011.
Query Operations: Automatic Local Analysis. Introduction Difficulty of formulating user queries –Insufficient knowledge of the collection –Insufficient.
WMES3103 : INFORMATION RETRIEVAL
Chapter 5: Query Operations Baeza-Yates, 1999 Modern Information Retrieval.
Ch 4: Information Retrieval and Text Mining
1 Query Language Baeza-Yates and Navarro Modern Information Retrieval, 1999 Chapter 4.
Vector Space Model Any text object can be represented by a term vector Examples: Documents, queries, sentences, …. A query is viewed as a short document.
What is a document? Information need: From where did the metaphor, doing X is like “herding cats”, arise? quotation? “Managing senior programmers is like.
Information retrieval Finding relevant data using irrelevant keys Example: database of photographic images sorted by number, date. DBMS: Well structured.
Retrieval Models II Vector Space, Probabilistic.  Allan, Ballesteros, Croft, and/or Turtle Properties of Inner Product The inner product is unbounded.
CSE 730 Information Retrieval of Biomedical Data The use of medical lexicon in biomedical IR.
Automatic Indexing (Term Selection) Automatic Text Processing by G. Salton, Chap 9, Addison-Wesley, 1989.
Modern Information Retrieval Chapter 7: Text Operations Ricardo Baeza-Yates Berthier Ribeiro-Neto.
Term weighting and vector representation of text Lecture 3.
WMES3103 : INFORMATION RETRIEVAL INDEXING AND SEARCHING.
Recuperação de Informação. IR: representation, storage, organization of, and access to information items Emphasis is on the retrieval of information (not.
1 Automatic Indexing Automatic Text Processing by G. Salton, Addison-Wesley, 1989.
Modeling (Chap. 2) Modern Information Retrieval Spring 2000.
1 Vector Space Model Rong Jin. 2 Basic Issues in A Retrieval Model How to represent text objects What similarity function should be used? How to refine.
Automated Essay Grading Resources: Introduction to Information Retrieval, Manning, Raghavan, Schutze (Chapter 06 and 18) Automated Essay Scoring with e-rater.
Modern Information Retrieval Chapter 7: Text Processing.
Text mining.
COMP423.  Query expansion  Two approaches ◦ Relevance feedback ◦ Thesaurus-based  Most Slides copied from ◦
Chapter 2. Extracting Lexical Features 2007 년 1 월 23 일 인공지능연구실 조선호 Text : FINDING OUT ABOUT Page. 39 ~ 59.
1 CS 430: Information Discovery Lecture 16 Thesaurus Construction.
Information Retrieval by means of Vector Space Model of Document Representation and Cascade Neural Networks Igor Mokriš, Lenka Skovajsová Institute of.
Query Operations J. H. Wang Mar. 26, The Retrieval Process User Interface Text Operations Query Operations Indexing Searching Ranking Index Text.
1 Query Operations Relevance Feedback & Query Expansion.
1 University of Palestine Topics In CIS ITBS 3202 Ms. Eman Alajrami 2 nd Semester
Xiaoying Gao Computer Science Victoria University of Wellington Intelligent Agents COMP 423.
Katrin Erk Vector space models of word meaning. Geometric interpretation of lists of feature/value pairs In cognitive science: representation of a concept.
Chapter 6: Information Retrieval and Web Search
Introduction to Digital Libraries hussein suleman uct cs honours 2003.
LIS618 lecture 3 Thomas Krichel Structure of talk Document Preprocessing Basic ingredients of query languages Retrieval performance evaluation.
Text mining. The Standard Data Mining process Text Mining Machine learning on text data Text Data mining Text analysis Part of Web mining Typical tasks.
LANGUAGE MODELS FOR RELEVANCE FEEDBACK Lee Won Hee.
Iterative Translation Disambiguation for Cross Language Information Retrieval Christof Monz and Bonnie J. Dorr Institute for Advanced Computer Studies.
Chapter 2: Getting to Know Your Data
Vector Space Models.
1 Text Analysis. 2 t Indexing t Matrix Representations t Term Extraction and Analysis t Term Association t Lexical Measures of Term Significance t Document.
1 Latent Concepts and the Number Orthogonal Factors in Latent Semantic Analysis Georges Dupret
Text Operations J. H. Wang Feb. 21, The Retrieval Process User Interface Text Operations Query Operations Indexing Searching Ranking Index Text.
Generating Query Substitutions Alicia Wood. What is the problem to be solved?
Term Weighting approaches in automatic text retrieval. Presented by Ehsan.
1 CS 430: Information Discovery Lecture 8 Automatic Term Extraction and Weighting.
Xiaoying Gao Computer Science Victoria University of Wellington COMP307 NLP 4 Information Retrieval.
Chapter 4 Matching Process. Matching Process t query 의 한 단어가 문서에 나타났다고 해서 반드시 검색되어야 하는 것은 아니다  query 에 여러 단어가 있을 수 있다  그 단어가 문맥상 중요하지 않을 수 있다 u e.g.
3. Weighting and Matching against Indices 인공지능 연구실 송승미 Text : Finding out about Page:
1 CS 430 / INFO 430 Information Retrieval Lecture 12 Query Refinement and Relevance Feedback.
INFSCI 2140 Information Storage and Retrieval Lecture 5: Text Analysis Peter Brusilovsky
1 CS 430: Information Discovery Lecture 21 Interactive Retrieval.
1 Text Categorization  Assigning documents to a fixed set of categories  Applications:  Web pages  Recommending pages  Yahoo-like classification hierarchies.
1 Stemming Algorithms AI LAB 정 동 환. 2 Stemming algorithm 개념  Stemming Algorithm  입력된 단어의 어근 (root) 만을 추출하는 알고리즘.  Stemmer Stemming algorithm 을 구현한.
VECTOR SPACE INFORMATION RETRIEVAL 1Adrienn Skrop.
Similarity Measures for Text Document Clustering
Instance Based Learning
Multimedia Information Retrieval
Representation of documents and queries
5. Vector Space and Probabilistic Retrieval Models
Presentation transcript:

1 Signal-to-Noise Ratio t Information theory 에 기반  1948, Claude Shannon t information (Shannon 의 정의 )  unexpectedness of a message ( 의미와는 무관 ) t information content of a choice  H(p 1,p 2,…,p n ) u n 개의 message(event), message i 의 발생확률 p i u p 1 +p 2 +…+p n =1(p i :nonnegative) t goal  to measure the information content of the choice of a message from this set of messages

2 Signal-to-Noise Ratio t H 를 정의하기 위한 3 가지 가정  H is a continuous function of the p i u 확률이 조금 변하면 H 도 조금 변한다  각 확률 p i 가 같다면 (p i = 1/n), H 는 n 의 단조 증가 함 수이다 u 후보 메시지의 수가 많으면 H 가 크다  하나의 선택을 2 개의 연속적인 선택으로 분할할 수 있으면, 분할 후의 H 의 합은 원래의 H 와 같아야 한다

3 Signal-to-Noise Ratio t 세번째 가정을 설명하는 예  p1=1/2, p2=1/3, p3=1/6  3 가지 메시지 중 1 개를 직접 선택하는 경우 u H(1/2, 1/3, 1/6)  첫번째와 나머지 중 하나를 먼저 선택하는 경우 u H(1/2, 1/3, 1/6 ) = H(1/2, 1/2) + 1/2 H(2/3, 1/3)  두번째와 나머지 중 하나를 먼저 선택하는 경우 u H(1/2, 1/3, 1/6 ) = H(2/3, 1/3) + 2/3 H(3/4, 1/4)

4 Signal-to-Noise Ratio t H 의 3 가지 가정을 모두 만족하는유일한 함수 는 물리학의 entropy 함수이다  H = -K  p i log 2 p i  K=1 일 때, H =  p i log 2 (1/p i ) u average information content

5 [ 정리 ] 2 가지 information content 1. 사건 (event) 의 information content  사건 발생의 unexpectedness  log 2 (1/p i ) 2. 사건 선택 (choice) 의 information content  각 후보 사건의 확률합 = 1  각 후보 사건의 information content 들의 평균적인 information content  H =  p i log 2 (1/p i )  각 사건의 확률이 비슷할수록 높은 값  선택의 information content 가 낮더라도, 확률이 낮은 (information content 가 큰 ) 사건의 발생은 높 은 unexpectedness

6 Signal-to-Noise Ratio(continued) t Signal-to-noise ratio: s k  정보 이론의 관점에서 index term 의 가치를 측정 t weight w ik =f ik s k  noise of term k n k =  (f ik /t k )log 2 (t k /f ik )= log 2 [(t k /f ik ) (f ik /t k ) ] s t : the total frequency in the collection s f : the frequency of the document  signal of term k s k =log 2 t k - n k (>0, why?)

7 Term Discrimination Value t How well a term distinguish one document from another  need to measure the similarity of two documents u 같은 key term 을 가지고 있는가 ? u Document similarity :  s  (D 1,D 2 ) : 매우 비슷하면 1, 전혀 다르면 0 t Average similarity of a document collection  1/(N(N-1))   (D 1,D 2 ) (O(N 2 ) 의 복잡도 )  a simpler computation u centroid document, D* ( O(N) 의 복잡도 ) u f* k =  f ik /N = t k /N,  * = c  (D*, D i )

8 Term Discrimination Value t discrimination value of term k   k =  * k -  * u  * k : deleted average similarity for term k u  * : average similarity containing term k   k >0 : term k increases the dissimilarity   k <0 : term k decreases the dissimilarity  좋은 식별자일수록 더 큰 양의  k 값을 가진다 t weight w ik =f ik  k

9 Other methods of analysis t document 는 단순한 통계 정보 이상의 것을 담 고 있다  e.g. natural language processing t Pragmatic factors  trigger phrases u 특정 유형의 정보가 있음을 알림 u figure, table, for example, conclusion,...  source of document u 유명한 저자, 저명 학술지,...  사용자에 대한 정보 u high school student or Ph.D.?, well versed or not?

10 Document Similarity t Similarity  key concept behind information storage and retrieval.  목적 u query 에 의해 표현된 정보와 유사한 내용을 가지고 있 는 document 를 검색하는 것. t Lexically based measures are dominant.  문서 길이 등에 의한 편차를 줄이기 위해 정규 화된 (normalized) similarity measure 를 사용

11 Lexically based measure t Basic representation  vector form  D = u t i : ith term in the vocabulary u t1, t2, …, tN s term frequencies, s or indicator of term occurrence

12 Occurrence-oriented(0-1 vector) t Basic comparison unit   (D 1, D 2 ) = w - (n 1 n 2 /N) s 0 보다 클수도 있고 작을수도 있다 ( 클수록 비슷 ) s 0 인 경우 : independence value of w (w = n1n2/N) u n 1 = w+x u n 2 = w+y u N = w+x+y+z u w = the number of terms for which t 1i = t 2i = 1 u x = the number of terms for which t 1i = 1, t 2i = 0 u y = the number of terms for which t 1i = 0, t 2i = 1 u z = the number of terms for which t 1i = 0, t 2i = 0

13 Occurrence-oriented(0-1 vector)

14 t Coefficient of association  상관 계수 C  (D 1,D 2 ) =  (D 1, D 2 ) /    만 단독으로 사용하면 너무 큰 값이 될 수 있으 므로 계수  로 나눈 값을 최종 상관 ( 유사 ) 계수 로 사용 u N=10,000, w=1000, n1=1000, n2=1000 이면,  는 900 t Separation Coefficient u 두 문서가 분리된 정도 ( 유사도의 반대 개념 ) (>0, <1) u 유사도 = 평균적 분리도 – 두 문서 간 분리도 u  (S)=N/2 Occurrence-oriented(0-1 vector)

15 Occurrence-oriented(0-1 vector)

16 t Other coefficients Occurrence-oriented(0-1 vector)

17 t  를 사용하지 않는 coefficient( 상관계수 )  Dice’s Coefficient u independant value 를 사용하지 않음 u w 항만을 사용 ( 산술 평균으로 나눈값 )  Cosine Coefficient Occurrence-oriented(0-1 vector)

18 frequency-oriented t 빈도수 기반 유사도  based on metric or distance measure  3 가지 가정 u nonnegative, 동일 문서간 거리 =0 u symmetric u triangle inequality: d(A, B)+d(B, C) > d(A, C)  similarity 는 distance 에 반비례  pseudo-metric u 실제로는 다른 문서간 거리가 0 이 되는 것을 허용 u list of key terms 를 사용하는 경우 : full text 검색에 적합

19 frequency-oriented t 유사도 (similarity) 는 distance 의 반비례 함수  ex) if d is distance, e -d can be the similarity function t L p metrics  일반적으로 p 는, u 1:city block(or Manhatan) distance u 2:Euclidean distance u  :maximal direction distance

20 frequency-oriented t 예제 : D 1 =, D 2 =, D 3 =, D 4 =  D1 으로부터 D2, D4 까지의 상대 거리는 측정값의 종류에 따라 달라짐

21 7. Problems of using a uncontrolled vocabulary t The impact of very common terms  stop list t variants of a given term  stemming t the use of different terms with similar meanings  thesaurus

22 Stop list (Negative dictionary) t most common words(the,of,and,…) in English account for 50% or more of any given text. t maintaining stop list can increase performance t But, the use of stop words should be carefully considered.  ex) “To be, or not to be”  Adding subject dependent stop list to general one can solve this problem more or less.

23 Stemming t a given word may occur in many different forms. t for example,  computer, computers, computing, compute, computes, computation, computational, computationally t stemming algorithm can increase performance t 주로 접미사 (suffix) 를 반복적으로 제거  맨끝으로부터 가장 긴 접미사를 찾는 것이 목적

24 Stemming t 접두사 (prefix) 를 활용하지 않는 이유  접두사인지 단어의 일부인지를 구별하기 힘들다 u inner, interior, into  접두사의 제거가 단어의 뜻을 크게 변화시킬 수도 있다 u negative prefixes (unfortunate vs. fortunate) t problems  Result of stemming can make the meaning of words change. u ex) breed = bre + ed  Stem changes in plural of noun in English. u ex) knives = knive + s  full text 의 stemming 에는 매우 큰 비용 u 대안 : query 에 대해서만 stemming 하고 * 를 사용한다 s computers -> comput*

25 Thesaurus t different terms can assume similar meanings.  ex) post a letter = mail a letter t Thesaurus contains  synonyms and antonyms  broader and narrower terms  closely related terms t during stroage process,  control the vocabulary  replace each term variant with a standard term chosen on the basis of the thesaurus

26 Thesaurus t During query process,  broaden a query and ensures that relevant documents are not missed. t problems  Homographs u two words with distinct meanings but identical spellings u 구분을 위해서는 syntactic, semantic, pragmatic analysis 가 모두 필요하다 u ex) I can can a can.  Homonyms (multimedia document 의 경우 ) u words that sound alike but have distinct meanings u ex) bore vs boar