1 Text Analysis. 2 t Indexing t Matrix Representations t Term Extraction and Analysis t Term Association t Lexical Measures of Term Significance t Document.

1 Text Analysis

2 t Indexing t Matrix Representations t Term Extraction and Analysis t Term Association t Lexical Measures of Term Significance t Document Similarity t Problems of using a uncontrolled vocabulary

3 1. Indexing t Indexing  the act of assigning index terms to a document  manually or automatically t Indexing language(Vocabulary)  controlled or uncontrolled u controlled: limited to a predefined set of index terms u uncontrolled: allow use of any term that fits some broad criteria

4 1. Indexing  purpose u to permit easy location of documents by topic u to define topic areas, and hence relate one document to another u to predict relevance of a given document to a specified information need  characteristics u exhaustivity - the breadth of coverage of the index terms u specificity - the depth of coverage

5 Manual indexing t generally, uncontrolled indexing for manual indexing t Problem  lack of consistency u indexer 마다 다른 exhaustivity 와 specificity u controlled vocabulary 를 사용하면 다른 문제가 발생 s document 의 내용을 정확히 나타내기 어려울 수 있다.  indexer-user mismatch u 같은 개념을 다른 용어를 사용해서 표시 u controlled vocabulary 를 사용해도 해결하기 어렵다.

6 Manual indexing(continued) t Characterizing the occurrence of terms  link u occur together or have semantic relationship s ex) digital and computer u using conjunction  role u indicating its function or usage s 꽃의 이름은 식물학적 정의에 등장하기도 하고, 정원에 서의 용도를 서술하는 문장에 등장하기도 한다 u using prepositional phrases t Cross-referencing u enhance the usability of an indexing language u See,See also(RT),Broader term(BT),Narrower term(NT)

7 Automatic indexing t Algorithm 이용, index term 을 결정  almost, based on the frequency of occurrence  guiding principles u words 는 두개의 subset 으로 나눌 수 있다. s grammatical/relational and content-bearing u content-bearing words 중에서 더 많이 나타나는 word 는 더 중요 u a word 가 document collection 의 average occurrence 와 유의하게 다를 때 document 를 구별하는데 사용가능

8 Automatic indexing (Continued) t Does not settle the issue of a controlled vocabulary vs. an uncontrolled one t Recent trends  linguistic knowledge 이용 u syntactic structure u semantics and concepts u ex) DR-LINK(both) : 고유명사, 보통명사 등의 구분  inferencing technique t A major use of the index  inverted file: list the document containing each term u matching terms to document: 한번만 수행 ( 모든 query 가 공유 )

9 2. Matrix Representation t many-to-many relationship between terms and documents  관계를 명확하게 하기 위해 세 가지 matrix 사용 u term-document matrix u term-term matrix u document-document matrix

10 2. Matrix Rep.(continued) t term-document matrix, A  rows : vocabulary terms  columns : documents  0 : does not occur, 1 or N : occur t term-term matrix, T  rows, columns : vocabulary terms  nonzero(1 or N) u ith, jth term occur together in some document u or have some other relationship

11 2. Matrix Rep.(continued) t document-document matrix, D  rows,columns : documents  nonzero u documents have some terms in common u or have some other relationship s ex) author in common t 이 matrix 들은 sparse: 빈칸의 저장을 피해야  ex) term-document matrix 대신 a list of terms 사용 u 각 term 에는 list of document 가 attach 되어 있다 u 빈도수가 중요한 경우에는 ‘frequency-document identifier’ 쌍을 저장

12 3. Term Extraction and Analysis t Frequency variation  one basis for selection as automatic indexing terms t Zipf’s law  rank  frequency  constant V if the words are ranked in order of decreasing frequency  빈번한 단어들은 빈도수가 급격히 감소함을 암시 t 자주 나타나는 ( 빈번한 ) 단어  grammatical necessity: the, of, and, and a  half of any given text is made up of approximately 250 words

13 3. Term Extraction and Analysis t 빈번한 단어가 index term 으로 부적합한 이유  거의 모든 문서가 이들 단어를 포함  문서의 주된 아이디어와 무관 t 드문 단어가 index term 으로 부적합한 이유  문서의 아이디어와 유관할 수 있지만, 이런 단어 로 검색하면 결과 문서의 수가 너무 작다 u inability to retrieve many documents t Two thresholds for defining index terms  upper : high-frequency terms  lower : rare words

14 3. Term Extraction and Analysis t Zipf’s law 는 일반적인 guideline 일뿐  빈도수가 딱 한번인 100 개의 단어가 있다면, 공식 이 성립하지 않는다 : 각각은 다른 rank 를 가짐 t “the most frequent 20% of the text words account for 70% of term usage.” 와 모순됨  f = kr -1 u 전체 문서의 수는 이 곡선의 아래 면적이고, 적분에 의 해 구할 수 있다. 그러나 이 전체 면적은 무한대 u 따라서, 어떤 finite portion 도 전체 면적의 70% 가 아니다  f = kr -(1-  ) (  >0), f = kr -(1+  ) (  >0) 의 경우도 마찬가지

15 4. Term Association t 빈도수가 충분히 높은 단어쌍이나 구절은 indexing vocabulary 에 포함되어야 함 t word proximity  depend on u a given number of intervening words, u on the words appearing in the same sentence, etc.  word order, punctuation t 여러 종류의 문서 집합을 고려해야 한다  digital computer 는 의학, 음악 분야 문서집합에 서는 중요하지만, 컴퓨터 분야에서는 너무 빈번 해서 중요하지 않고, 철학 분야에서는 너무 드 물어서 중요하지 않다

16 5. Lexical Measure of Term Significance t development of an indexing language  begins with analysis of the words and phrases occurring  문서별 빈도 -> 전체 문서에서의 빈도 t Term-document matrix 보다는 term-list 가 더 실용적  sparseness t Word phrase 의 빈도  각 구성 단어의 빈도로부터 직접 구할 수는 없 지만 범위는 알 수 있다  f(AB)  min (f(A), f(B))

17 5. Lexical Measure of Term Significance t absolute term frequency  can be very misleading  documents and document collections vary in size t relative term frequency  sizes and characteristics 를 고려하여 수정된 값 u 문서내 빈도수 / 문서의 길이 ( 단어수 )  전체 문서 집합을 고려한 빈도 u 단어의 전체 빈도수 / 문서 집합의 모든 단어의 빈도수 합 u 단어를 포함하는 문서의 수 / 전체 문서의 수

18 5. Lexical Measure of Term Significance(continued) t Inverse document frequency weight t Signal-to-noise ratio t Term discrimination value

19 Inverse Document Frequency Weight t The frequency of occurrence of a term is weighted by the number of documents that contain the term  많은 문서에서 나타나면 low weight t inverse document frequency(idf)  log 2 (N/d k )+1 = log 2 N-log 2 d k +1 u d k : the number of documents containing the term k u N : the number of documents in the collection  최소값 = 1

20 Inverse Document Frequency Weight(continued) t inverse document frequenct weight(tf.idf)  w ik =f ik [log 2 N - log 2 d k + 1]  increases with the frequency of the term in the document  decreases with the number of documents containing the term  로그함수 : 문서집합 크기의 증가에 둔감 u collection 의 크기가 2 배가 되면 idf 값은 1 증가

1 Text Analysis. 2 t Indexing t Matrix Representations t Term Extraction and Analysis t Term Association t Lexical Measures of Term Significance t Document.

Similar presentations

Presentation on theme: "1 Text Analysis. 2 t Indexing t Matrix Representations t Term Extraction and Analysis t Term Association t Lexical Measures of Term Significance t Document."— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

1 Text Analysis. 2 t Indexing t Matrix Representations t Term Extraction and Analysis t Term Association t Lexical Measures of Term Significance t Document.

Similar presentations

Presentation on theme: "1 Text Analysis. 2 t Indexing t Matrix Representations t Term Extraction and Analysis t Term Association t Lexical Measures of Term Significance t Document."— Presentation transcript:

Similar presentations

About project

Feedback