Chapter 2. Extracting Lexical Features 2007 년 1 월 23 일 인공지능연구실 조선호 Text : FINDING OUT ABOUT Page. 39 ~ 59
Preview 2.1 Building Useful Tools 2.2 Inter-document Parsing 2.3 Intra-document Parsing Stemming and Other Morphological Processing Noise Words Summary 2.4 Example corpora 2.5 Implementation Basic Algorithm Fine Points Software Libraries
2.1 Building Useful Tools Introduce the example of IR system. Search engine 개발의 주된 three phases 1.First Phase - textual objects 의 arbitrary pile 을 잘 정의된 ( 각 포함하고 있 는 terms 의 string 이 index 되어진 ) document 의 corpus 로 convert 2.Second Phase - Index relation 을 invert 하기 위해 효율적인 data structure 로 만드는 것이 필요 ☞ 특정 Keywords 가 포함된 모든 문서를 찾을 수 있다. ( 모든 keyword 가 포함 된 특정 문서를 찾는 것보다 더 유리 ) 3.Third Phase – query 에 가장 유사한 것들을 검색하기 위해 인덱스들에 대한 query 를 match Extracting Lexical features – First and second phase 에서 주로 사용 : 그 이후의 분석에서 의미를 가진 features 의 집합의 추출이 목표 이 작업을 통해 얻게 되는 이러한 단위적인 feature set 의 specification 이 중요. Level of analysis – documents, words, roots, characters,...
2.2 Inter-document Parsing Corpus (an arbitrary “pile of text”) 를 개별적인 검색 가능한 document 로 만드는 단계 AI theses (AIT) and 의 사례 Multiple text fields - concatenation( 연결 ) 로써의 implement 주석을 사용 - hitlist 에 proxy 들로 사용 - 특별한 강조로 사용 특별한 document class 들을 위한 Pre-filters - deTeX - HTML, XML parsers (SAX, DOX) 은 문장 구성에 따른 구조적인 정보의 사례 Ex) mark-up languages – TEX, XML, HTML ☞ filter 같은 것이 존재하여 의미 있는 text 를 추출해낸다.
2.3 Intra-document Parsing File 은 간단히 character 들의 stream 으로 구분된다고 가정할 수 있다. Process a string of characters assemble characters into tokens (tokenizer) choose tokens to index Lexical Analyzer generator Ex) Lex / yacc Basic idea is a finite state machine Triples of input state, transition token, output state
Lexical Analyzer Output of lexical analyzer is a string of tokens Remaining operations are all on these tokens We have already thrown away some information; makes more efficient, but limits somewhat the power of our search Same lexical analysis for both documents and queries!
Stemming and Other Morphological Processing Conflation Stemming Rewrite rules Porter stemmer Other approaches Phrases
Stemming Additional processing at the token level We covered earlier this semester Turn words into a canonical form: “ cars ” into “ car ” “ children ” into “ child ” “ walked ” into “ walk ” Decreases the total number of different tokens to be processed Decreases the precision of a search, but increases its recall
Conflation
Stemming Stemming 에서는 suffix 들은 제거된다. 다음은 복수형의 단수형화이다. -WOMAN / WOMEN -LEAF / LEAVES -FERRY / FERRIES -ALUMNUS / ALUMNI -DATUM / DATA Rewrite rules
Porter stemmer Rules Rule matching
Other approaches Phrases
Noise Words a.k.a. Stop Words, negative dictionaries Function words that contribute little or nothing to meaning Very frequent words If a word occurs in every document, it is not useful in choosing among documents However, need to be careful, because this is corpus- dependent Often implemented as a discrete list
Summary Text document is represented by the words it contains (and their occurrences) e.g., “ Lord of the rings ” { “ the ”, “ Lord ”, “ rings ”, “ of ” } Highly efficient Makes learning far simpler and easier Order of words is not that important for certain applications Stemming Reduce dimensionality Identifies a word by its root e.g., flying, flew fly Stop words Identifies the most common words that are unlikely to help with text mining e.g., “ the ”, “ a ”, “ an ”, “ you ”
2.4 Example Corpora We are assuming a fixed corpus. Some sample corpora: AIT . Anyone ’ s . Reuters corpus Brown corpus Will contain textual fields, maybe structured attributes Textual: free, unformatted, no meta-information. NLP mostly needed here Structured: additional information beyond the content
AI Theses (AIT)
AIT year Distribution
Structured Fields for An Message Header – From, To, Cc, Subject, Date
Text fields for Subject Format is structured, content is arbitrary. Captures most critical part of content. Proxy for content -- but may be inaccurate. Body of Highly irregular, informal English. Entire document, not summary. Spelling and grammar irregularities. Structure and length vary.
2.5 Implementation Indexing We have a tokenized, stemmed sequence of words Next step is to parse document, extracting index terms Assume that each token is a word and we don ’ t want to recognize any more complex structures than single words. When all documents are processed, create index
Basic algorithm Figure 2.4 Basic Posting Data Structure
Basic Indexing Algorithm For each document in the corpus Get the next token Create or update an entry in a list -doc ID, frequency. For each token found in the corpus calculate #docs, total frequency sort by frequency Often called a “ reverse index ”, because it reverses the “ words in a document ” index to be a “ documents containing words ” index. May be built on the fly or created after indexing.
Refined Posting Data Structures Minimizing OS dependencies
Fine Points Dynamic Corpora (e.g., the web): requires incremental algorithms Higher-resolution data (eg, char position). Supports highlighting Supports phrase searching Useful in relevance ranking Giving extra weight to proxy text (typically by doubling or tripling frequency count) Document-type-specific processing In HTML, want to ignore tags In , maybe want to ignore quoted material
Basic Measures for Text Retrieval Precision: the percentage of retrieved documents that are in fact relevant to the query (i.e., “correct” responses) Recall: the percentage of documents that are relevant to the query and were, in fact, retrieved Relevant Relevant & Retrieved Retrieved All Documents