Download presentation
Presentation is loading. Please wait.
Published byCharlene Singleton Modified over 9 years ago
1
Chapter 2. Extracting Lexical Features 2007 년 1 월 23 일 인공지능연구실 조선호 Text : FINDING OUT ABOUT Page. 39 ~ 59
2
Preview 2.1 Building Useful Tools 2.2 Inter-document Parsing 2.3 Intra-document Parsing 2.3.1 Stemming and Other Morphological Processing 2.3.2 Noise Words 2.3.3 Summary 2.4 Example corpora 2.5 Implementation 2.5.1 Basic Algorithm 2.5.2 Fine Points 2.5.3 Software Libraries
3
2.1 Building Useful Tools Introduce the example of IR system. Search engine 개발의 주된 three phases 1.First Phase - textual objects 의 arbitrary pile 을 잘 정의된 ( 각 포함하고 있 는 terms 의 string 이 index 되어진 ) document 의 corpus 로 convert 2.Second Phase - Index relation 을 invert 하기 위해 효율적인 data structure 로 만드는 것이 필요 ☞ 특정 Keywords 가 포함된 모든 문서를 찾을 수 있다. ( 모든 keyword 가 포함 된 특정 문서를 찾는 것보다 더 유리 ) 3.Third Phase – query 에 가장 유사한 것들을 검색하기 위해 인덱스들에 대한 query 를 match Extracting Lexical features – First and second phase 에서 주로 사용 : 그 이후의 분석에서 의미를 가진 features 의 집합의 추출이 목표 이 작업을 통해 얻게 되는 이러한 단위적인 feature set 의 specification 이 중요. Level of analysis – documents, words, roots, characters,...
4
2.2 Inter-document Parsing Corpus (an arbitrary “pile of text”) 를 개별적인 검색 가능한 document 로 만드는 단계 AI theses (AIT) and email 의 사례 Multiple text fields - concatenation( 연결 ) 로써의 implement 주석을 사용 - hitlist 에 proxy 들로 사용 - 특별한 강조로 사용 특별한 document class 들을 위한 Pre-filters - deTeX - HTML, XML parsers (SAX, DOX) Email 은 문장 구성에 따른 구조적인 정보의 사례 Ex) mark-up languages – TEX, XML, HTML ☞ filter 같은 것이 존재하여 의미 있는 text 를 추출해낸다.
8
2.3 Intra-document Parsing File 은 간단히 character 들의 stream 으로 구분된다고 가정할 수 있다. Process a string of characters assemble characters into tokens (tokenizer) choose tokens to index Lexical Analyzer generator Ex) Lex / yacc Basic idea is a finite state machine Triples of input state, transition token, output state
9
Lexical Analyzer Output of lexical analyzer is a string of tokens Remaining operations are all on these tokens We have already thrown away some information; makes more efficient, but limits somewhat the power of our search Same lexical analysis for both documents and queries!
10
Stemming and Other Morphological Processing Conflation Stemming Rewrite rules Porter stemmer Other approaches Phrases
11
Stemming Additional processing at the token level We covered earlier this semester Turn words into a canonical form: “ cars ” into “ car ” “ children ” into “ child ” “ walked ” into “ walk ” Decreases the total number of different tokens to be processed Decreases the precision of a search, but increases its recall
12
Conflation
13
Stemming Stemming 에서는 suffix 들은 제거된다. 다음은 복수형의 단수형화이다. -WOMAN / WOMEN -LEAF / LEAVES -FERRY / FERRIES -ALUMNUS / ALUMNI -DATUM / DATA Rewrite rules
14
Porter stemmer Rules Rule matching
15
Other approaches Phrases
16
Noise Words a.k.a. Stop Words, negative dictionaries Function words that contribute little or nothing to meaning Very frequent words If a word occurs in every document, it is not useful in choosing among documents However, need to be careful, because this is corpus- dependent Often implemented as a discrete list
17
Summary Text document is represented by the words it contains (and their occurrences) e.g., “ Lord of the rings ” { “ the ”, “ Lord ”, “ rings ”, “ of ” } Highly efficient Makes learning far simpler and easier Order of words is not that important for certain applications Stemming Reduce dimensionality Identifies a word by its root e.g., flying, flew fly Stop words Identifies the most common words that are unlikely to help with text mining e.g., “ the ”, “ a ”, “ an ”, “ you ”
18
2.4 Example Corpora We are assuming a fixed corpus. Some sample corpora: AIT Email. Anyone ’ s email. Reuters corpus Brown corpus Will contain textual fields, maybe structured attributes Textual: free, unformatted, no meta-information. NLP mostly needed here Structured: additional information beyond the content
19
AI Theses (AIT)
20
AIT year Distribution
21
Structured Fields for Email An Email Message Header – From, To, Cc, Subject, Date
22
Text fields for Email Subject Format is structured, content is arbitrary. Captures most critical part of content. Proxy for content -- but may be inaccurate. Body of email Highly irregular, informal English. Entire document, not summary. Spelling and grammar irregularities. Structure and length vary.
23
2.5 Implementation Indexing We have a tokenized, stemmed sequence of words Next step is to parse document, extracting index terms Assume that each token is a word and we don ’ t want to recognize any more complex structures than single words. When all documents are processed, create index
24
Basic algorithm Figure 2.4 Basic Posting Data Structure
25
Basic Indexing Algorithm For each document in the corpus Get the next token Create or update an entry in a list -doc ID, frequency. For each token found in the corpus calculate #docs, total frequency sort by frequency Often called a “ reverse index ”, because it reverses the “ words in a document ” index to be a “ documents containing words ” index. May be built on the fly or created after indexing.
26
Refined Posting Data Structures Minimizing OS dependencies
27
Fine Points Dynamic Corpora (e.g., the web): requires incremental algorithms Higher-resolution data (eg, char position). Supports highlighting Supports phrase searching Useful in relevance ranking Giving extra weight to proxy text (typically by doubling or tripling frequency count) Document-type-specific processing In HTML, want to ignore tags In email, maybe want to ignore quoted material
29
Basic Measures for Text Retrieval Precision: the percentage of retrieved documents that are in fact relevant to the query (i.e., “correct” responses) Recall: the percentage of documents that are relevant to the query and were, in fact, retrieved Relevant Relevant & Retrieved Retrieved All Documents
Similar presentations
© 2025 SlidePlayer.com. Inc.
All rights reserved.