Presentation is loading. Please wait.

Presentation is loading. Please wait.

Chapter 2. Extracting Lexical Features 2007 년 1 월 23 일 인공지능연구실 조선호 Text : FINDING OUT ABOUT Page. 39 ~ 59.

Similar presentations


Presentation on theme: "Chapter 2. Extracting Lexical Features 2007 년 1 월 23 일 인공지능연구실 조선호 Text : FINDING OUT ABOUT Page. 39 ~ 59."— Presentation transcript:

1 Chapter 2. Extracting Lexical Features 2007 년 1 월 23 일 인공지능연구실 조선호 Text : FINDING OUT ABOUT Page. 39 ~ 59

2 Preview  2.1 Building Useful Tools  2.2 Inter-document Parsing  2.3 Intra-document Parsing  2.3.1 Stemming and Other Morphological Processing  2.3.2 Noise Words  2.3.3 Summary  2.4 Example corpora  2.5 Implementation  2.5.1 Basic Algorithm  2.5.2 Fine Points  2.5.3 Software Libraries

3 2.1 Building Useful Tools  Introduce the example of IR system.  Search engine 개발의 주된 three phases 1.First Phase - textual objects 의 arbitrary pile 을 잘 정의된 ( 각 포함하고 있 는 terms 의 string 이 index 되어진 ) document 의 corpus 로 convert 2.Second Phase - Index relation 을 invert 하기 위해 효율적인 data structure 로 만드는 것이 필요 ☞ 특정 Keywords 가 포함된 모든 문서를 찾을 수 있다. ( 모든 keyword 가 포함 된 특정 문서를 찾는 것보다 더 유리 ) 3.Third Phase – query 에 가장 유사한 것들을 검색하기 위해 인덱스들에 대한 query 를 match  Extracting Lexical features – First and second phase 에서 주로 사용 : 그 이후의 분석에서 의미를 가진 features 의 집합의 추출이 목표  이 작업을 통해 얻게 되는 이러한 단위적인 feature set 의 specification 이 중요.  Level of analysis – documents, words, roots, characters,...

4 2.2 Inter-document Parsing  Corpus (an arbitrary “pile of text”) 를 개별적인 검색 가능한 document 로 만드는 단계 AI theses (AIT) and email 의 사례  Multiple text fields - concatenation( 연결 ) 로써의 implement  주석을 사용 - hitlist 에 proxy 들로 사용 - 특별한 강조로 사용  특별한 document class 들을 위한 Pre-filters  - deTeX  - HTML, XML parsers (SAX, DOX) Email 은 문장 구성에 따른 구조적인 정보의 사례  Ex) mark-up languages – TEX, XML, HTML ☞ filter 같은 것이 존재하여 의미 있는 text 를 추출해낸다.

5

6

7

8 2.3 Intra-document Parsing  File 은 간단히 character 들의 stream 으로 구분된다고 가정할 수 있다.  Process a string of characters  assemble characters into tokens (tokenizer)  choose tokens to index  Lexical Analyzer generator Ex) Lex / yacc  Basic idea is a finite state machine  Triples of input state, transition token, output state

9 Lexical Analyzer  Output of lexical analyzer is a string of tokens  Remaining operations are all on these tokens  We have already thrown away some information; makes more efficient, but limits somewhat the power of our search  Same lexical analysis for both documents and queries!

10 Stemming and Other Morphological Processing  Conflation  Stemming  Rewrite rules  Porter stemmer  Other approaches  Phrases

11 Stemming  Additional processing at the token level  We covered earlier this semester  Turn words into a canonical form:  “ cars ” into “ car ”  “ children ” into “ child ”  “ walked ” into “ walk ”  Decreases the total number of different tokens to be processed  Decreases the precision of a search, but increases its recall

12 Conflation

13 Stemming  Stemming 에서는 suffix 들은 제거된다. 다음은 복수형의 단수형화이다. -WOMAN / WOMEN -LEAF / LEAVES -FERRY / FERRIES -ALUMNUS / ALUMNI -DATUM / DATA  Rewrite rules

14  Porter stemmer  Rules  Rule matching

15  Other approaches  Phrases

16 Noise Words  a.k.a. Stop Words, negative dictionaries  Function words that contribute little or nothing to meaning  Very frequent words  If a word occurs in every document, it is not useful in choosing among documents  However, need to be careful, because this is corpus- dependent  Often implemented as a discrete list

17 Summary  Text document is represented by the words it contains (and their occurrences)  e.g., “ Lord of the rings ”  { “ the ”, “ Lord ”, “ rings ”, “ of ” }  Highly efficient  Makes learning far simpler and easier  Order of words is not that important for certain applications  Stemming  Reduce dimensionality  Identifies a word by its root  e.g., flying, flew  fly  Stop words  Identifies the most common words that are unlikely to help with text mining  e.g., “ the ”, “ a ”, “ an ”, “ you ”

18 2.4 Example Corpora  We are assuming a fixed corpus. Some sample corpora:  AIT  Email. Anyone ’ s email.  Reuters corpus  Brown corpus  Will contain textual fields, maybe structured attributes  Textual: free, unformatted, no meta-information. NLP mostly needed here  Structured: additional information beyond the content

19  AI Theses (AIT)

20 AIT year Distribution

21 Structured Fields for Email  An Email Message Header – From, To, Cc, Subject, Date

22 Text fields for Email  Subject  Format is structured, content is arbitrary.  Captures most critical part of content.  Proxy for content -- but may be inaccurate.  Body of email  Highly irregular, informal English.  Entire document, not summary.  Spelling and grammar irregularities.  Structure and length vary.

23 2.5 Implementation  Indexing  We have a tokenized, stemmed sequence of words  Next step is to parse document, extracting index terms  Assume that each token is a word and we don ’ t want to recognize any more complex structures than single words.  When all documents are processed, create index

24  Basic algorithm Figure 2.4 Basic Posting Data Structure

25  Basic Indexing Algorithm  For each document in the corpus  Get the next token  Create or update an entry in a list -doc ID, frequency.  For each token found in the corpus  calculate #docs, total frequency  sort by frequency  Often called a “ reverse index ”, because it reverses the “ words in a document ” index to be a “ documents containing words ” index.  May be built on the fly or created after indexing.

26  Refined Posting Data Structures  Minimizing OS dependencies

27  Fine Points  Dynamic Corpora (e.g., the web): requires incremental algorithms  Higher-resolution data (eg, char position).  Supports highlighting  Supports phrase searching  Useful in relevance ranking  Giving extra weight to proxy text (typically by doubling or tripling frequency count)  Document-type-specific processing  In HTML, want to ignore tags  In email, maybe want to ignore quoted material

28

29 Basic Measures for Text Retrieval  Precision: the percentage of retrieved documents that are in fact relevant to the query (i.e., “correct” responses)  Recall: the percentage of documents that are relevant to the query and were, in fact, retrieved Relevant Relevant & Retrieved Retrieved All Documents


Download ppt "Chapter 2. Extracting Lexical Features 2007 년 1 월 23 일 인공지능연구실 조선호 Text : FINDING OUT ABOUT Page. 39 ~ 59."

Similar presentations


Ads by Google