Chapter 2. Extracting Lexical Features 2007 년 1 월 23 일 인공지능연구실 조선호 Text : FINDING OUT ABOUT Page. 39 ~ 59.

Slides:



Advertisements
Similar presentations
Spelling Correction for Search Engine Queries Bruno Martins, Mario J. Silva In Proceedings of EsTAL-04, España for Natural Language Processing Presenter:
Advertisements

Chapter 5: Introduction to Information Retrieval
© Copyright 2012 STI INNSBRUCK Apache Lucene Ioan Toma based on slides from Aaron Bannert
Lecture 11 Search, Corpora Characteristics, & Lucene Introduction.
Probabilistic Language Processing Chapter 23. Probabilistic Language Models Goal -- define probability distribution over set of strings Unigram, bigram,
Final Project of Information Retrieval and Extraction by d 吳蕙如.
Information Retrieval in Practice
Search Engines and Information Retrieval
Architecture of a Search Engine
Basic IR: Queries Query is statement of user’s information need. Index is designed to map queries to likely to be relevant documents. Query type, content,
The College of Saint Rose CIS 460 – Search and Information Retrieval David Goldschmidt, Ph.D. from Search Engines: Information Retrieval in Practice, 1st.
Anatomy of a Large-Scale Hypertextual Web Search Engine (e.g. Google)
Properties of Text CS336 Lecture 3:. 2 Information Retrieval Searching unstructured documents Typically text –Newspaper articles –Web pages Other documents.
What is a document? Information need: From where did the metaphor, doing X is like “herding cats”, arise? quotation? “Managing senior programmers is like.
Information Retrieval and Extraction 資訊檢索與擷取 Chia-Hui Chang National Central University
1 Basic Text Processing and Indexing. 2 Document Processing Steps Lexical analysis (tokenizing) Stopwords removal Stemming Selection of indexing terms.
Information retrieval Finding relevant data using irrelevant keys Example: database of photographic images sorted by number, date. DBMS: Well structured.
1 CS 430: Information Discovery Lecture 2 Introduction to Text Based Information Retrieval.
Chapter 5: Information Retrieval and Web Search
Overview of Search Engines
1 Text Categorization  Assigning documents to a fixed set of categories  Applications:  Web pages  Recommending pages  Yahoo-like classification hierarchies.
1 LOMGen: A Learning Object Metadata Generator Applied to Computer Science Terminology A. Singh, H. Boley, V.C. Bhavsar National Research Council and University.
Extracting Lexical Features Development of software tools for a search engine 1. convert an arbitrary pile of textual objects into a well-defined corpus.
Search Engines and Information Retrieval Chapter 1.
Chapter 1 Introduction Dr. Frank Lee. 1.1 Why Study Compiler? To write more efficient code in a high-level language To provide solid foundation in parsing.
1 CS 430 / INFO 430 Information Retrieval Lecture 2 Text Based Information Retrieval.
Basics of Information Retrieval Lillian N. Cassel Some of these slides are taken or adapted from Source:
CSC 9010: Special Topics, Natural Language Processing. Spring, Matuszek & Papalaskari 1 Information Retrieval CSC 9010: Special Topics. Natural Language.
©2012 Paula Matuszek CSC 9010: Text Mining Applications: Information Retrieval Dr. Paula Matuszek
Chapter 2 Architecture of a Search Engine. Search Engine Architecture n A software architecture consists of software components, the interfaces provided.
Thanks to Bill Arms, Marti Hearst Documents. Last time Size of information –Continues to grow IR an old field, goes back to the ‘40s IR iterative process.
1 Information Retrieval Acknowledgements: Dr Mounia Lalmas (QMW) Dr Joemon Jose (Glasgow)
D. M. Akbar Hussain: Department of Software & Media Technology 1 Compiler is tool: which translate notations from one system to another, usually from source.
Chapter 6: Information Retrieval and Web Search
Search Engines. Search Strategies Define the search topic(s) and break it down into its component parts What terms, words or phrases do you use to describe.
Introduction to Information Retrieval Introduction to Information Retrieval CS276 Information Retrieval and Web Search Pandu Nayak and Prabhakar Raghavan.
A Scalable Machine Learning Approach for Semi-Structured Named Entity Recognition Utku Irmak(Yahoo! Labs) Reiner Kraft(Yahoo! Inc.) WWW 2010(Information.
Chapter 23: Probabilistic Language Models April 13, 2004.
Topic #1: Introduction EE 456 – Compiling Techniques Prof. Carl Sable Fall 2003.
Basic Implementation and Evaluations Aj. Khuanlux MitsophonsiriCS.426 INFORMATION RETRIEVAL.
IR Homework #1 By J. H. Wang Mar. 16, Programming Exercise #1: Vector Space Retrieval - Indexing Goal: to build an inverted index for a text collection.
Data Mining: Text Mining
©2003 Paula Matuszek CSC 9010: Search Engines Google Dr. Paula Matuszek (610)
1 CS 430: Information Discovery Lecture 8 Automatic Term Extraction and Weighting.
Relevance Feedback Prof. Marti Hearst SIMS 202, Lecture 24.
Introduction to Information Retrieval Introduction to Information Retrieval Introducing Information Retrieval and Web Search.
Selecting Relevant Documents Assume: –we already have a corpus of documents defined. –goal is to return a subset of those documents. –Individual documents.
1 CS 8803 AIAD (Spring 2008) Project Group#22 Ajay Choudhari, Avik Sinharoy, Min Zhang, Mohit Jain Smart Seek.
CS520 Web Programming Full Text Search Chengyu Sun California State University, Los Angeles.
General Architecture of Retrieval Systems 1Adrienn Skrop.
1 Text Categorization  Assigning documents to a fixed set of categories  Applications:  Web pages  Recommending pages  Yahoo-like classification hierarchies.
Chapter 5 Ranking with Indexes. Indexes and Ranking n Indexes are designed to support search  Faster response time, supports updates n Text search engines.
CS315 Introduction to Information Retrieval Boolean Search 1.
Information Retrieval in Practice
Why indexing? For efficient searching of a document
Designing Cross-Language Information Retrieval System using various Techniques of Query Expansion and Indexing for Improved Performance  Hello everyone,
Search Engines and Search techniques
Search Engine Architecture
Text Based Information Retrieval
Information Retrieval and Web Search
CS 430: Information Discovery
MG4J – Managing GigaBytes for Java Introduction
Boolean Retrieval Term Vocabulary and Posting Lists Web Search Basics
Thanks to Bill Arms, Marti Hearst
Text Categorization Assigning documents to a fixed set of categories
CS 430: Information Discovery
Chapter 7 Lexical Analysis and Stoplists
Introduction to Information Retrieval
Chapter 5: Information Retrieval and Web Search
Information Retrieval and Web Design
Presentation transcript:

Chapter 2. Extracting Lexical Features 2007 년 1 월 23 일 인공지능연구실 조선호 Text : FINDING OUT ABOUT Page. 39 ~ 59

Preview  2.1 Building Useful Tools  2.2 Inter-document Parsing  2.3 Intra-document Parsing  Stemming and Other Morphological Processing  Noise Words  Summary  2.4 Example corpora  2.5 Implementation  Basic Algorithm  Fine Points  Software Libraries

2.1 Building Useful Tools  Introduce the example of IR system.  Search engine 개발의 주된 three phases 1.First Phase - textual objects 의 arbitrary pile 을 잘 정의된 ( 각 포함하고 있 는 terms 의 string 이 index 되어진 ) document 의 corpus 로 convert 2.Second Phase - Index relation 을 invert 하기 위해 효율적인 data structure 로 만드는 것이 필요 ☞ 특정 Keywords 가 포함된 모든 문서를 찾을 수 있다. ( 모든 keyword 가 포함 된 특정 문서를 찾는 것보다 더 유리 ) 3.Third Phase – query 에 가장 유사한 것들을 검색하기 위해 인덱스들에 대한 query 를 match  Extracting Lexical features – First and second phase 에서 주로 사용 : 그 이후의 분석에서 의미를 가진 features 의 집합의 추출이 목표  이 작업을 통해 얻게 되는 이러한 단위적인 feature set 의 specification 이 중요.  Level of analysis – documents, words, roots, characters,...

2.2 Inter-document Parsing  Corpus (an arbitrary “pile of text”) 를 개별적인 검색 가능한 document 로 만드는 단계 AI theses (AIT) and 의 사례  Multiple text fields - concatenation( 연결 ) 로써의 implement  주석을 사용 - hitlist 에 proxy 들로 사용 - 특별한 강조로 사용  특별한 document class 들을 위한 Pre-filters  - deTeX  - HTML, XML parsers (SAX, DOX) 은 문장 구성에 따른 구조적인 정보의 사례  Ex) mark-up languages – TEX, XML, HTML ☞ filter 같은 것이 존재하여 의미 있는 text 를 추출해낸다.

2.3 Intra-document Parsing  File 은 간단히 character 들의 stream 으로 구분된다고 가정할 수 있다.  Process a string of characters  assemble characters into tokens (tokenizer)  choose tokens to index  Lexical Analyzer generator Ex) Lex / yacc  Basic idea is a finite state machine  Triples of input state, transition token, output state

Lexical Analyzer  Output of lexical analyzer is a string of tokens  Remaining operations are all on these tokens  We have already thrown away some information; makes more efficient, but limits somewhat the power of our search  Same lexical analysis for both documents and queries!

Stemming and Other Morphological Processing  Conflation  Stemming  Rewrite rules  Porter stemmer  Other approaches  Phrases

Stemming  Additional processing at the token level  We covered earlier this semester  Turn words into a canonical form:  “ cars ” into “ car ”  “ children ” into “ child ”  “ walked ” into “ walk ”  Decreases the total number of different tokens to be processed  Decreases the precision of a search, but increases its recall

Conflation

Stemming  Stemming 에서는 suffix 들은 제거된다. 다음은 복수형의 단수형화이다. -WOMAN / WOMEN -LEAF / LEAVES -FERRY / FERRIES -ALUMNUS / ALUMNI -DATUM / DATA  Rewrite rules

 Porter stemmer  Rules  Rule matching

 Other approaches  Phrases

Noise Words  a.k.a. Stop Words, negative dictionaries  Function words that contribute little or nothing to meaning  Very frequent words  If a word occurs in every document, it is not useful in choosing among documents  However, need to be careful, because this is corpus- dependent  Often implemented as a discrete list

Summary  Text document is represented by the words it contains (and their occurrences)  e.g., “ Lord of the rings ”  { “ the ”, “ Lord ”, “ rings ”, “ of ” }  Highly efficient  Makes learning far simpler and easier  Order of words is not that important for certain applications  Stemming  Reduce dimensionality  Identifies a word by its root  e.g., flying, flew  fly  Stop words  Identifies the most common words that are unlikely to help with text mining  e.g., “ the ”, “ a ”, “ an ”, “ you ”

2.4 Example Corpora  We are assuming a fixed corpus. Some sample corpora:  AIT  . Anyone ’ s .  Reuters corpus  Brown corpus  Will contain textual fields, maybe structured attributes  Textual: free, unformatted, no meta-information. NLP mostly needed here  Structured: additional information beyond the content

 AI Theses (AIT)

AIT year Distribution

Structured Fields for  An Message Header – From, To, Cc, Subject, Date

Text fields for  Subject  Format is structured, content is arbitrary.  Captures most critical part of content.  Proxy for content -- but may be inaccurate.  Body of  Highly irregular, informal English.  Entire document, not summary.  Spelling and grammar irregularities.  Structure and length vary.

2.5 Implementation  Indexing  We have a tokenized, stemmed sequence of words  Next step is to parse document, extracting index terms  Assume that each token is a word and we don ’ t want to recognize any more complex structures than single words.  When all documents are processed, create index

 Basic algorithm Figure 2.4 Basic Posting Data Structure

 Basic Indexing Algorithm  For each document in the corpus  Get the next token  Create or update an entry in a list -doc ID, frequency.  For each token found in the corpus  calculate #docs, total frequency  sort by frequency  Often called a “ reverse index ”, because it reverses the “ words in a document ” index to be a “ documents containing words ” index.  May be built on the fly or created after indexing.

 Refined Posting Data Structures  Minimizing OS dependencies

 Fine Points  Dynamic Corpora (e.g., the web): requires incremental algorithms  Higher-resolution data (eg, char position).  Supports highlighting  Supports phrase searching  Useful in relevance ranking  Giving extra weight to proxy text (typically by doubling or tripling frequency count)  Document-type-specific processing  In HTML, want to ignore tags  In , maybe want to ignore quoted material

Basic Measures for Text Retrieval  Precision: the percentage of retrieved documents that are in fact relevant to the query (i.e., “correct” responses)  Recall: the percentage of documents that are relevant to the query and were, in fact, retrieved Relevant Relevant & Retrieved Retrieved All Documents