Stemming Algorithms 資訊擷取與推薦技術:期中報告 指導教授:黃三益 老師 學生: 9142608 黃哲修 9142609 張家豪.

Slides:



Advertisements
Similar presentations
1 Chap 14 Ranking Algorithm 指導教授 : 黃三益 博士 學生 : 吳金山 鄭菲菲.
Advertisements

Chapter 5: Introduction to Information Retrieval
Information Retrieval: Indexing Acknowledgements: Dr Mounia Lalmas (QMW) Dr Joemon Jose (Glasgow)
Intelligent Information Retrieval CS 336 –Lecture 3: Text Operations Xiaoyan Li Spring 2006.
Chapter 9 Allomorphy: Books with more than one cover Morphology Lane 333.
CS 430 / INFO 430 Information Retrieval
1 Discussion Class 3 The Porter Stemmer. 2 Course Administration No class on Thursday.
Spring 2002NLE1 CC 384: Natural Language Engineering Week 2, Lecture 2 - Lemmatization and Stemming; the Porter Stemmer.
CS 430 / INFO 430 Information Retrieval
1 Relevance Feedback and other Query Modification Techniques 課程名稱 : 資訊擷取與推薦技術 指導教授 : 黃三益 教授 報告者 : 博一 楊錦生 (d ) 博一 曾繁絹 (d )
Text Operations: Preprocessing. Introduction Document preprocessing –to improve the precision of documents retrieved –lexical analysis, stopwords elimination,
1 CS 430 / INFO 430 Information Retrieval Lecture 5 Searching Full Text 5.
Properties of Text CS336 Lecture 4:. 2 Stop list Typically most frequently occurring words –a, about, at, and, etc, it, is, the, or, … Among the top 200.
The College of Saint Rose CIS 460 – Search and Information Retrieval David Goldschmidt, Ph.D. from Search Engines: Information Retrieval in Practice, 1st.
WMES3103 : INFORMATION RETRIEVAL
Sets and Maps Chapter 9. Chapter 9: Sets and Maps2 Chapter Objectives To understand the Java Map and Set interfaces and how to use them To learn about.
IR Data Structures Making Matching Queries and Documents Effective and Efficient.
This Class u How stemming is used in IR u Stemming algorithms u Frakes: Chapter 8 u Kowalski: pages
Unsupervised Learning of Natural Language Morphology using MDL John Goldsmith November 9, 2001.
1 CS 430: Information Discovery Lecture 4 Data Structures for Information Retrieval.
Information Retrieval Homework #1 Members: Wesley, Lbr, Shuang CSIE, NCU.
Opening Computational Door on Knock Knock Jokes Julia M. Taylor & Lawrence J. Mazlack Applied Artificial Intelligence Laboratory University of Cincinnati.
Document and Query Forms Chapter 2. 2 Document & Query Forms Q 1. What is a document? A document is a stored data record in any form A document is a stored.
Pat-Tree-Based Adaptive keyphrase Extraction for Intelligent Chinese Information Retrieval 出處: institute of information science, academia sinica, taipei,
陳虹瑋 國立陽明大學 生物資訊學程 Genome Engineering Lab. Genome Engineering Lab The Newest.
1 Basic Text Processing and Indexing. 2 Document Processing Steps Lexical analysis (tokenizing) Stopwords removal Stemming Selection of indexing terms.
Information retrieval Finding relevant data using irrelevant keys Example: database of photographic images sorted by number, date. DBMS: Well structured.
Stemming Technology E-Business Technologies Prof. Dr. Eduard Heindl By Ajay Singh.
HYPERGEO 1 st technical verification ARISTOTLE UNIVERSITY OF THESSALONIKI Baseline Document Retrieval Component N. Bassiou, C. Kotropoulos, I. Pitas 20/07/2000,
1 Terms and Query Operations Information Retrieval: Data Structures and Algorithms by W.B. Frakes and R. Baeza-Yates (Eds.) Englewood Cliffs, NJ: Prentice.
Chapter 5: Information Retrieval and Web Search
Chapter. 8: Indexing and Searching Sections: 8.1 Introduction, 8.2 Inverted Files 9/13/ Dr. Almetwally Mostafa.
Prof. Yousef B. Mahdy , Assuit University, Egypt File Organization Prof. Yousef B. Mahdy Chapter -4 Data Management in Files.
DCU meets MET: Bengali and Hindi Morpheme Extraction Debasis Ganguly, Johannes Leveling, Gareth J.F. Jones CNGL, School of Computing, Dublin City University,
Similarity based Retrieval from Sequence Databases using Automata as Queries 作者 : A. Prasad Sistla, Tao Hu, Vikas howdhry 出處 :CIKM 2002 ACM 指導教授 : 郭煌政老師.
CS 430: Information Discovery
1 University of Palestine Topics In CIS ITBS 3202 Ms. Eman Alajrami 2 nd Semester
Data Structure. Two segments of data structure –Storage –Retrieval.
Introduction n How to retrieval information? n A simple alternative is to search the whole text sequentially n Another option is to build data structures.
Chapter 6: Information Retrieval and Web Search
1 Cryptanalysis Four kinds of attacks (recall) The objective: determine the key ( Herckhoff principle ) Assumption: English plaintext text Basic techniques:
1 CS 430: Information Discovery Lecture 4 Files Structures for Inverted Files.
Query and Document Operations - 1 Terms and Query Operations Hsin-Hsi Chen.
Basic Implementation and Evaluations Aj. Khuanlux MitsophonsiriCS.426 INFORMATION RETRIEVAL.
Imaged Document Text Retrieval without OCR IEEE Trans. on PAMI vol.24, no.6 June, 2002 報告人:周遵儒.
1 Language Specific Crawler for Myanmar Web Pages Pann Yu Mon Management and Information System Engineering Department Nagaoka University of Technology,
1 String Processing CHP # 3. 2 Introduction Computer are frequently used for data processing, here we discuss primary application of computer today is.
(C) 2003, The University of Michigan1 Information Retrieval Handout #2 February 3, 2003.
INFO Week 5 Text Properties and Operations Dr. Xia Lin Assistant Professor College of Information Science and Technology Drexel University.
1 Discussion Class 3 Stemming Algorithms. 2 Discussion Classes Format: Question Ask a member of the class to answer Provide opportunity for others to.
Sets and Maps Chapter 9. Chapter Objectives  To understand the Java Map and Set interfaces and how to use them  To learn about hash coding and its use.
Wednesday 5 th December am-10.00am. What are the expectations? Expectations for each year group Reception Children must be secure in Phase 3.
Author :K. Thambiratnam and S. Sridharan DYNAMIC MATCH PHONE-LATTICE SEARCHES FOR VERY FAST AND ACCURATE UNRESTRICTED VOCABULARY KEYWORD SPOTTING Reporter.
1 Stemming Algorithms AI LAB 정 동 환. 2 Stemming algorithm 개념  Stemming Algorithm  입력된 단어의 어근 (root) 만을 추출하는 알고리즘.  Stemmer Stemming algorithm 을 구현한.
Language Modeling Again So are we smooth now? Courtesy of Chris Jordan.
Terms and Query Operations Hsin-Hsi Chen. Lexical Analysis and Stoplists.
1 Chapter 7 Text Operations. 2 Logical View of a Document document structure recognition text+ structure accents, spacing, etc. stopwords noun groups.
Stemming Algorithms 資訊擷取與推薦技術:期中報告 指導教授:黃三益 老師 學生: 黃哲修 張家豪 From
Sets and Maps Chapter 9.
義守大學資訊工程學系 作者:郭東黌, 張佑康 報告人:徐碩利 Date: 2006/11/01
Designing Cross-Language Information Retrieval System using various Techniques of Query Expansion and Indexing for Improved Performance  Hello everyone,
CS 430: Information Discovery
Multimedia Information Retrieval
國立臺北科技大學 課程:資料庫系統 fall Chapter 18
DHT Routing Geometries and Chord
Data Mining Chapter 6 Search Engines
Chapter 5: Information Retrieval and Web Search
Sets and Maps Chapter 9.
資訊擷取與推薦技術:期中報告 指導教授:黃三益 老師 學生: 黃哲修 張家豪
Discussion Class 3 Stemming Algorithms.
Presentation transcript:

Stemming Algorithms 資訊擷取與推薦技術:期中報告 指導教授:黃三益 老師 學生: 黃哲修 張家豪

Outline Introduction Types of stemming algorithms Experimental evaluations of stemming Stemming to compress inverted files Summary Appendix

Introduction Stemming is one technique to provide ways of finding morphological variants of search terms. Used to improve retrieval effectiveness and to reduce the size of indexing files. Taxonomy for stemming algorithms

Introduction (con ’ t) Criteria for judging stemmers Correctness Overstemming: too much of a term is removed. Understemming: too little of a term is removed. Retrieval effectiveness measured with recall and precision, and on their speed, size, and so on compression performance

Type of stemming algorithms Table lookup approach Successor Variety n-gram stemmers Affix Removal Stemmers

Table lookup approach Store a table of all index terms and their stems, so terms from queries and indexes could be stemmed very fast. Problems There is no such data for English. Or some terms are domain dependent. The storage overhead for such a table, though trading size for time is sometimes warranted.

Successor Variety approach Determine word and morpheme boundaries based on the distribution of phonemes in a large body of utterances. The successor variety of a string is the number of different characters that follow it in words in some body of text. The successor variety of substrings of a term will decrease as more characters are added until a segment boundary is reached.

PrefixSuccessor VarietyLetters R RE REA READ READA READAB READABL READABLE E,I,O A,D D A,I,S B L E (Blank) Test Word: READABLE Corpus: ABLE, APE, BEATABLE, FIXABLE, READ, READABLE, READING, READS, RED, ROPE, RIPE Successor Variety approach (con ’ t)

cutoff method some cutoff value is selected and a boundary is identified whenever the cutoff value is reached peak and plateau method segment break is made after a character whose successor variety exceeds that of the characters immediately preceding and following it complete method

Successor Variety approach (con ’ t) entropy method : the number of words in a text body beginning with the i length sequence of letters  : the number of words in with the successor j The probability that a member of number of words in has the successor j is given by The entropy of is

Successor Variety approach (con ’ t) Two criteria used to evaluate various segmentation methods 1. the number of correct segment cuts divided by the total number of cuts 2. the number of correct segment cuts divided by the total number of true boundaries After segmenting, if the first segment occurs in more than 12 words in the corpus, it is probably a prefix.

Successor Variety approach (con ’ t) The successor variety stemming process has three parts 1. determine the successor varieties for a word 2. segment the word using one of the methods 3. select one of the segments as the stem

n-gram stemmers Association measures are calculated between pairs of terms based on shared unique digrams. statistics => st ta at ti is st ti ic cs unique digrams = at cs ic is st ta ti statistical => st ta at ti is st ti ic ca al unique digrams = al at ca ic is st ta ti Dice’s coefficient (similarity) A and B are the numbers of unique digrams in the first and the second words. C is the number of unique digrams shared by A and B.

n-gram stemmers (con ’ t) Similarity measures are determined for all pairs of terms in the database, forming a similarity matrix Once such a similarity matrix is available, terms are clustered using a single link clustering method (as described in Ch.16)

Affix Removal Stemmers Affix removal algorithms remove suffixes and/or prefixes from terms leaving a stem If a word ends in “ies” but not ”eies” or ”aies ” (Harman 1991) Then “ies” -> “y” If a word ends in “es” but not ”aes”, or ”ees ” or “oes” Then “es” -> “e” If a word ends in “s” but not ”us” or ”ss ” Then “s” -> “NULL”

The Porter algorithm The Porter algorithm consists of a set of condition/action rules. The condition fall into three classes Conditions on the stem Conditions on the suffix Conditions on rules

Conditions on the stem 1.The measure, denoted m,of a stem is based on its alternate vowel-consonant sequences. Measure Example M=0 M=1 M=2 TR,EE,TREE,Y,BY TROUBLE,OATS,TREES,IVY TROUBLES,PRIVATE,OATEN

Conditions on the stem (con ’ t) 2.* ---the stem ends with a given letter X 3.*v*---the stem contains a vowel 4.*d ---the stem ends in double consonant 5.*o ---the stem ends with a consonant-vowel- consonant,sequence,where the final consonant is not w, x or y Suffix conditions take the form: (current_suffix == pattern)

Conditions on the rules The rules are divided into steps. The rules in a step are examined in sequence, and only one rule from a step can apply { step1a(word); step1b(stem); if (the second or third rule of step 1b was used) step1b1(stem); step1c(stem); step2(stem); step3(stem); step4(stem); step5a(stem); step5b(stem); }

Experimental Evaluations of stemming

Stemming Studies : Conclusion The majority of stemming’s affection on retrieval performance have been positive Stemming is as effective as manual conflation The effect of stemming is dependent on the nature of vocabulary used There appears to be little difference between the retrieval effectiveness of different full stemmers

Stemming to compress inverted files Lennon et al. report the following compression percentages for various stemmers and databases. It is obvious that the savings in storage can be substantial. Compression rates also increase for affix removal stemmers as the number of suffixes increases.

Summary Stemmers are used to conflate terms to improve retrieval effectiveness and /or to reduce the size of indexing file. Stemming will increase recall at the cost of decreased precision. Stemming can have marked effect on the size of indexing files,sometimes decreasing the size of file as much as 50 percent.