This Class u How stemming is used in IR u Stemming algorithms u Frakes: Chapter 8 u Kowalski: pages 67-76.

This Class u How stemming is used in IR u Stemming algorithms u Frakes: Chapter 8 u Kowalski: pages 67-76

Stemming algorithms u Affix removing stemmers u Dictionary lookup stemmers u n-gram stemmers u Successor variety stemmers

Stemming u Conflation - combining morphological term variants u Done manually or automatically u Automatic algorithms called stemmers

Stemming algorithms Conflation methods ManualAutomatic Affix Removal Successor Variety Dictionary Lookup n-grams Longest Match Simple Removal

Stemming is used for: u Enhance query formulation (and improve recall) by providing term variants u Reduce size of index files by combining term variants into single index term

Stemming during indexing u Index terms are stemmed words u Saves dictionary space u One inverted index list for all variants u Saves inverted index file space when position information in document not included u Query terms are also stemmed

Index is not stemmed u In this case the index contains words u No compression is achieved u No information is lost u Enables wild card searches u Enables long phrase searches when position information included

Providing term variants during search u A stemming algorithm generate term variants u Term variants added to query automatically (query expansion) or u The user is provided with term variants and decides which ones to include

Example u A user searching for ystem users?is provided in the CATALOG system with term variants for sers?and ystem

Example (cont.) Search term: users Term Occurrences 1. user 15 2. users 1 3. used 3 4. using 2 u User selects variants to include in query

Stemmer correctness u A stemmer can be incorrect by either – Under-stemming or by – Over-stemming u Over-stemming can reduce precision u Under-stemming can affect recall

Over-stemming u Terms with different meanings are conflated u onsiderate? and onsider?and onsideration should not be stemmed to on? with ontra? ontact? etc.

Under-Stemming u Prevents related terms from being conflated u Under-stemming onsideration?to onsiderat? prevents conflating it with onsider

Evaluating stemmers u In information retrieval stemmers are evaluated by their: – effect on retrieval and – compression rate, and – not linguistic correctness

Evaluating stemmers u Studies have shown that stemming has a positive effect on retrieval. u Performance of algorithms comparable u Results vary between test collections

Affix removal stemmers u Remove – suffixes and and/or – prefixes from terms – leaving a stem

Affix removal stemmers u In English stemmers are suffix removers u In other languages, for example Hebrew, both prefix and suffix are removed

Affix removal stemmers u Most affix removal stemmers in use are: – iterative - for example, onsideration?stemmed first to onsiderat?then to onsider – longest match stemmers using a set of stemming rules.

A simple stemmer u Harman experimented – concluded minimal stemming helpful u Her simple stemmer changes: – Plural to singular – Third person to first person

A simple stemmer u Algorithm changes: u kies?to ky? ies->y u etrieves?to etrieve? es->s, and u oors?to oor? s->NULL u (leaves orpus?or ellness? u ies?to y?

A simple stemmer 1. word ends in es?but not ies?or ies?change end to ? 2. word ends in s? but not es? es?or es?change to ? 3. word ends in ?but not s?or s? remove s

The Paice/Husk stemmer u Uses a table of rules grouped into sections u Section for each last letter of a suffix (rules for forms ending in a, then b, etc.) u A form is any word or part of a word considered for stemming

The Paice/Husk stemmer u Each rule specifies a deletion or a replacement of an ending u The order of the rules in each section is important. u Rules tried until one can be applied, and the current form is updated

Rule structure u Each rule contains 5 parts (2 are optional): u An ending (one or more characters in reverse order) u An optional ntact?flag ??denoting form not yet stemmed

Rule structure u A digit (>=0) specifying no. characters to remove u An optional string to append (after removal) u A rule ending with ??denotes stemming should continue ?? terminating the stemming process

Examples of rules u ei3y>? u if form ends in es?then replace the last 3 letters by ?and continue stemming ( ries?becomes ry?

Examples of rules u u*2.? u if form ends with m?and word is intact remove 2 last letters and terminate stemming. u aximum?is stemmed to axim? but resum?from resumably?remains unchanged

Examples of rules u lp0.?- if word terminates in ly?terminate. Next rule l2>?does not remove y?from ultiply u ois4j>?causes ion?to be replaced by ? u ?acts as dummy ending u rovision?converted to rovij?and then to rovid

Acceptability conditions u Rule not applied unless conditions satisfied u Attempt to prevent over-stemming u Without them ent? ant? ice? ate? ation?iver?reduce to ? u There are 2 rules:

Acceptability conditions u If form starts with a vowel then at least 2 letters must remain (owed/owing->ow but not ear->e) u If a form starts with a consonant then at least 3 letters must remain, and at least one must be a vowel or (saying->say, crying->cry, but not string- >str, meant->me, or cement->ce)

Acceptability conditions u These rules cause error in the stemming of some short-rooted words u (doing, dying, being). u These could be dealt with separately with a table lookup

Example with Paice stemming u eparately?- use ?section u mismatch ylb1>, yli3y>, ylp0. u match yl2>. Form becomes eparate? u use rule 1>?in ?section u form changes to eparat?- use t section u mismatch with acilp4y.? match with a2>? change form to epar u use r section, match with a2.? So ep

Other examples

n-grams u Fixed length consecutive series of ?characters u Bigrams: – Sea colony -> (se ea co ol lo on ny) u Trigrams – Sea colony -> (sea col olo lon ony), or -> (#se sea ea# #co col olo lon ony ny#)

Usage of n-grams u Used in world war II by cryptographers u Spell checking u Text compression u Signature files u Stemming

n-gram temmers u Adamson and Borcham (1974) u Method for grouping term variants u Language independent

n-gram temmers u Each term transformed to n-gram u A similarity value is generated between any pair of terms in database, resulting in a similarity matrix

n-gram temmers u A clustering method (single link) groups highly similar terms into clusters u Most matrix elements had value 0. u Used a cutoff value of 0.6 for their clustering algorithm

Dice Coefficient u Many formulas for computing set similarity u Dice coefficient: S=2(|A  B|)/(|A|+|B|)  0   S  1  S=1 if A=B, S=0 if A  B= 

Sets of Unique Bigrams  Let A and B denote the sets of unique bigrams associated with two terms, and let C=A  B u statistics -> (st ta at ti is st ti ic cs) u Set of unique bigrams for statistics: A={at cs ic is st ta ti}, |A|=7

n-gram temmers u statistical= (st ta at ti is st ti ic ca al) u Set of unique bigrams for statistical B= {al at ca ic is st ta ti}, |B|=8 u C={at ic is ta st ti}, |C|=6 u S=2|C|/(|A|+|B|)=2x6/(7+8)=.8

Table lookup method u Ideally, a table is constructed with stem for every word u Stemming - look up word find stem u There is no such data for English u Systems use a combination of dictionary lookup and conflation rules

Dictionary lookup method u INQUERY uses Kstem u Kstem is a morphological analyzer that conflates word variants to root form

Dictionary lookup method u Tries to avoid collapsing words with different meaning to same root u The original word or a stemmed version is looked up in a dictionary and replaced by the best stem

Successor variety stemmer u Based on work in structural linguistic (Hafer and Weiss) u Performed less well than affix removing stemmers u Given a set of words, the successor variety (SV) of a string is the number of different characters that follow it in words in the set

Successor variety stemmers u Terms : {able, axle, accident, ape, about, apply, application, applies} u The SV of p?is 2 p?is followed by ?in pe?and by ?in pply application and applies u The SV of ?is 4 ?followed in set by ? ?? and

SVs for pply?and pplies * denotes a break point at peak

SV for pplication

Segmenting words u 4 ways: – Cut-off SV is reached – SV eaks – A substring of a word is equal to another word in the set eadable?breaks into ead?and ble – Entropy based method

Selecting a stem u First segment is selected if it occurs in at most 12 words, u Otherwise the second segment is selected (3 segments are unlikely)

Summary u All automatic stemmers - sometimes incorrect u n-gram method can be used for different languages u In general affix removing stemmers are more orrect u Longest match stemming does not always generate satisfactory word stems

This Class u How stemming is used in IR u Stemming algorithms u Frakes: Chapter 8 u Kowalski: pages 67-76.

Similar presentations

Presentation on theme: "This Class u How stemming is used in IR u Stemming algorithms u Frakes: Chapter 8 u Kowalski: pages 67-76."— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

This Class u How stemming is used in IR u Stemming algorithms u Frakes: Chapter 8 u Kowalski: pages 67-76.

Similar presentations

Presentation on theme: "This Class u How stemming is used in IR u Stemming algorithms u Frakes: Chapter 8 u Kowalski: pages 67-76."— Presentation transcript:

Similar presentations

About project

Feedback