LECTURE 6 Natural Language Processing- Practical
Stemming words Stemming is a technique for removing affixes from a word, ending up with the stem.
Porter Stemming Algorithm One of the most common stemming algorithms is the Porter Stemming Algorithm, by Martin Porter. It is designed to remove and replace well known suffixes of English words.
example >>> from nltk.stem import PorterStemmer >>> stemmer = PorterStemmer() >>> stemmer.stem('cooking') 'cook' >>> stemmer.stem('cookery') 'cookeri'
Lancaster Stemming Algorithm The LancasterStemmer functions just like the PorterStemmer, but can produce slightly different results.
example >>> from nltk.stem import LancasterStemmer >>> stemmer = LancasterStemmer() >>> stemmer.stem('cooking') 'cook' >>> stemmer.stem('cookery') 'cookery'
RegexpStemmer You can also construct your own stemmer using the RegexpStemmer. It takes a single regular expression (either compiled or as a string) and will remove any prefix or suffix that matches.
>>> from nltk.stem import RegexpStemmer >>> stemmer = RegexpStemmer('ing') >>> stemmer.stem('cooking') 'cook' >>> stemmer.stem('cookery') 'cookery' >>> stemmer.stem('ingleside') 'leside‘ A RegexpStemmer should only be used in very specific cases that are not covered by the PorterStemmer or LancasterStemmer.
SnowballStemmer New in NLTK 2.0b9 is the SnowballStemmer, which supports 13 non-English languages. To use it, you create an instance with the name of the language you are using, and then call the stem() method. Here is a list of all the supported languages, and an example using the Spanish nowballStemmer:
>>> from nltk.stem import SnowballStemmer >>> SnowballStemmer.languages ('danish', 'dutch', 'finnish', 'french', 'german', 'hungarian', 'italian', 'norwegian', 'portuguese', 'romanian', 'russian', 'spanish', 'swedish') >>> spanish_stemmer = SnowballStemmer('spanish') >>> spanish_stemmer.stem('hola') u'hol'
Stem this words :- Meaning (use PorterStemmer) Translation (use RegexpStemmer)