Presentation is loading. Please wait.

Presentation is loading. Please wait.

資訊擷取與推薦技術:期中報告 指導教授:黃三益 老師 學生: 黃哲修 張家豪

Similar presentations


Presentation on theme: "資訊擷取與推薦技術:期中報告 指導教授:黃三益 老師 學生: 黃哲修 張家豪"— Presentation transcript:

1 資訊擷取與推薦技術:期中報告 指導教授:黃三益 老師 學生: 9142608 黃哲修 9142609 張家豪
Stemming Algorithms 資訊擷取與推薦技術:期中報告 指導教授:黃三益 老師 學生: 黃哲修 張家豪 From modified by Sumanta

2 The Porter Algorithm Word = Stem + Affix(es)
E.g., generalizations = general + ization + s Stemming is the determination of the stem of a given word Porter’s stemmer is a rule-based algorithm E.g., ational → ate (apply: relational → relate) Porter’s stemmer is heuristic, in that it is a practical method not guaranteed to be optimal

3 The Porter Stemmer: Definitions
CONSONANT: a letter other than A, E, I, O, U, and Y preceded by consonant (in TOY, consonants are T,Y; in SYZYGY they are S, Z, G) VOWEL: any other letter With this definition all words and parts of words are of form: [C](VC)m[V] C=string of one or more consonants (con+) and [C] indicates arbitrary presence of the contents V=string of one or more vowels and [V] indicates arbitrary … E.g., Troubles C VC VC = C(VC)2 m is the measure of the word m = 0: TR, EE, TREE m = 1: TROUBLE, OATS, TREES m = 2: TROUBLES, PRIVATE, OATE Spring 2002 NLE

4 Rule Format Rules are of the form (condition) S1 → S2
where S1 and S2 are suffixes. Given a set of rules, only the one with the longest matching suffix S1 is applies. Conditions: 1. m --- measure m = k or m > k, where k is an integer 2.*X --- the stem ends with a given letter X 3.*v*--- the stem contains a vowel 4.*d --- the stem ends in double consonant 5.*o --- the stem ends with a consonant-vowel-consonant sequence, where the final consonant is not w, x or y, (e.g., wil, hop) Rules are divided into sets and in each successive step one set of rules is applied.

5 if (the second or third rule of step 1b was used) step1b1(stem);
Porter Steps Each step corresponds to a set of rules. The rules in a step are examined in sequence , and only one rule from a step can apply { step1a(word); step1b(stem); if (the second or third rule of step 1b was used) step1b1(stem); step1c(stem); step2(stem); step3(stem); step4(stem); step5a(stem); step5b(stem); }

6

7 if the second or third rule of step 1b was used

8

9

10

11

12 Examples/Problems computers → computer → comput singing → sing
Step 1a Step 4 computers → computer → comput singing → sing generalizations → information → instructor → Try words of your own … Step 1b

13 Porter’s Mishaps On-line Porter’s at gives gas (noun) → ga gases (plural) → gase gasses (verb, present tense) → gass gassing (verb, present continuous) → gass gaseous (adjective) → gaseou This is not good – all these words should ideally reduce to the same stem. Trade-off: More rules (accurate but slow) vs Less rules (efficient but sometimes wrong). Google does give different results for gas and gases, so maybe they use these Porter rules:-)


Download ppt "資訊擷取與推薦技術:期中報告 指導教授:黃三益 老師 學生: 黃哲修 張家豪"

Similar presentations


Ads by Google