Learning to Extract Keyphrases from Text Paper by: Peter Turney National Research Council of Canada Technical Report (1999) Presented by: Prerak Sanghvi.

Learning to Extract Keyphrases from Text Paper by: Peter Turney National Research Council of Canada Technical Report (1999) Presented by: Prerak Sanghvi Computer Science and Engineering Department State University of New York at Buffalo

GenEx Algorithm GenEx is a special purpose algorithm devised for extraction of keyphrases from text. The ‘performance’ of this algorithm was compared with C4.5 applied to keyword extraction and found to be superior. ‘Performance’ is measured by comparing the list of keyphrases extracted by the algorithm to the list of keyphrases suggested by the authors of the document. Consists of two parts: the Extractor and the Genitor algorithms

Extractor 2. Score Single Stems 1. Find Single Stems 3. Select Top Stems 4. Find Stem Phrases 5. Score Stem Phrases 6. Expand Single Stems 7. Drop Duplicates 8. Add Suffixes 9. Add Capitals 10. Final Output

Extractor Algorithm Step 1: Find Single Stems: Make a list of all unique words. Drop stop words (‘and’, ‘or’, ‘if’, ‘he’, ‘she’) and words with less than three characters. Stem the words by truncating them at STEM_LENGTH. Step 2: Score Single Stems: For each unique stem, count how often the stem appears in the text and note when it first appears. The score of a stem is the number of times it appears, multiplied by a factor. This factor is based on how early it appears in the document.

Extractor Algorithm Step 3: Select Top Single Stems: Rank the stems in order of descending score and make a list of the top NUM_WORKING single stems. Step 4: Find Stem Phrases: Make a list of all phrases in input text (excluding stop words). A phrase is a sequence of one, two or three words that appear consecutively in the text. Stem each phrase by truncating each word in the phrase at STEM_LENGTH characters.

Extractor Algorithm Step 5: Score Stem Phrases: Score is based on how often the phrase appears in the document. It is also based on two other factors: how early the phrase appears in the document, and how many words it contains. Step 6: Expand Single Stems: For each stem in the list of the top NUM_WORKING single stems, find the highest scoring stem phrase. The result is a list of NUM_WORKING stem phrases.

Extractor Algorithm Step 7: Drop Duplicates. Step 8: Add Suffixes: For each stem phrase, find the most frequent corresponding whole phrase in the input text. Step 9: Add Capitalization: Not important for our purposes Step 10: Final Output: Each output phrase must not be in the list of supplied stop phrases.

Genitor Genitor is a steady state genetic algorithm, used to tune the parameters of the Extractor. The algorithm is tuned with a dataset, consisting of documents paired with target lists of keyphrases. The learning process involves adjusting the parameters to maximize the match between the output of Extractor and the target keyphrase lists.

Learning to Extract Keyphrases from Text Paper by: Peter Turney National Research Council of Canada Technical Report (1999) Presented by: Prerak Sanghvi.

Similar presentations

Presentation on theme: "Learning to Extract Keyphrases from Text Paper by: Peter Turney National Research Council of Canada Technical Report (1999) Presented by: Prerak Sanghvi."— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

Learning to Extract Keyphrases from Text Paper by: Peter Turney National Research Council of Canada Technical Report (1999) Presented by: Prerak Sanghvi.

Similar presentations

Presentation on theme: "Learning to Extract Keyphrases from Text Paper by: Peter Turney National Research Council of Canada Technical Report (1999) Presented by: Prerak Sanghvi."— Presentation transcript:

Similar presentations

About project

Feedback