1 Discussion Class 3 Stemming Algorithms. 2 Discussion Classes Format: Question Ask a member of the class to answer Provide opportunity for others to.

Slides:



Advertisements
Similar presentations
Stemming Algorithms 資訊擷取與推薦技術:期中報告 指導教授:黃三益 老師 學生: 黃哲修 張家豪.
Advertisements

Intelligent Information Retrieval CS 336 –Lecture 3: Text Operations Xiaoyan Li Spring 2006.
 Lex helps to specify lexical analyzers by specifying regular expression  i/p notation for lex tool is lex language and the tool itself is refered to.
CSE3201/CSE4500 Information Retrieval Systems Introduction to Information Retrieval.
CS 430 / INFO 430 Information Retrieval
Enhancing Translation Systems with Bilingual Concordancing Functionalities V. ANTONOPOULOSC. MALAVAZOS I. TRIANTAFYLLOUS. PIPERIDIS Presentation: V. Antonopoulos.
1 Discussion Class 3 The Porter Stemmer. 2 Course Administration No class on Thursday.
Spring 2002NLE1 CC 384: Natural Language Engineering Week 2, Lecture 2 - Lemmatization and Stemming; the Porter Stemmer.
CS 430 / INFO 430 Information Retrieval
1 CS 430 / INFO 430 Information Retrieval Lecture 5 Searching Full Text 5.
1 Discussion Class 2 A Vector Space Model for Automated Indexing.
1 Discussion Class 11 Click through Data as Implicit Feedback.
1 CS 430 / INFO 430 Information Retrieval Lecture 2 Searching Full Text 2.
1 Discussion Class 4 Latent Semantic Indexing. 2 Discussion Classes Format: Question Ask a member of the class to answer. Provide opportunity for others.
Knuth-Morris-Pratt Algorithm left to right scan like the naïve algorithm one main improvement –on a mismatch, calculate maximum possible shift to the right.
IR Data Structures Making Matching Queries and Documents Effective and Efficient.
This Class u How stemming is used in IR u Stemming algorithms u Frakes: Chapter 8 u Kowalski: pages
1 Discussion Class 10 Informedia. 2 Discussion Classes Format: Question Ask a member of the class to answer. Provide opportunity for others to comment.
Unsupervised Learning of Natural Language Morphology using MDL John Goldsmith November 9, 2001.
1 Discussion Class 12 User Interfaces and Visualization.
1 Discussion Class 3 Inverse Document Frequency. 2 Discussion Classes Format: Questions. Ask a member of the class to answer. Provide opportunity for.
1 Basic Text Processing and Indexing. 2 Document Processing Steps Lexical analysis (tokenizing) Stopwords removal Stemming Selection of indexing terms.
1 Discussion Class 2 A Vector Space Model for Automated Indexing.
1 Discussion Class 8 The Google File System. 2 Discussion Classes Format: Question Ask a member of the class to answer. Provide opportunity for others.
1 Discussion Class 5 TREC. 2 Discussion Classes Format: Questions. Ask a member of the class to answer. Provide opportunity for others to comment. When.
1 Final Discussion Class User Interfaces. 2 Discussion Classes Format: Question Ask a member of the class to answer Provide opportunity for others to.
Evaluation of N-grams Conflation Approach in Text-based Information Retrieval Serge Kosinov University of Alberta, Computing Science Department, Edmonton,
CS 430 / INFO 430 Information Retrieval
1 Terms and Query Operations Information Retrieval: Data Structures and Algorithms by W.B. Frakes and R. Baeza-Yates (Eds.) Englewood Cliffs, NJ: Prentice.
1 Numerical Integration Section Why Numerical Integration? Let’s say we want to evaluate the following definite integral:
Number Sense Standards Measurement and Geometry Statistics, Data Analysis and Probability CST Math 6 Released Questions Algebra and Functions 0 Questions.
1 CS 501 Spring 2002 CS 501: Software Engineering Lecture 9 Techniques for Requirements Definition and Specification I.
1 CS 501 Spring 2002 CS 501: Software Engineering Lecture 23 Reliability III.
1 Discussion Class 9 Thesaurus Construction. 2 Discussion Classes Format: Question Ask a member of the class to answer Provide opportunity for others.
1 University of Palestine Topics In CIS ITBS 3202 Ms. Eman Alajrami 2 nd Semester
Data Structure. Two segments of data structure –Storage –Retrieval.
1 CS 430 / INFO 430 Information Retrieval Lecture 5 Searching Full Text 5.
Query and Document Operations - 1 Terms and Query Operations Hsin-Hsi Chen.
Introduction to Algorithms Jiafen Liu Sept
1 Discussion Class 1 Three Information Retrieval Systems.
Answering Questions In PE. Within Higher PE the questions you will be asked will challenge you to think for yourself. The knowledge you gain during the.
1 Discussion Class 1 Inverted Files. 2 Discussion Classes Format: Question Ask a member of the class to answer Provide opportunity for others to comment.
1 Discussion Class 10 Thesaurus Construction. 2 Discussion Classes Format: Question Ask a member of the class to answer Provide opportunity for others.
AP STATISTICS LESSON AP STATISTICS LESSON PROBABILITY MODELS.
Slide Slide 1 Chapter 10 Correlation and Regression 10-1 Overview 10-2 Correlation 10-3 Regression 10-4 Variation and Prediction Intervals 10-5 Multiple.
Advanced Higher STATISTICS Spearman’s Rank (Spearman’s rank correlation coefficient) Lesson Objectives 1. Explain why it is used. 2. List the advantages.
1 Stemming Algorithms AI LAB 정 동 환. 2 Stemming algorithm 개념  Stemming Algorithm  입력된 단어의 어근 (root) 만을 추출하는 알고리즘.  Stemmer Stemming algorithm 을 구현한.
Short Answer Questions (SAQs) Dr. Himanshu Khatri Associate Professor Microbiology Department.
1 Discussion Class 2 A Vector Space Model for Automated Indexing.
1 CS 501 Spring 2002 CS 501: Software Engineering Lecture 15 System Architecture III.
LECTURE 6 Natural Language Processing- Practical.
Terms and Query Operations Hsin-Hsi Chen. Lexical Analysis and Stoplists.
1 Chapter 7 Text Operations. 2 Logical View of a Document document structure recognition text+ structure accents, spacing, etc. stopwords noun groups.
Stemming Algorithms 資訊擷取與推薦技術:期中報告 指導教授:黃三益 老師 學生: 黃哲修 張家豪 From
An Effective Statistical Approach to Blog Post Opinion Retrieval Ben He, Craig Macdonald, Jiyin He, Iadh Ounis (CIKM 2008)
CS4470 Computer Networking Protocols
Section 8.3 PROBABILITY.
Estimation and calculation
Indirect References to Macro Variables
Discussion Class 7 Lucene.
CALCULATE Use numbers given in the question to work out an answer. Always show working.
資訊擷取與推薦技術:期中報告 指導教授:黃三益 老師 學生: 黃哲修 張家豪
Relevance Feedback and Query Modification
Be the Historian Complete the tasks for each of the history skills below on your own paper. You should work with your other groups members but you must.
Discussion Class 3 Stemming Algorithms.
Math 0332 Subsets Name ________________________
Introduction to information retrieval
Discussion Class 9 Google.
Discussion Class 9 Informedia.
Discussion Class 8 User Interfaces.
Presentation transcript:

1 Discussion Class 3 Stemming Algorithms

2 Discussion Classes Format: Question Ask a member of the class to answer Provide opportunity for others to comment When answering: Give your name. Make sure that the TA hears it. Stand up Speak clearly so that all the class can hear

3 Question 1: Conflation methods (a) Define the terms: stem, suffix, prefix, conflation, morpheme (b) Define the terms in the following diagram: Conflation methods Manual Automatic (stemmers) Affix Successor Table n-gram removal variety lookup Longest Simple match removal

4 Question 2: Table look-up (a) What are the advantages and disadvantages of table look-up methods? (b) When would you use table look-up?

5 Question 3: Successor variety methods Hafer and Weiss defined their technique as: Let  be a word of length n,  i is a length i prefix of . Let D be the corpus of words. D  i is defined as the subset of D containing the terms whose first i letters match  i exactly. The successor variety of  i, denoted by S  i, is then defined as the number of letters that occupy the i+1 st position of words in D  i. A test word of length n has n successor varieties S  i, S  i,..., S  i. Explain this definition, using the word "computation" as an example.

6 With successor variety methods, how do the following methods of segmentation work? (a) cutoff method (b) peak and plateau method (c) complete word method Question 4: Successor variety methods

7 (a) Explain the following notation: statistics => st ta at ti is st ti ic cs unique diagrams =>at cs ic is st ta ti statistical => st ta at ti is st ti ic ca al unique diagrams => al at ca ic is st ta ti (b) Calculate the similarity using Dice's coefficient: S = Question 5: n-gram methods 2C A + B A is the number of unique diagrams in the first term B is the number of unique diagrams in the second term C is the number of shared unique diagrams (c) How would you use this approach for stemming?

8 Question 6: Porter's algorithm (a) What is an iterative, longest match stemmer? (b) How is longest match achieved in the Porter algorithm?

9 Question 7: Porter's algorithm ConditionsSuffixReplacementExamples (m > 0)eedeefeed -> feed agreed -> agree (*v*)ednullplastered -> plaster bled -> bled (*v*)ingnullmotoring -> motor sing -> sing (a) Explain this table (b) How does this table apply to: "exceeding", "ringed"?

10 Question 8: Evaluation (a) What is the overall effectiveness of stemming? (b) Give a possible reason why Stemmer A might be better than Stemmer B on Collection X but worse on Collection Y.