Classifying Parts of Speech Based on Sparse Data Katherine Brainard
The Problem Sparse data has little contextual information Many words fall into this category Automatic PoS taggers and finders are useful
Approach Relatively easy to learn categories from frequent words Infrequent words often more “ regular ” than their common counterparts Learn frequent words, then use these to classify infrequent Uses clustering for the frequent words
Evaluating the Model Somewhat tricky - want eval function that doesn ’ t encourage degenerate behavior Evaluation separated from clustering Used both bigram probability model and comparison with already-tagged data
Results Improvement of ~36% from delaying processing of data About 2.5 times better than classifying infrequent words into one lump Using just contextual data produced the best performance