POS Tagging and Morphological Analysis Xiaofei Lu APLNG 596D July 10, 2009
Agenda Assignment for credit POS tagging and the Stanford POS tagger Lemmatization and MORPHA Partial replication of Biber (2006) Ch. 3
Assignment for credit Formulate one research question that involves some sort of corpus analysis Conduct a small-scale pilot study based on your research question Submit a short description of your study, including your research question, procedure, and results
POS tagging What is the task? What is it useful for? Input and output format? What is it useful for? Linguistic analysis? NLP tasks? What are the issues involved? Tagset? Ambiguous words? Unknown words? Approaches to POS tagging Supervised and unsupervised (see Lu, 2005)
POS Tagset Effect on linguistic analysis Effect on tagger accuracy Overspecification vs underspecification Effect on tagger accuracy Example tagset Penn Treebank POS tagset BNC Tagset
Working with the terminal Important – follow demonstrations carefully so that you don’t get lost Open a terminal mkdir data mkdir tools Download wsj_0001.txt to your data folder Other commands: cd, cp, more, wc Paths: read wsj_0001.txt from the tools folder
Activity Download wsj_0001.txt to your data folder Tag the file manually using Penn Tagset Compare your results with a classmate’s and then with the Penn Treebank tagging here
Stanford POS Tagger Download the basic tagger Move it to your tools directory and install it tar –zxf stanford-postagger-2009-9-28.tar.gz Read the readme file Use the tagger to tag wsj_0001.txt Compare the results with the Penn Treebank tagging Query the tagged file with AntConc
Lemmatization What is the task? Why is it useful? Classifying morphologically-related words under one head-word Why is it useful?
Issues in lemmatization Defining what lemmas are Go, went, goes, going? Differ, different, difference? Can as a modal verb, verb and a noun? Simple stemming not enough Longer/long vs. better/bett Requires POS tagging
MORPHA Download flex 2.5.4a and MORPHA More them to your tools folder Install flex first and then MORPHA Copy verbstem.list from the morph folder to your data folder Experiment with morpha from your data folder ../tools/morph/morpha < input_file > output_file Experiment with the -a, -c, -t options
Analyses in Biber (2006) Ch3 Classroom teaching versus textbooks Number of types at different frequency levels Selected types with very high frequencies Number of types at 3 freq levels, by POS Distribution of specialized types in registers by POS Number of word types across academic disciplines Distribution of specialized types in disciplines by POS
Replicating Biber (2006) Ch3 Tagging and lemmatization Frequency lists using AntConc Terminal commands: awk and comm