Download presentation
Presentation is loading. Please wait.
1
POS Tagging and Morphological Analysis
Xiaofei Lu APLNG 596D July 10, 2009
2
Agenda Assignment for credit POS tagging and the Stanford POS tagger
Lemmatization and MORPHA Partial replication of Biber (2006) Ch. 3
3
Assignment for credit Formulate one research question that involves some sort of corpus analysis Conduct a small-scale pilot study based on your research question Submit a short description of your study, including your research question, procedure, and results
4
POS tagging What is the task? What is it useful for?
Input and output format? What is it useful for? Linguistic analysis? NLP tasks? What are the issues involved? Tagset? Ambiguous words? Unknown words? Approaches to POS tagging Supervised and unsupervised (see Lu, 2005)
5
POS Tagset Effect on linguistic analysis Effect on tagger accuracy
Overspecification vs underspecification Effect on tagger accuracy Example tagset Penn Treebank POS tagset BNC Tagset
6
Working with the terminal
Important – follow demonstrations carefully so that you don’t get lost Open a terminal mkdir data mkdir tools Download wsj_0001.txt to your data folder Other commands: cd, cp, more, wc Paths: read wsj_0001.txt from the tools folder
7
Activity Download wsj_0001.txt to your data folder
Tag the file manually using Penn Tagset Compare your results with a classmate’s and then with the Penn Treebank tagging here
8
Stanford POS Tagger Download the basic tagger
Move it to your tools directory and install it tar –zxf stanford-postagger tar.gz Read the readme file Use the tagger to tag wsj_0001.txt Compare the results with the Penn Treebank tagging Query the tagged file with AntConc
9
Lemmatization What is the task? Why is it useful?
Classifying morphologically-related words under one head-word Why is it useful?
10
Issues in lemmatization
Defining what lemmas are Go, went, goes, going? Differ, different, difference? Can as a modal verb, verb and a noun? Simple stemming not enough Longer/long vs. better/bett Requires POS tagging
11
MORPHA Download flex 2.5.4a and MORPHA More them to your tools folder
Install flex first and then MORPHA Copy verbstem.list from the morph folder to your data folder Experiment with morpha from your data folder ../tools/morph/morpha < input_file > output_file Experiment with the -a, -c, -t options
12
Analyses in Biber (2006) Ch3 Classroom teaching versus textbooks
Number of types at different frequency levels Selected types with very high frequencies Number of types at 3 freq levels, by POS Distribution of specialized types in registers by POS Number of word types across academic disciplines Distribution of specialized types in disciplines by POS
13
Replicating Biber (2006) Ch3
Tagging and lemmatization Frequency lists using AntConc Terminal commands: awk and comm
Similar presentations
© 2025 SlidePlayer.com. Inc.
All rights reserved.