POS Tagging and Morphological Analysis

Slides:



Advertisements
Similar presentations
Corpus Linguistics Richard Xiao
Advertisements

LING 388 Language and Computers Lecture 22 11/25/03 Sandiway FONG.
Today: Run SAS programs on Saturn (UNIX tutorial) Runs SAS programs on the PC.
ANLE1 CC 437: Advanced Natural Language Engineering ASSIGNMENT 2: Implementing a query expansion component for a Web Search Engine.
Introducing the Command Line CMSC 121 Introduction to UNIX Much of the material in these slides was taken from Dan Hood’s CMSC 121 Lecture Notes.
POS based on Jurafsky and Martin Ch. 8 Miriam Butt October 2003.
NLP and Speech 2004 English Grammar
Measuring Linguistic Complexity Kristopher Kyle
7/17/2009 rwjBROOKDALE COMMUNITY COLLEGE1 Unix Comp-145 C HAPTER 2.
CS 141 Labs are mandatory. Attendance will be taken in each lab. Make account on moodle. Projects will be submitted via moodle.
ELN – Natural Language Processing Giuseppe Attardi
Help session: Unix basics Keith 9/9/2011. Login in Unix lab  User name: ug0xx Password: ece321 (initial)  The password will not be displayed on the.
© Crown copyright Met Office An Introduction to Linux PRECIS Workshop, University of Reading, 23rd – 27th April 2012.
LATTICE TECHNOLOGY, INC. For Version 10.0 and later XVL Web Master Advanced Tutorial For Version 10.0 and later.
Searching American National Corpus with the Help of AntConc.
Lemmatization Tagging LELA /20 Lemmatization Basic form of annotation involving identification of underlying lemmas (lexemes) of the words in.
Example: Jena and Fuseki
Agenda User Profile File (.profile) –Keyword Shell Variables Linux (Unix) filters –Purpose –Commands: grep, sort, awk cut, tr, wc, spell.
Dedan Githae, BecA-ILRI Hub Introduction to Linux / UNIX OS MARI eBioKit Workshop; Nov , 2014.
Learner corpus analysis and error annotation Xiaofei Lu CALPER 2010 Summer Workshop July 13, 2010.
ABAQUS Installation on LINUX Platform D. Hanumanthappa, A. Jérusalem May 5th, 2010.
How to Install and Run Prepared by: Shubhra Kanti Karmaker Santu Lecturer CSE Department BUET.
Isecur1ty training center Presented by : Eng. Mohammad Khreesha.
How to Tag a Corpus Using Stanford Tagger. Accuracy All tokens: 97.32% Unknown words: 90.79%
Compiling and Analyzing Your Own Learner Corpus Xiaofei Lu CALPER 2012 Summer Workshop July 16, 2012.
Vim Editor and Unix Command gcc compiler Computer Networks.
Tracking Language Development with Learner Corpora Xiaofei Lu CALPER 2010 Summer Workshop July 12, 2010.
Agenda Link of the week Use of Virtual Machine Review week one lab assignment This week’s expected outcomes Review next lab assignments Break Out Problems.
Unix Shell Basics Edited from Greg Wilson's "Software Carpentry"
Week 9 - Nov 7, Week 9 Agenda I/O redirection I/O redirection pipe pipe tee tee.
Linux Commands C151 Multi-User Operating Systems.
Putting it All Together Xiaofei Lu APLNG 596D July 17, 2009.
+ Introduction to Unix Joey Azofeifa Dowell Lab Short Read Class Day 2 (Slides inspired by David Knox)
Learning Unix/Linux Based on slides from: Eric Bishop.
LING/C SC 581: Advanced Computational Linguistics Lecture Notes Feb 17 th.
 CSC 215 : Procedural Programming with C C Compilers.
Part-Of-Speech Tagging Radhika Mamidi. POS tagging Tagging means automatic assignment of descriptors, or tags, to input tokens. Example: “Computational.
Language Identification and Part-of-Speech Tagging
‘Capturing the Zoo’ A system for downloading, preparing, and managing corpus data from online forums John Williams |
Lecture 9: Part of Speech
Oozie – Workflow Engine
Stubbs Lab Bioinformatics - 2 Retrieving sequence data files and Linux commands Nov 17, 2016 Joe Troy.
Commands Basic syntax of shell commands UNIX or shell commands have a basic structure command -options target command comes first (such as cd or ls) any.
Some Linux Commands.
C151 Multi-User Operating Systems
Computational and Statistical Methods for Corpus Analysis: Overview
The Linux Operating System
NLP Assignments for Undergraduates (1)
Unix Operating System (Week Two)
LING/C SC/PSYC 438/538 Lecture 20 Sandiway Fong.
Intro to UNIX System and Homework 1
Text Analytics Giuseppe Attardi Università di Pisa
Linux + Galaxy Server Tutorial
Machine Learning in Practice Lecture 12
Machine Learning in Natural Language Processing
LING/C SC 581: Advanced Computational Linguistics
Topics in Linguistics ENG 331
LING/C SC/PSYC 438/538 Lecture 23 Sandiway Fong.
LING/C SC 581: Advanced Computational Linguistics
Supervised vs. unsupervised Learning
Tutorial Unix Command & Makefile CIS 5027
CS 124/LINGUIST 180 From Languages to Information
Command line.
Natural Language Processing
Yung-Hsiang Lu Purdue University
Corpus processing tools
Module 6 Working with Files and Directories
Lab 2: Terminal Basics.
Meni Adler and Michael Elhadad Ben Gurion University COLING-ACL 2006
BYU COCA: CORPUS OF CONTEMPORARY AMERICAN ENGLISH
Presentation transcript:

POS Tagging and Morphological Analysis Xiaofei Lu APLNG 596D July 10, 2009

Agenda Assignment for credit POS tagging and the Stanford POS tagger Lemmatization and MORPHA Partial replication of Biber (2006) Ch. 3

Assignment for credit Formulate one research question that involves some sort of corpus analysis Conduct a small-scale pilot study based on your research question Submit a short description of your study, including your research question, procedure, and results

POS tagging What is the task? What is it useful for? Input and output format? What is it useful for? Linguistic analysis? NLP tasks? What are the issues involved? Tagset? Ambiguous words? Unknown words? Approaches to POS tagging Supervised and unsupervised (see Lu, 2005)

POS Tagset Effect on linguistic analysis Effect on tagger accuracy Overspecification vs underspecification Effect on tagger accuracy Example tagset Penn Treebank POS tagset BNC Tagset

Working with the terminal Important – follow demonstrations carefully so that you don’t get lost Open a terminal mkdir data mkdir tools Download wsj_0001.txt to your data folder Other commands: cd, cp, more, wc Paths: read wsj_0001.txt from the tools folder

Activity Download wsj_0001.txt to your data folder Tag the file manually using Penn Tagset Compare your results with a classmate’s and then with the Penn Treebank tagging here

Stanford POS Tagger Download the basic tagger Move it to your tools directory and install it tar –zxf stanford-postagger-2009-9-28.tar.gz Read the readme file Use the tagger to tag wsj_0001.txt Compare the results with the Penn Treebank tagging Query the tagged file with AntConc

Lemmatization What is the task? Why is it useful? Classifying morphologically-related words under one head-word Why is it useful?

Issues in lemmatization Defining what lemmas are Go, went, goes, going? Differ, different, difference? Can as a modal verb, verb and a noun? Simple stemming not enough Longer/long vs. better/bett Requires POS tagging

MORPHA Download flex 2.5.4a and MORPHA More them to your tools folder Install flex first and then MORPHA Copy verbstem.list from the morph folder to your data folder Experiment with morpha from your data folder ../tools/morph/morpha < input_file > output_file Experiment with the -a, -c, -t options

Analyses in Biber (2006) Ch3 Classroom teaching versus textbooks Number of types at different frequency levels Selected types with very high frequencies Number of types at 3 freq levels, by POS Distribution of specialized types in registers by POS Number of word types across academic disciplines Distribution of specialized types in disciplines by POS

Replicating Biber (2006) Ch3 Tagging and lemmatization Frequency lists using AntConc Terminal commands: awk and comm