Tools for Natural Language Processing Applications

Tools for Natural Language Processing Applications
Guruprasad Saikumar & Kham Nguyen

OUTLINE Natural Language Toolkit Part of Speech Taggers Parsers
Language Modeling Back-off N-gram Language Model

Natural Language Toolkit
Created as part of a computational linguistics course in the Dept of Comp & Info. Science, University of Pennsylvania ( 2001). Natural Language Toolkit (NLTK) can be used as Teaching tool Individual study tool Platform for prototyping and building research systems NLTK is organized as a flat hierarchy of packages and modules

NLTK contents: Current NLTK Modules: Useful link
Python Modules Tutorials Problem sets Reference documentation Technical documentation Current NLTK Modules: Basic operations like tokenization, tree structure, etc. Tagging Parsing Visualization Useful link

Parts of Speech TAGGERS

Stanford Log-linear Part-of-speech Tagger
By Kristina Toutanova Maximum Entropy based POS Tagger Java implementation Two trained Tagger models for English, using Penn Treebank tag set Link to download software: References: Kristina Toutanova and Christopher D. Manning. Enriching the Knowledge Sources Used in a Maximum Entropy Part-of-Speech Tagger. Proceedings of the Joint SIGDAT Conference on Empirical Methods in Natural Language Processing and Very Large Corpora (EMNLP/VLC-2000), Hong Kong. Kristina Toutanova, Dan Klein, Christopher Manning, and Yoram Singer Feature-Rich Part-of-Speech Tagging with a Cyclic Dependency Network. In Proceedings of HLT-NAACL 2003 pages

Tree Tagger Institute for Computational Linguistics of the University of Stuttgart. Based on modified version of ID3 decision tree algorithm Tagger for languages – English, German, French, Italian, Spanish, Greek p(NN|DET,ADJ) Download link:

Brill Tagger Developed by Eric Brill Transformation based POS Tagger
The rule-based part of speech tagger works by first assigning each word its most likely tag Rules are learned to use contextual cues to improve tagging accuracy Download Link References: Some advances in rule-based part of speech tagging, Proceedings of the Twelfth National Conference on Artificial Intelligence (AAAI-94), Seattle, Wa., 1994.

TnT –Trigrams’n Tags A statistical POS Tagger
Developed by Thorsten Brants, Saarland University. Implementation of Viterbi algorithm for second order Markov models Language models available for German and English. Tagger can be adapted to new languages. Useful Link:

Parsers

Stanford Parser Contributions mainly by Dan Klein with support code and linguistic grammar development by Christopher Manning Java implementation of probabilistic natural language parser. Online parser link Download link Reference: Dan Klein and Christopher D. Manning Fast Exact Inference with a Factored Model for Natural Language Parsing. In Advances in Neural Information Processing Systems 15 (NIPS 2002), December 2002.

Language Modeling

SRI Language Modeling Toolkit
SRILM was mainly developed by Andreas Stolcke, Speech Technology and Research Laboratory, CA. SRILM is a collection of C++ libraries, executable programs and helper scripts Main application - statistical modeling for speech recognition LM based on n-gram statistics Reference “SRILM - An Extensible Language Modeling Toolkit", in Proc. Intl. Conf. Spoken Language Processing, Denver, Colorado, September 2002 Download Link:

Text analysis and summarization Tools
System Quirk Text analysis Generate word lists indexing Tracker Download link MEAD summarization and evaluation tool Some features of the tool are Multiple document summarization Query based summarization Various evaluation methods.

Back-off N-gram Language Model
Kham Nguyen Spring 2006 Northeastern University

Northeastern University
Outline Statistical language modeling What is an N-gram language model Back-off N-gram Northeastern University

What is “statistical language modeling”?
A statistical language model (LM) provides a mechanism for computing: P(<s>, w1, w2, …, wn, </s>) Used in speech recognition Also used in machine translation, language identification, etc Northeastern University

N-gram Language Model The simplest form of an N-gram probability is just the relative frequency of the N-gram Northeastern University

Back-off N-gram Language Model
Necessary for unseen N-grams by “backing off” to lower order N-grams For simplicity, abbreviate the history (w1, w2,..,wn-1) as h, and (w2,..,wn-1) as h’ n=1: unigram n=2: bigram n=3: trigram Northeastern University

Back-off weight A back-off N-gram is: The probability axiom requires: Define 2 disjoint sets of wi’s: -BO(wi|h): set of all wi that wi|h seen in training data, and BO(wi|h): set of all wi that wi|h unseen in training data Northeastern University

Back-off weight (cont.)
Northeastern University

Perplexity The “quality” of an LM is typically measured by its perplexity, or the “branching” factor Basically, the perplexity is the average numbers of words that can appear after a history Northeastern University

N-gram LM for Speech Recognition
Language Model is one of the knowledge sources used in automatic speech recognition (ASR) Almost all State-of-the-art ASR systems use Back-off N-gram LM, N typically is 3 (or trigram) Northeastern University

Tools for Natural Language Processing Applications

Similar presentations

Presentation on theme: "Tools for Natural Language Processing Applications"— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

Tools for Natural Language Processing Applications

Similar presentations

Presentation on theme: "Tools for Natural Language Processing Applications"— Presentation transcript:

Similar presentations

About project

Feedback