Download presentation
Presentation is loading. Please wait.
Published byBlaise Henry Modified over 6 years ago
1
Tools for Natural Language Processing Applications
Guruprasad Saikumar & Kham Nguyen
2
OUTLINE Natural Language Toolkit Part of Speech Taggers Parsers
Language Modeling Back-off N-gram Language Model
3
Natural Language Toolkit
Created as part of a computational linguistics course in the Dept of Comp & Info. Science, University of Pennsylvania ( 2001). Natural Language Toolkit (NLTK) can be used as Teaching tool Individual study tool Platform for prototyping and building research systems NLTK is organized as a flat hierarchy of packages and modules
4
NLTK contents: Current NLTK Modules: Useful link
Python Modules Tutorials Problem sets Reference documentation Technical documentation Current NLTK Modules: Basic operations like tokenization, tree structure, etc. Tagging Parsing Visualization Useful link
5
Parts of Speech TAGGERS
6
Stanford Log-linear Part-of-speech Tagger
By Kristina Toutanova Maximum Entropy based POS Tagger Java implementation Two trained Tagger models for English, using Penn Treebank tag set Link to download software: References: Kristina Toutanova and Christopher D. Manning. Enriching the Knowledge Sources Used in a Maximum Entropy Part-of-Speech Tagger. Proceedings of the Joint SIGDAT Conference on Empirical Methods in Natural Language Processing and Very Large Corpora (EMNLP/VLC-2000), Hong Kong. Kristina Toutanova, Dan Klein, Christopher Manning, and Yoram Singer Feature-Rich Part-of-Speech Tagging with a Cyclic Dependency Network. In Proceedings of HLT-NAACL 2003 pages
7
Tree Tagger Institute for Computational Linguistics of the University of Stuttgart. Based on modified version of ID3 decision tree algorithm Tagger for languages – English, German, French, Italian, Spanish, Greek p(NN|DET,ADJ) Download link:
8
Brill Tagger Developed by Eric Brill Transformation based POS Tagger
The rule-based part of speech tagger works by first assigning each word its most likely tag Rules are learned to use contextual cues to improve tagging accuracy Download Link References: Some advances in rule-based part of speech tagging, Proceedings of the Twelfth National Conference on Artificial Intelligence (AAAI-94), Seattle, Wa., 1994.
9
TnT –Trigrams’n Tags A statistical POS Tagger
Developed by Thorsten Brants, Saarland University. Implementation of Viterbi algorithm for second order Markov models Language models available for German and English. Tagger can be adapted to new languages. Useful Link:
10
Parsers
11
Stanford Parser Contributions mainly by Dan Klein with support code and linguistic grammar development by Christopher Manning Java implementation of probabilistic natural language parser. Online parser link Download link Reference: Dan Klein and Christopher D. Manning Fast Exact Inference with a Factored Model for Natural Language Parsing. In Advances in Neural Information Processing Systems 15 (NIPS 2002), December 2002.
12
Language Modeling
13
SRI Language Modeling Toolkit
SRILM was mainly developed by Andreas Stolcke, Speech Technology and Research Laboratory, CA. SRILM is a collection of C++ libraries, executable programs and helper scripts Main application - statistical modeling for speech recognition LM based on n-gram statistics Reference “SRILM - An Extensible Language Modeling Toolkit", in Proc. Intl. Conf. Spoken Language Processing, Denver, Colorado, September 2002 Download Link:
14
Text analysis and summarization Tools
System Quirk Text analysis Generate word lists indexing Tracker Download link MEAD summarization and evaluation tool Some features of the tool are Multiple document summarization Query based summarization Various evaluation methods.
15
Back-off N-gram Language Model
Kham Nguyen Spring 2006 Northeastern University
16
Northeastern University
Outline Statistical language modeling What is an N-gram language model Back-off N-gram Northeastern University
17
What is “statistical language modeling”?
A statistical language model (LM) provides a mechanism for computing: P(<s>, w1, w2, …, wn, </s>) Used in speech recognition Also used in machine translation, language identification, etc Northeastern University
18
Northeastern University
N-gram Language Model The simplest form of an N-gram probability is just the relative frequency of the N-gram Northeastern University
19
Back-off N-gram Language Model
Necessary for unseen N-grams by “backing off” to lower order N-grams For simplicity, abbreviate the history (w1, w2,..,wn-1) as h, and (w2,..,wn-1) as h’ n=1: unigram n=2: bigram n=3: trigram Northeastern University
20
Northeastern University
Back-off weight A back-off N-gram is: The probability axiom requires: Define 2 disjoint sets of wi’s: -BO(wi|h): set of all wi that wi|h seen in training data, and BO(wi|h): set of all wi that wi|h unseen in training data Northeastern University
21
Back-off weight (cont.)
Northeastern University
22
Northeastern University
Perplexity The “quality” of an LM is typically measured by its perplexity, or the “branching” factor Basically, the perplexity is the average numbers of words that can appear after a history Northeastern University
23
N-gram LM for Speech Recognition
Language Model is one of the knowledge sources used in automatic speech recognition (ASR) Almost all State-of-the-art ASR systems use Back-off N-gram LM, N typically is 3 (or trigram) Northeastern University
Similar presentations
© 2024 SlidePlayer.com. Inc.
All rights reserved.