Presentation is loading. Please wait.

Presentation is loading. Please wait.

Tools for Natural Language Processing Applications

Similar presentations


Presentation on theme: "Tools for Natural Language Processing Applications"— Presentation transcript:

1 Tools for Natural Language Processing Applications
Guruprasad Saikumar & Kham Nguyen

2 OUTLINE Natural Language Toolkit Part of Speech Taggers Parsers
Language Modeling Back-off N-gram Language Model

3 Natural Language Toolkit
Created as part of a computational linguistics course in the Dept of Comp & Info. Science, University of Pennsylvania ( 2001). Natural Language Toolkit (NLTK) can be used as Teaching tool Individual study tool Platform for prototyping and building research systems NLTK is organized as a flat hierarchy of packages and modules

4 NLTK contents: Current NLTK Modules: Useful link
Python Modules Tutorials Problem sets Reference documentation Technical documentation Current NLTK Modules: Basic operations like tokenization, tree structure, etc. Tagging Parsing Visualization Useful link

5 Parts of Speech TAGGERS

6 Stanford Log-linear Part-of-speech Tagger
By Kristina Toutanova Maximum Entropy based POS Tagger Java implementation Two trained Tagger models for English, using Penn Treebank tag set Link to download software: References: Kristina Toutanova and Christopher D. Manning. Enriching the Knowledge Sources Used in a Maximum Entropy Part-of-Speech Tagger. Proceedings of the Joint SIGDAT Conference on Empirical Methods in Natural Language Processing and Very Large Corpora (EMNLP/VLC-2000), Hong Kong. Kristina Toutanova, Dan Klein, Christopher Manning, and Yoram Singer Feature-Rich Part-of-Speech Tagging with a Cyclic Dependency Network. In Proceedings of HLT-NAACL 2003 pages

7 Tree Tagger Institute for Computational Linguistics of the University of Stuttgart. Based on modified version of ID3 decision tree algorithm Tagger for languages – English, German, French, Italian, Spanish, Greek p(NN|DET,ADJ) Download link:

8 Brill Tagger Developed by Eric Brill Transformation based POS Tagger
The rule-based part of speech tagger works by first assigning each word its most likely tag Rules are learned to use contextual cues to improve tagging accuracy Download Link References: Some advances in rule-based part of speech tagging, Proceedings of the Twelfth National Conference on Artificial Intelligence (AAAI-94), Seattle, Wa., 1994.

9 TnT –Trigrams’n Tags A statistical POS Tagger
Developed by Thorsten Brants, Saarland University. Implementation of Viterbi algorithm for second order Markov models Language models available for German and English. Tagger can be adapted to new languages. Useful Link:

10 Parsers

11 Stanford Parser Contributions mainly by Dan Klein with support code and linguistic grammar development by Christopher Manning Java implementation of probabilistic natural language parser. Online parser link Download link Reference: Dan Klein and Christopher D. Manning Fast Exact Inference with a Factored Model for Natural Language Parsing. In Advances in Neural Information Processing Systems 15 (NIPS 2002), December 2002.

12 Language Modeling

13 SRI Language Modeling Toolkit
SRILM was mainly developed by Andreas Stolcke, Speech Technology and Research Laboratory, CA. SRILM is a collection of C++ libraries, executable programs and helper scripts Main application - statistical modeling for speech recognition LM based on n-gram statistics Reference “SRILM - An Extensible Language Modeling Toolkit", in Proc. Intl. Conf. Spoken Language Processing, Denver, Colorado, September 2002 Download Link:

14 Text analysis and summarization Tools
System Quirk Text analysis Generate word lists indexing Tracker Download link MEAD summarization and evaluation tool Some features of the tool are Multiple document summarization Query based summarization Various evaluation methods.

15 Back-off N-gram Language Model
Kham Nguyen Spring 2006 Northeastern University

16 Northeastern University
Outline Statistical language modeling What is an N-gram language model Back-off N-gram Northeastern University

17 What is “statistical language modeling”?
A statistical language model (LM) provides a mechanism for computing: P(<s>, w1, w2, …, wn, </s>) Used in speech recognition Also used in machine translation, language identification, etc Northeastern University

18 Northeastern University
N-gram Language Model The simplest form of an N-gram probability is just the relative frequency of the N-gram Northeastern University

19 Back-off N-gram Language Model
Necessary for unseen N-grams by “backing off” to lower order N-grams For simplicity, abbreviate the history (w1, w2,..,wn-1) as h, and (w2,..,wn-1) as h’ n=1: unigram n=2: bigram n=3: trigram Northeastern University

20 Northeastern University
Back-off weight A back-off N-gram is: The probability axiom requires: Define 2 disjoint sets of wi’s: -BO(wi|h): set of all wi that wi|h seen in training data, and BO(wi|h): set of all wi that wi|h unseen in training data Northeastern University

21 Back-off weight (cont.)
Northeastern University

22 Northeastern University
Perplexity The “quality” of an LM is typically measured by its perplexity, or the “branching” factor Basically, the perplexity is the average numbers of words that can appear after a history Northeastern University

23 N-gram LM for Speech Recognition
Language Model is one of the knowledge sources used in automatic speech recognition (ASR) Almost all State-of-the-art ASR systems use Back-off N-gram LM, N typically is 3 (or trigram) Northeastern University


Download ppt "Tools for Natural Language Processing Applications"

Similar presentations


Ads by Google