A new framework for Language Model Training David Huggins-Daines January 19, 2006.

Slides:



Advertisements
Similar presentations
Beyond Text Representation Building on Unicode to Implement a Multilingual Text Analysis Framework Thomas Hampp – IBM Germany Content Management Development.
Advertisements

A Workflow Engine with Multi-Level Parallelism Supports Qifeng Huang and Yan Huang School of Computer Science Cardiff University
Coursework.  5 groups of 4-5 students  2 project options  Full project specifications on 3 rd March  Final deadline 10 th May 2011  Code storage.
University of Sheffield NLP Module 11: Advanced Machine Learning.
CALO Decoder Progress Report for March Arthur (Decoder and ICSI Training) Jahanzeb (Decoder) Ziad (ICSI Training) Moss (ICSI Training) Carnegie Mellon.
SI485i : NLP Set 4 Smoothing Language Models Fall 2012 : Chambers.
1 Password Advanced Password Management. 2 Standard Password Management including tool for blocking usage of easily cracked passwords Extensive dictionary.
Topic 15 Implementing and Using Stacks
Language modeling for speaker recognition Dan Gillick January 20, 2004.
Supervised by Prof. LYU, Rung Tsong Michael Department of Computer Science & Engineering The Chinese University of Hong Kong Prepared by: Chan Pik Wah,
Presented by IBM developer Works ibm.com/developerworks/ 2006 January – April © 2006 IBM Corporation. Making the most of Creating Eclipse plug-ins.
Topic 15 Implementing and Using Stacks
Overview of Search Engines
Copyright © 2001 by Wiley. All rights reserved. Chapter 1: Introduction to Programming and Visual Basic Computer Operations What is Programming? OOED Programming.
November 2011 At A Glance GREAT is a flexible & highly portable set of mission operations analysis tools that increases the operational value of ground.
Learning Table Extraction from Examples Ashwin Tengli, Yiming Yang and Nian Li Ma School of Computer Science Carnegie Mellon University Coling 04.
Evidence from Content INST 734 Module 2 Doug Oard.
SI485i : NLP Set 3 Language Models Fall 2012 : Chambers.
1 Advanced Smoothing, Evaluation of Language Models.
Introduction to Automatic Speech Recognition
12/13/2007Chia-Ho Ling1 SRILM Language Model Student: Chia-Ho Ling Instructor: Dr. Veton Z. K ë puska.
Name:Venkata subramanyan sundaresan Instructor:Dr.Veton Kepuska.
Arc: Programming Options Dr Andy Evans. Programming ArcGIS ArcGIS: Most popular commercial GIS. Out of the box functionality good, but occasionally: You.
CMU-Statistical Language Modeling & SRILM Toolkits
Multi-Style Language Model for Web Scale Information Retrieval Kuansan Wang, Xiaolong Li and Jianfeng Gao SIGIR 2010 Min-Hsuan Lai Department of Computer.
M. Taimoor Khan * Java Server Pages (JSP) is a server-side programming technology that enables the creation of dynamic,
Learning Objectives Data and Information Six Basic Operations Computer Operations Programs and Programming What is Programming? Types of Languages Levels.
Arthur Kunkle ECE 5525 Fall Introduction and Motivation  A Large Vocabulary Speech Recognition (LVSR) system is a system that is able to convert.
Stimulsoft Reports.Net 20 Problems which Stimulsoft Reports.Net solves
Building a Statistical Language Model Using CMUCLMTK
Information Extraction From Medical Records by Alexander Barsky.
Chapter 7. BEAT: the Behavior Expression Animation Toolkit
Spoken dialog for e-learning supported by domain ontologies Dario Bianchi, Monica Mordonini and Agostino Poggi Dipartimento di Ingegneria dell’Informazione.
Newsjunkie: Providing Personalized Newsfeeds via Analysis of Information Novelty Gabrilovich et.al WWW2004.
Learning Phonetic Similarity for Matching Named Entity Translation and Mining New Translations Wai Lam, Ruizhang Huang, Pik-Shan Cheung ACM SIGIR 2004.
By: Meghal Bhatt.  Sphinx4 is a state of the art speaker independent, continuous speech recognition system written entirely in java programming language.
© Copyright 2008 STI INNSBRUCK NLP Interchange Format José M. García.
A Universal Framework for Data Validation Giovanni Organtini INFN-Sez. di Roma.
1 The Software Development Process  Systems analysis  Systems design  Implementation  Testing  Documentation  Evaluation  Maintenance.
Programming for Geographical Information Analysis: Advanced Skills Lecture 1: Introduction Programming Arc Dr Andy Evans.

Python Programming Using Variables and input. Objectives We’re learning to build functions and to use inputs and outputs. Outcomes Build a function Use.
LML Speech Recognition Speech Recognition Introduction I E.M. Bakker.
Cluster-specific Named Entity Transliteration Fei Huang HLT/EMNLP 2005.
Introduction to Grid Computing Felix Hageloh Roberto Valenti Deployment of a Language Detector Grid Service University of Amsterdam,
1 Modeling Long Distance Dependence in Language: Topic Mixtures Versus Dynamic Cache Models Rukmini.M Iyer, Mari Ostendorf.
Chapter One An Introduction to Programming and Visual Basic.
MedKAT Medical Knowledge Analysis Tool December 2009.
INFSO-RI Enabling Grids for E-sciencE ARDA Experiment Dashboard Ricardo Rocha (ARDA – CERN) on behalf of the Dashboard Team.
BY KALP SHAH Sentence Recognizer. Sphinx4 Sphinx4 is the best and versatile recognition system. Sphinx4 is a speech recognition system which is written.
ALPHABET RECOGNITION USING SPHINX-4 BY TUSHAR PATEL.
A Maximum Entropy Language Model Integrating N-grams and Topic Dependencies for Conversational Speech Recognition Sanjeev Khudanpur and Jun Wu Johns Hopkins.
S1S1 S2S2 S3S3 8 October 2002 DARTS ATraNoS Automatic Transcription and Normalisation of Speech Jacques Duchateau, Patrick Wambacq, Johan Depoortere,
Maximum Entropy techniques for exploiting syntactic, semantic and collocational dependencies in Language Modeling Sanjeev Khudanpur, Jun Wu Center for.
W3C Multimodal Interaction Activities Deborah A. Dahl August 9, 2006.
David Adams ATLAS AJDL: Abstract Job Description Language David Adams BNL June 29, 2004 PPDG Collaboration Meeting Williams Bay.
Intelligent Database Systems Lab 國立雲林科技大學 National Yunlin University of Science and Technology 1 Chinese Named Entity Recognition using Lexicalized HMMs.
Your Interactive Guide to the Digital World Discovering Computers 2012 Chapter 13 Computer Programs and Programming Languages.
© NCSR, Frascati, July 18-19, 2002 CROSSMARC big picture Domain-specific Web sites Domain-specific Spidering Domain Ontology XHTML pages WEB Focused Crawling.
ANALYSIS TRAIN ON THE GRID Mihaela Gheata. AOD production train ◦ AOD production will be organized in a ‘train’ of tasks ◦ To maximize efficiency of full.
Development Environment
The Simple Corpus Tool Martin Weisser Research Center for Linguistics & Applied Linguistics Guangdong University of Foreign Studies
Speaker : chia hua Authors : Long Qin, Ming Sun, Alexander Rudnicky
Node.js Express Web Services
Dtk-tools Benoit Raybaud, Research Software Manager.
Topic 15 Implementing and Using Stacks
Execute your Processes
Use Cases Simple Machine Translation (using Rainbow)
#01# ASP.NET Core Overview Design by: TEDU Trainer: Bach Ngoc Toan
Presentation transcript:

A new framework for Language Model Training David Huggins-Daines January 19, 2006

Overview Current tools Requirements for new framework User Interface Examples Design and API

Current status of LM training The CMU SLM toolkit Efficient implementation of basic algorithms Doesn’t handle all tasks of building a LM Text normalization Vocabulary selection Interpolation/adaptation Requires an expert to “put the pieces together” Lots of scripts SimpleLM, Communicator, CALO, etc. Other LM toolkits SRILM, Lemur, others?

Requirements LM training should be Repeatable An “end-to-end” rebuild should produce the same result Configurable It should be easy to change parameters and rebuild the entire model to see their effect Flexible Should support many types of source texts, methods of training Extensible Modular structure to allow new methods and data sources to be easily implemented

Tasks of building an LM Normalize source texts They come in many different formats! LM toolkit expects a stream of words What is a “word”? Compound words, acronyms Non-lexemes (filler words, pauses, disfluencies) What is a “sentence”? Segmentation of input data Annotate source texts with class tags Select a vocabulary Determine optimal vocabulary size Collect words from training texts Define vocabulary classes Vocabulary closure Build a dictionary (pronunciation modeling)

Tasks, continued Estimate N-Gram model(s) Choose the appropriate smoothing parameters Find the appropriate divisions of the training set Interpolate N-Gram models Use a held-out set representative of the test set Find weights for different models which maximize likelihood (minimize perplexity) on this domain Evaluate language model Jointly minimize perplexity and OOV rate (they tend to move in opposite directions)

A Simple Switchboard Example Top level tag - must be only one A set of transcripts The input filter to use A list of files Exclude singletons Backreference to named object

A More Complicated Example swb.test.lsn icsi.test.mrt BRAZIL cmu.test.trs (Interpolation of ICSI and Switchboard) Vocabularies can be nested (merged) Files can be listed directly in element contents Words can be listed directly in element contents Held-out set for interpolation Interpolate previously named LMs

Command-line Interface lm_train “Runs” an XML configuration file build_vocab Build vocabularies, normalize transcripts ngram_train Train individual N-Gram models ngram_test Evaluate N-Gram models ngram_interpolate Interpolate and combine N-Gram models ngram_pronounce Build a pronunciation lexicon from a language model or vocabulary

Programming Interface NGramFactory Builds an NGramModel from an XML specification (as seen previously) NGramModel Trains a single N-Gram LM from some transcripts Vocabulary Builds a vocabulary from transcripts or other vocabularies InputFilter Subclassed into InputFilter::CMU, InputFilter::ICSI, InputFilter::HUB5, InputFilter::ISL, etc Reads transcripts in some format and outputs a word stream

Design in Plain English NGramFactory builds an NGramModel NGramModel has a Vocabulary NGramModel and Vocabulary can have Transcripts NGramModel and Vocabulary use an InputFilter (or maybe they don’t) NGramModel can merge two other NGramModel s using a set of Transcripts Vocabulary can merge another Vocabulary

A very simple InputFilter use strict; package InputFilter::Simple; require InputFilter; use base 'InputFilter'; sub process_transcript { my ($self, $file) local ($_, *FILE); open FILE, "<$file" or die "Failed to open $file: $!"; while ( ) { chomp; = split; } 1; (InputFilter/Simple.pm) please !!! Subclass of InputFilter (This is just good practice) Pass each sentence to this method Read the input file Tokenize, normalize, etc

Where to get it Currently in CVS on fife.speech :ext:fife.speech.cs.cmu.edu:/home/CVS module LMTraining Future: CPAN and cmusphinx.org Possibly integrated with the CMU SLM toolkit in the future

Stuff TODO Class LM support Communicator-style class tags are recognized and supported NGramModel will build.lmctl and.probdef files However this requires normalizing the files to a transcript first, then running the semi-automatic Communicator tagger Automatic tagging would be nice… Support for languages other than English Text normalization conventions Word segmentation (for Asian languages) Character set support (case conversions etc) Unicode (also a CMU-SLM problem)

Questions?