SRILM - The SRI Language Modeling Toolkit

Name: SRILM - The SRI Language Modeling Toolkit
Uploaded: 2017-07-11T02:22:19+00:00
Duration: PTM7S28
Channel: Alice May
Description: SRILM - The SRI Language Modeling Toolkit

SRILM - The SRI Language Modeling Toolkit
Presented by Yeon JongHeum Intelligent Database Systems Laboratory, SNU

Contents Environment Download Compile Making Corpus Execution Result

Environment Hardware OS IBM ThinkPad T41
Intel(R) Pentium(R) M processor 1600MHz 1GiB DDR RAM OS Ubuntu Linux 8.04

Download http://www.speech.sri.com/projects/srilm/download.html

Compile ubuntu 환경을 기준으로 하므로, 명령어들은 리눅스 배포본마다 다소 차이가 있을 수 있다.
csh, tcl, gcc, g++, gawk 등의 필요한 패키지를 설치한다. sudo aptitude install csh tcl tcl-dev build-essential gawk

Compile (cont’d) 다운로드 받은 SRILM 의 압축을 푼다 tar xvfz srilm.tgz

Compile (cont’d) 쓰기 권한을 추가한다.

Compile (cont’d) Makefile 의 SRILM 환경변수를 수정한다.

Compile (cont’d) commom/Makefile.machine.ARCH 파일의 CC, CXX, TCL_INCLUDE 등을 수정한다. ARCH 는 SRILM 이 실행되는 환경으로 sbin/machine-type 을 실행하여 알아본다.

Compile make World 명령어로 컴파일한다.

Corpus 형태소 분석된 파일의 인코딩을 euc-kr 에서 utf-8 로 수정
수정된 파일들에서 각 형태소를 찾아 하나의 큰 파일 생성 파일을 Training Set 과 Test Set 으로 나눈다. 한줄에 하나의 문장이 있으며 각 형태소는 공백으로 구분된다. 스크립트는 참조

Corpus - Example

Execution SRILM Training Set ngram-count Language Model Test Set ngram
Perplexity

ngram-count Command Default ngram-count -text train_morCorpus.txt
-lm lm_default.txt Default Trigram, Good-Turing discounting, Katz backoff -text : corpus to read -lm : output file of language model

Good-Turing Discounting Parameters
Command ngram-count -text train_morCorpus.txt -lm lm_gt_3_7.txt -order 3 -gt1min 3 -gt1max 7 -gt2min 3 -gt2max 7 -gt3min 3 -gt3max 7 Parameter -gtNmin count -gtNmax count Max Count Min Count

Format of Language Model
e.g., lm_default.txt \data\ ngram 1=200989 ngram 2= ngram 3= \1-grams: 무조 무조건 무조소 \2-grams: 군종 교구 군종 교구장 군종 사목 Log probability (Base 10) Log of Backoff Weight

Ney’s absolute discounting
Command ngram-count -text train_morCorpus.txt -lm lm_absoulte0.5_3gram.txt -order 3 -cdiscount1 0.5 -cdiscount2 0.5 -cdiscount3 0.5 Parameter -order n : generate to n-grams. 없으면 trigram 까지 생성한다. -cdiscountN value : values is a constant to subtract for N-grams

Witten-Bell discounting
Command ngram-count -text train_morCorpus.txt -lm lm_witten_3gram.txt -order 3 -wbdiscount1 -wbdiscount2 -wbdiscount3

Ristad's natural discounting
Command ngram-count -text train_morCorpus.txt -lm lm_nd_3gram.txt -order 3 -ndiscount1 -ndiscount2 -ndiscount3

Chen and Goodman's modified Kneser-Ney discounting
Command ngram-count -text train_morCorpus.txt -lm lm_knd_5gram.txt -order 3 -kndiscount1 -kndiscount2 -kndiscount3

Original Kneser-Ney discounting
Command ngram-count -text train_morCorpus.txt -lm lm_uknd_5gram.txt -order 3 -ukndiscount1 -ukndiscount2 -ukndiscount3

Discounting with Interpolate
Original Kneser-Ney discounting + Interpolate ngram-count -text train_morCorpus.txt -lm lm_uknd_inter_5gram.txt -order 3 -ukndiscount1 -ukndiscount2 -ukndiscount3 -interpolate1 -interpolate2 -interpolate3 Parameter -interpolateN Only Witten-Bell, absolute discounting, and (original or modified) Kneser-Ney smoothing currently support interpolation

Compute Perplexity Command Parameter Result ngram -lm lm_default.txt
-ppl testCorpus.txt Parameter -lm : Language Model -ppl : Compute sentence scores (log probabilities) and perplexities from the sentences in textfile Result file testCorpus.txt: sentences, words, OOVs 0 zeroprobs, logprob= e+06 ppl= ppl1=

original Kneser-Ney + Interpolate
Result ppl Absolute Discounting Witten-Bell Ristad's Natural modified Kneser-Ney original Kneser-Ney original Kneser-Ney + Interpolate Good-Turing No Smoothing Smoothing +1 1 2 810.01 3 75.714 4 5 70.247 72.076 8257 ppl1 959.83 2346.1 131.32 100.44 86.538

SRILM - The SRI Language Modeling Toolkit

Similar presentations

Presentation on theme: "SRILM - The SRI Language Modeling Toolkit"— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

SRILM - The SRI Language Modeling Toolkit

Similar presentations

Presentation on theme: "SRILM - The SRI Language Modeling Toolkit"— Presentation transcript:

Similar presentations

About project

Feedback