Digital Speech Processing Homework 3

Slides:

Advertisements

Similar presentations

專題研究 WEEK 4 - LIVE DEMO Prof. Lin-Shan Lee TA. Hsiang-Hung Lu,Cheng-Kuan Wei.

Advertisements

ICS103 Programming in C Lecture 1: Overview of Computers & Programming

BETA E-hand-in: Guidelines for Home assignments 1.

Zhang Hongyi CSCI2100B Data Structures Tutorial 2

Perl Practical Extraction and Report Language Senior Projects II Jeff Wilson.

1 Language Model (LM) LING 570 Fei Xia Week 4: 10/21/2009 TexPoint fonts used in EMF. Read the TexPoint manual before you delete this box.: AAAAAA A A.

Almost-Spring Short Course on Speech Recognition Instructors: Bhiksha Raj and Rita Singh Welcome.

Lane Medical Library & Knowledge Management Center Perl Programming for Biologists SESSION 2: Tue Feb 10 th 2009 Yannick Pouliot,

Matlab Software To Do Analyses as in Marron’s Talks Matlab Available from UNC Site License Download Software: Google “Marron Software”

Agenda What is Computer Programming? The Programming Process

12/13/2007Chia-Ho Ling1 SRILM Language Model Student: Chia-Ho Ling Instructor: Dr. Veton Z. K ë puska.

Name:Venkata subramanyan sundaresan Instructor:Dr.Veton Kepuska.

CMU-Statistical Language Modeling & SRILM Toolkits

Arthur Kunkle ECE 5525 Fall Introduction and Motivation  A Large Vocabulary Speech Recognition (LVSR) system is a system that is able to convert.

SAS Workshop Lecture 1 Lecturer: Annie N. Simpson, MSc.

IT-101 Section 001 Lecture #3 Introduction to Information Technology.

Statistical Language Modeling using SRILM Toolkit

Efficient Minimal Perfect Hash Language Models David Guthrie, Mark Hepple, Wei Liu University of Sheffield.

CS OPERATING SYSTEMS Project 2 Matrix Multiplication Project.

Compiled Matlab on Condor: a recipe 30 th October 2007 Clare Giacomantonio.

Intelligent Database Systems Lab N.Y.U.S.T. I. M. Chinese Word Segmentation and Statistical Machine Translation Presenter : Wu, Jia-Hao Authors : RUIQIANG.

Input & Output In Java. Input & Output It is very complicated for a computer to show how information is processed. Although a computer is very good at.

Operating Systems Session 1. Contact details TA: Alexander(Sasha) Apartsin ◦ ◦ Office hours: Homepage:

Computer Programming I Hour 2 - Writing Your First C Program.

IR Homework #2 By J. H. Wang Mar. 31, Programming Exercise #2: Query Processing and Searching Goal: to search relevant documents for a given query.

Operating Systems Session 1. Contact details TA: Alexander(Sasha) Apartsin ◦ ◦ Office hours: TA: Sasha Alperovich.

File Input and Output July 2nd, Inputs and Outputs Inputs Keyboard Mouse storage(hard drive) Networks O utputs Graphs Images Videos(image stacks)

Compact WFSA based Language Model and Its Application in Statistical Machine Translation Xiaoyin Fu, Wei Wei, Shixiang Lu, Dengfeng Ke, Bo Xu Interactive.

SRILM - The SRI Language Modeling Toolkit

NRC Report Conclusion Tu Zhaopeng NIST06  The Portage System  For Chinese large-track entry, used simple, but carefully- tuned, phrase-based.

IR Homework #1 By J. H. Wang Mar. 5, Programming Exercise #1: Indexing Goal: to build an index for a text collection using inverted files Input:

Latent Topic Modeling of Word Vicinity Information for Speech Recognition Kuan-Yu Chen, Hsuan-Sheng Chiu, Berlin Chen ICASSP 2010 Hao-Chin Chang Department.

Homework #4: Operator Overloading and Strings By J. H. Wang Apr. 17, 2009.

Speech Analysis TA : 林賢進 HW /10/28 1. Goal This homework is aimed to analyze speech from spectrogram, and try to distinguish different initials/

Digital Speech Processing HW3

Object Oriented Programming COP3330 / CGS5409.  Assignment Submission Overview  Compiling with g++  Using Makefiles  Misc. Review.

1 Lecture 4 Post-Graduate Students Advanced Programming (Introduction to MATLAB) Code: ENG 505 Dr. Basheer M. Nasef Computers & Systems Dept.

Homework 1 (due:April 10th) Deadline : April 10th 11:59pm Where to submit? eClass 과제방 ( How to submit? Create a folder. The name.

DISCRETE HIDDEN MARKOV MODEL IMPLEMENTATION DIGITAL SPEECH PROCESSING HOMEWORK #1 DISCRETE HIDDEN MARKOV MODEL IMPLEMENTATION Date: Oct, Revised.

Homework 1 (due:April 8th) Deadline : April 8th 11:59pm Where to submit? eClass “ 과제방 ” ( How to submit? Create a folder. The name.

1 Project 7: Looping. Project 7 For this project you will produce two Java programs. The requirements for each program will be described separately on.

Homework 2 (due:April 17th) Deadline : April 17th 11:59pm Where to submit? eClass “ 과제방 ” ( How to submit? Create a folder. The.

HW2-2 Speech Analysis TA: 林賢進

Digital Speech Processing Homework /05/04 王育軒.

+ Auto-Testing Code for Teachers & Beginning Programmers Dr. Ronald K. Smith Graceland University.

Homework 1 (due:April 13th) Deadline : April 13th 11:59pm Where to submit? eClass 과제방 ( How to submit? Create a folder. The name.

ENEE150 Discussion 01 Section 0101 Adam Wang.

Date: October, Revised by 李致緯

Homework 3 (due:May 27th) Deadline : May 27th 11:59pm

Prof. Lin-shan Lee TA. Roy Lu

Digital Speech Processing Homework 3

Speech Analysis TA:Chuan-Hsun Wu

ICS103 Programming in C Lecture 1: Overview of Computers & Programming

專題研究 week3 Language Model and Decoding

Prof. Lin-shan Lee TA. Lang-Chi Yu

Digital Speech Processing

Homework 3 (due:June 5th)

Digital Speech Processing Homework 3

Operation System Program 4

Language-Model Based Text-Compression

Introduction to Primitive Data types

Homework 3 May 王育軒.

Prof. Lin-shan Lee TA. Po-chun, Hsu

A message-based cocktail watermarking system

Homework 2 (due:May 15th) Deadline : May 15th 11:59pm

Homework 2 (due:May 5th) Deadline : May 5th 11:59pm

Prof. Lin-shan Lee TA. Roy Lu

Computer network HW2 -Retransmission + Congest control

Introduction to Primitive Data types

Presentation transcript:

Digital Speech Processing Homework 3 蔡政昱吳全勳 2014/5/21

Outline Introduction SRILM Requirement Submission Format

Outline Introduction SRILM Requirement Submission Format

Introduction 讓他十分ㄏ怕只ㄒ望ㄗ己明ㄋ度別再這ㄇㄎ命了演ㄧㄩ樂產ㄧㄐ入積ㄐㄓ型提ㄕ競爭ㄌ Your HW3 讓他十分害怕只希望自己明年度別再這麼苦命了演藝娛樂產業加入積極轉型提升競爭力

Introduction In general, we can use a language model For example, let Z = 演ㄧㄩ樂產ㄧ P(Z) is independent of W W = w1w2…wN , Z = z1z2…zN Available from Bigram Language Model

Introduction 演ㄧㄩ樂業藝餘娛於 0.02 0.2 0.1 0.01 0.3

So… We need to build a bigram character-based language model. Use the language model to decode the sequence. There is a nice toolkit to help you.

Outline Introduction SRILM Requirement Submission Format

SRILM SRI Language Model Toolkit http://www.speech.sri.com/projects/srilm/ A toolkit for building and applying various statistical language models C++ classes in SRILM are very useful Using and reproducing some programs of SRILM in this homework

SRILM Download the executable from the course website Different platform: i686 for 32-bit GNU/Linux i686-m64 for 64-bit GNU/Linux (CSIE workstation) Cygwin for 32-bit Windows with cygwin environment If you want to use the C++ library, you can build it from the source code

SRILM You are strongly recommended to read FAQ on the course website Possibly useful codes in SRILM $SRIPATH/misc/src/File.cc (.h) $SRIPATH/lm/src/Vocab.cc (.h) $SRIPATH/lm/src/ngram.cc (.h) $SRIPATH/lm/src/testError.cc (.h)

SRILM perl separator.pl corpus.txt > corpus_seg.txt

SRILM ./ngram-count –text corpus_seg.txt –write lm.cnt –order 2 -text: input text filename -write: output count filename -order: order of ngram language model ./ngram-count –read lm.cnt –lm bigram.lm –unk –order 2 -read: input count filename -lm: output language model name -unk: view OOV as <unk> without this, all the OOV will be removed

Example 在國民黨失去政權後第一次參加元旦總統府升旗典禮 corpus_seg.txt 在國民黨失去政權後第一次參加元旦總統府升旗典禮有立委感慨國民黨不團結才會失去政權有立委則猛批總統陳水扁人人均顯得百感交集 …. trigram.lm \data\ ngram 1=6868 ngram 2=1696830 ngram 3=4887643 \1-grams: -1.178429 </s> -99 <s> -2.738217 -1.993207 一 -1.614897 -4.651746 乙 -1.370091 ...... lm.cnt 夏 11210 俸 267 鴣 7 衹 1 微 11421 檎 27 …… Log Probability

SRILM ./disambig –text $file –map $map –lm $LM –order $order -text: input filename -map: a mapping from (注音/國字) to (國字) You should generate this mapping by yourself from the given utf8-ZhuYin.map, either using EXCEL or writing a simple program on your own. -lm: input language model

SRILM Be aware of polyphones (破音字). There should be spaces between all characters. utf8-ZhuYin.map 一ㄧˊ/ㄧˋ/ㄧ_ 乙ㄧˇ 丁ㄉㄧㄥ_ ㄑㄧ_ 乃ㄋㄞˇ ㄐㄧㄡˇ … 長ㄔㄤˊ/ㄓㄤˇ 行ㄒㄧㄥˊ/ㄏㄤˊ ZhuYin-utf8.map ㄅ八匕卜不卞巴比丙包 … 八八匕匕卜卜 … ㄆ仆匹片丕叵平扒扑疋 … 仆仆匹匹

Outline Introduction SRILM Requirement Submission Format

Requirement (I) Segment corpus and all test data into characters perl separator.pl corpus.txt corpus_seg.txt perl separator.pl <testdata/xx.txt> <testdata/xx.txt> Train character-based bigram LM Get counts: ./ngram-count –text corpus_seg.txt –write lm.cnt –order 2 Compute probability: ./ngram-count –read lm.cnt –lm bigram.lm –unk –order 2 Generate the map from utf8-ZhuYin.map See FAQ 4 Using disambig to decode testdata/xx.txt ./disambig –text $file –map $map –lm $LM –order $order > $output

Requirement (II) Implement your version of disambig. Using dynamic programming (Viterbi). The vertical axes are the candidate characters.

Requirement (II) ex: <s> 這是一個範例格式 </s> You have to use C++ or Matlab. You are strongly recommended to use C++… Speed Using SRILM’s library will save you a lot of time (please refer to FAQ) Your output format should be consistent with srilm. ex: <s> 這是一個範例格式 </s> There are an <s> at the beginning of a sentence, a </s> at the end, and whitespaces in between all characters.

How to deal with utf8 string s="ㄏㄏ^^ 先洗澡~~"; for(int i=0; i<s.length();){ if(s[i]>char(0)){ //it’s not a Chinese character i++; } else if (s[i]==char(0xe3) && s[i+1]==char(0x84) && (s[i+2]>=char(0x85) && s[i+2]<=char(0xa9))){ // it’s Bopomofo cout<<s[i]<<s[i+1]<<s[i+2]; i+=3; } else { //simply treat others as Chinese characters } ------ Output: ㄏㄏ A Chinese character in utf8 is always 3 bytes, and the three bytes are always 1110xxxx, 10xxxxxx and 10xxxxxx. The ZhuYin characters in utf8 are from [E3][84][85] to [E3][84][A9]. So if we want to know whether a character is ZhuYin …

Outline Introduction SRILM Requirement Submission Format

Submission format Put all the files into the directory [Your_Student_ID], and rename it after your own ID. Files required: The ZhuYin-utf8.map you generated The decoded results of 10 test data produced by SRILM’s disambig result1/1.txt ~ 10.txt The decoded results of 10 test data produced by your disambig result2/1.txt ~ 10.txt All source codes (your disambig & your program for the map generation) Makefile (if C++ is used) Report Neither SRILM related files nor corpus_seg.txt nor LMs Compress the directory [Your_Student_ID] into zip file (so the zip file will only have one directory in it), and then upload to CEIBA. Any wrong file format will lose 10 points.

Submission format (report) The report should include: 1. Your environment (CSIE workstation, Cygwin, …) 2. How to “compile” your program (if C++ is used) 3. How to “execute” your program (give me examples) ex: ./program –a xxx –b yyy 4. What you have done 5. NO more than two A4 pages. 6. NO “what you have learned”

Grading Requirement I (40%) Requirement II (40%) Report (20%) Note that if you use C++ and there’s no Makefile in the submitted zip file, score in this part will be halved. Report (20%) Bonus (15%) Character-based trigram language model (10%) (need pruning for speed) Other strategies (5%)

If you have any questions… FAQ http://speech.ee.ntu.edu.tw/homework/DSP_HW3/faq.html 蔡政昱 r02942067@ntu.edu.tw 吳全勳 r02922002@ntu.edu.tw