Presentation is loading. Please wait.

Presentation is loading. Please wait.

Syntactic And Sub-lexical Features For Turkish Discriminative Language Models ICASSP 2010 Ebru Arısoy, Murat Sarac¸lar, Brian Roark, Izhak Shafran Bang-Xuan.

Similar presentations


Presentation on theme: "Syntactic And Sub-lexical Features For Turkish Discriminative Language Models ICASSP 2010 Ebru Arısoy, Murat Sarac¸lar, Brian Roark, Izhak Shafran Bang-Xuan."— Presentation transcript:

1 Syntactic And Sub-lexical Features For Turkish Discriminative Language Models ICASSP 2010 Ebru Arısoy, Murat Sarac¸lar, Brian Roark, Izhak Shafran Bang-Xuan Huang Department of Computer Science & Information Engineering National Taiwan Normal University

2 2 Outline Introduction Sub-lexical language models Feature sets for DLM –Morphological Features –Syntactic Features –Sub-lexical Features Experiments Conclusions and Discussion

3 In this paper we make use of both sub-lexical recognition units and discriminative training in Turkish language models. Turkish is an agglutinative language. Its agglutinative nature leads to a high number of out-ofvocabulary (OOV) words which degrade the ASR accuracy. To handle the OOV problem, vocabularies composed of sub-lexical units have been proposed for agglutinative languages. Introduction 3 most words are formed by joining morphemes together A article Syntactic( 句法 ) sentence Ex: 今天 下午 需要 開會 lexical or word

4 DLM is a complementary approach to the baseline language model. In contrast to the generative language model, it is trained on acoustic sequences with their transcripts to optimize discriminative objective functions using both positive (reference transcriptions) and negative (recognition errors) examples. DLM is a feature-based language modeling approach. Therefore, each candidate hypothesis in DLM training data is represented as a feature vector of the acoustic input, x, and the candidate hypothesis, y. Introduction 4 ….. sentence x …. 12341234 Feature vector 0 1 2 3 ….. i candidate hypothesis Ex: N-best, lattice

5 Sub-lexical models In this approach, the recognition lexicon is composed of sub-lexical units instead of words. Grammatically-derived units, stems, affixes or their groupings, and statistically-derived units, morphs, have both been proposed as lexical items for Turkish ASR. Morphs are learned statistically from words by the Morfessor algorithm. Morfessor uses a Minimum Description Length principle to learn a sub-word lexicon in an unsupervised manner. 5

6 Feature sets for DLM –Morphological Features –Syntactic Features –Sub-lexical Features  Clustering of sub-lexical units  Brown et al.’s algorithm  minimum edit distance (MED)  Long distance triggers 6

7 Feature sets for DLM Root ( 原型 ) ex: able => dis-able, en-able, un-able, comfort-able-ly, …. Inflectional groups (IG) Brown et al.’s algorithm - semantically-based, syntactically-based minimum edit distance (MED) 將一個字串轉成另一個字串所需的最少編輯 (insertion, deletion, substitution) 次數 Ex: intension -> execution del ‘i’ => nttention sub ‘n’ to ‘e’ => etention sub ‘t’ to ‘x’ => exention ins ‘u’ => exenution sub ‘n’ to ‘c’ => execution 7

8 Feature sets for DLM Long distance triggers Considering initial morphs as stems and non-initial morphs as suffixes, we assume that the existence of a morph can trigger another morph in the same sentence. we extract all the morph pairs between the morphs of any two words in a sentence as the candidate morph triggers. Among the possible candidates, we try to select only the pairs where morphs are occurring together for a special function. 8

9 Experiments 9

10 Conclusions and Discussion The main contributions of this paper are (i) syntactic information is incorporated into Turkish DLM (ii) effect of language modeling units on DLMis investigated (iii) morpho-syntactic information is explored when using sub-lexical units. It is shown that DLM with basic features yields more improvement for morphs than for words. Our final observation is that the high number of features are masking the expected gains of the proposed features, mostly due to the sparseness of the observations per parameter. This will make feature selection a crucial issue for our future research. 10

11 Weekly report Generate word graph Recognition result 11 characterword ML_training83.5476.24 MPE_iter184.8377.77

12 MDLM-D + prior 12 SigmaTrainTestDev - Train_best Dev_best 900 Train_best 0.9370.8550.862 Dev_best 0.9230.8570.864 1600 Train_best 0.9390.8560.863 Dev_best 0.9240.8570.865 2500 Train_best 0.9400.8560.864 Dev_best 0.9350.8580.866 3600 Train_best 0.9410.8570.864 Dev_best 0.9320.8580.866 8100 Train_best 0.9415610.8573740.864554 Dev_best 0.9330.8580.866

13 MDLM-F vs MDLM-D + prior 13 MDLM-FTrain_best 0.9490.8600.867 Dev_best 0.9480.8610.868 MDLM-DTrain_best Dev_best MDLM-D+Train_best 0.9400.8560.864 Dev_best 0.9350.8580.866


Download ppt "Syntactic And Sub-lexical Features For Turkish Discriminative Language Models ICASSP 2010 Ebru Arısoy, Murat Sarac¸lar, Brian Roark, Izhak Shafran Bang-Xuan."

Similar presentations


Ads by Google