Intelligent Key Prediction by N-grams and Error-correction Rules Kanokwut Thanadkran, Virach Sornlertlamvanich and Tanapong Potipiti Information Research and Development Division, National Electronics and Computer Technology Center(NECTEC), Thailand
Outline Introduction Related Works The Approach Experiments Conclusions
Introduction The Thai-English Keyboard System, Problems and Solutions
The Thai-English Keyboard System 174 characters. 83 in Thai. 52 in English. 39 in Symbol. 4 characters/1 Button. 2 languages. 2 sets of characters. Use special keys to select languages and to distinguish a set of characters. ฤ ฟ EnglishThai Shift No shift A a A
The Problems Two annoyances for Thais to type Thai- English bilingual texts. To switch between languages by using a language switching key. To distinguish a set of characters by using a combination of shift-key and a character- key. Other multilingual users face the same problems.
Related Works Language Identification and Key Prediction
Language Identification Microsoft Word Automatic language identification. Considering only surrounding text. Problems. No key prediction. Often detect incorrect language. Difficult to switch to the proper language.
Key Prediction Zheng et al. Automatic word prediction. Base on lexical tree. Word class n-grams are employed. Problems. No language identification. Required large amount of memory space.
The Approach Overview, Language Identification, Key Prediction and Pattern Shortening
The System Overview There are two main processes in our system. Automatic language identification. Automatic Thai key prediction. Language? Key Input Thai Key Prediction Output Eng Thai
(1) (2) Language Identification
Thai Key Prediction (3)
Error-correction Rules Some character sequences often generate errors. Error patterns are collected for error-correction process. Mutual information is used to optimize the patterns. Training corpus Tri-gram prediction Error correction rules Errors from Tri-gram prediction
Error-collection Rule Reduction Pattern No. Error Key Sequences Correct Patterns 1.k[lkl9 2.mpklkl9 3.kkklkl9 4.lkl9
Pattern Shortening (4) (5)
The Error-correction Process Finite Automata pattern matching is applied. The longest pattern is selected. Key Input Finite Automata Terminal? No Yes Correction
Experiments Language Identification and Key Prediction
Language Identification Result Bi-Gram is employed. Using artificial corpus for testing. The first 6 characters are enough. m (number of first characters) Identification Accuracy (%)
Key Prediction Corpus Information 25 MB training corpus. 5 MB test corpus Training Corpus (%) Test Corpus (%) Unshift Alphabets Shift Alphabets
Key Prediction Result Prediction Accuracy from Training Corpus (%) Prediction Accuracy from Test Corpus (%) Tri-gram Prediction Tri-gram Prediction + Error Correction
Conclusions The Approach and Future Works
The Approach We have applied the tri-gram model and error-correction rules for intelligent Thai key prediction and English-Thai language identification. The experiment reports 99 percent in accuracy, which is very impressive.
Future Works Hopefully, this technique is applicable to other Asian languages and multilingual systems. Our future work is to apply the algorithm to mobile phones, handheld devices and multilingual input systems.
The End Questions and Answers