Download presentation
Presentation is loading. Please wait.
Published byPearl Carpenter Modified over 9 years ago
1
1 A preliminary study on unknown word problem in Chinese word segmentation Authors: Ming -Yu Lin Tung – Hui Chiang Keh-Yih Su Speaker: Jbc
2
2 Abstract Unknown word is the main factor that affect the performance of WS. To solve the unknown word, this paper proposes two way: Morphological rule: solving the regular unknown words. Statistical model : solving the irregular unknown words.
3
3 Outline Introduction System architecture Overview of the baseline model The morphological analysis Tagging part of speech Unknown word modeling
4
4
5
5 Introduction-(1) Word: 許多中文處理工作的基本單位 在中文有沒有界限的困擾 Unknown word 影響 WS 頗大. Unknown word 的分類 : Regular: EX: time, date (11:50, 11/12), reduplication Irregular: EX: proper names, compound nouns.
6
6
7
7 Introduction-(2) 不同類型的 unknown word 的對策 : Regular: 使用 morphological rule 來辨識. Irregular: 使用統計模式來辨識.
8
8 System Architecture-(1)
9
9 System Architecture-(2) Lexicon: 89590 entries. 49 tags.
10
10 System Architecture-(2) Lexicon: 89590 entries. 49 tags. # of characters / word # of entries 1 1,734 2 35,492 3 19,650 4 24,054 5 6,140 6 2,020 >=7 500 Total 89,590
11
11 System Architecture-(3) Morphological Rules: 17 條. ( 在最後面的 Appendix A) Corpus:
12
12 Morphological Rules
13
13 Statistics of Corpora
14
14 Overview of the Baseline Model-(1) The baseline model:
15
15 Overview of the Baseline Model-(2) Baseline vs. Max match:
16
16
17
17 Overview of the Baseline Model-(3) Two error patterns: s_ns( mis-combined error): Ex.| 一 | 個 | 人 | | 一 | 個人 | ns_s( over-segmentation error): Ex.| 轉換器 | | 轉換 | 器 |
18
18 Statistics of Error Patterns
19
19 The Morphological Analysis-(1) 本 paper 提出了使用 Morphological rules 來找出規則的 unknown words. Rule ordering: Using SFS(sequencial forward selection) procedure. Cost = w r * (1-P r ) + w p * (1-P p )
20
20 The Morphological Analysis-(2)
21
21 The Morphological Analysis-(3) Baseline model + morphological rule:
22
22 The Morphological Analysis-(4) 使用 morphological rule 後對 s_ns 與 ns_s 的改善 :
23
23 Tagging part of speech-(1)
24
24 Tagging part of speech-(2)
25
25 Tagging part of speech-(3)
26
26 Tagging part of speech-(4)
27
27 Unknown word modeling-(1) 5 unknown word categories: 應加入辭典的 words. Ex: 爭議 應用 morphological rules 規範的 words. Ex: 牛肝, 牛心. 縮寫. Ex: 國大. 專有名詞. Ex: 胡適. 其他.( 如印錯的 word, Ex: 吩付 辭典中沒有的 word. )
28
28 Unknown word modeling-(2) 使用 unknown word model 來找不規則 的 unknown word. 確認有無 unknown word 存在所預測的區域. 如果有, 找出 unknown word 是那一塊.
29
29 Unknown word modeling-(3) 確認有沒有 :
30
30 Unknown word modeling-(4) 確認那一塊 :
31
31 Result-(1)
32
32 Result-(2)
Similar presentations
© 2025 SlidePlayer.com. Inc.
All rights reserved.