Presentation is loading. Please wait.

Presentation is loading. Please wait.

Tsinghua University 1 Statistical Properties of Overlapping Ambiguities in Chinese Word Segmentation and a Strategy for Their Disambiguation Wei Qiao,

Similar presentations


Presentation on theme: "Tsinghua University 1 Statistical Properties of Overlapping Ambiguities in Chinese Word Segmentation and a Strategy for Their Disambiguation Wei Qiao,"— Presentation transcript:

1 Tsinghua University 1 Statistical Properties of Overlapping Ambiguities in Chinese Word Segmentation and a Strategy for Their Disambiguation Wei Qiao, Maosong Sun and Wolfgang Menzel State Key Lab of Intelligent Tech. & Sys. Tsinghua University Department Informatic, Hamburg University

2 Tsinghua University 2 Part Ⅰ Background

3 Tsinghua University 3 Introduction Chinese word segmentation Combination ambiguity 火 把 (torch) 火 (fire) 把 (make) Overlapping ambiguity a. 先解决其主要问题,再解决其次要问题 其 次要 (the subordinate) b. 首先要关注整体,其次要注意细节 其次 要 (secondly we should) ★ 火 把

4 Tsinghua University 4 Overlapping ambiguity string (OAS) Length; Order; Intersection length; Structure Maximal overlapping ambiguity string (MOAS) True / Pseudo ambiguity MOAS e.g. 其次要 ( TM ) : 其次 要 & 其 次要 e.g. 部长篇小说 (PM) : 部 (measure word) 长篇小说 Related Terms order2 order1 0 12 3 0-2, 1-3 3

5 Tsinghua University 5 [Sun et al.,1999] 100 million character A set of core for MOAS is found [Li, et al., 2003] 650 million character Similar method is used to improve the performance of segmenter Previous Work

6 Tsinghua University 6 Two basic issues remain unsolved in their work: Only include news data, the results need further validated Determine the core of pseudo OA strings. both for general-purpose and domain-specific. Motivation

7 Tsinghua University 7 Statistical Properties of MOAS From General Corpus From Domain-specific Corpus Part Ⅱ

8 Tsinghua University 8 Data Set CBC : 929,963,468 characters Rich in content (from 1920’s) covering rich categories such as novel, essay, news…… Chinese Word List Peking University, with 74,191 entries Automatically find totally 733,066 distinct MOAS types in CBC From General Corpus

9 Tsinghua University 9 Detailed Distribution Perspective 1: Length From General Corpus

10 Tsinghua University 10 Perspective 2: Order From General Corpus

11 Tsinghua University 11 Perspective 3: Intersection Length From General Corpus

12 Tsinghua University 12 Perspective 4: Structure distribution From General Corpus

13 Tsinghua University 13 Top N Frequent MOAS --Core candidate 3500 ~ 50.78% 7000 ~ 60.43% 40000 ~ 80.39% From General Corpus

14 Tsinghua University 14 Stability VS Corpus size From General Corpus # of MOAS VS Corpus size # of top N MOAS VS Corpus size Top 7000

15 Tsinghua University 15 Pseudo MOAS Detection Relax definition on “Pseudo” Eg. “ 出国门 ” : 出 国门 (go abroad) in almost all the cases 出国 门 (the way to go abroad) small possibility 5,507 PM and 1,439 TM judged by hand Token coverage of PM and TM over CBC From General Corpus

16 Tsinghua University 16 Domain-Specific Corpora Ency55: 90.02 million characters Web55: 54.97 million characters Common Parts From Domain-specific Corpora

17 Tsinghua University 17 Frequent MOAS Coverage in Domain Specific Corpora (N=3,500) From Domain-specific Corpora

18 Tsinghua University 18 From Domain-specific Corpora Frequent MOAS Coverage in Domain Specific Corpora (N=7,000)

19 Tsinghua University 19 From Domain-specific Corpora Frequent MOAS Coverage in Domain Specific Corpora (N=40,000)

20 Tsinghua University 20 From Domain-specific Corpora PM and TM distribution over Domain Corpora 42% of overlapping ambiguities in any Chinese text can be 100% solved. ★

21 Tsinghua University 21 Part Ⅲ Disambiguation

22 Tsinghua University 22 Disambiguation Method Current performance on OA Performance of ICTCLAS1.0 http://www.nlp.org.cn on OAs http://www.nlp.org.cn e.g. 公安局 长 是 主管 这一 事故 的 The police chief ( 公安 局长 ) is the person who in charge of this accident. Performance of MSR-Seg1.0 http://research.microsoft.com/-S-MSRSeg on OAs http://research.microsoft.com/-S-MSRSeg e.g. 核电站的特殊性 质 The special properties ( 特殊 性质 ) of nuclear power station

23 Tsinghua University 23 Disambiguation Method Performance of CRF-base[Lafferty 2001] CWS on OAs e.g. 这一 现状 先 天地 决定 了 他们 的 使命 This situation congenitally ( 先天 地 ) makes them to take the mission About 2% of OAS are mistakenly segmented ——it is a net gain

24 Tsinghua University 24 Individual-based method Simple table lookup: record the PMs and the correct segmentation in a table Advantage Satisfactory token coverage to MOASs Full correctness for segmentation of pseudo MOASs Low cost in time and space complexity. Disambiguation Method

25 Tsinghua University 25 An extension of [Sun et. al, 1999] Adjust the exist results in large corpora Further verify the properties on domain- specific corpora An disambiguation strategy is proposed Over 42% Overlapping ambiguity can be resolved without any mistake Will be more effective when facing running text Conclusion

26 Tsinghua University 26 Reference Lafferty J., A. McCallum, and F. Pereira. 2001. Conditional random fields: Probabilistic models for segmenting and labeling sequence data. In Proceedings of 18 th International Conference of ICML, pages 282-289. Li R., S.H. Liu, S.W. Ye, and Z.Z. Shi. 2001. A method for resolving overlapping ambiguities in Chinese word segmentation based on SVM and k-NN. Journal of Chinese Information Processing, 15(6): 13-18. (In Chinese) Li M., J.F. Gao, C.N. Huang, and J.F. Li. 2003. Unsupervised training for overlapping ambiguity resolution in Chinese word segmentation. In Proceedings of SIGHAN’2003, pages 1-7. Sun M.S. and Z.P. Zuo. 1998. Overlapping ambiguities in Chinese text. Quantitative and Computational Studies on the Chinese Language, pages 323-338. Sun M.S., C.N. Huang, and B.K.Y. T’sou. 1997. Using character bigram for ambiguity resolution In Chinese word segmentation. Computer Research and Development, 34(5): 332-339. (In Chinese) Sun M.S., Z.P. Zuo and B.K.Y. T’sou. 1999. The role of high frequent maximal crossing ambiguities in Chinese word segmentation. Journal of Chinese Information Processing, 13(1): 27-37. (In Chinese)

27 Tsinghua University 27 Thank you any comments ? ^.^


Download ppt "Tsinghua University 1 Statistical Properties of Overlapping Ambiguities in Chinese Word Segmentation and a Strategy for Their Disambiguation Wei Qiao,"

Similar presentations


Ads by Google