Download presentation
Presentation is loading. Please wait.
Published byKristopher Hopkins Modified over 9 years ago
1
Rethinking Chinese Word Segmentation: Tokenization, Character Classification, or Wordbreak Identification 黃居仁 Chu-Ren Huang Academia Sinica http://cwn.ling.sinica.edu.tw/huang/huang.htm April 11, 2007,Hong Kong Polytechnic University
2
Citation Please note that this is our ongoing work that will be presented later as Chu-Ren Huang, Petr Šimon, Shu-Kai Hsieh and Laurent Prévot. 2007. Rethinking Chinese Word Segmentation: Tokenization, Character Classification, or Wordbreak Identification. To appear in the proceedings of the 2007 ACL Annual Meeting.
3
Outline Introduction: modeling and theoretical challenges Previous Models –Segmentation as Tokenization –Character classification model A radical model Implementation and experiment Conclusion/Implications
4
Introduction: modeling and theoretical challenges Back to the basics: The goal of Chinese word segmentation is to identify wordbreaks –Such that these segmented units can be used as processing units (i.e. words) Crucially –Words are not identified before segmentation –Wordbreaks in Chinese fall at character- breaks only, and at no other places
5
Challenge I Segmentation is the pre-requisite task for all Chinese processing applications, hence a realistic solution of segmentation must be Robust: perform consistently regardless of language variations Scalable: be applicable to all variants of Chinese and requires minimal training Portable: applicable for real time processing to all kinds of texts, all the time,
6
Challenge II Chinese speakers perform segmentation subconsciously without mistakes, hence if we simulate human segmentation, it must : Be Robust, Sharable, Portable Not assume prior lexical knowledge Equally sensitive to known and unknown words
7
So Far Not so good All exiting algorithms perform reasonably well but require –Large set of training data –Long training time –Comprehensive lexicon – And the training process must be repeated with every new variant (topic/style/genre) But Why?
8
Previous Models I Segmentation as Tokenization The Classical Model (Chen and Liu 1992 etc.) Segmentation is interpreted as identification of tokens (e.g. words) in a text, hence contains two steps –Dictionary Lookup –Unknown Word (or OOV) Resolution
9
Segmentation as Tokenization 2 Find all sequences Ci, …Ci+m such that [Ci, …Ci+m] is a token iff –it is an entry in the lexicon, or –It not a lexical entry but is predicted to be so by a unknown word resolution algorithm Ambiguity Resolution: when there is a Cj, such that both [x, Cj, y] and [y, Cj, z] are entries in the lexicon
10
Segmentation as Tokenization 3 High Complexity: –mapping tens of thousand of lexical entries to even more possible matching strings –Overlapping ambiguity estimated to be up to 20% depending on texts and lexica Not Robust –Dependent on lexicon (and lexica are notoriously easy to change and expensive to build –OOV?
11
Previous Models II: Character Classification Currently Popular Model (Xue 2003, Gao et al. 2004) Segmentation is re-interpreted as classification of character positions. –Classify and tag each character according to its position in a word (initial, final, middle etc.) –Learn the distribution of such classification from a corpus –Predict segmentation based on positional classification of a character in a string
12
Character Classification 2 Character Classification: –Each character Ci is associated with a 3-tuple Ci: where Inii, Midi, Fini are the probability for Ci, to be in Initial, Middle, or Final positions respectively. Ambiguity Resolution: –Multiple classification of a character: A character does not occur exclusively as initial or final etc. –Conflicting classifications of neighboring characters.
13
Character Classification 3 Less Complexity: –6,000 characters x 3 to 10 positional classes Higher Performance: 97% f-score on SigHAN bakeoff (Huang and Zhao 2006)
14
Character Classification 4 Inherent Modeling Problems Segmentation becomes a second order decision dependent on first order decision on character classification –Unnecessary complexity involved –Inherent ceiling set (segmentation cannot outperform character classification) Still highly dependent on lexicon –Character positions must be defined with prior lexical knowledge of a word
15
Our New Proposal Naïve but Radical Segmentation is nothing but segmentation –Possible segmentation sites are well-defined without ambiguity. They are simply the character-breaks clearly marked in any text. –The task is simply to identify all CB which also function as Wordbreak (WB) –Based on distributional information extracted from the contexts surrounding CB’s (i.e. characters)
16
Simple Formalization Any Chinese text is envisioned as a sequence characters-breaks CB’s, evenly distributed among a sequence of characters c’s. CB 0 c 1 CB 1 c 2...CB i-1 c i CB i...CB n-1 c n CB n NB: Psycholinguistic experiment with eye-tracking machine shows that eyes can fix on edges of a character when reading Chinese. (J.L. Tsai, p.c.)
17
How to Model Distributional Information of blanks? There is no overt difference between CB’s and WB’s. Unlike English, where the CB spaces are small, but the WB spaces are BIG. –Hence distributional information must come from the context. CB 0 c 1 CB 1 c 2...CB i-1 c i CB i...CB n-1 c n CB n –Overtly, CB’s carry no distributional Info. –However, c’s do carry information about the status of a CB/WB in its neighborhood (based on a tagged corpus, or human experience)
18
Range of Relevant Context CB i-2 CB i-1 c i CB i+1 CB i+2 Recall that CB’s carry no overt information, while c’s do. Linguistically, it is attested that initial, final, second, and penultimate positions are morphologically significant. –In other words, a linguistic element can carry explicit information about immediately adjacent CB’s as well the CB’s immediately adjacent to the above two 2CB-Model: Taking all the immediate ones 4CB-Model: Taking two more
19
Collecting Distributional Information CB i-2 CB i-1 c i CB i+1 CB i+2 Adopt either 2CBM or 4CBM Collect a 2-tuple or 4-tuple for each character from a segmented corpus Sum up the n-tuple value for all tokens belong to the same character type to form a distributional vector CharacterV1V1 V2V2 V3V3 V4V4 的 0.01270.98660.99170.0081 一 0.10080.87440.65000.2819 是 0.19020.80510.97080.0286 不 0.06830.90550.46530.4657 有 0.24910.73970.84080.1253 Table 2. Character table for 4CBM
20
Estimating Distributional Features of CB’s c -2 c -1 CB c +1 c +2 For each CB, distributional information is contributed by 2 or 4 adjacent characters Each characters carry the four-element vector given above, align the vector positions and then sum up Note that no knowledge from a lexicon is involved (while the character classification model is making explicit decision of the position of that character in a word)
21
Aligning Vector Positions c -2 c -1 CB c +1 c +2 c -2 c -1 c +1 c +2
22
Theoretical Issues in Modeling Do we look beyond WB’s (in 4CBM)? –No, characters cannot contribute to boundary conditions beyond an existing boundary. –Yes, we cannot assume lexical knowledge a priori (and the model is more elegant) One or Two features (in 4CBM)? –No, positive information (that there is a WB) and negative (that there is no WB) should be complimentary –Yes (especially when the answer to the above Q is no), since there are under-specified cases
23
Size of Distributional Info The Sinica Corpus 5.0 contains 6820 types of c’s (characters, numbers, punctuation, Latin alphabet etc.) The 10 million word corpus is converted into 14.4 million labeled CB vectors. In this first study we implement a CB only model, without any preprocessing of punctuation marks.
24
How to Model Decision I Assuming that each character represents an independent event, hence all relevant vectors can be summed up and evaluated –Simple heuristic by sum and threshold –Decision Tree trained on segmented corpus –Machine-learning trained on segmented corpus?
25
Simple Sum and Threshold Heuristic Mean for sums of CB vectors for each S and -S (mean probability of S = 2.90445651112, -S = 1.89855870063) One standard deviation difference between each CB vector and threshold values was used as a segmentation heuristics 88% accuracy Error analysis: CB vectors are not linearly separable
26
Decision Tree A decision tree classifier (YaDT, Ruggieri2004) is adopted on a 900,000 CB vectors sample of 100,000 boundary vectors for testing phase. Achieves up to 97% accuracy in inside test, including numbers, punctuation and foreign words.
27
Evaluation: SigHAN Bakeoff Note that our method is NOT designed for SigHAN bakeoff, where resources are devoted to fine-tune for the small extra edge in scoring This radical model aims to be robust in a real world situation, where it can perform reliably without extra tuning when encountered different texts No manual pre-processing, texts input as seen
28
Evaluation Closed test, but without any lexical knowledge
29
Discussion The method is basically sound We still need to develop an effective algorithm for adaptation to new variants Automatic pre-processing on punctuation marks and foreign symbols should improve the performance What role should lexical knowledge play? The character as independent event assumption may be incorrect
30
How to Model Decision II Assuming that a string of characters are not independent events, hence certain combinations (as well as single characters) can contribute to WB decision. One possible implementation: c’s as committee members, decision by vote –Five voting blocks by simple majority: c -2 c -1, c -1, c +1 c -1, c +1, c +1 c +2 c -2 c -1 CB c +1 c +2
31
Conclusion I We propose a radical but elegant model for Chinese Word Segmentation Where the task is reduce to binary classification of CB’s into WB’s and non WB’s The model does not pre-suppose and lexical knowledge and relies only on distributional information of characters as the context for CB’s
32
Conclusion II In principle, this model should be robust and scalable for all different variants of texts Preliminary experiment result is promising yet leave rooms for improvement Work is still on-going You are welcomed to adopt this model and experiment with your favorite algorithm!
Similar presentations
© 2025 SlidePlayer.com. Inc.
All rights reserved.