Presentation is loading. Please wait.

Presentation is loading. Please wait.

Semantic classification of Chinese unknown words Huihsin Tseng Linguistics University of Colorado at Boulder ACL 2003 Student Research Workshop.

Similar presentations


Presentation on theme: "Semantic classification of Chinese unknown words Huihsin Tseng Linguistics University of Colorado at Boulder ACL 2003 Student Research Workshop."— Presentation transcript:

1 Semantic classification of Chinese unknown words Huihsin Tseng Linguistics University of Colorado at Boulder ACL 2003 Student Research Workshop

2 2007/10/29 Huang, Ting-Hao 2 / 30 Introduction The biggest problem : Incompleteness of Dictionaries For the Sinica Corpus, articles contain on average 3.51% words that were not listed in the Chinese Electronic Dictionary (1998) Unknown words makes NLP tasks difficult Ex: Segmentation, Word Sense Disambiguation

3 2007/10/29 Huang, Ting-Hao 3 / 30 Introduction (cont.) The Caraballo (1999)’s system Adopt the contextual information to assign nouns to their hyponyms. Roark and Charniak (1998) Use the co-occurrence of words as features to classify nouns. → Context is clearly an important feature

4 2007/10/29 Huang, Ting-Hao 4 / 30 Introduction (cont.) This paper focuses on non-contextual features Follow Ciaramita (2002), is morphological similarity to words whose semantic category is known.

5 2007/10/29 Huang, Ting-Hao 5 / 30 Introduction (cont.) 2 ways to generate new Chinese words: 1. Compounding A compound is a word made up of other words. ( Ex: 光幻覺) 2. Affixation A word is formed by affixation when a stem is combined with a prefix or a suffix morpheme. ( Ex: 科學家)

6 2007/10/29 Huang, Ting-Hao 6 / 30 Introduction (cont.)

7 2007/10/29 Huang, Ting-Hao 7 / 30 The CiLin thesaurus 《同義詞詞林》 CiLin (Mei et al 1986) A – humanG – mental action B – objectH – activity C – time and spaceI – state D – abstractJ – association E – attributeK – auxiliary F – actionL – respect

8 2007/10/29 Huang, Ting-Hao 8 / 30 The CiLin thesaurus (cont.)

9 2007/10/29 Huang, Ting-Hao 9 / 30 Corpus analysis of Chinese unknown words Unknown words are the Sinica Corpus lexicons that are not listed in the Chinese Electronic Dictionary of 80,000 lexicons and the CiLin. The focus of most other Chinese unknown word research is on identification of proper nouns, but the majority of unknown words in Chinese are lexical words.

10 2007/10/29 Huang, Ting-Hao 10 / 30 Corpus analysis of Chinese unknown words (cont.)

11 2007/10/29 Huang, Ting-Hao 11 / 30 Corpus analysis of Chinese unknown words (cont.) Compounds Chinese compounds are made up of words that are linked together by morpho-syntactic relations such as modifier-head, verb-object, and so on. Affixation Chinese affix is a much weaker cue to the semantic category of the word than English -ist or -ian, because it is more ambiguous. ( Ex. 家: expert / family and home / house )

12 2007/10/29 Huang, Ting-Hao 12 / 30 Semantic classification Baseline Assign the semantic category of the morphological head to each word. An example-base semantic classification Adopt a more sophisticated nearest neighbor approach such that the distance between an unknown word and examples from the CiLin thesaurus computed based upon its morphological structure.

13 2007/10/29 Huang, Ting-Hao 13 / 30 An example-base semantic classification – Step 1 Morphological Analyzer (Tseng and Chen 2002) 1. word → a sequence of morphemes 2. tags the syntactic categories of morphemes 3. predicts morpho-syntactic relationships between morphemes, such as modifier-head, verb-object and resultative verbs  Ex. 舞蹈家 → 舞蹈 + 家 → modifier-head

14 2007/10/29 Huang, Ting-Hao 14 / 30 An example-base semantic classification – Step 1 (Cont.)

15 2007/10/29 Huang, Ting-Hao 15 / 30 An example-base semantic classification – Step 2 Finding similar entries (examples)  The CiLin thesaurus is then searched for the words sharing at least one morpheme with the unknown word, in the same position.  Ex. Unknown word : 舞蹈家 → List : 歌唱家、回家、富貴家

16 2007/10/29 Huang, Ting-Hao 16 / 30 An example-base semantic classification – Step 3 Morpho-syntactic Relationships Filter  Delete the examples output by step 2 with different morpho-syntactic relationships.  If no examples are found, the system falls back to the baseline classification method.

17 2007/10/29 Huang, Ting-Hao 17 / 30 An example-base semantic classification – Step 4 Compute the distance  Between the unknown word and each selected example output by step 3.  Chen, C. J., M. H. Bai and K. J. Chen. (1997) Category Guessing for Chinese Unknown Words  The similarity of two words is the least common ancestor information content (IC)

18 2007/10/29 Huang, Ting-Hao 18 / 30 An example-base semantic classification – Step 4 (cont.) Compute the distance  Information content (IC) : Entropy(System) − Entropy(Semantic category)  Similarity (probability of all leaves are equal) :

19 2007/10/29 Huang, Ting-Hao 19 / 30 An example-base semantic classification – Step 4 (cont.)

20 2007/10/29 Huang, Ting-Hao 20 / 30 An example-base semantic classification – Step 4 (cont.)

21 2007/10/29 Huang, Ting-Hao 21 / 30 An example-base semantic classification – Step 4 (cont.) Recursively Run  跑碼頭 (unknown) / 跑旱船 (known) → 碼頭 (known) / 旱船 (unknown) → guess the category of 旱船 ( 輪船 / 帆船 …) … → No words without a similarity measurement

22 2007/10/29 Huang, Ting-Hao 22 / 30 An example-base semantic classification – Step 5 Assign the category  舞蹈家 with 歌唱家 / 回家 / 富貴家 Sim( 舞蹈, 歌唱 ) = 0.87 Sim( 舞蹈, 回 ) = 0.26 Sim( 舞蹈, 富貴 ) = 0 → 舞蹈家 is most likely to be 歌唱家

23 2007/10/29 Huang, Ting-Hao 23 / 30 An example-base semantic classification – Step 5 (cont.) Assign the category  Compute the average distance to the K nearest neighbors  The category with the lowest distance is assigned to the unknown word.

24 2007/10/29 Huang, Ting-Hao 24 / 30 An example-base semantic classification – Step 5 (cont.) K = 5 α= 0.5

25 2007/10/29 Huang, Ting-Hao 25 / 30 Experiment 56,830 words in CiLin  Training set : 80%  Development set : 10%  Test set : 10% (assumed unknown) Proper nouns are filtered out. In evaluation, any one of the categories of an ambiguous word is considered correct.

26 2007/10/29 Huang, Ting-Hao 26 / 30 Experiment (cont.)

27 2007/10/29 Huang, Ting-Hao 27 / 30 Experiment (cont.)

28 2007/10/29 Huang, Ting-Hao 28 / 30 Experiment (cont.) Error analysis  Data Error idioms, metaphors, and slang 片語、隱喻、行話 ( Ex. 母老虎、看門狗)  Classifier Error Lack of examples ( Ex. 鐵欄杆) Preciseness of the similarity measurement is not powerful ( Ex. 運動場 – C.time and space 商場 / 屠宰場 / 會場、 D.abstract 球場) Taxonomy of the CiLin is ambiguous ( Ex. 體操房 – B.object 刑房 / 書房 / 暗房 / 廚房, D.abstract 牢房 / 彈子房)

29 2007/10/29 Huang, Ting-Hao 29 / 30 Conclusion Main contributions  First attempt in adding semantic knowledge to Chinese unknown words  Without contextual information Future work  Using the contextual information

30 2007/10/29 Huang, Ting-Hao 30 / 30 Thank you !


Download ppt "Semantic classification of Chinese unknown words Huihsin Tseng Linguistics University of Colorado at Boulder ACL 2003 Student Research Workshop."

Similar presentations


Ads by Google