Semantic classification of Chinese unknown words Huihsin Tseng Linguistics University of Colorado at Boulder ACL 2003 Student Research Workshop.

Slides:



Advertisements
Similar presentations
Punctuation Generation Inspired Linguistic Features For Mandarin Prosodic Boundary Prediction CHEN-YU CHIANG, YIH-RU WANG AND SIN-HORNG CHEN 2012 ICASSP.
Advertisements

CS460/IT632 Natural Language Processing/Language Technology for the Web Lecture 2 (06/01/06) Prof. Pushpak Bhattacharyya IIT Bombay Part of Speech (PoS)
Part-Of-Speech Tagging and Chunking using CRF & TBL
Automatic Metaphor Interpretation as a Paraphrasing Task Ekaterina Shutova Computer Lab, University of Cambridge NAACL 2010.
The University of Wisconsin-Madison Universal Morphological Analysis using Structured Nearest Neighbor Prediction Young-Bum Kim, João V. Graça, and Benjamin.
Part II. Statistical NLP Advanced Artificial Intelligence Part of Speech Tagging Wolfram Burgard, Luc De Raedt, Bernhard Nebel, Lars Schmidt-Thieme Most.
Faculty Of Applied Science Simon Fraser University Cmpt 825 presentation Corpus Based PP Attachment Ambiguity Resolution with a Semantic Dictionary Jiri.
Hindi POS tagging and chunking : An MEMM approach Aniket Dalal Kumar Nagaraj Uma Sawant Sandeep Shelke Under the guidance of Prof. P. Bhattacharyya.
6/16/20151 Recent Results in Automatic Web Resource Discovery Soumen Chakrabartiv Presentation by Cui Tao.
1 Noun Homograph Disambiguation Using Local Context in Large Text Corpora Marti A. Hearst Presented by: Heng Ji Mar. 29, 2004.
Creating a Bilingual Ontology: A Corpus-Based Approach for Aligning WordNet and HowNet Marine Carpuat Grace Ngai Pascale Fung Kenneth W.Church.
Dept. of Computer Science & Engg. Indian Institute of Technology Kharagpur Part-of-Speech Tagging and Chunking with Maximum Entropy Model Sandipan Dandapat.
A Memory-Based Approach to Semantic Role Labeling Beata Kouchnir Tübingen University 05/07/04.
Statistical Natural Language Processing. What is NLP?  Natural Language Processing (NLP), or Computational Linguistics, is concerned with theoretical.
Learning Table Extraction from Examples Ashwin Tengli, Yiming Yang and Nian Li Ma School of Computer Science Carnegie Mellon University Coling 04.
Disambiguation of References to Individuals Levon Lloyd (State University of New York) Varun Bhagwan, Daniel Gruhl (IBM Research Center) Varun Bhagwan,
Andreea Bodnari, 1 Peter Szolovits, 1 Ozlem Uzuner 2 1 MIT, CSAIL, Cambridge, MA, USA 2 Department of Information Studies, University at Albany SUNY, Albany,
Computational Methods to Vocalize Arabic Texts H. Safadi*, O. Al Dakkak** & N. Ghneim**
Lemmatization Tagging LELA /20 Lemmatization Basic form of annotation involving identification of underlying lemmas (lexemes) of the words in.
1 Statistical NLP: Lecture 10 Lexical Acquisition.
Lecture 12: 22/6/1435 Natural language processing Lecturer/ Kawther Abas 363CS – Artificial Intelligence.
Classifying Tags Using Open Content Resources Simon Overell, Borkur Sigurbjornsson & Roelof van Zwol WSDM ‘09.
CIG Conference Norwich September 2006 AUTINDEX 1 AUTINDEX: Automatic Indexing and Classification of Texts Catherine Pease & Paul Schmidt IAI, Saarbrücken.
Distributional Part-of-Speech Tagging Hinrich Schütze CSLI, Ventura Hall Stanford, CA , USA NLP Applications.
Word Sense Disambiguation (WSD)
Discovery of Manner Relations and their Applicability to Question Answering Roxana Girju 1,2, Manju Putcha 1, and Dan Moldovan 1 University of Texas at.
1 Statistical NLP: Lecture 9 Word Sense Disambiguation.
ACBiMA: Advanced Chinese Bi-Character Word Morphological Analyzer 1 Ting-Hao (Kenneth) Huang Yun-Nung (Vivian) Chen Lingpeng Kong
WORD SENSE DISAMBIGUATION STUDY ON WORD NET ONTOLOGY Akilan Velmurugan Computer Networks – CS 790G.
Arabic Tokenization, Part-of-Speech Tagging and Morphological Disambiguation in One Fell Swoop Nizar Habash and Owen Rambow Center for Computational Learning.
Recognizing Names in Biomedical Texts: a Machine Learning Approach GuoDong Zhou 1,*, Jie Zhang 1,2, Jian Su 1, Dan Shen 1,2 and ChewLim Tan 2 1 Institute.
CS 4705 Lecture 19 Word Sense Disambiguation. Overview Selectional restriction based approaches Robust techniques –Machine Learning Supervised Unsupervised.
14/12/2009ICON Dipankar Das and Sivaji Bandyopadhyay Department of Computer Science & Engineering Jadavpur University, Kolkata , India ICON.
Opinion Holders in Opinion Text from Online Newspapers Youngho Kim, Yuchul Jung and Sung-Hyon Myaeng Reporter: Chia-Ying Lee Advisor: Prof. Hsin-Hsi Chen.
NYU: Description of the Proteus/PET System as Used for MUC-7 ST Roman Yangarber & Ralph Grishman Presented by Jinying Chen 10/04/2002.
1 Multi-Perspective Question Answering Using the OpQA Corpus (HLT/EMNLP 2005) Veselin Stoyanov Claire Cardie Janyce Wiebe Cornell University University.
Linguistics The ninth week. Chapter 3 Morphology  3.1 Introduction  3.2 Morphemes.
Summarization Focusing on Polarity or Opinion Fragments in Blogs Yohei Seki Toyohashi University of Technology Visiting Scholar at Columbia University.
An Entity-Mention Model for Coreference Resolution with Inductive Logic Programming Xiaofeng Yang 1 Jian Su 1 Jun Lang 2 Chew Lim Tan 3 Ting Liu 2 Sheng.
Cluster-specific Named Entity Transliteration Fei Huang HLT/EMNLP 2005.
Iterative Translation Disambiguation for Cross Language Information Retrieval Christof Monz and Bonnie J. Dorr Institute for Advanced Computer Studies.
Multilingual Opinion Holder Identification Using Author and Authority Viewpoints Yohei Seki, Noriko Kando,Masaki Aono Toyohashi University of Technology.
From Text to Image: Generating Visual Query for Image Retrieval Wen-Cheng Lin, Yih-Chen Chang and Hsin-Hsi Chen Department of Computer Science and Information.
Learning Phonetic Similarity for Matching Named Entity Translations and Mining New Translations Wai Lam Ruizhang Huang Pik-Shan Cheung Department of Systems.
Hedge Detection with Latent Features SU Qi CLSW2013, Zhengzhou, Henan May 12, 2013.
Natural Language Processing Chapter 2 : Morphology.
Intelligent Database Systems Lab N.Y.U.S.T. I. M. Unsupervised word sense disambiguation for Korean through the acyclic weighted digraph using corpus and.
UWMS Data Mining Workshop Content Analysis: Automated Summarizing Prof. Marti Hearst SIMS 202, Lecture 16.
Multi-level Bootstrapping for Extracting Parallel Sentence from a Quasi-Comparable Corpus Pascale Fung and Percy Cheung Human Language Technology Center,
Shallow Parsing for South Asian Languages -Himanshu Agrawal.
III. MORPHOLOGY. III. Morphology 1. Morphology The study of the internal structure of words and the rules by which words are formed. 1.1 Open classes.
Finding document topics for improving topic segmentation Source: ACL2007 Authors: Olivier Ferret (18 route du Panorama, BP6) Reporter:Yong-Xiang Chen.
1 Measuring the Semantic Similarity of Texts Author : Courtney Corley and Rada Mihalcea Source : ACL-2005 Reporter : Yong-Xiang Chen.
1 Gloss-based Semantic Similarity Metrics for Predominant Sense Acquisition Ryu Iida Nara Institute of Science and Technology Diana McCarthy and Rob Koeling.
Using Wikipedia for Hierarchical Finer Categorization of Named Entities Aasish Pappu Language Technologies Institute Carnegie Mellon University PACLIC.
Exploiting Named Entity Taggers in a Second Language Thamar Solorio Computer Science Department National Institute of Astrophysics, Optics and Electronics.
Combining Text and Image Queries at ImageCLEF2005: A Corpus-Based Relevance-Feedback Approach Yih-Cheng Chang Department of Computer Science and Information.
General characteristics As any other part of speech, the noun can be characterized by three criteria:  Semantic (the meaning)  Morphological (the form.
Improving Music Genre Classification Using Collaborative Tagging Data Ling Chen, Phillip Wright *, Wolfgang Nejdl Leibniz University Hannover * Georgia.
Word classes and part of speech tagging. Slide 1 Outline Why part of speech tagging? Word classes Tag sets and problem definition Automatic approaches.
A knowledge rich morph analyzer for Marathi derived forms Ashwini Vaidya IIIT Hyderabad.
A Simple English-to-Punjabi Translation System By : Shailendra Singh.
Tasneem Ghnaimat. Language Model An abstract representation of a (natural) language. An approximation to real language Assume we have a set of sentences,
Learning to Generate Complex Morphology for Machine Translation Einat Minkov †, Kristina Toutanova* and Hisami Suzuki* *Microsoft Research † Carnegie Mellon.
NLP Midterm Solution #1 bilingual corpora –parallel corpus (document-aligned, sentence-aligned, word-aligned) (4) –comparable corpus (4) Source.
Morphological Segmentation Inside-Out
A method for WSD on Unrestricted Text
Enriching Taxonomies With Functional Domain Knowledge
Meni Adler and Michael Elhadad Ben Gurion University COLING-ACL 2006
A Joint Model of Orthography and Morphological Segmentation
Presentation transcript:

Semantic classification of Chinese unknown words Huihsin Tseng Linguistics University of Colorado at Boulder ACL 2003 Student Research Workshop

2007/10/29 Huang, Ting-Hao 2 / 30 Introduction The biggest problem : Incompleteness of Dictionaries For the Sinica Corpus, articles contain on average 3.51% words that were not listed in the Chinese Electronic Dictionary (1998) Unknown words makes NLP tasks difficult Ex: Segmentation, Word Sense Disambiguation

2007/10/29 Huang, Ting-Hao 3 / 30 Introduction (cont.) The Caraballo (1999)’s system Adopt the contextual information to assign nouns to their hyponyms. Roark and Charniak (1998) Use the co-occurrence of words as features to classify nouns. → Context is clearly an important feature

2007/10/29 Huang, Ting-Hao 4 / 30 Introduction (cont.) This paper focuses on non-contextual features Follow Ciaramita (2002), is morphological similarity to words whose semantic category is known.

2007/10/29 Huang, Ting-Hao 5 / 30 Introduction (cont.) 2 ways to generate new Chinese words: 1. Compounding A compound is a word made up of other words. ( Ex: 光幻覺) 2. Affixation A word is formed by affixation when a stem is combined with a prefix or a suffix morpheme. ( Ex: 科學家)

2007/10/29 Huang, Ting-Hao 6 / 30 Introduction (cont.)

2007/10/29 Huang, Ting-Hao 7 / 30 The CiLin thesaurus 《同義詞詞林》 CiLin (Mei et al 1986) A – humanG – mental action B – objectH – activity C – time and spaceI – state D – abstractJ – association E – attributeK – auxiliary F – actionL – respect

2007/10/29 Huang, Ting-Hao 8 / 30 The CiLin thesaurus (cont.)

2007/10/29 Huang, Ting-Hao 9 / 30 Corpus analysis of Chinese unknown words Unknown words are the Sinica Corpus lexicons that are not listed in the Chinese Electronic Dictionary of 80,000 lexicons and the CiLin. The focus of most other Chinese unknown word research is on identification of proper nouns, but the majority of unknown words in Chinese are lexical words.

2007/10/29 Huang, Ting-Hao 10 / 30 Corpus analysis of Chinese unknown words (cont.)

2007/10/29 Huang, Ting-Hao 11 / 30 Corpus analysis of Chinese unknown words (cont.) Compounds Chinese compounds are made up of words that are linked together by morpho-syntactic relations such as modifier-head, verb-object, and so on. Affixation Chinese affix is a much weaker cue to the semantic category of the word than English -ist or -ian, because it is more ambiguous. ( Ex. 家: expert / family and home / house )

2007/10/29 Huang, Ting-Hao 12 / 30 Semantic classification Baseline Assign the semantic category of the morphological head to each word. An example-base semantic classification Adopt a more sophisticated nearest neighbor approach such that the distance between an unknown word and examples from the CiLin thesaurus computed based upon its morphological structure.

2007/10/29 Huang, Ting-Hao 13 / 30 An example-base semantic classification – Step 1 Morphological Analyzer (Tseng and Chen 2002) 1. word → a sequence of morphemes 2. tags the syntactic categories of morphemes 3. predicts morpho-syntactic relationships between morphemes, such as modifier-head, verb-object and resultative verbs  Ex. 舞蹈家 → 舞蹈 + 家 → modifier-head

2007/10/29 Huang, Ting-Hao 14 / 30 An example-base semantic classification – Step 1 (Cont.)

2007/10/29 Huang, Ting-Hao 15 / 30 An example-base semantic classification – Step 2 Finding similar entries (examples)  The CiLin thesaurus is then searched for the words sharing at least one morpheme with the unknown word, in the same position.  Ex. Unknown word : 舞蹈家 → List : 歌唱家、回家、富貴家

2007/10/29 Huang, Ting-Hao 16 / 30 An example-base semantic classification – Step 3 Morpho-syntactic Relationships Filter  Delete the examples output by step 2 with different morpho-syntactic relationships.  If no examples are found, the system falls back to the baseline classification method.

2007/10/29 Huang, Ting-Hao 17 / 30 An example-base semantic classification – Step 4 Compute the distance  Between the unknown word and each selected example output by step 3.  Chen, C. J., M. H. Bai and K. J. Chen. (1997) Category Guessing for Chinese Unknown Words  The similarity of two words is the least common ancestor information content (IC)

2007/10/29 Huang, Ting-Hao 18 / 30 An example-base semantic classification – Step 4 (cont.) Compute the distance  Information content (IC) : Entropy(System) − Entropy(Semantic category)  Similarity (probability of all leaves are equal) :

2007/10/29 Huang, Ting-Hao 19 / 30 An example-base semantic classification – Step 4 (cont.)

2007/10/29 Huang, Ting-Hao 20 / 30 An example-base semantic classification – Step 4 (cont.)

2007/10/29 Huang, Ting-Hao 21 / 30 An example-base semantic classification – Step 4 (cont.) Recursively Run  跑碼頭 (unknown) / 跑旱船 (known) → 碼頭 (known) / 旱船 (unknown) → guess the category of 旱船 ( 輪船 / 帆船 …) … → No words without a similarity measurement

2007/10/29 Huang, Ting-Hao 22 / 30 An example-base semantic classification – Step 5 Assign the category  舞蹈家 with 歌唱家 / 回家 / 富貴家 Sim( 舞蹈, 歌唱 ) = 0.87 Sim( 舞蹈, 回 ) = 0.26 Sim( 舞蹈, 富貴 ) = 0 → 舞蹈家 is most likely to be 歌唱家

2007/10/29 Huang, Ting-Hao 23 / 30 An example-base semantic classification – Step 5 (cont.) Assign the category  Compute the average distance to the K nearest neighbors  The category with the lowest distance is assigned to the unknown word.

2007/10/29 Huang, Ting-Hao 24 / 30 An example-base semantic classification – Step 5 (cont.) K = 5 α= 0.5

2007/10/29 Huang, Ting-Hao 25 / 30 Experiment 56,830 words in CiLin  Training set : 80%  Development set : 10%  Test set : 10% (assumed unknown) Proper nouns are filtered out. In evaluation, any one of the categories of an ambiguous word is considered correct.

2007/10/29 Huang, Ting-Hao 26 / 30 Experiment (cont.)

2007/10/29 Huang, Ting-Hao 27 / 30 Experiment (cont.)

2007/10/29 Huang, Ting-Hao 28 / 30 Experiment (cont.) Error analysis  Data Error idioms, metaphors, and slang 片語、隱喻、行話 ( Ex. 母老虎、看門狗)  Classifier Error Lack of examples ( Ex. 鐵欄杆) Preciseness of the similarity measurement is not powerful ( Ex. 運動場 – C.time and space 商場 / 屠宰場 / 會場、 D.abstract 球場) Taxonomy of the CiLin is ambiguous ( Ex. 體操房 – B.object 刑房 / 書房 / 暗房 / 廚房, D.abstract 牢房 / 彈子房)

2007/10/29 Huang, Ting-Hao 29 / 30 Conclusion Main contributions  First attempt in adding semantic knowledge to Chinese unknown words  Without contextual information Future work  Using the contextual information

2007/10/29 Huang, Ting-Hao 30 / 30 Thank you !