ACBiMA: Advanced Chinese Bi-Character Word Morphological Analyzer 1 Ting-Hao (Kenneth) Huang Yun-Nung (Vivian) Chen Lingpeng Kong

Slides:

Advertisements

Similar presentations

A Comparison of Implicit and Explicit Links for Web Page Classification Dou Shen 1 Jian-Tao Sun 2 Qiang Yang 1 Zheng Chen 2 1 Department of Computer Science.

Advertisements

Document Summarization using Conditional Random Fields Dou Shen, Jian-Tao Sun, Hua Li, Qiang Yang, Zheng Chen IJCAI 2007 Hao-Chin Chang Department of Computer.

CS460/IT632 Natural Language Processing/Language Technology for the Web Lecture 2 (06/01/06) Prof. Pushpak Bhattacharyya IIT Bombay Part of Speech (PoS)

ICONIP 2005 Improve Naïve Bayesian Classifier by Discriminative Training Kaizhu Huang, Zhangbing Zhou, Irwin King, Michael R. Lyu Oct

Large-Scale Entity-Based Online Social Network Profile Linkage.

计算机科学与技术学院 Chinese Semantic Role Labeling with Dependency-driven Constituent Parse Tree Structure Hongling Wang, Bukang Wang Guodong Zhou NLP Lab, School.

TEMPLATE DESIGN © Identifying Noun Product Features that Imply Opinions Lei Zhang Bing Liu Department of Computer Science,

Recognizing Implicit Discourse Relations in the Penn Discourse Treebank Ziheng Lin, Min-Yen Kan, and Hwee Tou Ng Department of Computer Science National.

GENERATING AUTOMATIC SEMANTIC ANNOTATIONS FOR RESEARCH DATASETS AYUSH SINGHAL AND JAIDEEP SRIVASTAVA CS DEPT., UNIVERSITY OF MINNESOTA, MN, USA.

Predicting Text Quality for Scientific Articles Annie Louis University of Pennsylvania Advisor: Ani Nenkova.

Shallow Processing: Summary Shallow Processing Techniques for NLP Ling570 December 7, 2011.

SUPPORT VECTOR MACHINES PRESENTED BY MUTHAPPA. Introduction Support Vector Machines(SVMs) are supervised learning models with associated learning algorithms.

Predicting the Semantic Orientation of Adjective Vasileios Hatzivassiloglou and Kathleen R. McKeown Presented By Yash Satsangi.

1 CSC 594 Topics in AI – Applied Natural Language Processing Fall 2009/ Shallow Parsing.

U.S. SENATE BILL CLASSIFICATION & VOTE PREDICTION Alessandra Paulino Rick Pocklington Serhat Selcuk Bucak.

1/1/ Question Classification in English-Chinese Cross-Language Question Answering: An Integrated Genetic Algorithm and Machine Learning Approach Min-Yuh.

Presentation in IJCNN 2004 Biased Support Vector Machine for Relevance Feedback in Image Retrieval Hoi, Chu-Hong Steven Department of Computer Science.

Sentence Classifier for Helpdesk s Anthony 6 June 2006 Supervisors: Dr. Yuval Marom Dr. David Albrecht.

Parsing the NEGRA corpus Greg Donaker June 14, 2006.

A Framework for Named Entity Recognition in the Open Domain Richard Evans Research Group in Computational Linguistics University of Wolverhampton UK

Text Classification Using Stochastic Keyword Generation Cong Li, Ji-Rong Wen and Hang Li Microsoft Research Asia August 22nd, 2003.

Learning Table Extraction from Examples Ashwin Tengli, Yiming Yang and Nian Li Ma School of Computer Science Carnegie Mellon University Coling 04.

Zhenghua Li, Jiayuan Chao, Min Zhang, Wenliang Chen {zhli13, minzhang, Soochow University, China Coupled Sequence.

To Trust of Not To Trust? Predicting Online Trusts using Trust Antecedent Framework Viet-An Nguyen 1, Ee-Peng Lim 1, Aixin Sun 2, Jing Jiang 1, Hwee-Hoon.

Mining the Peanut Gallery: Opinion Extraction and Semantic Classification of Product Reviews K. Dave et al, WWW 2003, citations Presented by Sarah.

Attention Deficit Hyperactivity Disorder (ADHD) Student Classification Using Genetic Algorithm and Artificial Neural Network S. Yenaeng 1, S. Saelee 2.

Richard Socher Cliff Chiung-Yu Lin Andrew Y. Ng Christopher D. Manning

Fast Webpage classification using URL features Authors: Min-Yen Kan Hoang and Oanh Nguyen Thi Conference: ICIKM 2005 Reporter: Yi-Ren Yeh.

Empirical Methods in Information Extraction Claire Cardie Appeared in AI Magazine, 18:4, Summarized by Seong-Bae Park.

Detecting Promotional Content in Wikipedia Shruti Bhosale Heath Vinicombe Ray Mooney University of Texas at Austin 1.

Researcher affiliation extraction from homepages I. Nagy, R. Farkas, M. Jelasity University of Szeged, Hungary.

1 A Unified Relevance Model for Opinion Retrieval (CIKM 09’) Xuanjing Huang, W. Bruce Croft Date: 2010/02/08 Speaker: Yu-Wen, Hsu.

Employing Active Learning to Cross-Lingual Sentiment Classification with Data Quality Controlling Shoushan Li †‡ Rong Wang † Huanhuan Liu † Chu-Ren Huang.

1 Statistical NLP: Lecture 9 Word Sense Disambiguation.

Part-Of-Speech Tagging using Neural Networks Ankur Parikh LTRC IIIT Hyderabad

Combining terminology resources and statistical methods for entity recognition: an evaluation Angus Roberts, Robert Gaizauskas, Mark Hepple, Yikun Guo.

A Weakly-Supervised Approach to Argumentative Zoning of Scientific Documents Yufan Guo Anna Korhonen Thierry Poibeau 1 Review By: Pranjal Singh Paper.

Arabic Tokenization, Part-of-Speech Tagging and Morphological Disambiguation in One Fell Swoop Nizar Habash and Owen Rambow Center for Computational Learning.

A Language Independent Method for Question Classification COLING 2004.

Rethinking Chinese Word Segmentation: Tokenization, Character Classification, or Wordbreak Identification 黃居仁 Chu-Ren Huang Academia Sinica

An Effective Word Sense Disambiguation Model Using Automatic Sense Tagging Based on Dictionary Information Yong-Gu Lee

INFORMATION NETWORKS DIVISION COMPUTER FORENSICS UNCLASSIFIED 1 DFRWS2002 Language and Gender Author Cohort Analysis of .

Recognizing Names in Biomedical Texts: a Machine Learning Approach GuoDong Zhou 1,*, Jie Zhang 1,2, Jian Su 1, Dan Shen 1,2 and ChewLim Tan 2 1 Institute.

A Cascaded Finite-State Parser for German Michael Schiehlen Institut für Maschinelle Sprachverarbeitung Universität Stuttgart

Kyoungryol Kim Extracting Schedule Information from Korean .

Exploiting Context Analysis for Combining Multiple Entity Resolution Systems -Ramu Bandaru Zhaoqi Chen Dmitri V.kalashnikov Sharad Mehrotra.

Tokenization & POS-Tagging

TEXT ANALYTICS - LABS Maha Althobaiti Udo Kruschwitz Massimo Poesio.

Presenter: Jinhua Du ( 杜金华 ) Xi’an University of Technology 西安理工大学 NLP&CC, Chongqing, Nov , 2013 Discriminative Latent Variable Based Classifier.

CoCQA : Co-Training Over Questions and Answers with an Application to Predicting Question Subjectivity Orientation Baoli Li, Yandong Liu, and Eugene Agichtein.

USE RECIPE INGREDIENTS TO PREDICT THE CATEGORY OF CUISINE Group 7 – MEI, Yan & HUANG, Chenyu.

Page 1 NAACL-HLT 2010 Los Angeles, CA Training Paradigms for Correcting Errors in Grammar and Usage Alla Rozovskaya and Dan Roth University of Illinois.

Creating Subjective and Objective Sentence Classifier from Unannotated Texts Janyce Wiebe and Ellen Riloff Department of Computer Science University of.

Semantic classification of Chinese unknown words Huihsin Tseng Linguistics University of Colorado at Boulder ACL 2003 Student Research Workshop.

Exploiting Named Entity Taggers in a Second Language Thamar Solorio Computer Science Department National Institute of Astrophysics, Optics and Electronics.

From Words to Senses: A Case Study of Subjectivity Recognition Author: Fangzhong Su & Katja Markert (University of Leeds, UK) Source: COLING 2008 Reporter:

Divided Pretreatment to Targets and Intentions for Query Recommendation Reporter: Yangyang Kang /23.

Virtual Examples for Text Classification with Support Vector Machines Manabu Sassano Proceedings of the 2003 Conference on Emprical Methods in Natural.

Extracting Opinion Topics for Chinese Opinions using Dependence Grammar Guang Qiu, Kangmiao Liu, Jiajun Bu*, Chun Chen, Zhiming Kang Reporter: Chia-Ying.

Word Sense and Subjectivity (Coling/ACL 2006) Janyce Wiebe Rada Mihalcea University of Pittsburgh University of North Texas Acknowledgements: This slide.

Twitter as a Corpus for Sentiment Analysis and Opinion Mining

Intelligent Database Systems Lab 國立雲林科技大學 National Yunlin University of Science and Technology 1 Chinese Named Entity Recognition using Lexicalized HMMs.

1 A Statistical Matching Method in Wavelet Domain for Handwritten Character Recognition Presented by Te-Wei Chiang July, 2005.

Language Identification and Part-of-Speech Tagging

Sentiment analysis algorithms and applications: A survey

Guillaume-Alexandre Bilodeau

Generating Natural Answers by Incorporating Copying and Retrieving Mechanisms in Sequence-to-Sequence Learning Shizhu He, Cao liu, Kang Liu and Jun Zhao.

By Hossein Hematialam and Wlodek Zadrozny Presented by

Extracting Why Text Segment from Web Based on Grammar-gram

Presentation transcript:

ACBiMA: Advanced Chinese Bi-Character Word Morphological Analyzer 1 Ting-Hao (Kenneth) Huang Yun-Nung (Vivian) Chen Lingpeng Kong

Outline ● Introduction ● Related Work ● Morphological Type Scheme ● Morphological Type Classification o Drived Word: Rule-Based Approach o Compond Word: ML Approach ● Experiments o ACBiMA Corpus 1.0 o Experimental Results ● Conclusion & Future Work 2:: ACBIMA.ORG ::

Outline ● Introduction ● Related Work ● Morphological Type Scheme ● Morphological Type Classification o Drived Word: Rule-Based Approach o Compond Word: ML Approach ● Experiments o ACBiMA Corpus 1.0 o Experimental Results ● Conclusion & Future Work 3:: ACBIMA.ORG ::

Introduction ● NLP tasks usually focus on segmented words ● Morphology is how words are composed with morphemes ● Usages of Chinese morphological structures o Sentiment Analysis (Ku, 2009; Huang, 2009) o POS Tagging (Qiu, 2008) o Word Segmentation (Gao, 2005) o Parsing (Li, 2011; Li, 2012; Zhang, 2013) ● Challenge for Chinese morphology o Lack of complete theories o Lack of category schema o Lack of toolkits 4:: ACBIMA.ORG :: 抗菌 anti bacteria verb object

Outline ● Introduction ● Related Work ● Morphological Type Scheme ● Morphological Type Classification o Drived Word: Rule-Based Approach o Compond Word: ML Approach ● Experiments o ACBiMA Corpus 1.0 o Experimental Results ● Conclusion & Future Work 5:: ACBIMA.ORG ::

Related Work 6:: ACBIMA.ORG :: ● Focus on longer unknown words o Tseng, 2002; Tseng, 2005; Lu, 2008; Qiu, 2008 ● Focus on the functionality of morphemic characters o Bruno, 2010 ● Focus on Chinese bi-character words o Huang, et al., 2010 (LREC) o 52% multi-character Chinese tokens are bi-character o analyze Chinese morphological types o developed a suite of classifiers for type prediction Issue: covers only a subset of Chinese content words and has limited scalability

Outline ● Introduction ● Related Work ● Morphological Type Scheme ● Morphological Type Classification o Drived Word: Rule-Based Approach o Compond Word: ML Approach ● Experiments o ACBiMA Corpus 1.0 o Experimental Results ● Conclusion & Future Work 7:: ACBIMA.ORG ::

Morphological Type Scheme 8:: ACBIMA.ORG :: Chinese Bi-Char Content Word Single-Morpheme Word Synthetic Word Derived Word Compound Word dup, pfx, sfx, neg, ec a-head, conj, n-head, nsubj, v-head, vobj, vprt, els els

Derived Word 9:: ACBIMA.ORG ::

Compond Word 10:: ACBIMA.ORG ::

Outline ● Introduction ● Related Work ● Morphological Type Scheme ● Morphological Type Classification o Drived Word: Rule-Based Approach o Compond Word: ML Approach ● Experiments o ACBiMA Corpus 1.0 o Experimental Results ● Conclusion & Future Work 11:: ACBIMA.ORG ::

Morphological Type Classification 12:: ACBIMA.ORG :: ● Assumption: Chinese morphological structures are independent from word-level contexts (Tseng, 2002; Li, 2011) ● Derived words o Rule-based approach ● Compound words o ML-based approach

Derived Word: Rule-Based 13:: ACBIMA.ORG :: ● Idea o a morphologically derived word can be recognized based on its formation ● Approach o pattern matching rules ● Evaluation o Data: Chinese Treebank 7.0 o Result: o 2.9% of bi-char content words are annotated as derived words o Precision = 0.97 Rule-based methods are able to effectively recognizing derived words.

Compond Word: ML-Based 14:: ACBIMA.ORG :: ● Idea o The characteristics of individual characters can help decide the type of compond words ● ML classification models o Naïve Bayes o Random Forest o SVM

Classification Feature 15:: ACBIMA.ORG :: o Dict: Revised Mandarin Chinese Dictionary (MoE, 1994) o CTB: Chinese Treebank 5.1 (Xue et al., 2005)

Outline ● Introduction ● Related Work ● Morphological Type Scheme ● Morphological Type Classification o Drived Word: Rule-Based Approach o Compond Word: ML Approach ● Experiments o ACBiMA Corpus 1.0 o Experimental Results ● Conclusion & Future Work 16:: ACBIMA.ORG ::

ACBiMA Corpus :: ACBIMA.ORG :: ● Initial Set o 3,052 words o Extracted from CTB5 o Annotated with difficulty level ● Whole Set o 11,366 words o Initial Set + 3k words from CTB k words from (Huang, 2010)

Baseline Models 18:: ACBIMA.ORG :: 1)Majority 2)Stanford Dependency Map 3)Tabular Models o Step 1: assign the POS tags to each known character based on different heuristics o Step 2: assign the most frequent morphological type obtained from training data to each POS combination, e.g., “(VV, NN) = vobj”

Experimental Result 19:: ACBIMA.ORG :: o Setting: 10-fold cross-validation o Metrics: Macro F-measure (MF), Accuracy (ACC) Approachnsubj v- head a- head n- head vprtvobjconjelsMFACC Majority Stanford Dep. Map Tabular Stanford CTB Dict Tablular approaches perform better among all baselines.

Experimental Result 20:: ACBIMA.ORG :: o Setting: 10-fold cross-validation o Metrics: Macro F-measure (MF), Accuracy (ACC) Approachnsubj v- head a- head n- head vprtvobjconjelsMFACC Majority Stanford Dep. Map Tabular Stanford CTB Dict Naïve Base Random Forest SVM ML-based methods outperform all baselines, where SVM & RF perform best.

Experimental Result 21:: ACBIMA.ORG :: o Setting: 10-fold cross-validation o Metrics: Macro F-measure (MF), Accuracy (ACC) Approachnsubj v- head a- head n- head vprtvobjconjelsMFACC Majority Stanford Dep. Map Tabular Stanford CTB Dict Naïve Base Random Forest SVM Avg Difficulty

Outline ● Introduction ● Related Work ● Morphological Type Scheme ● Morphological Type Classification o Drived Word: Rule-Based Approach o Compond Word: ML Approach ● Experiments o ACBiMA Corpus 1.0 o Experimental Results ● Conclusion & Future Work 22:: ACBIMA.ORG ::

Conclusion & Future Work 23:: ACBIMA.ORG :: ● Contribution o Linguistic  Propose a morphological type scheme  Develop a corpus containing about 11K words o Technical  Develop an effective morphological classifier o Practical  Data and tool available  Additional features for any Chinese task ● Future o Improve other NLP tasks by using ACBiMA

Q & A Thanks for your attentions!! 24 ●