1/1/ An Integrated Knowledge-based and Machine Learning Approach for Chinese Question Classification Min-Yuh Day 1,2, Cheng-Wei Lee 1, Shih-Hung Wu 3,

Slides:



Advertisements
Similar presentations
A Comparison of Implicit and Explicit Links for Web Page Classification Dou Shen 1 Jian-Tao Sun 2 Qiang Yang 1 Zheng Chen 2 1 Department of Computer Science.
Advertisements

Recognizing Human Actions by Attributes CVPR2011 Jingen Liu, Benjamin Kuipers, Silvio Savarese Dept. of Electrical Engineering and Computer Science University.
Integrated Instance- and Class- based Generative Modeling for Text Classification Antti PuurulaUniversity of Waikato Sung-Hyon MyaengKAIST 5/12/2013 Australasian.
Image classification Given the bag-of-features representations of images from different classes, how do we learn a model for distinguishing them?
Jing-Shin Chang National Chi Nan University, IJCNLP-2013, Nagoya 2013/10/15 ACLCLP – Activities ( ) & Text Corpora.
1/1/ A Knowledge-based Approach to Citation Extraction Min-Yuh Day 1,2, Tzong-Han Tsai 1,3, Cheng-Lung Sung 1, Cheng-Wei Lee 1, Shih-Hung Wu 4, Chorng-Shyong.
Made with OpenOffice.org 1 Sentiment Classification using Word Sub-Sequences and Dependency Sub-Trees Pacific-Asia Knowledge Discovery and Data Mining.
Recognition of Fragmented Characters Using Multiple Feature-Subset Classifiers Recognition of Fragmented Characters Using Multiple Feature-Subset Classifiers.
Watching Unlabeled Video Helps Learn New Human Actions from Very Few Labeled Snapshots Chao-Yeh Chen and Kristen Grauman University of Texas at Austin.
1/1/ Integrating Genetic Algorithms with Conditional Random Fields to Enhance Question Informer Prediction Min-Yuh Day 1, 2, Chun-Hung Lu 1, 2, Chorng-Shyong.
Gimme’ The Context: Context- driven Automatic Semantic Annotation with CPANKOW Philipp Cimiano et al.
Automatic Set Expansion for List Question Answering Richard C. Wang, Nico Schlaefer, William W. Cohen, and Eric Nyberg Language Technologies Institute.
1/1/ Question Classification in English-Chinese Cross-Language Question Answering: An Integrated Genetic Algorithm and Machine Learning Approach Min-Yuh.
1/1/ Designing an Ontology-based Intelligent Tutoring Agent with Instant Messaging Min-Yuh Day 1,2, Chun-Hung Lu 1,3, Jin-Tan David Yang 4, Guey-Fa Chiou.
Scalable Text Mining with Sparse Generative Models
1 Drawing Dynamic Geometry Figures with Natural Language Wing-Kwong Wong a, Sheng-Kai Yin b, Chang-Zhe Yang c a Department of Electronic Engineering b.
Text Classification Using Stochastic Keyword Generation Cong Li, Ji-Rong Wen and Hang Li Microsoft Research Asia August 22nd, 2003.
A Graph-based Approach to Named Entity Categorization in Wikipedia Using Conditional Random Fields Yotaro Watanabe, Masayuki Asahara and Yuji Matsumoto.
Kuang Ru; Jinan Xu; Yujie Zhang; Peihao Wu Beijing Jiaotong University
Semantic Similarity over Gene Ontology for Multi-label Protein Subcellular Localization Shibiao WAN and Man-Wai MAK The Hong Kong Polytechnic University.
Exploiting Ontologies for Automatic Image Annotation M. Srikanth, J. Varner, M. Bowden, D. Moldovan Language Computer Corporation
An Integrated Approach for Arabic-English Named Entity Translation Hany Hassan IBM Cairo Technology Development Center Jeffrey Sorensen IBM T.J. Watson.
1 Formal Models for Expert Finding on DBLP Bibliography Data Presented by: Hongbo Deng Co-worked with: Irwin King and Michael R. Lyu Department of Computer.
Intelligent DataBase System Lab, NCKU, Taiwan Josh Jia-Ching Ying 1, Eric Hsueh-Chan Lu 2 and Vincent S. Tseng 1 1 Institute of Computer Science and Information.
Summary  The task of extractive speech summarization is to select a set of salient sentences from an original spoken document and concatenate them to.
 Text Representation & Text Classification for Intelligent Information Retrieval Ning Yu School of Library and Information Science Indiana University.
This work is supported by the Intelligence Advanced Research Projects Activity (IARPA) via Department of Interior National Business Center contract number.
A Language Independent Method for Question Classification COLING 2004.
Date: 2014/02/25 Author: Aliaksei Severyn, Massimo Nicosia, Aleessandro Moschitti Source: CIKM’13 Advisor: Jia-ling Koh Speaker: Chen-Yu Huang Building.
An Effective Word Sense Disambiguation Model Using Automatic Sense Tagging Based on Dictionary Information Yong-Gu Lee
Experiments of Opinion Analysis On MPQA and NTCIR-6 Yaoyong Li, Kalina Bontcheva, Hamish Cunningham Department of Computer Science University of Sheffield.
A Bootstrapping Method for Building Subjectivity Lexicons for Languages with Scarce Resources Author: Carmen Banea, Rada Mihalcea, Janyce Wiebe Source:
Jun-Won Suh Intelligent Electronic Systems Human and Systems Engineering Department of Electrical and Computer Engineering Speaker Verification System.
NTCIR /21 ASQA: Academia Sinica Question Answering System for CLQA (IASL) Cheng-Wei Lee, Cheng-Wei Shih, Min-Yuh Day, Tzong-Han Tsai, Tian-Jian Jiang,
Opinion Holders in Opinion Text from Online Newspapers Youngho Kim, Yuchul Jung and Sung-Hyon Myaeng Reporter: Chia-Ying Lee Advisor: Prof. Hsin-Hsi Chen.
A Scalable Machine Learning Approach for Semi-Structured Named Entity Recognition Utku Irmak(Yahoo! Labs) Reiner Kraft(Yahoo! Inc.) WWW 2010(Information.
Jun Li, Peng Zhang, Yanan Cao, Ping Liu, Li Guo Chinese Academy of Sciences State Grid Energy Institute, China Efficient Behavior Targeting Using SVM Ensemble.
Summary  Extractive speech summarization aims to automatically select an indicative set of sentences from a spoken document to concisely represent the.
Exploiting Wikipedia Categorization for Predicting Age and Gender of Blog Authors K Santosh Aditya Joshi Manish Gupta Vasudeva Varma
Date : 2013/03/18 Author : Jeffrey Pound, Alexander K. Hudek, Ihab F. Ilyas, Grant Weddell Source : CIKM’12 Speaker : Er-Gang Liu Advisor : Prof. Jia-Ling.
Improving Named Entity Translation Combining Phonetic and Semantic Similarities Fei Huang, Stephan Vogel, Alex Waibel Language Technologies Institute School.
From Text to Image: Generating Visual Query for Image Retrieval Wen-Cheng Lin, Yih-Chen Chang and Hsin-Hsi Chen Department of Computer Science and Information.
Carolyn Penstein Rosé Language Technologies Institute Human-Computer Interaction Institute School of Computer Science With funding from the National Science.
Intelligent Database Systems Lab 國立雲林科技大學 National Yunlin University of Science and Technology Advisor : Dr. Hsu Presenter : Yu Cheng Chen Author: YU-SHENG.
Liangjie Hong and Brian D. Davison Department of Computer Science and Engineering Lehigh University SIGIR 2009.
Mining Dependency Relations for Query Expansion in Passage Retrieval Renxu Sun, Chai-Huat Ong, Tat-Seng Chua National University of Singapore SIGIR2006.
Subjectivity Recognition on Word Senses via Semi-supervised Mincuts Fangzhong Su and Katja Markert School of Computing, University of Leeds Human Language.
1 Measuring the Semantic Similarity of Texts Author : Courtney Corley and Rada Mihalcea Source : ACL-2005 Reporter : Yong-Xiang Chen.
Using Wikipedia for Hierarchical Finer Categorization of Named Entities Aasish Pappu Language Technologies Institute Carnegie Mellon University PACLIC.
Question Classification using Support Vector Machine Dell Zhang National University of Singapore Wee Sun Lee National University of Singapore SIGIR2003.
Exploiting Named Entity Taggers in a Second Language Thamar Solorio Computer Science Department National Institute of Astrophysics, Optics and Electronics.
Text Categorization by Boosting Automatically Extracted Concepts Lijuan Cai and Tommas Hofmann Department of Computer Science, Brown University SIGIR 2003.
Combining Text and Image Queries at ImageCLEF2005: A Corpus-Based Relevance-Feedback Approach Yih-Cheng Chang Department of Computer Science and Information.
Virtual Examples for Text Classification with Support Vector Machines Manabu Sassano Proceedings of the 2003 Conference on Emprical Methods in Natural.
A Multilingual Hierarchy Mapping Method Based on GHSOM Hsin-Chang Yang Associate Professor Department of Information Management National University of.
Incremental Reduced Support Vector Machines Yuh-Jye Lee, Hung-Yi Lo and Su-Yun Huang National Taiwan University of Science and Technology and Institute.
UIC at TREC 2006: Blog Track Wei Zhang Clement Yu Department of Computer Science University of Illinois at Chicago.
Multi-Class Sentiment Analysis with Clustering and Score Representation Yan Zhu.
Intelligent Database Systems Lab 國立雲林科技大學 National Yunlin University of Science and Technology 1 Boosting the Feature Space: Text Classification for Unstructured.
An Automatic Method for Selecting the Parameter of the RBF Kernel Function to Support Vector Machines Cheng-Hsuan Li 1,2 Chin-Teng.
Facial Smile Detection Based on Deep Learning Features Authors: Kaihao Zhang, Yongzhen Huang, Hong Wu and Liang Wang Center for Research on Intelligent.
Compact Bilinear Pooling
CATEGORIZATION OF NEWS ARTICLES USING NEURAL TEXT CATEGORIZER
CRF &SVM in Medication Extraction
Cheng-Ming Huang, Wen-Hung Liao Department of Computer Science
PEBL: Web Page Classification without Negative Examples
Cost Sensitive Evaluation Measures for F-term Classification
SVM Based Learning System for F-term Patent Classification
Hierarchical, Perceptron-like Learning for OBIE
Extracting Why Text Segment from Web Based on Grammar-gram
Presentation transcript:

1/1/ An Integrated Knowledge-based and Machine Learning Approach for Chinese Question Classification Min-Yuh Day 1,2, Cheng-Wei Lee 1, Shih-Hung Wu 3, Chorng-Shyong Ong 2, Wen-Lian Hsu 1 1 Institute of Information Science, Academia Sinica, Taipei 2 Department of Information Management, National Taiwan University, Taipei 3 Dept. of Computer Science and Information Engineering, Chaoyang Univ. of Technology, Taichung IEEE NLPKE 2005

Min-Yuh Day (SINICA; NTU) 2/2/ Outline Introduction Chinese Question Classification (CQC) Proposed Approach Knowledge-based Approach: INFOMAP Machine Learning Approach: SVM Integration of SVM and INFOMAP Hybrid Approach Experimental Results and Discussion Related Works Conclusions

Min-Yuh Day (SINICA; NTU) 3/3/ Introduction Question Answering TREC QA NTCIR CLQA Chinese Question Classification Goal: accurately classify a Chinese question into a question type and then map it to an expected answer type Chinese Question: 奧運的發源地在哪裡? Where is the originating place of the Olympics? Question Type: Q_LOCATION| 地 Question Types Answer extraction and answer filtering Improve the accuracy of the overall question answering system

Min-Yuh Day (SINICA; NTU) 4/4/ Introduction Problem of Question Classification 36.4% of the errors occur in the question classification module (Moldovan et al., 2003) Approaches to Question Classification (QC) Rule-based approaches Statistical approaches

Min-Yuh Day (SINICA; NTU) 5/5/ Proposed Approach Chinese Question Taxonomy Question Type Filter for Expected Answer Type (EAT) Knowledge-based Approach: INFOMAP Machine Learning Approach: SVM Hybrid Approach: Integration of SVM and INFOMAP

Min-Yuh Day (SINICA; NTU) 6/6/ Chinese Question Taxonomy for NTCIR CLQA Factoid Question Answering

Min-Yuh Day (SINICA; NTU) 7/7/ Question Type (QType) Filter for Expected Answer Type (EAT)

Min-Yuh Day (SINICA; NTU) 8/8/ INFOMAP (Knowledge-based Approach) INFOMAP: Knowledge Representation Framework Extracts important concepts from a natural language text Feature of INFOMAP represent and match complicated template structures hierarchical matching regular expressions semantic template matching frame (non-linear relations) matching graph matching We adopt INFOMAP as the knowledge-based approach for CQC Using INFOMAP, we can identify the question category from a Chinese question

Min-Yuh Day (SINICA; NTU) 9/9/ Knowledge Representation of Chinese Question Chinese Question: 2004 年奧運在哪一個城市舉行 ? (In which city were the Olympics held in 2004?) [5 Time]:[3 Organization]:[7 Q_Location]:([9 LocaitonRelatedEvent])

Min-Yuh Day (SINICA; NTU) 10/ Knowledge representation for CQC in INFOMAP

Min-Yuh Day (SINICA; NTU) 11/ Knowledge representation for CQC in INFOMAP 2004 年奧運在哪一個城市舉行 ? (In which city were the Olympics held in 2004?)

Min-Yuh Day (SINICA; NTU) 12/ Knowledge representation for CQC in INFOMAP 2004 年奧運在哪一個城市舉行 ? (In which city were the Olympics held in 2004?)

Min-Yuh Day (SINICA; NTU) 13/ Knowledge representation for CQC in INFOMAP 2004 年奧運在哪一個城市舉行 ? (In which city were the Olympics held in 2004?) [5 Time]: [3 Organization] :[7 Q_Location]: ([9 LocaitonRelatedEvent])

Min-Yuh Day (SINICA; NTU) 14/ SVM (Machine Learning Approach) Two types of feature used for CQC Syntactic features Bag-of-Words character-based bigram (CB) word-based bigram (WB) Part-of-Speech (POS) AUTOTAG POS tagger developed by CKIP, Academia Sinica Semantic Features HowNet Senses HowNet Main Definition (HNMD) HowNet Definition (HND)

Min-Yuh Day (SINICA; NTU) 15/ Integration of SVM and INFOMAP (Hybrid Approach) The integrated module selects the question type with the highest confidence score from the INFOMAP or the SVM model If the question matches the templates or rules represented in INFOMAP and obtains the question type, we use the question type obtain from INFOMAP first. If no question type can be obtained from INFOMAP, we use the result from the SVM model. If multiple question types are obtained from INFOMAP, we choose the one obtained from SVM first. If one question type with a high positive score is obtained from SVM and one question type obtained from INFOMAP, which is not the same as the one from SVM, we choose the one from SVM with a high positive score. SVM INFOMAP 0 1 >=2 HP

Min-Yuh Day (SINICA; NTU) 16/ Experimental Results and Discussion Datasets Training: 1350 questions 500 questions from CLQA ’ s development dataset 300 questions for Japanese news 200 questions for Traditional Chinese news 850 questions manually build for our proposed question taxonomy 518 questions in SVM 332 questions in INFOMAP Testing: 200 questions 200 Questions from CLQA ’ s formal run dataset We use different features to train the SVM model based on a total of 1350 questions and their labeled question type

Min-Yuh Day (SINICA; NTU) 17/ Experimental Results of CLQA’s development dataset SVM Training data: CLQAS300 (300 questions for Japanese news) SVM Testing data: CLQAS200 (200 questions for Chinese news) 48.00% 66.00% 52.57% 74.30%

Min-Yuh Day (SINICA; NTU) 18/ Experimental Results of CLQA’s Formal Run dataset Training dataset: 1350 questions 300 (Development dataset for Japanese News) (Development dataset for Chinese News) (SVM) (INFOMAP) Features: CB+HNMD Testing dataset: 200 questions CLQA ’ s formal run

Min-Yuh Day (SINICA; NTU) 19/ Experimental Results of CLQA’s Formal Run dataset

Min-Yuh Day (SINICA; NTU) 20/ Discussion Integrated approach performs better than the individual knowledge-based or machine learning approach knowledge-based approach performs well with easy questions using the templates and rules Easy questions are defined as follows: Clear words that show the question type and indicate the words that are not question types Ex: “ 誰 (Who) ”, “ 哪一位 (Which person) ”, “ 首位 (the first person) ” Explicit words that identify the question type. If words are easy to identify, it means they overlap with a question type Ex: “ 隊伍 (team) ” and “ 運動隊伍 (sports team) ” Interrogative words that connect with question type words in question Ex: “ 那個人 (Which Person) ”

Min-Yuh Day (SINICA; NTU) 21/ Related Works Li and Roth (2002) 6 coarse classes and 50 fine classes for TREC factoid question answering Sparse Network of Windows (SNoW) Over 90% accuracy Zhang and Lee (2003) Support Vector Machines (SVMs) Surface text features (bag-of-words and bag-of-ngrams) coarse-grained: 86% accuracy fine-grained: approximately 80% accuracy. Adding syntactic information coarse-grained: accuracy of 90% Suzuki et al. (2003) Hierarchical SVM Four feature sets (1) words only (2) words and named entities (3) words and semantic information (4) words and NEs and semantic information Coarse-grained: 95% (depth 1) Fine-grained: 75% (depth 4)

Min-Yuh Day (SINICA; NTU) 22/ Comparison with related works Question classification in Chinese The accuracy of CQC SVM: 73.5% INFOMAP: 88% Hybrid Approach (SVM+INFOMAP): 92%

Min-Yuh Day (SINICA; NTU) 23/ Conclusions We have proposed a Hybrid approach to Chinese question classification (CQC) for NTCIR CLQA factoid question-answering Hierarchical coarse-grained and fine-grained question taxonomies 6 coarse-grained categories and 62 fine-grained categories for Chinese questions Mapping method for question type filtering to obtain expected answer types (EAT) The integrated knowledge-based and machine learning approach achieves significantly better accuracy rate than individual approaches

Min-Yuh Day (SINICA; NTU) 24/ Applications: ASQA (Academia Sinica Question Answering system) ASQA (IASL-IIS-SINICA-TAIWAN) First place in the Chinese-Chinese (C-C) subtask of the NTCIR5 Cross-Language Question Answering (CLQA 2005) task

Min-Yuh Day (SINICA; NTU) 25/

Min-Yuh Day (SINICA; NTU) 26/ Q & A An Integrated Knowledge-based and Machine Learning Approach for Chinese Question Classification Min-Yuh Day 1,2, Cheng-Wei Lee 1, Shih-Hung Wu 3, Chorng-Shyong Ong 2, Wen-Lian Hsu 1 ( 戴敏育 1,2, 李政緯 1, 吳世弘 3, 翁崇雄 2, 許聞廉 1 ) 1 Institute of Information Science, Academia Sinica, Taipei 2 Department of Information Management, National Taiwan University, Taipei 3 Dept. of Computer Science and Information Engineering, Chaoyang Univ. of Technology, Taichung IEEE NLPKE 2005