1 Chinese Term Extraction Based on Delimiters Yuhang Yang, Qin Lu, Tiejun Zhao School of Computer Science and Technology, Harbin Institute of Technology.

Slides:



Advertisements
Similar presentations
A Comparison of Implicit and Explicit Links for Web Page Classification Dou Shen 1 Jian-Tao Sun 2 Qiang Yang 1 Zheng Chen 2 1 Department of Computer Science.
Advertisements

Statistical modelling of MT output corpora for Information Extraction.
Iterative Bilingual Lexicon Extraction from Comparable Corpora Using Topic Model and Context Based Methods Chenhui Chu, Toshiaki Nakazawa, Sadao Kurohashi.
Improving Statistical Machine Translation Accuracy Using Bilingual Lexicon Extraction with Paraphrases Chenhui Chu, Toshiaki Nakazawa, Sadao Kurohashi.
Chinese Word Segmentation Method for Domain-Special Machine Translation Su Chen; Zhang Yujie; Guo Zhen; Xu Jin’an Beijing Jiaotong University.
Mining for High Complexity Regions Using Entropy and Box Counting Dimension Quad-Trees Rosanne Vetro, Wei Ding, Dan A. Simovici Computer Science Department.
A Statistical Model for Domain- Independent Text Segmentation Masao Utiyama and Hitoshi Isahura Presentation by Matthew Waymost.
Systems Engineering and Engineering Management The Chinese University of Hong Kong Parameter Free Bursty Events Detection in Text Streams Gabriel Pui Cheong.
第四章 犯罪概念与犯罪构成. 第一节 犯罪概念 一、犯罪概念的类型  (一)犯罪的形式概念  (二)犯罪的实质概念  (三)犯罪的混合概念.
Chapter 1: Introduction to Pattern Recognition
1 Noun Homograph Disambiguation Using Local Context in Large Text Corpora Marti A. Hearst Presented by: Heng Ji Mar. 29, 2004.
Pat-Tree-Based Adaptive keyphrase Extraction for Intelligent Chinese Information Retrieval 出處: institute of information science, academia sinica, taipei,
Name Extraction from Chinese Novels CS224n Spring 2008 Jing Chen and Raylene Yung.
Large-Scale Cost-sensitive Online Social Network Profile Linkage.
An Automatic Segmentation Method Combined with Length Descending and String Frequency Statistics for Chinese Shaohua Jiang, Yanzhong Dang Institute of.
Longbiao Kang, Baotian Hu, Xiangping Wu, Qingcai Chen, and Yan He Intelligent Computing Research Center, School of Computer Science and Technology, Harbin.
1 NLP in Thailand by Asanee Kawtrakul Kasetsart University.
McEnery, T., Xiao, R. and Y.Tono Corpus-based language studies. Routledge. Unit A 2. Representativeness, balance and sampling (pp13-21)
Kuang Ru; Jinan Xu; Yujie Zhang; Peihao Wu Beijing Jiaotong University
Some Advances in Transformation-Based Part of Speech Tagging
Empirical Methods in Information Extraction Claire Cardie Appeared in AI Magazine, 18:4, Summarized by Seong-Bae Park.
1 Formal Models for Expert Finding on DBLP Bibliography Data Presented by: Hongbo Deng Co-worked with: Irwin King and Michael R. Lyu Department of Computer.
The Chinese University of Hong Kong Introduction to PAT-Tree and its variations Kenny Kwok Department of Computer Science and Engineering.
Iterative Bilingual Lexicon Extraction from Comparable Corpora with Topical and Contextual Knowledge Chenhui Chu, Toshiaki Nakazawa, Sadao Kurohashi Graduate.
1 Statistical NLP: Lecture 9 Word Sense Disambiguation.
Saab Mansour and Hermann Ney Human Language Technology and Pattern Recognition Computer Science Department RWTH Aachen University, Aachen, Germany NAACL-HLT.
Feng Zhang, Guang Qiu, Jiajun Bu*, Mingcheng Qu, Chun Chen College of Computer Science, Zhejiang University Hangzhou, China Reporter: 洪紹祥 Adviser: 鄭淑真.
11 A Hybrid Phish Detection Approach by Identity Discovery and Keywords Retrieval Reporter: 林佳宜 /10/17.
Yun-Nung (Vivian) Chen, Yu Huang, Sheng-Yi Kong, Lin-Shan Lee National Taiwan University, Taiwan.
Rethinking Chinese Word Segmentation: Tokenization, Character Classification, or Wordbreak Identification 黃居仁 Chu-Ren Huang Academia Sinica
The ICT Statistical Machine Translation Systems for IWSLT 2007 Zhongjun He, Haitao Mi, Yang Liu, Devi Xiong, Weihua Luo, Yun Huang, Zhixiang Ren, Yajuan.
2014 EMNLP Xinxiong Chen, Zhiyuan Liu, Maosong Sun State Key Laboratory of Intelligent Technology and Systems Tsinghua National Laboratory for Information.
Combining geometry and domain knowledge to interpret hand-drawn diagrams As Presented By: Andrew Campbell Christopher Dahlberg.
Opinion Holders in Opinion Text from Online Newspapers Youngho Kim, Yuchul Jung and Sung-Hyon Myaeng Reporter: Chia-Ying Lee Advisor: Prof. Hsin-Hsi Chen.
Intelligent Database Systems Lab 國立雲林科技大學 National Yunlin University of Science and Technology Advisor : Dr. Hsu Student : Sheng-Hsuan Wang Department.
Chinese Word Segmentation Adaptation for Statistical Machine Translation Hailong Cao, Masao Utiyama and Eiichiro Sumita Language Translation Group NICT&ATR.
Cache-based Document-level Statistical Machine Translation Prepared for I 2 R Reading Group Gongzhengxian 10 OCT 2011.
National Taiwan University, Taiwan
Improving Named Entity Translation Combining Phonetic and Semantic Similarities Fei Huang, Stephan Vogel, Alex Waibel Language Technologies Institute School.
From Text to Image: Generating Visual Query for Image Retrieval Wen-Cheng Lin, Yih-Chen Chang and Hsin-Hsi Chen Department of Computer Science and Information.
Voice Activity Detection based on OptimallyWeighted Combination of Multiple Features Yusuke Kida and Tatsuya Kawahara School of Informatics, Kyoto University,
1 Chen Yirong, Lu Qin, Li Wenjie, Cui Gaoying Department of Computing The Hong Kong Polytechnic University Chinese Core Ontology Construction from a Bilingual.
Liangjie Hong and Brian D. Davison Department of Computer Science and Engineering Lehigh University SIGIR 2009.
Multi-level Bootstrapping for Extracting Parallel Sentence from a Quasi-Comparable Corpus Pascale Fung and Percy Cheung Human Language Technology Center,
Intelligent Database Systems Lab 國立雲林科技大學 National Yunlin University of Science and Technology 1 Improving the performance of personal name disambiguation.
中学英语教学新模式 - 初中英语听力教学的整体设计 《学生双语报》 刘鹰. Why Listening Comprehension? Why Listening Comprehension –Most poorly developed in China Low scores on every level.
1 Unsupervised Adaptation of a Stochastic Language Model Using a Japanese Raw Corpus Gakuto KURATA, Shinsuke MORI, Masafumi NISHIMURA IBM Research, Tokyo.
1 Minimum Error Rate Training in Statistical Machine Translation Franz Josef Och Information Sciences Institute University of Southern California ACL 2003.
Text segmentation Amany AlKhayat. Before any real processing is done, text needs to be segmented at least into linguistic units such as words, punctuation,
Extracting and Ranking Product Features in Opinion Documents Lei Zhang #, Bing Liu #, Suk Hwan Lim *, Eamonn O’Brien-Strain * # University of Illinois.
1 Reversible and lossless data hiding in the integer wavelet transform domain (Review) Authors: S. Yousefi, H. R. Rabiee, E. Yousefi, and M. Ghanbari Speaker:
Event-Based Extractive Summarization E. Filatova and V. Hatzivassiloglou Department of Computer Science Columbia University (ACL 2004)
MMM2005The Chinese University of Hong Kong MMM2005 The Chinese University of Hong Kong 1 Video Summarization Using Mutual Reinforcement Principle and Shot.
1 ICASSP Paper Survey Presenter: Chen Yi-Ting. 2 Improved Spoken Document Retrieval With Dynamic Key Term Lexicon and Probabilistic Latent Semantic Analysis.
January 2012Spelling Models1 Human Language Technology Spelling Models.
Generation of Chinese Character Based on Human Vision and Prior Knowledge of Calligraphy 报告人: 史操 作者: 史操、肖建国、贾文华、许灿辉 单位: 北京大学计算机科学技术研究所 NLP & CC 2012: 基于人类视觉和书法先验知识的汉字自动生成.
A Simple English-to-Punjabi Translation System By : Shailendra Singh.
Corpus Exploitation from Wikipedia for Ontology Construction Gaoying Cui, Qin Lu, Wenjie Li, Yirong Chen The Department of Computing The Hong Kong Polytechnic.
一、中国梦与中国特色社会主义道路创新 中国梦即实现中华民族的伟大复兴 中国梦的实现路径是中国道路 中国特色社会主义道路的科学表述 中国特色社会主义道路的形成 是对新中国成 立 60 年成功经 验的总结 是对中国近代 170 年以来历 史经验的总结 是对五千年中 华优秀文化文 化的总结.
An Adaptive Learning with an Application to Chinese Homophone Disambiguation from Yue-shi Lee International Journal of Computer Processing of Oriental.
Language Identification and Part-of-Speech Tagging
Automatically Labeled Data Generation for Large Scale Event Extraction
Using lexical chains for keyword extraction
Written assignment Written assignment 书面作业.
Natural Language Processing (NLP)
Statistical NLP: Lecture 9
A New String Matching Algorithm Based on Logical Indexing
Natural Language Processing (NLP)
Statistical NLP : Lecture 9 Word Sense Disambiguation
Natural Language Processing (NLP)
Presentation transcript:

1 Chinese Term Extraction Based on Delimiters Yuhang Yang, Qin Lu, Tiejun Zhao School of Computer Science and Technology, Harbin Institute of Technology Department of Computing, The Hong Kong Polytechnic University May, 2008

2 Outline Introduction Related Works Methodology Experiment and Discussion Conclusion

3 Basic Concepts Terms(terminology): lexical units of the most fundamental knowledge of a domain Term extraction Term candidate extraction Unithood Terminology verification Termhood

4 Major Problems Term boundary identification based on term features Fewer features are not enough More features lead to more conflicts Limitation in scope low frequency terms long compound terms dependency on Chinese segmentation

5 Main Idea Delimiter based Term candidates extraction: identifying the relative stable and domain independent words immediate before and after these terms 扫描隧道显微镜是一种基于量子隧道效应的高分辨率显微镜 Scan tunneling microscope is a kind of quantum tunnelling effect-based high angular resolution microscope 社会主义制度是中华人民共和国的根本制度 Socialist system is the basic system of the People's Republic of China Potential Advantages of the proposed approach No strict limits on frequency or word length No need for full segmentation Relatively domain independent

6 Related works: Statistic-based Measures Internal measure (Schone and Jurafsky, 2001) Internal associative measures between constituents of the candidate characters, such as: Frequency Mutual information Contextual measure Dependency of candidates on its context: The left/right entropy (Sornlertlamvanich et al., 2000) The left/right context dependency (Chien, 1999) Accessor variety criteria (Feng et al., 2004).

7 Hybrid Approaches The UnitRate algorithm (Chen et al., 2006) occurrence probability + marginal variety probability The TCE_SEF&CV algorithm (Ji et al, 2007) significance estimation function + C-value measure Limitations Data sparseness for low frequency terms and long terms Cascading errors by full segmentation

8 Observations Sentences are constituted by substantives and functional words Domain specific terms (terms for short) are more likely to be domain substantives Predecessors and successors of terms are more likely to be functional words or general substantives connecting terms Predecessors and successors are markers of terms, referred to as term delimiters (or simply delimiters)

9 Delimiter Based Term Extraction Characteristics of delimiters Mainly functional words and general substantives Relatively stable Domain independent Can be extracted more easily Proposed model Identifying features of delimiters Identify terms by finding their predecessors and successors as their boundary words

10 Algorithm design TCE_DI (Term Candidate Extraction – Delimiter Identification) Input: Corpus extract (domain corpus ), DListlist ) (1). Partition Corpus extract to char strings by punctuations. (2). Partition char strings by delimiters to obtain term candidates. If there is no delimiter contained in a string, the whole string is regarded as a term candidate.

11 Acquisition of DList From a given stop word list Produced by experts or from a general corpus No training is needed DList_Ext algorithm Given a training corpus Corpus D_training, and A domain lexicon Lexicon Domain

12 The DList_Ext algorithm S1:For each term in Lexicon Domain mark T i in Corpus D_training as a lexical unit S2:Segment the remaining text S3:Extracts predecessors and successors of all T i as delimiter candidates S4:Remove all T i from delimiter candidates S5:Rank delimiter candidates by frequency Use of a simple threshold N DI

13 Experiments: Data Preparation Delimiter List DList IT Extracted by using Corpus IT_Small and Lexicon IT DList Legal Extracted by using Corpus Legal_Small and Lexicon Legal DList SW 494 general stop words

14 Performance Measurements Evaluation: Precision(sampling) & Rate of NTE Reference algorithms SEF&C-value (Ji et al, 2007) for t erm candidate extraction TFIDF (Frank et al., 1999) for both t erm candidate extraction and terminology verification LA_TV (Link Analysis based – Terminology Verification) for fair comparison

15 Evaluation: DList_Ext algorithm: N DI Corpus Legal_Large (11,048 sentences) Corpus IT_Large (60,508 sentences) DList IT (Top100)77.6%89.1% DList IT (Top300)84.6%92.6% DList IT (Top500)90.3%93.4% DList IT (Top700)92.7%93.9% DList legal (Top100)95.8%92.6% DList legal (Top300)97.8%96.2% DList legal (Top500)98.7%96.8% DList legal (Top700)99.1%97.1% DList SW 98.1% Coverage of Delimiters on Different Corpora

16 Evaluation: DList_Ext algorithm: N DI Frequency of Delimiters on Domain Corpora

17 Evaluation: DList_Ext algorithm: N DI Performance of DList IT on Corpus IT_Large Performance of DList Legal on Corpus IT_Large

18 N DI = 500 Performance of DList IT on Corpus Legal_Large Performance of DList Legal on Corpus Legal_Large

19 Evaluation on Term Extraction Performance of Different Algorithms on IT Domain and Legal Domain

20 Performance Analysis Domain independent and stable delimiters Being extracted easily and useful Larger granularity of domain specific terms Keeping many noisy strings out Less frequency sensitivity Concentrating on delimiters without regards to the frequencies of the candidates

21 Evaluation on New Term Extraction: R NTE Performance of Different Algorithms for New Term Extraction

22 Error Analysis Figure of Speech phrases “ 不难看出 ”(it is not difficult to see that….) “ 新方法中 ”(in the new methods) General words “ 思维状态 ”(mental state) “ 建筑 ”(architecture) Long strings which contain short terms “ 访问共享资源 ”(access shared resources), “ 再次遍历 ”(traverse again)

23 Conclusion A delimiter based approach for term candidate extraction Advantages Less sensitivity to term frequency Requiring little prior domain knowledge, relatively less adaptation for new domains Quite significant improvements for term extraction Much better performance for new term extraction Future works Improving overall term extraction algorithms Applying to related NLP tasks such as NER Applying to other languages

24 Thank You ! Q & A