1 Chinese Term Extraction Based on Delimiters Yuhang Yang, Qin Lu, Tiejun Zhao School of Computer Science and Technology, Harbin Institute of Technology Department of Computing, The Hong Kong Polytechnic University May, 2008
2 Outline Introduction Related Works Methodology Experiment and Discussion Conclusion
3 Basic Concepts Terms(terminology): lexical units of the most fundamental knowledge of a domain Term extraction Term candidate extraction Unithood Terminology verification Termhood
4 Major Problems Term boundary identification based on term features Fewer features are not enough More features lead to more conflicts Limitation in scope low frequency terms long compound terms dependency on Chinese segmentation
5 Main Idea Delimiter based Term candidates extraction: identifying the relative stable and domain independent words immediate before and after these terms 扫描隧道显微镜是一种基于量子隧道效应的高分辨率显微镜 Scan tunneling microscope is a kind of quantum tunnelling effect-based high angular resolution microscope 社会主义制度是中华人民共和国的根本制度 Socialist system is the basic system of the People's Republic of China Potential Advantages of the proposed approach No strict limits on frequency or word length No need for full segmentation Relatively domain independent
6 Related works: Statistic-based Measures Internal measure (Schone and Jurafsky, 2001) Internal associative measures between constituents of the candidate characters, such as: Frequency Mutual information Contextual measure Dependency of candidates on its context: The left/right entropy (Sornlertlamvanich et al., 2000) The left/right context dependency (Chien, 1999) Accessor variety criteria (Feng et al., 2004).
7 Hybrid Approaches The UnitRate algorithm (Chen et al., 2006) occurrence probability + marginal variety probability The TCE_SEF&CV algorithm (Ji et al, 2007) significance estimation function + C-value measure Limitations Data sparseness for low frequency terms and long terms Cascading errors by full segmentation
8 Observations Sentences are constituted by substantives and functional words Domain specific terms (terms for short) are more likely to be domain substantives Predecessors and successors of terms are more likely to be functional words or general substantives connecting terms Predecessors and successors are markers of terms, referred to as term delimiters (or simply delimiters)
9 Delimiter Based Term Extraction Characteristics of delimiters Mainly functional words and general substantives Relatively stable Domain independent Can be extracted more easily Proposed model Identifying features of delimiters Identify terms by finding their predecessors and successors as their boundary words
10 Algorithm design TCE_DI (Term Candidate Extraction – Delimiter Identification) Input: Corpus extract (domain corpus ), DListlist ) (1). Partition Corpus extract to char strings by punctuations. (2). Partition char strings by delimiters to obtain term candidates. If there is no delimiter contained in a string, the whole string is regarded as a term candidate.
11 Acquisition of DList From a given stop word list Produced by experts or from a general corpus No training is needed DList_Ext algorithm Given a training corpus Corpus D_training, and A domain lexicon Lexicon Domain
12 The DList_Ext algorithm S1:For each term in Lexicon Domain mark T i in Corpus D_training as a lexical unit S2:Segment the remaining text S3:Extracts predecessors and successors of all T i as delimiter candidates S4:Remove all T i from delimiter candidates S5:Rank delimiter candidates by frequency Use of a simple threshold N DI
13 Experiments: Data Preparation Delimiter List DList IT Extracted by using Corpus IT_Small and Lexicon IT DList Legal Extracted by using Corpus Legal_Small and Lexicon Legal DList SW 494 general stop words
14 Performance Measurements Evaluation: Precision(sampling) & Rate of NTE Reference algorithms SEF&C-value (Ji et al, 2007) for t erm candidate extraction TFIDF (Frank et al., 1999) for both t erm candidate extraction and terminology verification LA_TV (Link Analysis based – Terminology Verification) for fair comparison
15 Evaluation: DList_Ext algorithm: N DI Corpus Legal_Large (11,048 sentences) Corpus IT_Large (60,508 sentences) DList IT (Top100)77.6%89.1% DList IT (Top300)84.6%92.6% DList IT (Top500)90.3%93.4% DList IT (Top700)92.7%93.9% DList legal (Top100)95.8%92.6% DList legal (Top300)97.8%96.2% DList legal (Top500)98.7%96.8% DList legal (Top700)99.1%97.1% DList SW 98.1% Coverage of Delimiters on Different Corpora
16 Evaluation: DList_Ext algorithm: N DI Frequency of Delimiters on Domain Corpora
17 Evaluation: DList_Ext algorithm: N DI Performance of DList IT on Corpus IT_Large Performance of DList Legal on Corpus IT_Large
18 N DI = 500 Performance of DList IT on Corpus Legal_Large Performance of DList Legal on Corpus Legal_Large
19 Evaluation on Term Extraction Performance of Different Algorithms on IT Domain and Legal Domain
20 Performance Analysis Domain independent and stable delimiters Being extracted easily and useful Larger granularity of domain specific terms Keeping many noisy strings out Less frequency sensitivity Concentrating on delimiters without regards to the frequencies of the candidates
21 Evaluation on New Term Extraction: R NTE Performance of Different Algorithms for New Term Extraction
22 Error Analysis Figure of Speech phrases “ 不难看出 ”(it is not difficult to see that….) “ 新方法中 ”(in the new methods) General words “ 思维状态 ”(mental state) “ 建筑 ”(architecture) Long strings which contain short terms “ 访问共享资源 ”(access shared resources), “ 再次遍历 ”(traverse again)
23 Conclusion A delimiter based approach for term candidate extraction Advantages Less sensitivity to term frequency Requiring little prior domain knowledge, relatively less adaptation for new domains Quite significant improvements for term extraction Much better performance for new term extraction Future works Improving overall term extraction algorithms Applying to related NLP tasks such as NER Applying to other languages
24 Thank You ! Q & A