Automatic Extraction of Hierarchical Relations from Text Ting Wang1, Yaoyong Li2, Kalina Bontcheva2, Hamish Cunningham2, Ji Wang1 Presented by Valentin Tablan2 1National University of Defense Technology, Changsha, China 2University of Sheffield, Sheffield, UK http://gate.ac.uk/ http://nlp.shef.ac.uk/
Outline Relation extraction and machine learning. ACE04 corpus for relation extraction. Experiments using the SVM and a variety of NLP features. Experimental results. ESWC 2006, Budva, Montenegro
Relation Extraction Two key stages for Information Extraction (IE) Identify the instances of information entities Recognise the relations among the entity instances Typical relations among named entities. Physical: located, near, part-whole Personal or social: business, family Membership: employer-stuff, member-of-group …… ESWC 2006, Budva, Montenegro
Machine Learning for Relation Extraction A number of learning algorithms have been applied on relation extraction E.g. HMM, CRF, MEM, and SVM Machine learning based system can be ported to other domains relatively easily. SVM using rich features achieved good results for relation extraction. ESWC 2006, Budva, Montenegro
Main contributions of This Paper Apply the SVM to extract relations in hierarchy. Investigate a variety of general NLP features for relation extraction. Evaluate several kernel types of the SVM for relation extraction. ESWC 2006, Budva, Montenegro
The ACE04 Corpus The ACE04 corpus was used in our experiments. It contains 452 documents annotated with named entity mentions and a set of relations among the entity mentions. There are 7 relation types and 23 subtypes. 5914 relation instances. ESWC 2006, Budva, Montenegro
ACE04 Entity Hierarchy Type Subtype Person (none) Organization Government, Commercial, Educational, Non-profit, … Facility Plant, Building, Subarea_Building, Bounded Area, … Location Address, Boundary, Celestial, Water_Body, Land_Region_natural, … Geo-political Entity Continent, Nation, State/Province, County/District, … Vehicle Air, Land, Water, Subarea_Vehicle, Other Weapon Blunt, Exploding, Sharp, Chemical, Biological, ESWC 2006, Budva, Montenegro
ACE04 Relation Hierarchy Type Subtype Physical Located, Near*, Part-Whole Personal/Social Business*, Family*, Other* Employment/Membership/Subsidiary Employ-Exec, Employ-Staff, Employ-Undetermined, Member-of-Group, Subsidiary, Partner*, Other* Agent-Artifact User/Owner, Inventor/Manufacturer, Other PER/ORG Affiliation Ethnic, Ideology, Other GPE Affiliation Citizen/Resident, Based-In, Other Discourse (none) ESWC 2006, Budva, Montenegro
Three Levels in Relation Hierarchy ESWC 2006, Budva, Montenegro
Relation Extraction Experiments Extract the pre-defined relations among the named entity mentions. Use the “true” named entity mentions and their types in the ACE04 corpus. Multi-class classification problem: whether or not any two entity mentions in a sentence have any of the predefined relations. ESWC 2006, Budva, Montenegro
Using SVM for Relation Extraction SVM is a popular learning algorithm. We used the one-against-one method to learn k(k-1)/2 binary SVM classifiers for the k-class classification. Use the max-win method to determine the class of a test example. ESWC 2006, Budva, Montenegro
NLP Features for Relation Extraction For one example (i.e. a pair of entities mentions in the same sentence) the features used were from the two mentions and the neighbouring words in the sentence. We used a variety of NLP features, produced by the GATE and third-party software (GATE plugins). All the features used were independent of the data’s domain. A total of 94 features. ESWC 2006, Budva, Montenegro
Simple Features Words. POS tags of words used. Tokens of two mentions and those surrounding them. POS tags of words used. Entity features provided in the ACE04 corpus. Including types and subtypes of two mentions. Overlap features. The relative position of the two mentions. ESWC 2006, Budva, Montenegro
Syntactic features Chunk features. whether two mentions are in the same NP or VP chunk. Dependency feature obtained from the MiniPar involving dependency relationship between the two mentions. Parse tree feature obtained from the BuChart. the lowest phrase labels of the two mentions, and the phrase labels in the path connecting two mentions in the parse tree. ESWC 2006, Budva, Montenegro
Semantic features From the SQLF produced by BuChart. WordNet. Paths of semantic labels from predicate to the heads of two mentions, respectively. Semantic labels of some important predicates. WordNet. The synsets of two entity mentions and the words surrounding them. For a word, pick up the most important synset which has a POS compatible with the word’s. ESWC 2006, Budva, Montenegro
Experimental Settings Use the SVM package LIBSVM for training and testing. Do 5-fold cross-validation on the ACE04 data for each experiment. Adopt Precision, Recall and F1 as evaluation measures. ESWC 2006, Budva, Montenegro
Contribution of Individual Features Precision(%) Recall(%) F1(%) Word 57.90 23.48 33.38 +POS Tag 57.63 26.91 36.66 +Entity 60.03 44.47 50.49 +Mention 61.03 45.60 52. 03 +Overlap 60.51 48.01 53. 52 +Chunk 61.46 48.46 54.19 +Dependency 63.07 48.26 54.67 +Parse Tree 63.57 48.58 55.06 +SQLF 63.74 48.92 55.34 + WordNet 67.53 48.98 56.78 ESWC 2006, Budva, Montenegro
Discussion Every feature we used has some contribution. Most features improved the recall of results. Word alone can achieve 0.33 F1. Entity mentions’ types and subtypes are most useful among other features. Some complicated features, such as chunk, dependency, parse tree, and SQLF, are not as useful as we expected. ESWC 2006, Budva, Montenegro
Results for Different Kernel Types Precision (%) Recall (%) F1(%) Linear 66.41 49.18 56.50 Quadratic 68.96 46.20 55.33 Cubic 71.31 42.39 53.17 Linear kernel had the best performance in F1, though not significantly better than the quadratic kernel. Linear kernel is fastest. ESWC 2006, Budva, Montenegro
Impact of the Hierarchy Hierarchical level Precision (%) Recall (%) F1(%) Relation detection (1 class) 73.87 69.50 71.59 Type classification (7 classes) 71.41 60.03 65.20 Subtype classification (23 classes) 67.53 48.98 56.78 ESWC 2006, Budva, Montenegro
Results on the Subtypes of EMP-ORG #examples Precision (%) Recall (%) F1 (%) Employ-Exec 630 71.37 63.90 67.16 Employ-Undetermined 129 68.76 43.23 51.20 Employ-Staff 694 64.39 60.97 62.25 Member-of-Group 225 62.16 38.55 46.85 Subsidiary 300 83.81 65.29 72.79 Partner 16 0.00 Other 90 33.33 5.89 9.90 ESWC 2006, Budva, Montenegro
Discussion Results were different for different subtypes. Strong connection with to the number of examples. Subtype recognition had lower results than type. Recognition for subtype is more difficult than for type (subtypes less distinct than types). Deeper features help more with fine-grained distinctions. ESWC 2006, Budva, Montenegro
Conclusions Investigated the SVM-based classification for relation extraction. Explored a diverse set of NLP features. The results obtained were encouraging. More work would be needed. Experiments of extracting relations based on the named entities recognised by IE system. Evaluate the system on large-scale ontology and data. ESWC 2006, Budva, Montenegro
Thank you! More information: http://gate.ac.uk http://nlp.shef.ac.uk ESWC 2006, Budva, Montenegro