An Integrated Approach for Relation Extraction from Wikipedia Texts Yulan Yan Yutaka Matsuo Mitsuru Ishizuka The University of Tokyo WWW 2009.

An Integrated Approach for Relation Extraction from Wikipedia Texts Yulan Yan Yutaka Matsuo Mitsuru Ishizuka The University of Tokyo WWW 2009

2 Abstract Relation Extraction from Wikipedia Texts A novel distance function A linear clustering algorithm Wikipedia Texts  High quality texts  Heavily cross-linked articles  Sentence -> Dependency tree Web Texts  Frequency information  Relation terms  Sentence -> Surface pattern Experiments on two different domains American chief executives Companies

3 Problem definition Relation extraction between article entitled concept (ec) and one of related concepts (rc) There is a salient semantic relation r between p and p’  l(p)

4 Problem definition (Eric E. Schmidt, Google) (Eric E. Schmidt, Compiler) (Eric E. Schmidt, Atherton, California) … (Bill Gates, Microsoft) … Concept pairs ClusteringEvaluation

5 Overview of the Approach Text preprocessor Concept pair collection Sentence filtering Web Context Collector A set of ranked relational terms A set of surface patterns Dependency pattern modeling Linguistic information Linear clustering algorithm Local clustering Global clustering

6 1. Text Preprocessor - Relation Candidate Generation Wikipedia article texts to get relation candidates corresponding sentences. All hyper-linked concepts in the article as related concepts, which may share a semantic relationship with the entitled concept Concept pairs Appling a linguistic parser to split article text into sentences for the dependency pattern modeling module

7 2. Web Context Collection Querying with a concept pair Hypothesis The web exists some key terms and patterns that provide clues to the relation the concept pair assume Two kinds of relational information a set of ranked relational terms as keywords a set of surface patterns

8 2. Web Context Collection - Relational Term Ranking (1/2) To collect relational terms as indicators for each concept pair Verbs, nouns Such as “CEO”, “founder” Entropy-based feature ranking algorithm Chen et al., 2005 (IJCNLP) After the ranking A relational term list T cp is ranked according to term order A keyword k cp is selected as co-appearing in the term list T cp and corresponding Wikipedia sentence

9 Entropy-based Feature Ranking - J. Chen, D. Ji, C.L. Tan, and Z. Niu. 2005. Unsupervised Feature Selection for Relation Extraction. In Proceedings of JCNLP-2005. Local context vectors of co-occurrences of entity pair E 1 and E 2 P ={ p 1, p 2, … p N } The words occurred in P W ={ w 1, w 2, … w M } To select a subset of important features from W ;

10 2. Web Context Collection - Surface Pattern Generation (2/2) Content Words(CWs) ec( entitled concept), rc(related concept), keyword k cp Function Words Bag of words is to look for verbs, nouns, and coordinating conjunctions

11 3. Dependency Pattern Modeling Dependency patterns for relation clustering selected sentences one of entitled concept, one of the related concepts parsing into dependency structures  R. Bunescu and R. Mooney. 2005. A shortest path dependency kernel for relation extraction. In Proceedings of HLT/EMLNP-2005.  M. Zhang, J. Zhang, J. Su and G. Zhou. 2006. A Composite Kernel to Extract Relations between Entities with both Flat and Structured Features. In Proceedings of ACL-2006.

12 4. Linear Clustering Algorithm - Distance Function & Centroid Selection (1/2) All concept pairs are grouped by their keywords t cp Let G={G 1,G 2, …G n }, G i ={cp i1,cp i2,…, } shares the same keyword t cp A centroid c i is selected for group G i

13 4. Linear Clustering Algorithm - Distance Function & Centroid Selection (2/2) cost function cost(sp 1i,sp 2j ) B. Rosenfeld and R. Feldman. 2006. URES: an Unsupervised Web Relation Extraction System. In Proceedings of COLING/ACL-2006.

14 4. Linear Clustering Algorithm - Local Dependency Pattern Clustering

15 4. Linear Clustering Algorithm - Local Dependency Pattern Clustering

16 4. Linear Clustering Algorithm - Global Surface Pattern Clustering

17 Experiments Wikipedia dump on 03/12/2008 Two categories American chief executives 526 articles, 7310 concept pairs 1/3,1/3 for D l and D g, 18 groups Companies 434 articles, 4935 concept pairs 1/3, 1/3 for D l and D g, 28 groups Compare with B. Rosenfeld and R. Feldman. 2007. Clustering for Unsupervised Relation Identification. In Proceedings of CIKM-2007. surface feature

18 Experiments

21 Experiments

22 Conclusions A novel distance function A linear clustering algorithm Combination of two kinds of patterns Dependence patterns Surface patterns J. Chen, D. Ji, C.L. Tan, and Z. Niu. 2005. Unsupervised Feature Selection for Relation Extraction. In Proceedings of JCNLP-2005. R. Bunescu and R. Mooney. 2005. A shortest path dependency kernel for relation extraction. In Proceedings of HLT/EMLNP-2005. M. Zhang, J. Zhang, J. Su and G. Zhou. 2006. A Composite Kernel to Extract Relations between Entities with both Flat and Structured Features. In Proceedings of ACL-2006. B. Rosenfeld and R. Feldman. 2006. URES: an Unsupervised Web Relation Extraction System. In Proceedings of COLING/ACL-2006.

An Integrated Approach for Relation Extraction from Wikipedia Texts Yulan Yan Yutaka Matsuo Mitsuru Ishizuka The University of Tokyo WWW 2009.

Similar presentations

Presentation on theme: "An Integrated Approach for Relation Extraction from Wikipedia Texts Yulan Yan Yutaka Matsuo Mitsuru Ishizuka The University of Tokyo WWW 2009."— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

An Integrated Approach for Relation Extraction from Wikipedia Texts Yulan Yan Yutaka Matsuo Mitsuru Ishizuka The University of Tokyo WWW 2009.

Similar presentations

Presentation on theme: "An Integrated Approach for Relation Extraction from Wikipedia Texts Yulan Yan Yutaka Matsuo Mitsuru Ishizuka The University of Tokyo WWW 2009."— Presentation transcript:

Similar presentations

About project

Feedback