Impact of different relation extraction methods on network analysis results Jana Diesner
Motivation Text DataNetwork DataApplications Need: scalable, reliable, robust methods & tools Unstructured At any scale Network Analysis Answer substantive and graph-theoretic questions Develop and test hypothesis and theories Visualizations Populate databases Input to further computations, e.g. simulations, machine learning
Research Questions and Relevance How do network data and analysis results obtained by using different relation extraction methods compare to each other? Why does it matter? –Increased comparability, generalizability, transparency of methods and tools –Increased control and power for developers and users –Supports drawing of reasonable and valid conclusions
Relation Extraction Methods Proximity-based linkage of nodes Database query Proximity-based linkage of nodes Meta- Data Text, manual (TextM) Text, automated (TextA) Meta-data (META) Subject Matter Experts (SME) Codebook
Data 5 Sudan CorpusFunding CorpusEnron Corpus GenreNewswireScientific Writing s Size80,000 articles56,000 proposals53,000 s SourceLexisNexisCordisFERC/ SEC Time span8 years22 years4 years Text-based networks Article bodiesProject description bodies Meta-data network Index termsIndex terms and collaborators headers Large-scale, over-time, open source data from different domains
Results I 1.Text automated vs. manual: total number of nodes of sub-type “generic” far higher than “specific” –Rethink focus of network analysis: collectives vs. individuals –Importance of detecting unnamed entities 2.Ground truth data (SME) hardly resembled by analyzing text bodies and not at all by meta-data networks –In most ideal case, 50% of nodes and 20% of links 3.Agreement in structure and key entities depends on type of network
Results II 3.Agreement between text-based, and with meta-data depends on type of network Type of Network Text-Based NetworksMeta-Data Network Social networks - Substantial overlap between manual and automated, esp. w.r.t. key players - Localized view on geo- political entities and culture -Major international key players -Small overlap in key entities with text-based networks Knowledge networks - Gist of information in terms of common sense entities - Minimal overlap between manual and automated - Seem more informative (mini-summaries) -Less coreference resolution issues - Minimal overlap with text- based For more complete view, combine automated text-based with meta-data network
