Danyun Xu, Gong Cheng*, Yuzhong Qu Generating and Characterizing Gold-Standard Entity Summaries: A Study of DBpedia Danyun Xu, Gong Cheng*, Yuzhong Qu
Introduction Related Work Data Set Generating Gold-Standard Entity Summaries Characterizing Gold-Standard Entity Summaries Conclusion
Introduction Why Entity-centric structured data: Google’s Knowledge Graph Entity summarization Lack gold-standard entity summaries in evaluation What Present and evaluate several algorithms for automatically generating (near-) gold-standard summaries Characterize the generated gold-standard summaries
Introduction Related Work Data Set Generating Gold-Standard Entity Summaries Characterizing Gold-Standard Entity Summaries Conclusion
Related Work Algorithms Evaluation Rank properties Rank features Intrinsic method Extrinsic method
Introduction Related Work Data Set Generating Gold-Standard Entity Summaries Characterizing Gold-Standard Entity Summaries Conclusion
Data Set Dbpedia English version of DBpedia 3.7(wiki.dbpedia.org/Downloads37) 42.3 million RDF triples, 3.77 million entities Class 10 classes Almost pairwise disjoint
Introduction Related Work Data Set Generating Gold-Standard Entity Summaries Characterizing Gold-Standard Entity Summaries Conclusion
Generating Gold-Standard Entity Summaries Basic Idea (Extended Abstracts) Automatically identifies the features of an entity that are mentioned in its textual abstract Algorithms Evaluation
Generating Gold-Standard Entity Summaries Algorithms Preprocess Remove “;” … and “the”… Split phrases: PopulatedPlace Populated Place Lowercase Optional stemming Identify SEQ: a sequence SET_ALL: a set, all the words SET_ANY: a set, any word
Generating Gold-Standard Entity Summaries Evaluation 10 entities from each class, each entity has more than 10 features Manually construct gold-standard entity summaries
Introduction Related Work Data Set Generating Gold-Standard Entity Summaries Characterizing Gold-Standard Entity Summaries Conclusion
Characterizing Gold-Standard Entity Summaries Lengths Preference for Properties Preference for Diverse Properties Preference for Property Pairs Preference for Property Values
Characterizing Gold-Standard Entity Summaries Length Set maximum length Length varies widely Ratio in a narrower range
Characterizing Gold-Standard Entity Summaries Preference for Properties Name length Popularity Variety
Characterizing Gold-Standard Entity Summaries Preference for Properties Name length Properties with short names are preferable
Characterizing Gold-Standard Entity Summaries Preference for Properties Popularity Web (Bing) Data set Properties frequently seen in the data set are considerably preferable, Web-based popularity of a property seems not a strong indicator of preference Data set Web
Characterizing Gold-Standard Entity Summaries Preference for Properties Variety “familyName” vs “gender” the variety and popularity of a property in the data set are equally effective indicators of preference
Characterizing Gold-Standard Entity Summaries Preference for Diverse Properties diversify a summary the number of distinct properties in the summary/the number of distinct properties in the original description the number of distinct properties/number of features in summary gold-standard entity summaries are highly diversified
Characterizing Gold-Standard Entity Summaries Preference for Property Pairs String Similarity Co-occurrence
Characterizing Gold-Standard Entity Summaries Preference for Property Pairs String Similarity string similarity is not an effective indicator Co-occurrence a pair of properties that frequently co-occur in the data set also tend to be selected, Web-based degree of co-occurrence is a notable indicator of preference
Characterizing Gold-Standard Entity Summaries Preference for Property Values informativeness confirm the effectiveness of selecting rarely seen features into a summary
Introduction Related Work Data Set Generating Gold-Standard Entity Summaries Characterizing Gold-Standard Entity Summaries Conclusion
Conclusion Contribution Shortage Future work can hardly be applied to another data set that provides no textual abstract Future work optimizing the algorithm to generate more natural summaries Explore other factors
Thanks!