Bing-SF-IDF+: A Hybrid Semantics-Driven News Recommender Michel Capelle michelcapelle@gmail.com Marnix Moerland marnix.moerland@gmail.com Frederik Hogenboom fhogenboom@ese.eur.nl Flavius Frasincar frasincar@ese.eur.nl Damir Vandic vandic@ese.eur.nl Erasmus University Rotterdam PO Box 1738, NL-3000 DR Rotterdam, the Netherlands April 17, 2015 30th ACM/SIGAPP Symposium on Applied Computing (SAC 2015)
30th ACM/SIGAPP Symposium on Applied Computing (SAC 2015) Introduction (1) Recommender systems provide users with items of interest from a potentially large set of items Recommender systems: Content-based Collaborative filtering Hybrid Content-based systems are often term-based Common measure: Term Frequency – Inverse Document Frequency (TF-IDF) In today’s data intensive world, most people experience (or suffer from) an information overload. Recommender systems lend a hand in distinguishing between interesting and non-interesting products, news articles, etcetera. Based on for example user preferences or characteristics, possibly captured in user profiles or derived from browsing behavior, recommendations can be made. There are three basic types of recommendation systems: content-based recommenders, which recommend news items based on their content, collaborative filtering recommenders, which recommend news items by means of user similarity, and hybrid recommenders, that combine the previous two approaches. In the work covered by this presentation, I’ll focus on content-based recommender systems. Traditionally, these recommender systems are term-based, and hence operate on term frequencies. A commonly used measure is TF-IDF, which stands for Term Frequency – Inverse Document Frequency. When employing user profiles that describe users' interest based on the previously browsed items, these can be translated into vectors of TF-IDF weights. With a measure like cosine similarity, one can calculate how interesting a new item might be based on user profiles. For this, TF-IDF weights are computed on every term within a document. 30th ACM/SIGAPP Symposium on Applied Computing (SAC 2015)
30th ACM/SIGAPP Symposium on Applied Computing (SAC 2015) Introduction (2) One could take into account semantics: Semantic Similarity (SS) Cosine Similarity (CS) SS recommenders based on various similarity functions: Jiang & Conrath [1997] Leacock & Chodorow [1998] Lin [1998] Resnik [1995] Wu & Palmer [1994] However, TF-IDF has been introduced in the 1980s and since then, people have come to the believe that taking into account semantics is crucial for more accurate recommender systems. Semantics can be added in many ways, for instance through well-known similarity measures such as the semantic similarity and the cosine similarity. 30th ACM/SIGAPP Symposium on Applied Computing (SAC 2015)
30th ACM/SIGAPP Symposium on Applied Computing (SAC 2015) Introduction (3) CS recommenders: Concepts instead of terms → Concept Frequency – Inverse Document Frequency (CF-IDF): Reduces noise caused by non-meaningful terms Yields less terms to evaluate Allows for semantic features, e.g., synonyms Relies on a domain ontology Synsets instead of concepts → Synset Frequency – Inverse Document Frequency (SF-IDF): Similar to CF-IDF Relies on semantic lexicon Does not rely on a domain ontology Alternatively, you can make use of concepts instead of terms, which we tried a few years ago. This worked pretty well, but it also suffered from a drawback: domain ontology dependence. Therefore, we introduced a variant that does not rely on ontologies but on a large semantic lexicon like WordNet. 30th ACM/SIGAPP Symposium on Applied Computing (SAC 2015)
30th ACM/SIGAPP Symposium on Applied Computing (SAC 2015) Introduction (4) Current limitations w.r.t. named entities: CF-IDF relies too much on domain ontologies SF-IDF uses too generic semantic lexicons Hence, we coin Bing-SF-IDF+: Extends SF-IDF with semantic relations Also accounts for named entities through Bing page counts Domain independent as it does not rely on ontologies However, this method was not able to handle the often crucial named entities, as they are often not present in semantic lexicons. Therefore, we propose to take into account named entities, but not through the use of domain ontologies, but through Bing page counts. Also, we additionally consider semantic relations to improve SF-IDF, and call our method (for obvious reasons) Bing-SF-IDF+. 30th ACM/SIGAPP Symposium on Applied Computing (SAC 2015)
30th ACM/SIGAPP Symposium on Applied Computing (SAC 2015) Introduction (5) Implementations in Ceryx (as a plug-in for Hermes [Frasincar et al., 2009], a news processing framework) What is the performance of semantic recommenders? Bing-SF-IDF+ vs. SF-IDF Bing-SF-IDF+ vs. TF-IDF Bing-SF-IDF vs. SS Of course, we have implemented our new approach in our news processing framework plugin, which we have been using for a couple of years now. This allowed us to compare the performance of Bing-SF-IDF+ against SF-IDF, but also against TF-IDF and SS methods. 30th ACM/SIGAPP Symposium on Applied Computing (SAC 2015)
Framework: User Profile User profile consists of all read news items Implicit preference for specific topics 30th ACM/SIGAPP Symposium on Applied Computing (SAC 2015)
Framework: Preprocessing Before recommendations can be made, each news item is parsed: Tokenizer Sentence splitter Lemmatizer Part-of-Speech tagger Named entity recognizer 30th ACM/SIGAPP Symposium on Applied Computing (SAC 2015)
30th ACM/SIGAPP Symposium on Applied Computing (SAC 2015) Framework: Synsets We make use of the WordNet dictionary and WSD Each word has a set of senses and each sense has a set of semantically equivalent synonyms (synsets): Turkey: turkey, Meleagris gallopavo (animal) Turkey, Republic of Turkey (country) joker, turkey (annoying person) turkey, bomb, dud (failure) Fly: fly, aviate, pilot (operate airplane) flee, fly, take flight (run away) Synsets are linked using semantic pointers (relations) Hypernym, hyponym, … 30th ACM/SIGAPP Symposium on Applied Computing (SAC 2015)
30th ACM/SIGAPP Symposium on Applied Computing (SAC 2015) Framework: TF-IDF Term Frequency – Inverse Document Frequency: TF: the occurrence of a term in a single document IDF: the occurrence of a term in a set of documents Score: TF×IDF Similarity: Two vectors with TF-IDF scores for each term in: A document The user profile Cosine of the angle between the vectors determines similarity TF: as high as possible IDF: lower is better 30th ACM/SIGAPP Symposium on Applied Computing (SAC 2015)
30th ACM/SIGAPP Symposium on Applied Computing (SAC 2015) Framework: SF-IDF Synset Frequency – Inverse Document Frequency: SF: the occurrence of a synset in a single document IDF: the occurrence of a synset in a set of documents Score: SF×IDF Similarity: Two vectors with SF-IDF scores for each term in: A document The user profile Cosine of the angle between the vectors determines similarity SF: as high as possible IDF: lower is better 30th ACM/SIGAPP Symposium on Applied Computing (SAC 2015)
30th ACM/SIGAPP Symposium on Applied Computing (SAC 2015) Framework: SF-IDF+ SF-IDF for synsets and their related synsets: SF: the occurrence of a synset and its related synsets in a single document IDF: the occurrence of a synset and its related synsets in a set of documents Score: SF×IDF Similarity: Two vectors with SF-IDF+ scores for each term in: A document The user profile Cosine of the angle between the vectors determines similarity 30th ACM/SIGAPP Symposium on Applied Computing (SAC 2015)
30th ACM/SIGAPP Symposium on Applied Computing (SAC 2015) Framework: Bing Similarity: Point-Wise Mutual Information (PMI) Calculated for each pair of entities in: A document The user profile Based on: Co-occurrences of document and profile entities Occurrences of document entity Occurrences of profile entity Corrected for the number of indexed Web pages (~15bn) PMI =log [# co-occ document & profile entities] / [<# occ document entity> × <# occ profile entity>], but then corrected for the number of indexed web pages (15 billion) 30th ACM/SIGAPP Symposium on Applied Computing (SAC 2015)
Framework: Bing-SF-IDF+ Similarity: Bing: takes care of named entities SF-IDF+: takes care of synsets Score: weighted average of Bing and SF-IDF+ score Weight is optimized later on 30th ACM/SIGAPP Symposium on Applied Computing (SAC 2015)
30th ACM/SIGAPP Symposium on Applied Computing (SAC 2015) Framework: SS Similarity: Looks for commonalities in Part-of-Speech Calculated for each pair of synsets in: A document The user profile Jiang & Conrath [1997], Resnik [1995] , and Lin [1998]: information content of synsets Leacock & Chodorow [1998] and Wu & Palmer [1994]: path length between synsets While TF-IDF, SF-IDF, and Bing-SF-IDF+ make use of the standard cosine similarity, there are also other similarity measures. These take into account semantic similarity of synsets and can be divided into two categories, i.e.: the ones that are based on information content (the negative logarithm of the sum of all probabilities of all the words in the synset) the ones that are based on the path length between synsets For individual details, I would like to point you to the paper. 30th ACM/SIGAPP Symposium on Applied Computing (SAC 2015)
Implementation: Hermes Hermes framework is utilized for building a news personalization service for RSS Its implementation is the Hermes News Portal (HNP): Programmed in Java Uses OWL / SPARQL / Jena / GATE / WordNet The Hermes framework is utilized for building a news personalization service for RSS feeds. Its implementation is called Hermes News Portal, or simply HNP. The Hermes news personalization framework provides a semantic-based approach for retrieving news items related, directly or indirectly, to the concepts of interest. HNP takes RSS feeds of news items as input. Hermes employs an advanced Natural Language Processing (NLP) engine that uses techniques like tokenization, part-of-speech tagging, word sense disambiguation, gazetteering, etc. 30th ACM/SIGAPP Symposium on Applied Computing (SAC 2015)
Implementation: Ceryx Ceryx is a plug-in for HNP Uses WordNet / Stanford POS Tagger / JAWS lemmatizer / Lesk WSD / Alias-I LingPipe 4.1.0 Named Entity Recognizer / Bing API 2.0 Main focus is on recommendation support User profiles are constructed Computes TF-IDF, SF-IDF, Bing-SF-IDF+, and SS Our Ceryx plug-in for the HNP adds recommendation support to the news processing tool and implements TF-IDF, SF-IDF, Bing-SF-IDF+, and SS recommendation. Recommending news items starts with building a user profile. Building a user profile can be defined as keeping track of which articles the user has read so far. Those articles will provide us with information about the user's interests. The user profile can also be constructed in different ways, for instance using some user preference elicitation interfaces. The TF-IDF recommender analyses every term (but the stop words) in a news item, so the profile the TF-IDF recommender consists of a list of news items which it can process. The (Bing-)SF-IDF(+) recommender uses the same way of gathering information from a user profile. The main difference is that this recommender does not take all the text in a news item into account, but only the synsets (and possibly the named entities) found in it. The same holds for the SS recommenders. 30th ACM/SIGAPP Symposium on Applied Computing (SAC 2015)
30th ACM/SIGAPP Symposium on Applied Computing (SAC 2015) Evaluation (1) Experiment: We let 19 participants evaluate 100 news items We use 8 different user profiles focusing on various topics Ceryx computes TF-IDF, SF-IDF, Bing-SF-IDF+, and SS for various cut-off values F1 scores and Kappa statistics are evaluated Weight for Bing-SF-IDF+ is optimized using a genetic algorithm For evaluation of Ceryx, we let 19 users browse 100 news articles, indicating the interestingness when keeping in mind the preference for 8 different topics. Then, we let the TF-IDF, (Bing-)SF-IDF(+) and the semantic similarity recommenders determine the similarity with the user profile for each news item. We measure the results using F1 scores for various cut-off values (the minimum score resulting in recommendation) and we perform t-tests to assess significance. 30th ACM/SIGAPP Symposium on Applied Computing (SAC 2015)
30th ACM/SIGAPP Symposium on Applied Computing (SAC 2015) Evaluation (2) Results: Bing-SF-IDF+ significantly outperforms all other methods Its weight is optimized to 0.48, but increases for higher cut-off values: Bing similarities become more important when a high precision is required 30th ACM/SIGAPP Symposium on Applied Computing (SAC 2015)
30th ACM/SIGAPP Symposium on Applied Computing (SAC 2015) Conclusions Common recommendation is performed using TF-IDF Semantics could be considered by using synsets, related synsets, and named entities Semantics-based recommendation outperforms the classic term-based recommendation Future work: Combine multiple semantic lexicons Add Bing to other methods like SS 30th ACM/SIGAPP Symposium on Applied Computing (SAC 2015)
30th ACM/SIGAPP Symposium on Applied Computing (SAC 2015) Questions 30th ACM/SIGAPP Symposium on Applied Computing (SAC 2015)