Bing-SF-IDF+: A Hybrid Semantics-Driven News Recommender

Slides:

Advertisements

Similar presentations

Recommender System A Brief Survey.

Advertisements

A Comparison Study for Novelty Control Mechanisms Applied to Web News Stories 2012 IEEE/WIC/ACM International Conference on Web Intelligence (WI 2012)

Chapter 5: Introduction to Information Retrieval

Polarity Analysis of Texts using Discourse Structure CIKM 2011 Bas Heerschop Erasmus University Rotterdam Frank Goossen Erasmus.

Improved TF-IDF Ranker

Learning Semantic Information Extraction Rules from News The Dutch-Belgian Database Day 2013 (DBDBD 2013) Frederik Hogenboom Erasmus.

Semantic News Recommendation Using WordNet and Bing Similarities 28th Symposium On Applied Computing 2013 (SAC 2013) March 21, 2013 Michel Capelle

A Linguistic Approach for Semantic Web Service Discovery International Symposium on Management Intelligent Systems 2012 (IS-MiS 2012) July 13, 2012 Jordy.

Hermes: News Personalization Using Semantic Web Technologies

Exploiting Discourse Structure for Sentiment Analysis of Text OR 2013 Alexander Hogenboom In collaboration with Flavius Frasincar, Uzay Kaymak, and Franciska.

Determining Negation Scope and Strength in Sentiment Analysis SMC 2011 Paul van Iterson Erasmus School of Economics Erasmus University Rotterdam

Query Dependent Pseudo-Relevance Feedback based on Wikipedia SIGIR ‘09 Advisor: Dr. Koh Jia-Ling Speaker: Lin, Yi-Jhen Date: 2010/01/24 1.

Introduction to Information Retrieval (Manning, Raghavan, Schutze) Chapter 6 Scoring term weighting and the vector space model.

Information Retrieval in Practice

Search and Retrieval: More on Term Weighting and Document Ranking Prof. Marti Hearst SIMS 202, Lecture 22.

Query Operations: Automatic Local Analysis. Introduction Difficulty of formulating user queries –Insufficient knowledge of the collection –Insufficient.

Semantic text features from small world graphs Jure Leskovec, IJS + CMU John Shawe-Taylor, Southampton.

Sentiment Lexicon Creation from Lexical Resources BIS 2011 Bas Heerschop Erasmus School of Economics Erasmus University Rotterdam

March 17, 2008SAC WT Hermes: a Semantic Web-Based News Decision Support System* Flavius Frasincar Erasmus University Rotterdam.

Automatically Annotating Web Pages Using Google Rich Snippets 11th Dutch-Belgian Information Retrieval Workshop (DIR 2011) February 4, 2011 Frederik Hogenboom.

Detecting Economic Events Using a Semantics-Based Pipeline 22nd International Conference on Database and Expert Systems Applications (DEXA 2011) September.

News Personalization using the CF-IDF Semantic Recommender International Conference on Web Intelligence, Mining, and Semantics (WIMS 2011) May 25, 2011.

Information Retrieval

Analyzing Sentiment in a Large Set of Web Data while Accounting for Negation AWIC 2011 Bas Heerschop Erasmus School of Economics Erasmus University Rotterdam.

Chapter 5: Information Retrieval and Web Search

Overview of Search Engines

Word Sense Disambiguation for Automatic Taxonomy Construction from Text-Based Web Corpora 12th International Conference on Web Information System Engineering.

Sentiment Analysis with a Multilingual Pipeline 12th International Conference on Web Information System Engineering (WISE 2011) October 13, 2011 Daniëlla.

Erasmus University Rotterdam Introduction Nowadays, emerging news on economic events such as acquisitions has a substantial impact on the financial markets.

Erasmus University Rotterdam Introduction With the vast amount of information available on the Web, there is an increasing need to structure Web data in.

Learning Information Extraction Patterns Using WordNet Mark Stevenson and Mark A. Greenwood Natural Language Processing Group University of Sheffield,

RuleML-2007, Orlando, Florida1 Towards Knowledge Extraction from Weblogs and Rule-based Semantic Querying Xi Bai, Jigui Sun, Haiyan Che, Jin.

Chapter 2 Architecture of a Search Engine. Search Engine Architecture n A software architecture consists of software components, the interfaces provided.

Annotating Words using WordNet Semantic Glosses Julian Szymański Department of Computer Systems Architecture, Faculty of Electronics, Telecommunications.

10/22/2015ACM WIDM'20051 Semantic Similarity Methods in WordNet and Their Application to Information Retrieval on the Web Giannis Varelas Epimenidis Voutsakis.

Term Frequency. Term frequency Two factors: – A term that appears just once in a document is probably not as significant as a term that appears a number.

Chapter 6: Information Retrieval and Web Search

Search Engines. Search Strategies Define the search topic(s) and break it down into its component parts What terms, words or phrases do you use to describe.

LANGUAGE MODELS FOR RELEVANCE FEEDBACK Lee Won Hee.

Semantics-Based News Recommendation with SF-IDF+ International Conference on Web Intelligence, Mining, and Semantics (WIMS 2013) June 13, 2013 Marnix Moerland.

Erasmus University Rotterdam Introduction Content-based news recommendation is traditionally performed using the cosine similarity and TF-IDF weighting.

Lexico-semantic Patterns for Information Extraction from Text The International Conference on Operations Research 2013 (OR 2013) Frederik Hogenboom

Harvesting Social Knowledge from Folksonomies Harris Wu, Mohammad Zubair, Kurt Maly, Harvesting social knowledge from folksonomies, Proceedings of the.

1 Measuring the Semantic Similarity of Texts Author ： Courtney Corley and Rada Mihalcea Source ： ACL-2005 Reporter ： Yong-Xiang Chen.

Semantics-Based News Recommendation International Conference on Web Intelligence, Mining, and Semantics (WIMS 2012) June 14, 2012 Michel Capelle

2/10/2016Semantic Similarity1 Semantic Similarity Methods in WordNet and Their Application to Information Retrieval on the Web Giannis Varelas Epimenidis.

Learning Event Durations from Event Descriptions Feng Pan, Rutu Mulkar, Jerry R. Hobbs University of Southern California ACL ’ 06.

1 Text Categorization  Assigning documents to a fixed set of categories  Applications:  Web pages  Recommending pages  Yahoo-like classification hierarchies.

1 Dongheng Sun 04/26/2011 Learning with Matrix Factorizations By Nathan Srebro.

2016/9/301 Exploiting Wikipedia as External Knowledge for Document Clustering Xiaohua Hu, Xiaodan Zhang, Caimei Lu, E. K. Park, and Xiaohua Zhou Proceeding.

Personalized Ontology for Web Search Personalization S. Sendhilkumar, T.V. Geetha Anna University, Chennai India 1st ACM Bangalore annual Compute conference,

Information Retrieval in Practice

Automatic Writing Evaluation

Kim Schouten, Flavius Frasincar, and Rommert Dekker

Recommender Systems & Collaborative Filtering

Linguistic Graph Similarity for News Sentence Searching

Designing Cross-Language Information Retrieval System using various Techniques of Query Expansion and Indexing for Improved Performance Hello everyone,

Search Engine Architecture

Semantic Processing with Context Analysis

Text Based Information Retrieval

Aspect-Based Sentiment Analysis on the Web using Rhetorical Structure Theory Rowan Hoogervorst1, Erik Essink1, Wouter Jansen1, Max van den Helder1 Kim.

Web News Sentence Searching Using Linguistic Graph Similarity

Erasmus University Rotterdam

Exploring and Navigating: Tools for GermaNet

Location Recommendation — for Out-of-Town Users in Location-Based Social Network Yina Meng.

News Recommendation with CF-IDF+

CS 620 Class Presentation Using WordNet to Improve User Modelling in a Web Document Recommender System Using WordNet to Improve User Modelling in a Web.

Introduction Task: extracting relational facts from text

Semantic Similarity Methods in WordNet and their Application to Information Retrieval on the Web Yizhe Ge.

Giannis Varelas Epimenidis Voutsakis Paraskevi Raftopoulou

Presentation transcript:

Bing-SF-IDF+: A Hybrid Semantics-Driven News Recommender Michel Capelle michelcapelle@gmail.com Marnix Moerland marnix.moerland@gmail.com Frederik Hogenboom fhogenboom@ese.eur.nl Flavius Frasincar frasincar@ese.eur.nl Damir Vandic vandic@ese.eur.nl Erasmus University Rotterdam PO Box 1738, NL-3000 DR Rotterdam, the Netherlands April 17, 2015 30th ACM/SIGAPP Symposium on Applied Computing (SAC 2015)

30th ACM/SIGAPP Symposium on Applied Computing (SAC 2015) Introduction (1) Recommender systems provide users with items of interest from a potentially large set of items Recommender systems: Content-based Collaborative filtering Hybrid Content-based systems are often term-based Common measure: Term Frequency – Inverse Document Frequency (TF-IDF) In today’s data intensive world, most people experience (or suffer from) an information overload. Recommender systems lend a hand in distinguishing between interesting and non-interesting products, news articles, etcetera. Based on for example user preferences or characteristics, possibly captured in user profiles or derived from browsing behavior, recommendations can be made. There are three basic types of recommendation systems: content-based recommenders, which recommend news items based on their content, collaborative filtering recommenders, which recommend news items by means of user similarity, and hybrid recommenders, that combine the previous two approaches. In the work covered by this presentation, I’ll focus on content-based recommender systems. Traditionally, these recommender systems are term-based, and hence operate on term frequencies. A commonly used measure is TF-IDF, which stands for Term Frequency – Inverse Document Frequency. When employing user profiles that describe users' interest based on the previously browsed items, these can be translated into vectors of TF-IDF weights. With a measure like cosine similarity, one can calculate how interesting a new item might be based on user profiles. For this, TF-IDF weights are computed on every term within a document. 30th ACM/SIGAPP Symposium on Applied Computing (SAC 2015)

30th ACM/SIGAPP Symposium on Applied Computing (SAC 2015) Introduction (2) One could take into account semantics: Semantic Similarity (SS) Cosine Similarity (CS) SS recommenders based on various similarity functions: Jiang & Conrath [1997] Leacock & Chodorow [1998] Lin [1998] Resnik [1995] Wu & Palmer [1994] However, TF-IDF has been introduced in the 1980s and since then, people have come to the believe that taking into account semantics is crucial for more accurate recommender systems. Semantics can be added in many ways, for instance through well-known similarity measures such as the semantic similarity and the cosine similarity. 30th ACM/SIGAPP Symposium on Applied Computing (SAC 2015)

30th ACM/SIGAPP Symposium on Applied Computing (SAC 2015) Introduction (3) CS recommenders: Concepts instead of terms → Concept Frequency – Inverse Document Frequency (CF-IDF): Reduces noise caused by non-meaningful terms Yields less terms to evaluate Allows for semantic features, e.g., synonyms Relies on a domain ontology Synsets instead of concepts → Synset Frequency – Inverse Document Frequency (SF-IDF): Similar to CF-IDF Relies on semantic lexicon Does not rely on a domain ontology Alternatively, you can make use of concepts instead of terms, which we tried a few years ago. This worked pretty well, but it also suffered from a drawback: domain ontology dependence. Therefore, we introduced a variant that does not rely on ontologies but on a large semantic lexicon like WordNet. 30th ACM/SIGAPP Symposium on Applied Computing (SAC 2015)

30th ACM/SIGAPP Symposium on Applied Computing (SAC 2015) Introduction (4) Current limitations w.r.t. named entities: CF-IDF relies too much on domain ontologies SF-IDF uses too generic semantic lexicons Hence, we coin Bing-SF-IDF+: Extends SF-IDF with semantic relations Also accounts for named entities through Bing page counts Domain independent as it does not rely on ontologies However, this method was not able to handle the often crucial named entities, as they are often not present in semantic lexicons. Therefore, we propose to take into account named entities, but not through the use of domain ontologies, but through Bing page counts. Also, we additionally consider semantic relations to improve SF-IDF, and call our method (for obvious reasons) Bing-SF-IDF+. 30th ACM/SIGAPP Symposium on Applied Computing (SAC 2015)

30th ACM/SIGAPP Symposium on Applied Computing (SAC 2015) Introduction (5) Implementations in Ceryx (as a plug-in for Hermes [Frasincar et al., 2009], a news processing framework) What is the performance of semantic recommenders? Bing-SF-IDF+ vs. SF-IDF Bing-SF-IDF+ vs. TF-IDF Bing-SF-IDF vs. SS Of course, we have implemented our new approach in our news processing framework plugin, which we have been using for a couple of years now. This allowed us to compare the performance of Bing-SF-IDF+ against SF-IDF, but also against TF-IDF and SS methods. 30th ACM/SIGAPP Symposium on Applied Computing (SAC 2015)

Framework: User Profile User profile consists of all read news items Implicit preference for specific topics 30th ACM/SIGAPP Symposium on Applied Computing (SAC 2015)

Framework: Preprocessing Before recommendations can be made, each news item is parsed: Tokenizer Sentence splitter Lemmatizer Part-of-Speech tagger Named entity recognizer 30th ACM/SIGAPP Symposium on Applied Computing (SAC 2015)

30th ACM/SIGAPP Symposium on Applied Computing (SAC 2015) Framework: Synsets We make use of the WordNet dictionary and WSD Each word has a set of senses and each sense has a set of semantically equivalent synonyms (synsets): Turkey: turkey, Meleagris gallopavo (animal) Turkey, Republic of Turkey (country) joker, turkey (annoying person) turkey, bomb, dud (failure) Fly: fly, aviate, pilot (operate airplane) flee, fly, take flight (run away) Synsets are linked using semantic pointers (relations) Hypernym, hyponym, … 30th ACM/SIGAPP Symposium on Applied Computing (SAC 2015)

30th ACM/SIGAPP Symposium on Applied Computing (SAC 2015) Framework: TF-IDF Term Frequency – Inverse Document Frequency: TF: the occurrence of a term in a single document IDF: the occurrence of a term in a set of documents Score: TF×IDF Similarity: Two vectors with TF-IDF scores for each term in: A document The user profile Cosine of the angle between the vectors determines similarity TF: as high as possible IDF: lower is better 30th ACM/SIGAPP Symposium on Applied Computing (SAC 2015)

30th ACM/SIGAPP Symposium on Applied Computing (SAC 2015) Framework: SF-IDF Synset Frequency – Inverse Document Frequency: SF: the occurrence of a synset in a single document IDF: the occurrence of a synset in a set of documents Score: SF×IDF Similarity: Two vectors with SF-IDF scores for each term in: A document The user profile Cosine of the angle between the vectors determines similarity SF: as high as possible IDF: lower is better 30th ACM/SIGAPP Symposium on Applied Computing (SAC 2015)

30th ACM/SIGAPP Symposium on Applied Computing (SAC 2015) Framework: SF-IDF+ SF-IDF for synsets and their related synsets: SF: the occurrence of a synset and its related synsets in a single document IDF: the occurrence of a synset and its related synsets in a set of documents Score: SF×IDF Similarity: Two vectors with SF-IDF+ scores for each term in: A document The user profile Cosine of the angle between the vectors determines similarity 30th ACM/SIGAPP Symposium on Applied Computing (SAC 2015)

30th ACM/SIGAPP Symposium on Applied Computing (SAC 2015) Framework: Bing Similarity: Point-Wise Mutual Information (PMI) Calculated for each pair of entities in: A document The user profile Based on: Co-occurrences of document and profile entities Occurrences of document entity Occurrences of profile entity Corrected for the number of indexed Web pages (～15bn) PMI =log [# co-occ document & profile entities] / [<# occ document entity> × <# occ profile entity>], but then corrected for the number of indexed web pages (15 billion) 30th ACM/SIGAPP Symposium on Applied Computing (SAC 2015)

Framework: Bing-SF-IDF+ Similarity: Bing: takes care of named entities SF-IDF+: takes care of synsets Score: weighted average of Bing and SF-IDF+ score Weight is optimized later on 30th ACM/SIGAPP Symposium on Applied Computing (SAC 2015)

30th ACM/SIGAPP Symposium on Applied Computing (SAC 2015) Framework: SS Similarity: Looks for commonalities in Part-of-Speech Calculated for each pair of synsets in: A document The user profile Jiang & Conrath [1997], Resnik [1995] , and Lin [1998]: information content of synsets Leacock & Chodorow [1998] and Wu & Palmer [1994]: path length between synsets While TF-IDF, SF-IDF, and Bing-SF-IDF+ make use of the standard cosine similarity, there are also other similarity measures. These take into account semantic similarity of synsets and can be divided into two categories, i.e.: the ones that are based on information content (the negative logarithm of the sum of all probabilities of all the words in the synset) the ones that are based on the path length between synsets For individual details, I would like to point you to the paper. 30th ACM/SIGAPP Symposium on Applied Computing (SAC 2015)

Implementation: Hermes Hermes framework is utilized for building a news personalization service for RSS Its implementation is the Hermes News Portal (HNP): Programmed in Java Uses OWL / SPARQL / Jena / GATE / WordNet The Hermes framework is utilized for building a news personalization service for RSS feeds. Its implementation is called Hermes News Portal, or simply HNP. The Hermes news personalization framework provides a semantic-based approach for retrieving news items related, directly or indirectly, to the concepts of interest. HNP takes RSS feeds of news items as input. Hermes employs an advanced Natural Language Processing (NLP) engine that uses techniques like tokenization, part-of-speech tagging, word sense disambiguation, gazetteering, etc. 30th ACM/SIGAPP Symposium on Applied Computing (SAC 2015)

Implementation: Ceryx Ceryx is a plug-in for HNP Uses WordNet / Stanford POS Tagger / JAWS lemmatizer / Lesk WSD / Alias-I LingPipe 4.1.0 Named Entity Recognizer / Bing API 2.0 Main focus is on recommendation support User profiles are constructed Computes TF-IDF, SF-IDF, Bing-SF-IDF+, and SS Our Ceryx plug-in for the HNP adds recommendation support to the news processing tool and implements TF-IDF, SF-IDF, Bing-SF-IDF+, and SS recommendation. Recommending news items starts with building a user profile. Building a user profile can be defined as keeping track of which articles the user has read so far. Those articles will provide us with information about the user's interests. The user profile can also be constructed in different ways, for instance using some user preference elicitation interfaces. The TF-IDF recommender analyses every term (but the stop words) in a news item, so the profile the TF-IDF recommender consists of a list of news items which it can process. The (Bing-)SF-IDF(+) recommender uses the same way of gathering information from a user profile. The main difference is that this recommender does not take all the text in a news item into account, but only the synsets (and possibly the named entities) found in it. The same holds for the SS recommenders. 30th ACM/SIGAPP Symposium on Applied Computing (SAC 2015)

30th ACM/SIGAPP Symposium on Applied Computing (SAC 2015) Evaluation (1) Experiment: We let 19 participants evaluate 100 news items We use 8 different user profiles focusing on various topics Ceryx computes TF-IDF, SF-IDF, Bing-SF-IDF+, and SS for various cut-off values F1 scores and Kappa statistics are evaluated Weight for Bing-SF-IDF+ is optimized using a genetic algorithm For evaluation of Ceryx, we let 19 users browse 100 news articles, indicating the interestingness when keeping in mind the preference for 8 different topics. Then, we let the TF-IDF, (Bing-)SF-IDF(+) and the semantic similarity recommenders determine the similarity with the user profile for each news item. We measure the results using F1 scores for various cut-off values (the minimum score resulting in recommendation) and we perform t-tests to assess significance. 30th ACM/SIGAPP Symposium on Applied Computing (SAC 2015)

30th ACM/SIGAPP Symposium on Applied Computing (SAC 2015) Evaluation (2) Results: Bing-SF-IDF+ significantly outperforms all other methods Its weight is optimized to 0.48, but increases for higher cut-off values: Bing similarities become more important when a high precision is required 30th ACM/SIGAPP Symposium on Applied Computing (SAC 2015)

30th ACM/SIGAPP Symposium on Applied Computing (SAC 2015) Conclusions Common recommendation is performed using TF-IDF Semantics could be considered by using synsets, related synsets, and named entities Semantics-based recommendation outperforms the classic term-based recommendation Future work: Combine multiple semantic lexicons Add Bing to other methods like SS 30th ACM/SIGAPP Symposium on Applied Computing (SAC 2015)

30th ACM/SIGAPP Symposium on Applied Computing (SAC 2015) Questions 30th ACM/SIGAPP Symposium on Applied Computing (SAC 2015)