A Comparison Study for Novelty Control Mechanisms Applied to Web News Stories 2012 IEEE/WIC/ACM International Conference on Web Intelligence (WI 2012)

Slides:



Advertisements
Similar presentations
DISCOVERING EVENT EVOLUTION GRAPHS FROM NEWSWIRES Christopher C. Yang and Xiaodong Shi Event Evolution and Event Evolution Graph: We define event evolution.
Advertisements

Pseudo-Relevance Feedback For Multimedia Retrieval By Rong Yan, Alexander G. and Rong Jin Mwangi S. Kariuki
Text Categorization.
Chapter 5: Introduction to Information Retrieval
Diversity Maximization Under Matroid Constraints Date : 2013/11/06 Source : KDD’13 Authors : Zeinab Abbassi, Vahab S. Mirrokni, Mayur Thakur Advisor :
RCQ-ACS: RDF Chain Query Optimization Using an Ant Colony System WI 2012 Alexander Hogenboom Erasmus University Rotterdam Ewout Niewenhuijse.
Polarity Analysis of Texts using Discourse Structure CIKM 2011 Bas Heerschop Erasmus University Rotterdam Frank Goossen Erasmus.
WWW 2014 Seoul, April 8 th SNOW 2014 Data Challenge Two-level message clustering for topic detection in Twitter Georgios Petkos, Symeon Papadopoulos, Yiannis.
Learning Semantic Information Extraction Rules from News The Dutch-Belgian Database Day 2013 (DBDBD 2013) Frederik Hogenboom Erasmus.
Semantic News Recommendation Using WordNet and Bing Similarities 28th Symposium On Applied Computing 2013 (SAC 2013) March 21, 2013 Michel Capelle
A Linguistic Approach for Semantic Web Service Discovery International Symposium on Management Intelligent Systems 2012 (IS-MiS 2012) July 13, 2012 Jordy.
Hermes: News Personalization Using Semantic Web Technologies
Query Dependent Pseudo-Relevance Feedback based on Wikipedia SIGIR ‘09 Advisor: Dr. Koh Jia-Ling Speaker: Lin, Yi-Jhen Date: 2010/01/24 1.
Search and Retrieval: More on Term Weighting and Document Ranking Prof. Marti Hearst SIMS 202, Lecture 22.
Database Management Systems, R. Ramakrishnan1 Computing Relevance, Similarity: The Vector Space Model Chapter 27, Part B Based on Larson and Hearst’s slides.
Sentiment Lexicon Creation from Lexical Resources BIS 2011 Bas Heerschop Erasmus School of Economics Erasmus University Rotterdam
March 17, 2008SAC WT Hermes: a Semantic Web-Based News Decision Support System* Flavius Frasincar Erasmus University Rotterdam.
Automatically Annotating Web Pages Using Google Rich Snippets 11th Dutch-Belgian Information Retrieval Workshop (DIR 2011) February 4, 2011 Frederik Hogenboom.
Learning to Advertise. Introduction Advertising on the Internet = $$$ –Especially search advertising and web page advertising Problem: –Selecting ads.
ReQuest (Validating Semantic Searches) Norman Piedade de Noronha 16 th July, 2004.
Compare&Contrast: Using the Web to Discover Comparable Cases for News Stories Presenter: Aravind Krishna Kalavagattu.
Detecting Economic Events Using a Semantics-Based Pipeline 22nd International Conference on Database and Expert Systems Applications (DEXA 2011) September.
News Personalization using the CF-IDF Semantic Recommender International Conference on Web Intelligence, Mining, and Semantics (WIMS 2011) May 25, 2011.
The Vector Space Model …and applications in Information Retrieval.
INEX 2003, Germany Searching in an XML Corpus Using Content and Structure INEX 2003, Germany Yiftah Ben-Aharon, Sara Cohen, Yael Grumbach, Yaron Kanza,
Analyzing Sentiment in a Large Set of Web Data while Accounting for Negation AWIC 2011 Bas Heerschop Erasmus School of Economics Erasmus University Rotterdam.
JOURNAL OF INFORMATION SCIENCE AND ENGINEERING 30, (2014) BERLIN CHEN, YI-WEN CHEN, KUAN-YU CHEN, HSIN-MIN WANG2 AND KUEN-TYNG YU Department of Computer.
Word Sense Disambiguation for Automatic Taxonomy Construction from Text-Based Web Corpora 12th International Conference on Web Information System Engineering.
Sentiment Analysis with a Multilingual Pipeline 12th International Conference on Web Information System Engineering (WISE 2011) October 13, 2011 Daniëlla.
Predicting Missing Provenance Using Semantic Associations in Reservoir Engineering Jing Zhao University of Southern California Sep 19 th,
Erasmus University Rotterdam Introduction Nowadays, emerging news on economic events such as acquisitions has a substantial impact on the financial markets.
Erasmus University Rotterdam Introduction With the vast amount of information available on the Web, there is an increasing need to structure Web data in.
A News-Based Approach for Computing Historical Value-at-Risk International Symposium on Management Intelligent Systems 2012 (IS-MiS 2012) Frederik Hogenboom.
TOWL Time-determined ontology-based information system for real-time stock market analysis Econometric Institute Erasmus School of Economics Erasmus University.
Tag Clouds Revisited Date : 2011/12/12 Source : CIKM’11 Speaker : I- Chih Chiu Advisor : Dr. Koh. Jia-ling 1.
Adaptive News Access Daniel Billsus Presented by Chirayu Wongchokprasitti.
Group Recommendations with Rank Aggregation and Collaborative Filtering Linas Baltrunas, Tadas Makcinskas, Francesco Ricci Free University of Bozen-Bolzano.
Newsjunkie: Providing Personalized Newsfeeds via Analysis of Information Novelty Gabrilovich et.al WWW2004.
Ontology Updating Driven by Events Dutch-Belgian Database Day 2012 (DBDBD 2012) November 21, 2012 Frederik Hogenboom Jordy Sangers.
Xiaoying Gao Computer Science Victoria University of Wellington Intelligent Agents COMP 423.
Chapter 2 Architecture of a Search Engine. Search Engine Architecture n A software architecture consists of software components, the interfaces provided.
A Machine Learning Approach to Sentence Ordering for Multidocument Summarization and Its Evaluation D. Bollegala, N. Okazaki and M. Ishizuka The University.
Giorgos Giannopoulos (IMIS/”Athena” R.C and NTU Athens, Greece) Theodore Dalamagas (IMIS/”Athena” R.C., Greece) Timos Sellis (IMIS/”Athena” R.C and NTU.
Chapter 6: Information Retrieval and Web Search
1 Computing Relevance, Similarity: The Vector Space Model.
Ranking in Information Retrieval Systems Prepared by: Mariam John CSE /23/2006.
Contextual Ranking of Keywords Using Click Data Utku Irmak, Vadim von Brzeski, Reiner Kraft Yahoo! Inc ICDE 09’ Datamining session Summarized.
*Erasmus University Rotterdam P.O. Box 1738, NL-3000 DR Rotterdam, the Netherlands † Teezir BV Wilhelminapark 46, NL-3581 NL, Utrecht, the Netherlands.
Enhancing Cluster Labeling Using Wikipedia David Carmel, Haggai Roitman, Naama Zwerdling IBM Research Lab (SIGIR’09) Date: 11/09/2009 Speaker: Cho, Chin.
Knowledge based Personalization by Wonjung Kim. Outline Introduction Background – InfoQuilt system Personalization in InfoQuilt Related Work Conclusions.
LANGUAGE MODELS FOR RELEVANCE FEEDBACK Lee Won Hee.
Semantics-Based News Recommendation with SF-IDF+ International Conference on Web Intelligence, Mining, and Semantics (WIMS 2013) June 13, 2013 Marnix Moerland.
Erasmus University Rotterdam Introduction Content-based news recommendation is traditionally performed using the cosine similarity and TF-IDF weighting.
Lexico-semantic Patterns for Information Extraction from Text The International Conference on Operations Research 2013 (OR 2013) Frederik Hogenboom
Carnegie Mellon Novelty and Redundancy Detection in Adaptive Filtering Yi Zhang, Jamie Callan, Thomas Minka Carnegie Mellon University {yiz, callan,
Pairwise Preference Regression for Cold-start Recommendation Speaker: Yuanshuai Sun
Date: 2013/6/10 Author: Shiwen Cheng, Arash Termehchy, Vagelis Hristidis Source: CIKM’12 Advisor: Jia-ling Koh Speaker: Chen-Yu Huang Predicting the Effectiveness.
Semantics-Based News Recommendation International Conference on Web Intelligence, Mining, and Semantics (WIMS 2012) June 14, 2012 Michel Capelle
The Development of a search engine & Comparison according to algorithms Sung-soo Kim The final report.
GENERATING RELEVANT AND DIVERSE QUERY PHRASE SUGGESTIONS USING TOPICAL N-GRAMS ELENA HIRST.
Contextual Text Cube Model and Aggregation Operator for Text OLAP
Linguistic Graph Similarity for News Sentence Searching
Semantic Processing with Context Analysis
Web News Sentence Searching Using Linguistic Graph Similarity
Bing-SF-IDF+: A Hybrid Semantics-Driven News Recommender
News Recommendation with CF-IDF+
Evaluating Information Retrieval Systems
INF 141: Information Retrieval
Learning to Rank with Ties
Information Retrieval and Web Design
Presentation transcript:

A Comparison Study for Novelty Control Mechanisms Applied to Web News Stories 2012 IEEE/WIC/ACM International Conference on Web Intelligence (WI 2012) December 4-7, 2012 Arnout Verheij Allard Kleijn Flavius Frasincar Frederik Hogenboom Erasmus University Rotterdam PO Box 1738, NL-3000 DR Rotterdam, the Netherlands

Introduction (1) The Web is a great source of information However, there is an information overload Solution: information filtering and structuring based on user interests (commonly derived from queries) Example: Hermes news personalization framework retrieves relevant news based on preferences expressed using concepts from an ontology 2012 IEEE/WIC/ACM International Conference on Web Intelligence (WI 2012)

Introduction (2) Most news filtering systems filter out relevant articles with respect to user interests, topics, etc. However, within this subset of relevant articles, not all information is new Only articles with high novelty should be retrieved 2012 IEEE/WIC/ACM International Conference on Web Intelligence (WI 2012) ! ! ! ! newsrelevantnovel

Introduction (3) Key solution: story-based news representation combined with novelty control mechanisms Novelty control: –Sorting news items based on novelty compared to the seed item and previously browsed items –Based on distance measures for similarity –News items that are dissimilar to other news items but belong to the same topic indicate that the storyline is developing and should receive a high novelty score 2012 IEEE/WIC/ACM International Conference on Web Intelligence (WI 2012)

Introduction (4) Most novelty control methods use all words from documents in a vector-based news representation Considering all words generates noise Named entities (persons, companies, products, etc.) could carry a large part of the story information contained in a news item Hence, we explore different frequently used novelty control mechanisms and compare word-based and named entity-based vector representations approaches 2012 IEEE/WIC/ACM International Conference on Web Intelligence (WI 2012)

Novelty Control (1) News is sorted based on novelty scores of new items compared to already browsed items Novelty score is calculated using a distance metric: –Based on a certain document representation –Can be used pairwise: Compare a document to all previously browsed documents Determine novelty scores by computing the similarity (distance) to the most similar document –Can also be used non-pairwise (aggregated): Aggregate document representations from all previous documents Determine novelty scores by computing the document distance to the aggregated documents 2012 IEEE/WIC/ACM International Conference on Web Intelligence (WI 2012)

Novelty Control (2) Document representation: –Language models: A model is a vector of probabilities P(T i |D) of picking term T i randomly from document D for i = 1 … n Kullback-Leibler divergence & Jenson-Shannon divergence –Vector space models: A model is a vector of weights for terms T 1 … T n in document D based on, e.g., term presence, term counts, term frequency – inverse document frequency (TF-IDF), etc. Cosine similarity 2012 IEEE/WIC/ACM International Conference on Web Intelligence (WI 2012)

Implementation Hermes News Portal: –Java-based news personalization framework –News is scraped from RSS feeds –Named Entity recognition using the Stanford Named Entity Recognition and Information Extraction Package –Information is stored in a domain ontology: Title Body Date Publisher Named entities (+ frequencies) –Domain ontology is queried using Jena/ARQ 2012 IEEE/WIC/ACM International Conference on Web Intelligence (WI 2012)

Evaluation (1) Evaluated methods: –Frequently used novelty control mechanisms: Cosine similarity Kullback-Leibler divergence Jensen-Shannon divergence –Distance measures usage: Pairwise Aggregate –Vector-based news representations: All words Named entities –This yields 12 configurations, which are compared with the baseline: ordered by time 2012 IEEE/WIC/ACM International Conference on Web Intelligence (WI 2012)

Evaluation (2) Data set: – –8,097 news items from to –10 storylines (small storylines of less than 4 items are omitted): Debt crisis in Portugal Oil and gas prices rise by trouble in Middle East Detroit musicians on strike Pennsylvania judge corruption case … –9 news items per storyline (average) –Storylines span 2 to 38 days 2012 IEEE/WIC/ACM International Conference on Web Intelligence (WI 2012)

Evaluation (3) Golden standard: –3 annotators –Rankings from 0 to 3 –Based on novelty with respect to: Seed item List of previously read items Measures: –Kendalls : -1: rankings are reversed 1: rankings are the same –Discounted Cumulative Gain: 0: bad correlation between rankings 1: perfect correlation between rankings 2012 IEEE/WIC/ACM International Conference on Web Intelligence (WI 2012)

Evaluation (4) Kendalls MethodOverallPAWORDSNE COS0,1830,1650,2010,1160,250 KL0,2160,2570,1750,1790,254 JS0,1730,2520,0950,1560,191 Time-0,192 Discounted Cumulative Gain MethodOverallPAWORDSNE COS0,755 0,7510,760 KL0,7890,7600,8180,7850,794 JS0,7530,7690,7370,7520,754 Time0, IEEE/WIC/ACM International Conference on Web Intelligence (WI 2012)

Conclusions We have evaluated several novelty control mechanisms for ranking Web news articles depicting a storyline Overall, pairwise KL divergence performs best when considering named entities instead of all words from news items Future work: –Story detection techniques that take into account the age difference between news items –Semantic approaches for novelty control using concepts from a domain ontology 2012 IEEE/WIC/ACM International Conference on Web Intelligence (WI 2012)

Questions 2012 IEEE/WIC/ACM International Conference on Web Intelligence (WI 2012)