WWW 2014 Seoul, April 8 th SNOW 2014 Data Challenge Two-level message clustering for topic detection in Twitter Georgios Petkos, Symeon Papadopoulos, Yiannis.

Slides:



Advertisements
Similar presentations
Google News Personalization: Scalable Online Collaborative Filtering
Advertisements

Spelling Correction for Search Engine Queries Bruno Martins, Mario J. Silva In Proceedings of EsTAL-04, España for Natural Language Processing Presenter:
© Copyright 2012 STI INNSBRUCK Apache Lucene Ioan Toma based on slides from Aaron Bannert
TI: An Efficient Indexing Mechanism for Real-Time Search on Tweets Chun Chen 1, Feng Li 2, Beng Chin Ooi 2, and Sai Wu 2 1 Zhejiang University, 2 National.
Distant Supervision for Emotion Classification in Twitter posts 1/17.
Pete Bohman Adam Kunk.  Introduction  Related Work  System Overview  Indexing Scheme  Ranking  Evaluation  Conclusion.
VisualRank: Applying PageRank to Large-Scale Image Search Yushi Jing, Member, IEEE, and Shumeet Baluja, Member, IEEE.
WWW 2014 Seoul, April 8 th SNOW 2014 Data Challenge Symeon Papadopoulos (CERTH) David Corney (RGU) Luca Aiello (Yahoo! Labs)
Contextual Advertising by Combining Relevance with Click Feedback D. Chakrabarti D. Agarwal V. Josifovski.
Intelligent Information Retrieval 1 Vector Space Model for IR: Implementation Notes CSC 575 Intelligent Information Retrieval These notes are based, in.
SNOW Workshop, 8th April 2014 Real-time topic detection with bursty ngrams: RGU participation in SNOW 2014 challenge Carlos Martin and Ayse Goker (Robert.
Query Dependent Pseudo-Relevance Feedback based on Wikipedia SIGIR ‘09 Advisor: Dr. Koh Jia-Ling Speaker: Lin, Yi-Jhen Date: 2010/01/24 1.
WIMS 2014, Thessaloniki, June 2014 A soft frequent pattern mining approach for textual topic detection Georgios Petkos, Symeon Papadopoulos, Yiannis Kompatsiaris.
Information Retrieval Ling573 NLP Systems and Applications April 26, 2011.
ISP 433/533 Week 2 IR Models.
Query Operations: Automatic Local Analysis. Introduction Difficulty of formulating user queries –Insufficient knowledge of the collection –Insufficient.
J. Chen, O. R. Zaiane and R. Goebel An Unsupervised Approach to Cluster Web Search Results based on Word Sense Communities.
INEX 2003, Germany Searching in an XML Corpus Using Content and Structure INEX 2003, Germany Yiftah Ben-Aharon, Sara Cohen, Yael Grumbach, Yaron Kanza,
Chapter 5: Information Retrieval and Web Search
Disambiguation of References to Individuals Levon Lloyd (State University of New York) Varun Bhagwan, Daniel Gruhl (IBM Research Center) Varun Bhagwan,
Modeling (Chap. 2) Modern Information Retrieval Spring 2000.
Tag-based Social Interest Discovery
«Tag-based Social Interest Discovery» Proceedings of the 17th International World Wide Web Conference (WWW2008) Xin Li, Lei Guo, Yihong Zhao Yahoo! Inc.,
 Clustering of Web Documents Jinfeng Chen. Zhong Su, Qiang Yang, HongHiang Zhang, Xiaowei Xu and Yuhen Hu, Correlation- based Document Clustering using.
Search Engines and Information Retrieval Chapter 1.
1 Context-Aware Search Personalization with Concept Preference CIKM’11 Advisor : Jia Ling, Koh Speaker : SHENG HONG, CHUNG.
1 A study on automatically extracted keywords in text categorization Authors:Anette Hulth and Be´ata B. Megyesi From:ACL 2006 Reporter: 陳永祥 Date:2007/10/16.
Grouping search-engine returned citations for person-name queries Reema Al-Kamha, David W. Embley (Proceedings of the 6th annual ACM international workshop.
Web Search. Structure of the Web n The Web is a complex network (graph) of nodes & links that has the appearance of a self-organizing structure  The.
Michael Cafarella Alon HalevyNodira Khoussainova University of Washington Google, incUniversity of Washington Data Integration for Relational Web.
WHAT IS A SEARCH ENGINE A search engine is not a physical engine, instead its an electronic code or a software programme that searches and indexes millions.
When Experts Agree: Using Non-Affiliated Experts To Rank Popular Topics Meital Aizen.
Modern Information Retrieval: A Brief Overview By Amit Singhal Ranjan Dash.
Laboratory for InterNet Computing CSCE 561 Social Media Projects Ryan Benton October 8, 2012.
11 A Hybrid Phish Detection Approach by Identity Discovery and Keywords Retrieval Reporter: 林佳宜 /10/17.
GoogleDictionary Paul Nepywoda Alla Rozovskaya. Goal Develop a tool for English that, given a word, will illustrate its usage.
Intent Subtopic Mining for Web Search Diversification Aymeric Damien, Min Zhang, Yiqun Liu, Shaoping Ma State Key Laboratory of Intelligent Technology.
A Language Independent Method for Question Classification COLING 2004.
Weighting and Matching against Indices. Zipf’s Law In any corpus, such as the AIT, we can count how often each word occurs in the corpus as a whole =
Chapter 6: Information Retrieval and Web Search
Detecting Dominant Locations from Search Queries Lee Wang, Chuang Wang, Xing Xie, Josh Forman, Yansheng Lu, Wei-Ying Ma, Ying Li SIGIR 2005.
Mining Topic-Specific Concepts and Definitions on the Web Bing Liu, etc KDD03 CS591CXZ CS591CXZ Web mining: Lexical relationship mining.
Web Image Retrieval Re-Ranking with Relevance Model Wei-Hao Lin, Rong Jin, Alexander Hauptmann Language Technologies Institute School of Computer Science.
Probabilistic Models of Novel Document Rankings for Faceted Topic Retrieval Ben Cartrette and Praveen Chandar Dept. of Computer and Information Science.
Enhancing Cluster Labeling Using Wikipedia David Carmel, Haggai Roitman, Naama Zwerdling IBM Research Lab (SIGIR’09) Date: 11/09/2009 Speaker: Cho, Chin.
1 FollowMyLink Individual APT Presentation Third Talk February 2006.
Web- and Multimedia-based Information Systems Lecture 2.
Mining Document Collections to Facilitate Accurate Approximate Entity Matching Presented By Harshda Vabale.
NEW EVENT DETECTION AND TOPIC TRACKING STEPS. PREPROCESSING Removal of check-ins and other redundant data Removal of URL’s maybe Stemming of words using.
From Text to Image: Generating Visual Query for Image Retrieval Wen-Cheng Lin, Yih-Chen Chang and Hsin-Hsi Chen Department of Computer Science and Information.
A Word Clustering Approach for Language Model-based Sentence Retrieval in Question Answering Systems Saeedeh Momtazi, Dietrich Klakow University of Saarland,Germany.
Using a Named Entity Tagger to Generalise Surface Matching Text Patterns for Question Answering Mark A. Greenwood and Robert Gaizauskas Natural Language.
1 Masters Thesis Presentation By Debotosh Dey AUTOMATIC CONSTRUCTION OF HASHTAGS HIERARCHIES UNIVERSITAT ROVIRA I VIRGILI Tarragona, June 2015 Supervised.
1 13/05/07 1/20 LIST – DTSI – Interfaces, Cognitics and Virtual Reality Unit The INFILE project: a crosslingual filtering systems evaluation campaign Romaric.
26/01/20161Gianluca Demartini Ranking Categories for Faceted Search Gianluca Demartini L3S Research Seminars Hannover, 09 June 2006.
Extracting and Ranking Product Features in Opinion Documents Lei Zhang #, Bing Liu #, Suk Hwan Lim *, Eamonn O’Brien-Strain * # University of Illinois.
2/10/2016Semantic Similarity1 Semantic Similarity Methods in WordNet and Their Application to Information Retrieval on the Web Giannis Varelas Epimenidis.
Keyword Translation Accuracy and Cross-Lingual Question Answering in Chinese and Japanese Teruko Mitamura Mengqiu Wang Hideki Shima Frank Lin In CMU EACL.
Xiaoying Gao Computer Science Victoria University of Wellington COMP307 NLP 4 Information Retrieval.
Pete Bohman Adam Kunk.  Introduction  Related Work  System Overview  Indexing Scheme  Ranking  Evaluation  Conclusion.
Semi-Supervised Recognition of Sarcastic Sentences in Twitter and Amazon -Smit Shilu.
Knowledge and Information Retrieval Dr Nicholas Gibbins 32/4037.
14. June 2016 Mapping democracy Indira Ishmurzina
Search Engine Optimization
Web News Sentence Searching Using Linguistic Graph Similarity
A Deep Learning Technical Paper Recommender System
IST 516 Fall 2011 Dongwon Lee, Ph.D.
Information Retrieval
Data Integration for Relational Web
Semantic Similarity Methods in WordNet and their Application to Information Retrieval on the Web Yizhe Ge.
Presentation transcript:

WWW 2014 Seoul, April 8 th SNOW 2014 Data Challenge Two-level message clustering for topic detection in Twitter Georgios Petkos, Symeon Papadopoulos, Yiannis Kompatsiaris Centre for Research and Technology Hellas (CERTH)

Overview Applied approach: –Pre-processing –Topic detection –Ranking –Title extraction –Keyword extraction –Representative tweets selection –Relevant image extraction Evaluation Conclusions #2

Pre-processing Duplicate tweet aggregation: –Performed via simple hashing (very fast but does not capture near duplicates such as some retweets) –Counts kept for subsequent processing Language-based filtering: –Only content in English is kept –Public Java implementation used ( Significant computational benefit for subsequent steps. e.g., for the first timeslot: –Originally: 15,090 tweets –After duplicate aggregation: 7,546 unique tweets –After language filtering: 6,359 unique tweets written in English #3

Topic detection (1/2) Different types of topic detection algorithms: –Feature-pivot –Document-pivot –Probabilistic We opt for a document-pivot approach as we recognize that: –A reply tweet typically refers to the same topic as the tweet to which it is a reply. –Tweets that include the same URL refer to the same topic Such information is readily available and cannot be easily taken into account by other types of topic detection algorithms We generate first-level clusters by grouping together tweets based on the above relationships (using a Union-Find algorithm) #4

Topic detection (2/2) Not all tweets will belong to a first-level cluster; thus, we perform a second-level clustering. We perform an incremental threshold based clustering procedure that utilizes LSH: –For each tweet find the best matching item that has already been examined. If its similarity to it (using tf-idf and cosine similarity) is above some threshold assign it to the same cluster, otherwise create a new cluster. –If the examined tweet belongs to a first-level cluster, assign the other tweets from the first-level cluster to the same second-level cluster (either existing or new) and do not further consider these tweets Additionally, in order to reduce fragmentation: –We use the lemmatized version of terms (Stanford), instead of their raw form –We boost entities and hashtags by a constant factor (=1.5) Each second-level cluster is treated as a (fine-grained) topic. #5

Ranking A very large number of topics per timeslot is produced (e.g. 2,669 for the first timeslot) but we need to return only 10 per timeslot We recognize that the granularity and hierarchy of topics is important for ranking: fine-grain subtopics of popular coarse-grain topics should be ranked higher than other fine-grain topics that are not subtopics of a popular coarse-grain topic To cater for this we: –Detect coarse-grain topics by running again the document-pivot procedure (i.e. a third clustering process) but this time further boosting entities and hashtags (not by a constant factor, but a factor linear to their frequency) –Map each fine-grain topic to a coarse-grain topic to obtain a two-level hierarchy –Rank the coarse-grain topics by the number of tweets they contain –Rank the fine-grain topics within each coarse-grain topic again by the number of tweets they contain Apply a simple heuristic procedure to select the first few fine-grain topics from the first few coarse-grain topics #6

Title extraction For each topic, we obtain a set of candidate titles by splitting assigned tweets to sentences (Stanford NLP library used) Each candidate title gets a score depending on its frequency and the average likelihood of appearance of the words in it in an independent corpus Rank candidate titles and return the one with the highest score #7

Keyword extraction We opt for phrases rather than unigrams, because phrases are more descriptive and less ambiguous than unigrams For each topic, we obtain a set of candidate keywords by detecting the noun phrases and verb phrases in the assigned tweets As in the case of titles, each candidate keyword gets a score depending on its frequency and the average likelihood of appearance of the words in it in an independent corpus Rank candidate keywords Find the position in the ranked list with the largest score gap and select the keywords until that point #8

Representative tweets selection Related tweets for each topic are readily available since we apply a document-pivot approach Satisfactory diversity is achieved by not considering duplicates (pre-processing) and by considering replies (as part of the core topic detection procedure) Selection: First, most popular, then all replies and then again with popularity (until gathering 10 tweets). #9

Relevant image extraction Three cases: If there are images in the tweets assigned to the topic, return the most frequent image If not, query the Google search API with the title and return the first image returned If a result is not fetched (possibly because the title is too limiting), query again the Google search API but this time with the most popular keyword #10

Evaluation (1/2) Significant computational benefit from pre-processing steps Typically, a few hundred first-level clusters #11

Evaluation (2/2) #12 Missing Keyword “Hague” Multimedia irrelevant Topic Marginally Newsworthy