SNOW Workshop, 8th April 2014 Real-time topic detection with bursty ngrams: RGU participation in SNOW 2014 challenge Carlos Martin and Ayse Goker (Robert.

Slides:

Advertisements

Similar presentations

Image Retrieval With Relevant Feedback Hayati Cam & Ozge Cavus IMAGE RETRIEVAL WITH RELEVANCE FEEDBACK Hayati CAM Ozge CAVUS.

Advertisements

BCS SGAI Workshop on Social Media Analysis, 10th December 2013 Mining Newsworthy Topics from Social Media Carlos Martin, David Corney and Ayse Goker (Robert.

WWW 2014 Seoul, April 8 th SNOW 2014 Data Challenge Two-level message clustering for topic detection in Twitter Georgios Petkos, Symeon Papadopoulos, Yiannis.

TI: An Efficient Indexing Mechanism for Real-Time Search on Tweets Chun Chen 1, Feng Li 2, Beng Chin Ooi 2, and Sai Wu 2 1 Zhejiang University, 2 National.

CLEar (Clairaudient Ear) A Realtime Online Observatory for Bursty and Viral Events A demonstration of CLEar System.

Integrated Digital Event Web Archive and Library (IDEAL) and Aid for Curators Archive-It Partner Meeting Montgomery, Alabama Mohamed Farag & Prashant Chandrasekar.

Distant Supervision for Emotion Classification in Twitter posts 1/17.

Presenter: Liu, Ya Tian, Yujia Pham, Anh TwitterMonitor: Trend Detection over the Twitter Stream EvenTweet: Online Localized Event Detection from Twitter.

Pete Bohman Adam Kunk.  Introduction  Related Work  System Overview  Indexing Scheme  Ranking  Evaluation  Conclusion.

Characteristic Identifier Scoring and Clustering for Classification By Mahesh Kumar Chhaparia.

Implicit Queries for Vitor R. Carvalho (Joint work with Joshua Goodman, at Microsoft Research)

WWW 2014 Seoul, April 8 th SNOW 2014 Data Challenge Symeon Papadopoulos (CERTH) David Corney (RGU) Luca Aiello (Yahoo! Labs)

WIMS 2014, Thessaloniki, June 2014 A soft frequent pattern mining approach for textual topic detection Georgios Petkos, Symeon Papadopoulos, Yiannis Kompatsiaris.

Information Retrieval in Practice

WISE: Large Scale Content-Based Web Image Search Michael Isard Joint with: Qifa Ke, Jian Sun, Zhong Wu Microsoft Research Silicon Valley 1.

On Burstiness-Aware Search for Document Sequences Theodoros Lappas Benjamin Arai Manolis Platakis Dimitrios Kotsakos Dimitrios Gunopulos SIGKDD 2009.

The Vector Space Model …and applications in Information Retrieval.

INEX 2003, Germany Searching in an XML Corpus Using Content and Structure INEX 2003, Germany Yiftah Ben-Aharon, Sara Cohen, Yael Grumbach, Yaron Kanza,

Recommender systems Ram Akella November 26 th 2008.

Information Retrieval

Chapter 5: Information Retrieval and Web Search

Overview of Search Engines

1/16 Final project: Web Page Classification By: Xiaodong Wang Yanhua Wang Haitang Wang University of Cincinnati.

1 Text Categorization  Assigning documents to a fixed set of categories  Applications:  Web pages  Recommending pages  Yahoo-like classification hierarchies.

Tag-based Social Interest Discovery

«Tag-based Social Interest Discovery» Proceedings of the 17th International World Wide Web Conference (WWW2008) Xin Li, Lei Guo, Yihong Zhao Yahoo! Inc.,

Emerging Topic Detection on Twitter (Cataldi et al., MDMKDD 2010) Padmini Srinivasan Computer Science Department Department of Management Sciences

Crowd-Augmented Social Aware Search Soudip Roy Chowdhury & Bogdan Cautis.

TREC 2009 Review Lanbo Zhang. 7 tracks Web track Relevance Feedback track (RF) Entity track Blog track Legal track Million Query track (MQ) Chemical IR.

Chapter 2 Architecture of a Search Engine. Search Engine Architecture n A software architecture consists of software components, the interfaces provided.

Query Routing in Peer-to-Peer Web Search Engine Speaker: Pavel Serdyukov Supervisors: Gerhard Weikum Christian Zimmer Matthias Bender International Max.

Michael Cafarella Alon HalevyNodira Khoussainova University of Washington Google, incUniversity of Washington Data Integration for Relational Web.

Amy Dai Machine learning techniques for detecting topics in research papers.

Chapter 6: Information Retrieval and Web Search

Distributed Information Retrieval Server Ranking for Distributed Text Retrieval Systems on the Internet B. Yuwono and D. Lee Siemens TREC-4 Report: Further.

Wei Feng , Jiawei Han, Jianyong Wang , Charu Aggarwal , Jianbin Huang

Enhancing Cluster Labeling Using Wikipedia David Carmel, Haggai Roitman, Naama Zwerdling IBM Research Lab (SIGIR’09) Date: 11/09/2009 Speaker: Cho, Chin.

1 FollowMyLink Individual APT Presentation Third Talk February 2006.

Personalizing Web Search using Long Term Browsing History Nicolaas Matthijs, Cambridge Filip Radlinski, Microsoft In Proceedings of WSDM

Ranking CSCI 572: Information Retrieval and Search Engines Summer 2010.

Basic Implementation and Evaluations Aj. Khuanlux MitsophonsiriCS.426 INFORMATION RETRIEVAL.

DOCUMENT UPDATE SUMMARIZATION USING INCREMENTAL HIERARCHICAL CLUSTERING CIKM’10 (DINGDING WANG, TAO LI) Advisor: Koh, Jia-Ling Presenter: Nonhlanhla Shongwe.

Search Engines By: Faruq Hasan.

PCI th Panhellenic Conference in Informatics Clustering Documents using the 3-Gram Graph Representation Model 3 / 10 / 2014.

Vector Space Models.

NEW EVENT DETECTION AND TOPIC TRACKING STEPS. PREPROCESSING Removal of check-ins and other redundant data Removal of URL’s maybe Stemming of words using.

From Text to Image: Generating Visual Query for Image Retrieval Wen-Cheng Lin, Yih-Chen Chang and Hsin-Hsi Chen Department of Computer Science and Information.

1 Masters Thesis Presentation By Debotosh Dey AUTOMATIC CONSTRUCTION OF HASHTAGS HIERARCHIES UNIVERSITAT ROVIRA I VIRGILI Tarragona, June 2015 Supervised.

Crawling the Hidden Web Authors: Sriram Raghavan, Hector Garcia-Molina VLDB 2001 Speaker: Karthik Shekar 1.

Finding document topics for improving topic segmentation Source: ACL2007 Authors: Olivier Ferret (18 route du Panorama, BP6) Reporter:Yong-Xiang Chen.

On Frequent Chatters Mining Claudio Lucchese 1 st HPC Lab Workshop 6/15/12 1st HPC Workshp - Claudio Lucchese.

TWinner : Understanding News Queries with Geo-content using Twitter Satyen Abrol,Latifur Khan University of Texas at Dallas,Department of Computer Science.

User Modeling and Recommender Systems: recommendation algorithms

GENERATING RELEVANT AND DIVERSE QUERY PHRASE SUGGESTIONS USING TOPICAL N-GRAMS ELENA HIRST.

The P YTHY Summarization System: Microsoft Research at DUC 2007 Kristina Toutanova, Chris Brockett, Michael Gamon, Jagadeesh Jagarlamudi, Hisami Suzuki,

Analysis of Experiments on Hybridization of different approaches in mono and cross-language information retrieval DAEDALUS – Data, Decisions and Language,

A Nonparametric Method for Early Detection of Trending Topics Zhang Advisor: Prof. Aravind Srinivasan.

Pete Bohman Adam Kunk.  Introduction  Related Work  System Overview  Indexing Scheme  Ranking  Evaluation  Conclusion.

CS791 - Technologies of Google Spring A Webbased Kernel Function for Measuring the Similarity of Short Text Snippets By Mehran Sahami, Timothy.

Alvin CHAN Kay CHEUNG Alex YING Relationship between Twitter Events and Real-life.

Hierarchical Topic Detection UMass - TDT 2004 Ao Feng James Allan Center for Intelligent Information Retrieval University of Massachusetts Amherst.

General Architecture of Retrieval Systems 1Adrienn Skrop.

Collection Fusion in Carrot2

February 16, 2012 – Carmen Brenner

Creating fuzzy rules from numerical data using a neural network

Data Driven Job Search Engine Using Skills and Company Attribute Filters About me, Project as part of Internship during last summer at EverString. This.

#VisualHashtags Visual Summarization of Social Media Events using Mid-Level Visual Elements Sonal Goel (IIIT-Delhi), Sarthak Ahuja (IBM Research, India),

Pooria Taghizadeh : Dr. Hadi Tabatabaee : Dr. Mona Ghassemian :

21 Recipes for Mining Twitter

VECTOR SPACE MODEL Its Applications and implementations

Presentation transcript:

SNOW Workshop, 8th April 2014 Real-time topic detection with bursty ngrams: RGU participation in SNOW 2014 challenge Carlos Martin and Ayse Goker (Robert Gordon University)

Outline Architecture diagram Results Future work #2

Architecture diagram #3 Crawler Entities Extractor Entities Extractor Solr Tweets (English) Tweets (with Entities)

Architecture diagram #4 Crawler Entities Extractor Entities Extractor BNgram Keyword Extractor Topic Aggregator Topic Aggregator Solr Topics Combiner Query Builder Topic Labeller Tweets (English) Tweets (with Entities) Ranked topics Topics (+ keywords, entities, hashtags and urls) Merged topicsTopics (+ label) Topics (+ tweets)

Entities Extractor Extract entities per tweet using Stanford NER ( 3 class model  Identifies Person, Location and Organization. Efficient enough for a real-time system. #5

Architecture diagram #6 Crawler Entities Extractor Entities Extractor BNgram Solr Tweets (English) Tweets (with Entities) Ranked topics

BNgram approach Detection of bursty ngrams based on df-idf score  Bursty entities, hashtags and urls are also included in the approach. Re ngrams, 2- and 3-grams are considered (no unigrams anymore). Variant of tf-idf  Penalization of frequent terms in previous timeslots. Terms containing hashtags, entities, urls are boosted. Two previous timeslots (s=2) were considered in our experiments. #7

BNgram approach “Partial” membership clustering approach is an interesting alternative as one term could belong to different clusters (For example, entity “Obama” for the stories “Obama wins in Ohio” and “Obama wins in Illinois”). Apriori clustering algorithm has been used in the experiments of SNOW challenge Explore maximal associations between terms based on the number of shared tweets. #8

BNgram approach Output: Clusters of trending terms with tweets from the last timeslot associated to them. A tweet should contain a minimum number of cluster terms to be included. Clusters are ranked by their bursty scores (maximum df-idf value of topic terms) #9

Architecture diagram #10 Crawler Entities Extractor Entities Extractor BNgram Keyword Extractor Topic Aggregator Topic Aggregator Solr Tweets (English) Tweets (with Entities) Ranked topics Topics (+ keywords, entities, hashtags and urls)

Keyword Extractor and Topic Aggregator modules Topic Aggregator module: –Aggregate entities, hashtags and urls per topic (coming from topic tweets of the corresponding timeslot) keeping their frequencies. –Keep those ones whose frequency is higher than a threshold. Keyword Extractor module: –Extract main keywords (including ngrams) per topic (not extracted from Topic Aggregator) using bursty terms from the clusters. –Removal of urls, hashtags, user mentions, entities and acronyms. –Overlaps are also removed. –Keep df-idf scores as their weights. #11

Architecture diagram #12 Crawler Entities Extractor Entities Extractor BNgram Keyword Extractor Topic Aggregator Topic Aggregator Solr Topics Combiner Tweets (English) Tweets (with Entities) Ranked topics Topics (+ keywords, entities, hashtags and urls) Merged topics

Topic Combiner module Topic Combiner module: –Merge similar topics from the same timeslot. –Based on the co-occurrence of keywords (unigrams), entities, hashtags and urls from the compared topics. –According to preliminary results, Apriori algorithm makes this module more accurate as one term could belong to different topics. #13

Architecture diagram #14 Crawler Entities Extractor Entities Extractor BNgram Keyword Extractor Topic Aggregator Topic Aggregator Solr Topics Combiner Query Builder Tweets (English) Tweets (with Entities) Ranked topics Topics (+ keywords, entities, hashtags and urls) Merged topics Topics (+ tweets)

Query Builder module Creation of final queries to retrieve all the related tweets to the topic (Solr queries) and also filtering by time (simulating real-time scenario). 3 types of queries: –Keywords –Entities and Hashtags –Urls If keywords and entities in topic, keywords closer to the entities are the selected ones. Image population: If tweets contains links to images (metadata), they are added to the topic. #15

Query Builder module Replies are also considered. Be careful with spam replies Replies are not text-query dependent. More diversity?. Sentiment analysis, extraction of relevant keywords. #16

Query Builder module Diverse tweets are computed based on cosine similarity. This approach could be more or less strict depending on the selected threshold. #17

Architecture diagram #18 Crawler Entities Extractor Entities Extractor BNgram Keyword Extractor Topic Aggregator Topic Aggregator Solr Topics Combiner Query Builder Topic Labeller Tweets (English) Tweets (with Entities) Ranked topics Topics (+ keywords, entities, hashtags and urls) Merged topicsTopics (+ label) Topics (+ tweets)

Topic Labeller module BuzzFeed editor-in-chief Ben Smith: “Headlines sure look a lot like tweets these days.” ( harvard/) harvard/ For each topic tweet, a score is computed based on the following formula. where α = 0.8. The tweet with the highest score is selected as the Topic label after cleaning it. #19

Topic Labeller module Example of tweets after cleaning them Granularity is still an issue  Some topic labels are too general or specific. #20

Results - Examples of topics #21

Future work Improve Topic Combiner module – use of similarity measures. Further research on the use of replies and diverse tweets per Topic. Improve Topic Labeller module – granularity issue. Modifications in QueryBuilder module – use of term weights (Solr). #22

Thank you! address: Twitter