SocialStories: Segmenting Stories within Trending Twitter Topics

Slides:



Advertisements
Similar presentations
Recommender System A Brief Survey.
Advertisements

Mustafa Cayci INFS 795 An Evaluation on Feature Selection for Text Clustering.
A Vector Space Model for Automatic Indexing
Linking Entities in #Microposts ROMIL BANSAL, SANDEEP PANEM, PRIYA RADHAKRISHNAN, MANISH GUPTA, VASUDEVA VARMA INTERNATIONAL INSTITUTE OF INFORMATION TECHNOLOGY,
Incremental Clustering Previous clustering algorithms worked in “batch” mode: processed all points at essentially the same time. Some IR applications cluster.
Entity Tracking in Real- Time using Sub-Topic Detection on Twitter SANDEEP PANEM, ROMIL BANSAL, MANISH GUPTA, VASUDEVA VARMA INTERNATIONAL INSTITUTE OF.
Towards Twitter Context Summarization with User Influence Models Yi Chang et al. WSDM 2013 Hyewon Lim 21 June 2013.
Presenter: Liu, Ya Tian, Yujia Pham, Anh TwitterMonitor: Trend Detection over the Twitter Stream EvenTweet: Online Localized Event Detection from Twitter.
Supervised Learning Techniques over Twitter Data Kleisarchaki Sofia.
SNA: Research Dr. Nawaporn Wisitpongphan 1. Michael Mathioudakis, Nick Koudas TwitterMonitor: Trend Detection over the Twitter Stream Michael Mathioudakis,
Content Management & Hashtag Recommendation IN P2P OSN By Keerthi Nelaturu.
Comparing Twitter Summarization Algorithms for Multiple Post Summaries David Inouye and Jugal K. Kalita SocialCom May 10 Hyewon Lim.
1 / 22 Issues in Text Similarity and Categorization Jordan Smith – MUMT 611 – 27 March 2008.
1 Texmex – November 15 th, 2005 Strategy for the future Global goal “Understand” (= structure…) TV and other MM documents Prepare these documents for applications.
Latent Aspect Rating Analysis without Aspect Keyword Supervision Hongning Wang, Yue Lu, ChengXiang Zhai Department of.
Building an Intelligent Web: Theory and Practice Pawan Lingras Saint Mary’s University Rajendra Akerkar American University of Armenia and SIBER, India.
Scalable Text Mining with Sparse Generative Models
1 Zi Yang, Wei Li, Jie Tang, and Juanzi Li Knowledge Engineering Group Department of Computer Science and Technology Tsinghua University, China {yangzi,
Some studies on Vietnamese multi-document summarization and semantic relation extraction Laboratory of Data Mining & Knowledge Science 9/4/20151 Laboratory.
Attention and Event Detection Identifying, attributing and describing spatial bursts Early online identification of attention items in social media Louis.
Emerging Topic Detection on Twitter (Cataldi et al., MDMKDD 2010) Padmini Srinivasan Computer Science Department Department of Management Sciences
Bei Pan (Penny), University of Southern California
Page 1 WEB MINING by NINI P SURESH PROJECT CO-ORDINATOR Kavitha Murugeshan.
Topic Models in Text Processing IR Group Meeting Presented by Qiaozhu Mei.
CONCLUSION & FUTURE WORK Normally, users perform triage tasks using multiple applications in concert: a search engine interface presents lists of potentially.
2009 IEEE Symposium on Computational Intelligence in Cyber Security 1 LDA-based Dark Web Analysis.
Which of the two appears simple to you? 1 2.
Presented by: Apeksha Khabia Guided by: Dr. M. B. Chandak
Event Detection using Customer Care Calls 04/17/2013 IEEE INFOCOM 2013 Yi-Chao Chen 1, Gene Moo Lee 1, Nick Duffield 2, Lili Qiu 1, Jia Wang 2 The University.
This work is supported by the Intelligence Advanced Research Projects Activity (IARPA) via Department of Interior National Business Center contract number.
1 Linmei HU 1, Juanzi LI 1, Zhihui LI 2, Chao SHAO 1, and Zhixing LI 1 1 Knowledge Engineering Group, Dept. of Computer Science and Technology, Tsinghua.
Category Discovery from the Web slide credit Fei-Fei et. al.
Amy Dai Machine learning techniques for detecting topics in research papers.
CONCLUSION & FUTURE WORK Normally, users perform search tasks using multiple applications in concert: a search engine interface presents lists of potentially.
Online Kinect Handwritten Digit Recognition Based on Dynamic Time Warping and Support Vector Machine Journal of Information & Computational Science, 2015.
Enhancing Cluster Labeling Using Wikipedia David Carmel, Haggai Roitman, Naama Zwerdling IBM Research Lab (SIGIR’09) Date: 11/09/2009 Speaker: Cho, Chin.
A Model for Learning the Semantics of Pictures V. Lavrenko, R. Manmatha, J. Jeon Center for Intelligent Information Retrieval Computer Science Department,
PCI th Panhellenic Conference in Informatics Clustering Documents using the 3-Gram Graph Representation Model 3 / 10 / 2014.
Topic Modeling using Latent Dirichlet Allocation
1 Masters Thesis Presentation By Debotosh Dey AUTOMATIC CONSTRUCTION OF HASHTAGS HIERARCHIES UNIVERSITAT ROVIRA I VIRGILI Tarragona, June 2015 Supervised.
1 A Biterm Topic Model for Short Texts Xiaohui Yan, Jiafeng Guo, Yanyan Lan, Xueqi Cheng Institute of Computing Technology, Chinese Academy of Sciences.
Department of Automation Xiamen University
E VENT D ETECTION USING A C LUSTERING A LGORITHM Kleisarchaki Sofia, University of Crete, 1.
Comparing Document Segmentation for Passage Retrieval in Question Answering Jorg Tiedemann University of Groningen presented by: Moy’awiah Al-Shannaq
On Frequent Chatters Mining Claudio Lucchese 1 st HPC Lab Workshop 6/15/12 1st HPC Workshp - Claudio Lucchese.
Intelligent Database Systems Lab 國立雲林科技大學 National Yunlin University of Science and Technology 1 Enhancing Text Clustering by Leveraging Wikipedia Semantics.
Information Overload on the Internet: The Web Mining Techniques Approach UNIVERSITI UTARA MALAYSIA COLLEGE OF ARTS AND SCIENCES RESEARCH METHODOLOGY (SZRZ6014)
Data mining in web applications
Topic Modeling for Short Texts with Auxiliary Word Embeddings
Sentiment analysis algorithms and applications: A survey
Market Intelligence Analysis
DATA MINING Introductory and Advanced Topics Part III – Web Mining
Clustering of Web pages
Erasmus University Rotterdam
Mining the Data Charu C. Aggarwal, ChengXiang Zhai
Unsupervised Extraction of Template Structure in Web Search Queries www 2012 – Session: search Qingxia Liu.
Finding Clusters within a Class to Improve Classification Accuracy
Applying Key Phrase Extraction to aid Invalidity Search
#VisualHashtags Visual Summarization of Social Media Events using Mid-Level Visual Elements Sonal Goel (IIIT-Delhi), Sarthak Ahuja (IBM Research, India),
EDIUM: Improving Entity Disambiguation via User modelling
Topic Oriented Semi-supervised Document Clustering
Deep Learning Research & Application Center
iSRD Spam Review Detection with Imbalanced Data Distributions
A Network Science Approach to Fake News Detection on Social Media
Extraction of Multi-scale Outlier Hierarchy From Spatio-temporal Data Stream Jianming Lv.
Topic Models in Text Processing
Analyzing social media data to monitor public health trends
Bug Localization with Combination of Deep Learning and Information Retrieval A. N. Lam et al. International Conference on Program Comprehension 2017.
Promising “Newer” Technologies to Cope with the
GhostLink: Latent Network Inference for Influence-aware Recommendation
Presentation transcript:

SocialStories: Segmenting Stories within Trending Twitter Topics Kokil Jaidka, Prakhar Gupta, Sajal Rustagi, Kaushik Ramachandran | Big Data Experience Lab, Adobe Research

Motivation This work is conducted in the context of Topic Detection and Tracking in Social Media The volume and velocity of streaming Twitter content causes filter failure (Shirky, 2008) The variety of topical data makes it difficult to drill in and out of related stories Stories decay after some time, and are replaced by new stories RESEARCH OBJECTIVES: To identify individual stories in topical discussions on social media To detect new stories as they develop

Prior Art Story Segmentation – Identifying “bursty” topics on Twitter Hashtag-based LDA modeling Supervised and unsupervised Clustering with dictionary learning and cosine similarity However, these studies do not leverage the temporal aspect of Twitter topics Story Detection – Detecting new emerging stories on Twitter Hashtag-correlation to identify new Twitter trends Frequency analysis of co-occurring words Content ageing with exponential decay However, these studies do not consider recurring stories

Methodology Unsupervised single pass incremental online clustering algorithm Incoming tweets are modeled as feature vectors of entities Every cluster represents a story The features of every incoming tweet are compared against the features of existing clusters New clusters are created when the maximum similarity falls below the empirically set threshold

Step 1: Model tweets as feature vectors Syntactic and semantic parsing to obtain entities Term frequencies and inverse cluster frequencies calculated Initial weights are assigned to entities Words identified as keywords or hashtags are boosted

Step 2: Calculating feature weights Containment – The importance of an entity in the tweet. More the number of entities in tweet , less is the relevance of common entities between cluster and tweet Resemblance - assesses the similarity between the tweet and a cluster

Step 3: Incremental Clustering Incremental Clustering: “for an update sequence of n points in M, maintain a collection of k clusters such that as each one is presented, either it is assigned to one of the current k clusters or it starts off a new cluster while two existing clusters are merged into one.” Goals: minimize the maximum cluster diameter minimize the number of clusters given a fixed diameter For each incoming data point or tweet find the cluster which gives the maximum similarity. Similarity used is term-statistical similarity (cosine similarity)

Step 4: Timely update and decay Exponential decay.

Evaluation Dataset: 0.19 million tweets collected under the keyword “Adobe” in June 2013 4-day period of June 16-20, with maximum volume, selected for evaluation Gold standard: Hand-curated daily reports

Comparing Social Stories clusters with Gold Standard stories: 24/31 stories found

Evaluation Baselines: LDA with Gibbs’ sampling - LDA analyzes the words of original texts to discover the topics (as vocabularies that seems to co-occur together) that run through them. It does not require any prior annotation (Blei, Ng & Jordan, 2003; Mehrotra, Sanner, Buntine & Xie, 2013) Wavelet approach - wavelet transformations for event detection in social media. It builds wavelet signals for individual words based on their frequencies, and filters away trivial words by looking at their signal auto-correlations. The remaining words are then clustered to form events with a modularity-based graph partitioning technique (Weng & Li, 2011; Guille, 2014).

Comparing baseline results with Gold Standard stories: 2/7 stories found

Other results for the Social Stories system Purity - How many tweets in a cluster, should actually be in the same cluster (avg 85%) Rand Index - Pairwise recall; records the percentage of correct decisions made by the clustering algorithm for recurring tweets (avg 0.8)

References [1] S. Ahmed and M. M. Skoric. My name is khan: The use of twitter in the campaign for 2013 pakistan general election. In System Sciences (HICSS), 2014 47th Hawaii International Conference on, pages 2242–2251. IEEE, 2014. [2] J. Allan. Topic detection and tracking: event-based information organization, volume 12. Springer Science & Business Media, 2002. [3] F. Alvanaki, S. Michel, K. Ramamritham, and G. Weikum. See what’s enblogue: real-time emergent topic identification in social media. In Proceedings of the 15th International Conference on Extending Database Technology, pages 336–347. ACM, 2012. [4] D. M. Blei, A. Y. Ng, and M. I. Jordan. Latent dirichlet allocation. the Journal of machine Learning research, 3:993–1022, 2003. [5] M. Cataldi, L. Di Caro, and C. Schifanella. Emerging topic detection on twitter based on temporal and social terms evaluation. In Proceedings of the Tenth International Workshop on Multimedia Data Mining, page 4. ACM, 2010. [6] C. de Mazancourt and U. Dieckmann. Trade-off geometries and frequency-dependent selection. The American Naturalist, 164(6):765–778, 2004. [7] A. Guille. Diffusion de l’information dans les médias sociaux: modélisation et analyse. PhD thesis, Universite Lumiere Lyon 2, 2014. [8] S. P. Kasiviswanathan, P. Melville, A. Banerjee, and V. Sindhwani. Emerging topic detection using dictionary learning. In Proceedings of the 20th ACM international conference on Information and knowledge management, pages 745–754. ACM, 2011. [9] H. Koga and T. Taniguchi. Developing a user recommendation engine on twitter using estimated latent topics. In Human-Computer Interaction. Design and Development Approaches, pages 461–470. Springer, 2011. [10] C. D. Manning, P. Raghavan, and H. Schutze. Introduction to information retrieval, volume 1. Cambridge university press Cambridge, 2008. [11] M. Mathioudakis and N. Koudas. Twittermonitor: trend detection over the twitter stream. In Proceedings of the 2010 ACM SIGMOD International Conference on Management of data, pages 1155– 1158. ACM, 2010. [12] R. Mehrotra, S. Sanner, W. Buntine, and L. Xie. Improving lda topic models for microblogs via tweet pooling and automatic labeling. In Proceedings of the 36th international ACM SIGIR conference on Research and development in information retrieval, pages 889–892. ACM, 2013. [13] X.-H. Phan, L.-M. Nguyen, and S. Horiguchi. Learning to classify short and sparse text & web with hidden topics from large-scale data collections. In Proceedings of the 17th international conference on World Wide Web, pages 91–100. ACM, 2008. [14] H. Sayyadi, M. Hurst, and A. Maykov. Event detection and tracking in social streams. In ICWSM, 2009. [15] C. Shirky. It is not information overload. it is filter failure. 2008. [16] J. Weng and B.-S. Lee. Event detection in twitter. ICWSM, 11:401–408, 2011. [17] W. X. Zhao, J. Jiang, J. Weng, J. He, E.-P. Lim, H. Yan, and X. Li. Comparing twitter and traditional media using topic models. In Advances in Information Retrieval, pages 338–349. Springer, 2011.

Thank you!