A Nonparametric Method for Early Detection of Trending Topics Zhang Advisor: Prof. Aravind Srinivasan.

Slides:



Advertisements
Similar presentations
Global States.
Advertisements

ASSESSING RESPONSIVENESS OF HEALTH MEASUREMENTS. Link validity & reliability testing to purpose of the measure Some examples: In a diagnostic instrument,
Active Learning for Streaming Networked Data Zhilin Yang, Jie Tang, Yutao Zhang Computer Science Department, Tsinghua University.
Presented By: Omofonmwan Nelson. Agenda:  Twitter  Benefits of Twitter  Tweet  Tweeter Services  Geographical Distribution  Conclusion.
Presenter: Liu, Ya Tian, Yujia Pham, Anh TwitterMonitor: Trend Detection over the Twitter Stream EvenTweet: Online Localized Event Detection from Twitter.
Supervised Learning Techniques over Twitter Data Kleisarchaki Sofia.
SNOW Workshop, 8th April 2014 Real-time topic detection with bursty ngrams: RGU participation in SNOW 2014 challenge Carlos Martin and Ayse Goker (Robert.
CS 8751 ML & KDDEvaluating Hypotheses1 Sample error, true error Confidence intervals for observed hypothesis error Estimators Binomial distribution, Normal.
Statistical Methods Chichang Jou Tamkang University.
PROBABILITY AND SAMPLES: THE DISTRIBUTION OF SAMPLE MEANS.
BCOR 1020 Business Statistics
Control Charts for Attributes
Chapter 7 Probability and Samples: The Distribution of Sample Means
Software Process and Product Metrics
EARLY DETECTION OF TWITTER TRENDS MILAN STANOJEVIC UNIVERSITY OF BELGRADE SCHOOL OF ELECTRICAL ENGINEERING.
CORE MECHANICS. WHAT ARE CORE MECHANICS? Core mechanics are the heart of a game; they generate the gameplay and implement the rules. Formal definition:
McGraw-Hill/IrwinCopyright © 2009 by The McGraw-Hill Companies, Inc. All Rights Reserved. Chapter 9 Hypothesis Testing.
Introduction The large amount of traffic nowadays in Internet comes from social video streams. Internet Service Providers can significantly enhance local.
Education 793 Class Notes T-tests 29 October 2003.
Modeling Relationship Strength in Online Social Networks Rongjing Xiang: Purdue University Jennifer Neville: Purdue University Monica Rogati: LinkedIn.
Target Tracking with Binary Proximity Sensors: Fundamental Limits, Minimal Descriptions, and Algorithms N. Shrivastava, R. Mudumbai, U. Madhow, and S.
A Comparative Study of Search Result Diversification Methods Wei Zheng and Hui Fang University of Delaware, Newark DE 19716, USA
AP Statistics Chapter 9 Notes.
Stochastic sleep scheduling (SSS) for large scale wireless sensor networks Yaxiong Zhao Jie Wu Computer and Information Sciences Temple University.
Event Detection using Customer Care Calls 04/17/2013 IEEE INFOCOM 2013 Yi-Chao Chen 1, Gene Moo Lee 1, Nick Duffield 2, Lili Qiu 1, Jia Wang 2 The University.
Time Series Data Analysis - I Yaji Sripada. Dept. of Computing Science, University of Aberdeen2 In this lecture you learn What are Time Series? How to.
Chapter 5.1 Probability Distributions.  A variable is defined as a characteristic or attribute that can assume different values.  Recall that a variable.
Sampling distributions for sample means
Copyright © 2013, 2010 and 2007 Pearson Education, Inc. Section Inference about Two Means: Independent Samples 11.3.
Chapter 6 USING PROBABILITY TO MAKE DECISIONS ABOUT DATA.
Wei Feng , Jiawei Han, Jianyong Wang , Charu Aggarwal , Jianbin Huang
Web Image Retrieval Re-Ranking with Relevance Model Wei-Hao Lin, Rong Jin, Alexander Hauptmann Language Technologies Institute School of Computer Science.
Detection, Classification and Tracking in a Distributed Wireless Sensor Network Presenter: Hui Cao.
Statistics - methodology for collecting, analyzing, interpreting and drawing conclusions from collected data Anastasia Kadina GM presentation 6/15/2015.
Chapter 7 Probability and Samples: The Distribution of Sample Means
Chapter 9 Probability. 2 More Statistical Notation  Chance is expressed as a percentage  Probability is expressed as a decimal  The symbol for probability.
Chapter 7 Probability and Samples: The Distribution of Sample Means.
Chapter 4: Pattern Recognition. Classification is a process that assigns a label to an object according to some representation of the object’s properties.
Prediction of Influencers from Word Use Chan Shing Hei.
ISQS 6347, Data & Text Mining1 Ensemble Methods. ISQS 6347, Data & Text Mining 2 Ensemble Methods Construct a set of classifiers from the training data.
1 Statistical Significance Testing. 2 The purpose of Statistical Significance Testing The purpose of Statistical Significance Testing is to answer the.
6.1 Inference for a Single Proportion  Statistical confidence  Confidence intervals  How confidence intervals behave.
Date : 2013/03/18 Author : Jeffrey Pound, Alexander K. Hudek, Ihab F. Ilyas, Grant Weddell Source : CIKM’12 Speaker : Er-Gang Liu Advisor : Prof. Jia-Ling.
1 CSCD 326 Data Structures I Software Design. 2 The Software Life Cycle 1. Specification 2. Design 3. Risk Analysis 4. Verification 5. Coding 6. Testing.
1 Value of information – SITEX Data analysis Shubha Kadambe (310) Information Sciences Laboratory HRL Labs 3011 Malibu Canyon.
How MT cells analyze the motion of visual patterns Nicole C Rust1, 2, 4, Valerio Mante2, 3, 4, Eero P Simoncelli1, 2, 5 & J Anthony Movshon2, 5 Neurons.
Ch15: Decision Theory & Bayesian Inference 15.1: INTRO: We are back to some theoretical statistics: 1.Decision Theory –Make decisions in the presence of.
Threshold Setting and Performance Monitoring for Novel Text Mining Wenyin Tang and Flora S. Tsai School of Electrical and Electronic Engineering Nanyang.
Case Selection and Resampling Lucila Ohno-Machado HST951.
Crowd Fraud Detection in Internet Advertising Tian Tian 1 Jun Zhu 1 Fen Xia 2 Xin Zhuang 2 Tong Zhang 2 Tsinghua University 1 Baidu Inc. 2 1.
From the population to the sample The sampling distribution FETP India.
1 Probability and Statistics Confidence Intervals.
Normal Distributions MM2D1d Compare the means and standard deviations of random samples with the corresponding population parameters, including those population.
Artificial Intelligence in Game Design Lecture 20: Hill Climbing and N-Grams.
Chapter 8 Estimation ©. Estimator and Estimate estimator estimate An estimator of a population parameter is a random variable that depends on the sample.
Two Approaches to Estimation of Classification Accuracy Rate Under Item Response Theory Quinn N. Lathrop and Ying Cheng Assistant Professor Ph.D., University.
REU 2009-Traffic Analysis of IP Networks Daniel S. Allen, Mentor: Dr. Rahul Tripathi Department of Computer Science & Engineering Data Streams Data streams.
Alvin CHAN Kay CHEUNG Alex YING Relationship between Twitter Events and Real-life.
Mustafa Gokce Baydogan, George Runger and Eugene Tuv INFORMS Annual Meeting 2011, Charlotte A Bag-of-Features Framework for Time Series Classification.
Using Blog Properties to Improve Retrieval Gilad Mishne (ICWSM 2007)
Opinion spam and Analysis 소프트웨어공학 연구실 G 최효린 1 / 35.
Estimating standard error using bootstrap
Chapter 4 – Thread Concepts
Chapter 4 – Thread Concepts
Multimodal Learning with Deep Boltzmann Machines
James K Beard, Ph.D. April 20, 2005 SystemView 2005 James K Beard, Ph.D. April 20, 2005 April 122, 2005.
Ying shen Sse, tongji university Sep. 2016
The Rank-Sum Test Section 15.2.
Dynamic Supervised Community-Topic Model
Presentation transcript:

A Nonparametric Method for Early Detection of Trending Topics Zhang Advisor: Prof. Aravind Srinivasan

Presentation Motivation Data Model Research on Improving the Model Implementation Validation of Implementation Project Schedule

Motivation Global Human population is geographically distributed, multimodal sensors. Traditionally, Journalists, Explorers, 007 Nowadays, blogs, forums, product reviews, social networking Twitter : 200 million active users around globe Tweets are limited to 140 characters Short delays in reflecting what its users perceive

Empirical Observation Trending topics can typically be detected by a sudden high- magnitude spike in activity over some baseline of activity. The sudden spike is often preceded by lower magnitude activity that is indicative of the topic’s imminent popularity. Predict whether a topic will become trending

Data Model Topics that were trending at some point during the period of interest Topics that were not trending at some point during the period of interest

Data Model

Implementation – Data Collection : Twitter API During a sample window, we collect N examples of topics that trended at least once and N examples of topics that were not trending. We then sample Tweets from the sample window and label each tweet according to the topics mentioned. Finally, we construct a reference signal for each topic based on the Tweet activity corresponding to the topics.

Implementation – Topics Filter out topics : a. whose rank was never better than or equal to 3 b. topics that did not trend for long enough c. topics that reappear multiple times during sample window Collect not trending topics: a. Sample a list of phrases consisting of n-grams. Filter out the n words that contain any topic trending during the sample window b. Remove the n-grams shorter than three characters

Implementation – Construct Activity Signals

Implementation – Construct Reference Signals Trending Reference Signal: We select a small slice of the long signal that terminates at the first onset of trend. Not Trending Reference Signal: We assume that the rate signal is largely stationary and select the slice with random start and end times.

Strategy to Find parameters

Simulate ROC curve Varying one parameter while others are fixed Compare the early detection with the position on ROC curve Effect on Moving ROC curve.

Implementation – Algorithm 1 written by Zhang Zhang

Implementation – Algorithm 2 written by Zhang Zhang

Implementation – Algorithm 3 written by Zhang Zhang

Implementation – Algorithms 4 written by Zhang Zhang

Implementation – Parallel Computing Parallelize the scores for each of the topics Parallelize each of the reference signal distances for each topic I have not figured out the algorithm for this parallel part

Validation of Implementation Use one of the reference signal as the observation signal, and the probability that it belongs to its class is supposed to be nearly 1. If I increased the time step, then the above test result is supposed to converge to one with smaller error.

Delivery Documentation Codes of the Software Enormous data sets Testing Results Final Report

Schedule 10/30 Learn Programming language: python 11/30 Write codes to classify data as different topics 12/30 Write codes of algorithm 1-4 1/30 Figure out Parallel Algorithms 2/30 Implement Parralel 3/30 Test 4/30 Write Documents

(2012). To trend or not to trend. trend-or-not-to-trend.html. MATHIOUDAKIS, M. K. (2010). Twittermonitor: Trending detection over the twitter stream. ACM SIGMOD International Conference on Management of Data, (pp ). New York. Shah, S. N. (2011). A nonparametric method for early detection of trending topics. Twitter. ZHAO, S. V. (2011). Human as real-time sensors of social and phsical events: A case study of twitter and sports games. CoRR, (p. 1106).

Questions? Thank you!