Supervised Learning Techniques over Twitter Data Kleisarchaki Sofia.

Slides:

Advertisements

Similar presentations

An Introduction To Categorization Soam Acharya, PhD 1/15/2003.

Advertisements

Applications of one-class classification

Pseudo-Relevance Feedback For Multimedia Retrieval By Rong Yan, Alexander G. and Rong Jin Mwangi S. Kariuki

Text Categorization.

ECG Signal processing (2)

Introduction to Information Retrieval

Problem Semi supervised sarcasm identification using SASI

Supervised Learning Recap

Face Recognition & Biometric Systems Support Vector Machines (part 2)

A Survey on Text Categorization with Machine Learning Chikayama lab. Dai Saito.

Content Based Image Clustering and Image Retrieval Using Multiple Instance Learning Using Multiple Instance Learning Xin Chen Advisor: Chengcui Zhang Department.

Lecture 17: Supervised Learning Recap Machine Learning April 6, 2010.

Earthquake Shakes Twitter User: Analyzing Tweets for Real-Time Event Detection Takehi Sakaki Makoto Okazaki @ymatsuo.

Techniques for Event Detection Kleisarchaki Sofia.

Carnegie Mellon 1 Maximum Likelihood Estimation for Information Thresholding Yi Zhang & Jamie Callan Carnegie Mellon University

KNN, LVQ, SOM. Instance Based Learning K-Nearest Neighbor Algorithm (LVQ) Learning Vector Quantization (SOM) Self Organizing Maps.

Nearest Neighbor Retrieval Using Distance-Based Hashing Michalis Potamias and Panagiotis Papapetrou supervised by Prof George Kollios A method is proposed.

CS Instance Based Learning1 Instance Based Learning.

Review Rong Jin. Comparison of Different Classification Models  The goal of all classifiers Predicating class label y for an input x Estimate p(y|x)

Machine learning Image source:

1 Text Categorization  Assigning documents to a fixed set of categories  Applications:  Web pages  Recommending pages  Yahoo-like classification hierarchies.

APPLICATIONS OF DATA MINING IN INFORMATION RETRIEVAL.

Processing of large document collections Part 2 (Text categorization) Helena Ahonen-Myka Spring 2006.

Topics on Final Perceptrons SVMs Precision/Recall/ROC Decision Trees Naive Bayes Bayesian networks Adaboost Genetic algorithms Q learning Not on the final:

Machine Learning1 Machine Learning: Summary Greg Grudic CSCI-4830.

Processing of large document collections Part 2 (Text categorization, term selection) Helena Ahonen-Myka Spring 2005.

Text Classification, Active/Interactive learning.

Bayesian networks Classification, segmentation, time series prediction and more. Website: Twitter:

Pete Bohman Adam Kunk. What is real-time search? What do you think as a class?

Introduction to machine learning and data mining 1 iCSC2014, Juan López González, University of Oviedo Introduction to machine learning Juan López González.

Incident Threading for News Passages (CIKM 09) Speaker: Yi-lin,Hsu Advisor: Dr. Koh, Jia-ling. Date:2010/06/14.

Pete Bohman Adam Kunk. Real-Time Search  Definition: A search mechanism capable of finding information in an online fashion as it is produced. Technology.

Some working definitions…. ‘Data Mining’ and ‘Knowledge Discovery in Databases’ (KDD) are used interchangeably Data mining = –the discovery of interesting,

An Introduction to Support Vector Machine (SVM) Presenter : Ahey Date : 2007/07/20 The slides are based on lecture notes of Prof. 林智仁 and Daniel Yeung.

Ceyhun Karbeyaz Çağrı Toraman Anıl Türel Ahmet Yeniçağ Text Categorization For Turkish News.

One-class Training for Masquerade Detection Ke Wang, Sal Stolfo Columbia University Computer Science IDS Lab.

Today Ensemble Methods. Recap of the course. Classifier Fusion

Spam Detection Ethan Grefe December 13, 2013.

School of Engineering and Computer Science Victoria University of Wellington Copyright: Peter Andreae, VUW Image Recognition COMP # 18.

CSSE463: Image Recognition Day 11 Lab 4 (shape) tomorrow: feel free to start in advance Lab 4 (shape) tomorrow: feel free to start in advance Test Monday.

Chapter 23: Probabilistic Language Models April 13, 2004.

1Ellen L. Walker Category Recognition Associating information extracted from images with categories (classes) of objects Requires prior knowledge about.

Active learning Haidong Shi, Nanyi Zeng Nov,12,2008.

Machine Learning for Spam Filtering 1 Sai Koushik Haddunoori.

USE RECIPE INGREDIENTS TO PREDICT THE CATEGORY OF CUISINE Group 7 – MEI, Yan & HUANG, Chenyu.

E VENT D ETECTION USING A C LUSTERING A LGORITHM Kleisarchaki Sofia, University of Crete, 1.

Chapter 20 Classification and Estimation Classification – Feature selection Good feature have four characteristics: –Discrimination. Features.

KNN & Naïve Bayes Hongning Wang Today’s lecture Instance-based classifiers – k nearest neighbors – Non-parametric learning algorithm Model-based.

CSSE463: Image Recognition Day 11 Due: Due: Written assignment 1 tomorrow, 4:00 pm Written assignment 1 tomorrow, 4:00 pm Start thinking about term project.

Supervised Machine Learning: Classification Techniques Chaleece Sandberg Chris Bradley Kyle Walsh.

KAIST TS & IS Lab. CS710 Know your Neighbors: Web Spam Detection using the Web Topology SIGIR 2007, Carlos Castillo et al., Yahoo! 이 승 민.

Feature Selction for SVMs J. Weston et al., NIPS 2000 오장민 (2000/01/04) Second reference : Mark A. Holl, Correlation-based Feature Selection for Machine.

Identifying “Best Bet” Web Search Results by Mining Past User Behavior Author: Eugene Agichtein, Zijian Zheng (Microsoft Research) Source: KDD2006 Reporter:

Combining multiple learners Usman Roshan. Decision tree From Alpaydin, 2010.

Don’t Follow me : Spam Detection in Twitter January 12, 2011 In-seok An SNU Internet Database Lab. Alex Hai Wang The Pensylvania State University International.

SUPERVISED AND UNSUPERVISED LEARNING Presentation by Ege Saygıner CENG 784.

1 Text Categorization  Assigning documents to a fixed set of categories  Applications:  Web pages  Recommending pages  Yahoo-like classification hierarchies.

KNN & Naïve Bayes Hongning Wang

Combining Models Foundations of Algorithms and Machine Learning (CS60020), IIT KGP, 2017: Indrajit Bhattacharya.

CSSE463: Image Recognition Day 11

K Nearest Neighbor Classification

CSSE463: Image Recognition Day 11

Text Categorization Assigning documents to a fixed set of categories

Instance Based Learning

CSSE463: Image Recognition Day 15

Chapter 7: Transformations

CSSE463: Image Recognition Day 11

CSSE463: Image Recognition Day 11

NAÏVE BAYES CLASSIFICATION

Spam Detection Using Support Vector Machine Presenting By Nan Mya Oo University of Computer Studies Taunggyi.

Presentation transcript:

Supervised Learning Techniques over Twitter Data Kleisarchaki Sofia

Supervised Learning Algorithms - Process Problem Identification of required data Data pre-processing Algorithm Selection Training Evaluation with test set Classifier Parameter Tuning ok? yesno Def. of training set

Applying SML on our Problem Problem Identification of required data Data pre-processing Algorithm Selection Training Evaluation with test set Classifier Parameter Tuning o k? yesno Def. of training set Event Detection Data from social networks (i.e Twitter) Select the most informative attributes, features Algorithm Selection??? Training Evaluation with test set Classifier Parameter Tuning o k? yesno i.e. 2/3train, 1/3 estimating

Algorithm Selection  Logic Based Algorithms  Decision Trees, Learning Set of Rules  Perceptron Based Algorithms  Single/Multiple Layered Perceptron, Radial Basis Function (RBF)  Statistical Learning Algorithms  Naive Bayes Classifier, Bayesian Networks  Instance Based Learning Algorithms  k-Nearest Neighbours (k-NN)  Support Vector Machines (SVM)

Earthquake shakes Twitter Users: Real-time Event Detection by Social Sensors Earthquake Detection Twitter API (Q=“earthquak e, shaking”) Obtain feature A #of words Position of q-word Apply Classification (SVM Algorithm) Obtain feature B Words in tweet Obtain feature C Words before & after q-word Training Data Pre-Processing Separate sentences into a set of words. Apply stemming and stop-words elimination (morphological analysis). Extract Features A, B, C. Training Set: 592 positive examples. Apply classification using SVM algorithm with a linear kernel. The model classifies tweets automatically into positive and negative categories. Definition of Training Set Evaluation Classifier

Earthquake shakes Twitter Users: Real-time Event Detection by Social Sensors Earthquake Detection Twitter API (Q=“earthquak e, shaking”) Obtain feature A #of words Position of q-word Apply Classification (SVM Algorithm) Obtain feature B Words in tweet Obtain feature C Words before & after q-word Training Evaluation by Semantic Analysis Definition of Training Set Evaluation Classifier Feature B, C do not contribute much to the classification performance. User becomes surprised and produce a very short tweet. Low recall is due to the difficulty, even for humans, to decide if a tweet is actually reporting an earthquake.

Event Detection & Location Estimation Algorithm Earthquake Detection Twitter API (Q=“earthquak e, shaking”) Obtain feature A #of words Position of q-word Apply Classification (SVM Algorithm) Obtain feature B Words in tweet Obtain feature C Words before & after q-word Positive class? Calculate Temporal & Spatial Model Poccur> Pthres Event Detected (Query Map & Send Alert) yes Temporal Model Each tweet has its post time. The distribution is an exponential distribution. PDF: f(t; λ ) = λ e^- λ t, λ : fixed probability of posting a tweet from t to Δ t. Probability of n sensors returning a false alarm. Probability of event occurrence. λ =0.34, Pf = 0.35

Earthquake shakes Twitter Users: Real-time Event Detection by Social Sensors Earthquake Detection Twitter API (Q=“earthquak e, shaking”) Obtain feature A #of words Position of q-word Apply Classification (SVM Algorithm) Obtain feature B Words in tweet Obtain feature C Words before & after q-word Positive class? Calculate Temporal & Spatial Model Poccur> Pthres Event Detected (Query Map & Send Alert) yes Spatial Model Each tweet is associated with a location. Use Kalman and Particle Filters for location estimation.

Streaming FSD with application to Twitter  Problem: Solve FSD problem using a system that works in the streaming model and takes constant time to process each new document and also constant space.

Streaming FSD with application to Twitter Locality Sensitivity Hashing (LSH) Solves approximate-NN problem in sublinear time. Introduced by Indyk & Motwani (1998) This method relied on hashing each query point into buckets in such a way that the probability of collision was much higher for points that are near by. When a new point arrived, it would be hashed into a bucket and the points that were in the same bucket were inspected and the nearest one returned., #of hash tables, probability of two points x, y colliding δ, probability of missing a nearest neighbour Apply method LSH S  set of points that collide with d in LSH Apply FSD dismin(d) >= t Compare d to a fixed # of most recent documents & update distance Add d to inverted index Has more docs? Get document d yes

Streaming FSD with application to Twitter First Story Detection (FSD) Each document is compared with the previous ones. If its similarity to the closest document is below a certain threshold, the new document is declared to be first story. Apply method LSH S  set of points that collide with d in LSH Apply FSD dismin(d) >= t Compare d to a fixed # of most recent documents & update distance Add d to inverted index Has more docs? Get document d yes

Streaming FSD with application to Twitter Variance Reduction Strategy LSH only returns the true near neighbour. To overcome the problem, compare the query with a fixed number of most recent documents. Apply method LSH S  set of points that collide with d in LSH Apply FSD dismin(d) >= t Compare d to a fixed # of most recent documents & update distance Add d to inverted index Has more docs? Get document d yes

Streaming FSD with application to Twitter Algorithm Apply method LSH S  set of points that collide with d in LSH Apply FSD dismin(d) >= t Compare d to a fixed # of most recent documents & update distance Add d to inverted index Has more docs? Get document d yes

A Constant Space & Time Approach  Limit the number of documents inside a single bucket to a constant.  If the bucket is full the oldest document is removed.  Limit the number of comparisons to a constant.  Compare each new document with at most 3L documents it collided with. Take the 3L documents that collide most frequently.

Detecting Events in Twitter Posts  Threading  Subsets of tweets with the same topic.  Run streaming FSD and assign a novelty score to each tweet. Output which other tweet is most similar to.  Link Relation  a links to tweet b, if b is the nearest neighbour of a and 1-cos(a, b) < thresh  If the neighbour of α is within the distance thresh we assign it to an existing thread. Otherwise, create a new thread.

Twitter Experiments  million time stamped tweets.  Manually labelled the first tweet of each thread as:  Event  Neutral  Spam  Gold Standard: 820 tweets on which both annotators agreed.

Twitter Results  Ways of ranking the threads:  Baseline – random ordering of tweets  Size of thread – threads are ranked according to #of tweets  Number of users - threads are ranked according to unique #of users posting in a thread  Entropy + users , n i: #of times word i appears in the thread,, total #of words in the thread

Twitter Results

References  Supervised Machine Learning: A review of Classification Techniques, S.B Kotsiantis  Earthquake shakes Twitter Users: Real-time Event Detection by Social Sensors, Takeshi Sakaki, Makoto Okazaki, Yutaka Matsuo  Streaming First Story Detection with application to Twitter, Sasa Petrovic, Miles Osborne, Victor Lavrenko