Download presentation
Presentation is loading. Please wait.
1
Summary Presented by : Aishwarya Deep Shukla
You are Where you TWEET: A content-BASED approach to GEO-locating twitter users Summary Presented by : Aishwarya Deep Shukla
2
Problem Definition Only 26% of Twitter users list their location (city), the rest do not Just 0.42% use geo-tagging as of 2009 (timeframe of this study) Can we accurately guess the users location from just the tweet text ?
3
Introduction Twitter users slow to adopt geo tagging per tweet Geo-location of user important to understand trends Sparsity affects the geo-location applications Can we predict a user’s location based purely on the content of the user’s tweets ? Key Intuition: Specific keywords are more likely to be associated with a particular city/location
4
Challenges in locating the user with just tweet content
Twitter status updates are noisy ! Twitter users rely on shorthand, sms language – it is inconsistent and hard to text mine Even if location sensitive data is isolated from a user’s tweets, it might be error prone as user might have interest in any other location too, other than his present location User may have more than one location associated, E.g: travellers
5
Related Work Probabilistic language models based on the flickr photo tags – Serdyukov et. al Probablistic model looking at text + visual – Crandall et. Al Applications: Detecting earthquakes with real time twitter data - Each user is a treated as a sensor Detecting news origin location, and diffusion pattern
6
Data Collection Crawl through twitter public timeline API - random sampling Breadth First Search > Crawl Social Edges DATA 29 million status updates of over 1 million users 72% of the profiles have no location information listed 7% have bad location information (Eg: Wonderland) 21% have a city name listed in their profile The data distribution is representative of the population distribution
7
Evaluation Setup Location Estimation problem:
Stweets(u) - Set of tweets by user u Estimate the probability of the user u being in city i: p(i| Stweets(u) ), such that lest(u) is actual location of the user Test data: Extract the tweets of all users who have actual location listed (Coordinates) and use it to check algorithm accuracy Metrics Error Distance ErrDist(u)= d(lact(u), lest(u)) AvgErrDist(U) COPY THE FORMULA Accuracy (U) COPY THE FORMULA
8
Location Determination Algorithm
Select training dataset Associate real location with frequency of words used at that location Run it on the test data to predict location Accuracy: 10.12% Step 1 Baseline Determine words with high spatial focus These words are typically very specific to a place Accuracy:49.8% Step 2 Identifying local Smoothing Accuracy: 51% Step 3 Optimization
9
ESTIMATION ALGORITHM: BASELINE PROBABLITY
Baseline Location Estimation Training data of 130,689 users Plot their tweets Calculate estimated probability based on the formula P(i|Swords(u))= ∑ (p(i\w)*p(w)) Test Result : only 10.12% of the 5119 test users are in the 100 miles of the estimated location this way
10
OPTIMIZATOIN Identify Local Words in Tweets
Words with more compact scope compared to other words Determining Spatial Focus: Cd –α (C)- Focus , Dispersion (α) Determine the focus and dispersion
11
State Level Lattice based Model Based Tweet Sparsity
Large number of “tiny” word distibutions – words issued sparingly and from only a few cities Smoothing approaches State Level Aggregate the probability of a word by state Lattice based Aggregate by 1 X 1 square degrees Model Based Spatial focused word model
12
Experimental Results Goals of the test experiments
Does classification on spatial distribution help ? -- YES How much do different smoothing techniques help? Impact of amount of information about a particular user (via tweets) (Count. of tweets)
13
Estimation Quality: Number of Tweets
14
Comments Novel approach to locate users based on their tweeted text
Users with geo information has steadily increased to 45% now , more than double of when this paper was authored. Algorithm to locate users by tweet this way, can also work for other social media networks and even blogs. Isn't there a self-selection bias in who chooses to share location … Is it okay to assume overall algorithm accuracy based on the tests results ? Suggestions Why not use hashtags too and develop a geo-locator based on hashtags ?
Similar presentations
© 2024 SlidePlayer.com. Inc.
All rights reserved.