Presentation is loading. Please wait.

Presentation is loading. Please wait.

2014 How Do Celebrities Tweet? A Data Science Case Study Christina Zou Data Scientist, Twitter October 2014 #GHC14 2014.

Similar presentations


Presentation on theme: "2014 How Do Celebrities Tweet? A Data Science Case Study Christina Zou Data Scientist, Twitter October 2014 #GHC14 2014."— Presentation transcript:

1 2014 How Do Celebrities Tweet? A Data Science Case Study Christina Zou Data Scientist, Twitter (@christinazou) October 2014 #GHC14 2014

2 The Problem The Problem: come up with a method to meaningfully and efficiently classify VITs (Very Important Tweeters) by how they use Twitter

3 2014

4 Data Scientist Lesson # 1 Build intuition before building models.

5 2014 K-Means Clustering  For a given k (the # of clusters), iteratively find the set of cluster centers {u_i} and partitions {S_i} to minimize the objective function, a sum of squared errors, over data points x:

6 2014 Why Use K-Means Clustering?  An unsupervised learning method - no training labels or assumptions about use case categories needed  Simple, intuitive model  Relatively efficient - O(KNTD), where K = # clusters, N = # points, T = # dimensions per point, D = # iterations, and K, T, D generally <<< N

7 2014 Feature Extraction  Adhoc jobs to pull raw features for testing and initial model training −Twitter uses Pig/Scalding for Hadoop jobs −Feature values pulled from a variety of HDFS tables, smartly joined −Timeframe can be days - week(s) to write and run these scripts!  Production jobs to regularly collect features (and eventually compute and store use case classifications) −Unit testing −Configure, optimize, and deploy onto a cluster (Apache Mesos) −Scheduling, alerting, and monitoring (Apache Aurora)

8 2014 Data Scientist Lesson # 2 Data scientists spend a lot of time on tasks that are not sexy.

9 2014 Feature Engineering: Not So Fast…  If we don’t normalize by # of tweets, the effects of other signals will be drowned out: #tweets with ‘you’ => # tweets with ‘you’ / # tweets  We need to account for varying distributions between different features: feature => (feature - mean(feature)) / var(feature) (i.e. z_score(feature))  Outliers need to be dealt with: replace anything over the 1.5IQR range with the upper/lower bounds of the IQR  Missing data needs to be dealt with: Replace missing data with the mean

10 2014 Data Scientist Lesson # 3 Know the importance of cleaning your data.

11 2014 Feature Engineering: Curse of Dimensionality  Clustering high-dimension feature sets is dangerous even if you clean your data. This is the curse of dimensionality: higher dimensionality means... −You need more data to avoid issues of sparsity −In high dimensionality, the distance between points becomes more uniform −Harder to interpret an n-dimensional user feature vector if n is large  Solutions: −Manual feature selection −Dimensionality reduction: Principal Components Analysis, Singular Value Decomposition

12 2014 Data Scientist Lesson # 4 Working with big data is fundamentally different (i.e. requires different techniques) than working with ‘little’ data.

13 2014 Feature Engineering: The Feature Set

14 2014 The Model  Fitting the model is straightforward in R (in this case):  And we get results of the form: −Cluster 1 Center: {# tweets = 0.8, # engagements received/tweet = 0.2, time spent in app = 0.5,...} −Cluster 2 Center: {# tweets = 0.2, # engagements received/tweet = -0.4, time spent in app = 0.1,...} −…

15 2014 The Model  Q: what ‘k’ do we choose for k-means? I.e., how many clusters do we want?  A: There’s no right answer.  We use the elbow plot method and common sense:

16 2014 Data Scientist Lesson # 5 Data science is both a science and an art.

17 2014 Results and Interpretation

18 2014 Results  We have two healthy VIT use cases: −Super Sharers: Produces a high volume of personal, stream-of-consciousness tweets. Mostly young, in Music, News. AFGR = 2.0x median −Partnerships Influencers: Networking, career- oriented, media-savvy older tweeters in News, Music, Sports, TV. AFGR = 1.3x median  Behaviors we want to encourage out of VITs include: −High-volume tweeting of personal content −Outbound social engagements (giving faves, RTs, follows, etc.) −Media-laden tweets

19 2014

20 Data Scientist Lesson # 6 Interpretation is an underrated but critical skill for data scientists.

21 2014 Questions? 1. Build intuition before building models. 2. Don’t underestimate the data extraction step. 3. Clean your data! 4. Big data can break ‘regular-sized’ analytical techniques. 5. Data science is both a science and an art. 6. Interpretation is an underrated but critical skill for data scientists. @christinazou

22 2014 Got Feedback? Rate and Review the session using the GHC Mobile App To download visit www.gracehopper.org


Download ppt "2014 How Do Celebrities Tweet? A Data Science Case Study Christina Zou Data Scientist, Twitter October 2014 #GHC14 2014."

Similar presentations


Ads by Google