2014 How Do Celebrities Tweet? A Data Science Case Study Christina Zou Data Scientist, Twitter October 2014 #GHC14 2014.

Slides:



Advertisements
Similar presentations
The Software Infrastructure for Electronic Commerce Databases and Data Mining Lecture 4: An Introduction To Data Mining (II) Johannes Gehrke
Advertisements

CSC321: Introduction to Neural Networks and Machine Learning Lecture 24: Non-linear Support Vector Machines Geoffrey Hinton.
Clustering Basic Concepts and Algorithms
Data Mining Feature Selection. Data reduction: Obtain a reduced representation of the data set that is much smaller in volume but yet produces the same.
Data Mining Methodology 1. Why have a Methodology  Don’t want to learn things that aren’t true May not represent any underlying reality ○ Spurious correlation.
An Overview of Machine Learning
On Thursday 14 October 2010, Greater Manchester Police posted details of each incident it dealt with over a 24-hour period on Twitter. The aim was to raise.
Clustering and Dimensionality Reduction Brendan and Yifang April
1 Machine Learning with Apache Hama Tommaso Teofili tommaso [at] apache [dot] org.

Support Vector Machines and Kernel Methods
© University of Minnesota Data Mining for the Discovery of Ocean Climate Indices 1 CSci 8980: Data Mining (Fall 2002) Vipin Kumar Army High Performance.
Dimension reduction : PCA and Clustering Agnieszka S. Juncker Slides: Christopher Workman and Agnieszka S. Juncker Center for Biological Sequence Analysis.
Un Supervised Learning & Self Organizing Maps Learning From Examples
1 Visualizing the Legislature Howard University - Systems and Computer Science October 29, 2010 Mugizi Robert Rwebangira.
Prénom Nom Document Analysis: Data Analysis and Clustering Prof. Rolf Ingold, University of Fribourg Master course, spring semester 2008.
Dimension reduction : PCA and Clustering Christopher Workman Center for Biological Sequence Analysis DTU.
Computer Vision I Instructor: Prof. Ko Nishino. Today How do we recognize objects in images?
K-means Clustering. What is clustering? Why would we want to cluster? How would you determine clusters? How can you do this efficiently?
Hazırlayan NEURAL NETWORKS Radial Basis Function Networks I PROF. DR. YUSUF OYSAL.
Radial Basis Function Networks
Evaluating Performance for Data Mining Techniques
Methods in Medical Image Analysis Statistics of Pattern Recognition: Classification and Clustering Some content provided by Milos Hauskrecht, University.
Large Two-way Arrays Douglas M. Hawkins School of Statistics University of Minnesota
Maintaining a Microsoft SQL Server 2008 Database SQLServer-Training.com.
Committed to Deliver….  We are Leaders in Hadoop Ecosystem.  We support, maintain, monitor and provide services over Hadoop whether you run apache Hadoop,
Unsupervised Learning. CS583, Bing Liu, UIC 2 Supervised learning vs. unsupervised learning Supervised learning: discover patterns in the data that relate.
This week: overview on pattern recognition (related to machine learning)
Chapter 6 : Software Metrics
COMMON EVALUATION FINAL PROJECT Vira Oleksyuk ECE 8110: Introduction to machine Learning and Pattern Recognition.
CPSC 502, Lecture 15Slide 1 Introduction to Artificial Intelligence (AI) Computer Science cpsc502, Lecture 16 Nov, 3, 2011 Slide credit: C. Conati, S.
Dimension reduction : PCA and Clustering Slides by Agnieszka Juncker and Chris Workman modified by Hanne Jarmer.
MACHINE LEARNING 8. Clustering. Motivation Based on E ALPAYDIN 2004 Introduction to Machine Learning © The MIT Press (V1.1) 2  Classification problem:
Neural Networks - Lecture 81 Unsupervised competitive learning Particularities of unsupervised learning Data clustering Neural networks for clustering.
Advanced Analytics on Hadoop Spring 2014 WPI, Mohamed Eltabakh 1.
Data Science and Big Data Analytics Chap 4: Advanced Analytical Theory and Methods: Clustering Charles Tappert Seidenberg School of CSIS, Pace University.
Principal Component Analysis Machine Learning. Last Time Expectation Maximization in Graphical Models – Baum Welch.
Cluster Analysis Potyó László. Cluster: a collection of data objects Similar to one another within the same cluster Similar to one another within the.
Clustering Clustering is a technique for finding similarity groups in data, called clusters. I.e., it groups data instances that are similar to (near)
V. Clustering 인공지능 연구실 이승희 Text: Text mining Page:82-93.
1 Unsupervised Learning and Clustering Shyh-Kang Jeng Department of Electrical Engineering/ Graduate Institute of Communication/ Graduate Institute of.
Speech Lab, ECE, State University of New York at Binghamton  Classification accuracies of neural network (left) and MXL (right) classifiers with various.
Yue Xu Shu Zhang.  A person has already rated some movies, which movies he/she may be interested, too?  If we have huge data of user and movies, this.
CLUSTERING AND SEGMENTATION MIS2502 Data Analytics Adapted from Tan, Steinbach, and Kumar (2004). Introduction to Data Mining.
Optimal Reverse Prediction: Linli Xu, Martha White and Dale Schuurmans ICML 2009, Best Overall Paper Honorable Mention A Unified Perspective on Supervised,
Clustering Algorithms Minimize distance But to Centers of Groups.
2014 Build & Infrastructure Engineering What It Is and Why You Need It Na’Tosha J. Bard October 9, 2014 #GHC
SUPERVISED AND UNSUPERVISED LEARNING Presentation by Ege Saygıner CENG 784.
Introduction to Data Mining Clustering & Classification Reference: Tan et al: Introduction to data mining. Some slides are adopted from Tan et al.
Unsupervised Classification
Cluster Analysis What is Cluster Analysis? Types of Data in Cluster Analysis A Categorization of Major Clustering Methods Partitioning Methods.
Data Science Interview Questions 1.What do you mean by word Data Science? Data Science is the extraction of knowledge from large.
#GHC The Best Mentors Have Questions, Not Answers Becky Splitt Head of Strategic Initiatives, StudyBlue.
Rodney Nielsen Many of these slides were adapted from: I. H. Witten, E. Frank and M. A. Hall Data Science Algorithms: The Basic Methods Clustering WFH:
Machine Learning Supervised Learning Classification and Regression K-Nearest Neighbor Classification Fisher’s Criteria & Linear Discriminant Analysis Perceptron:
Leverage Big Data With Hadoop Analytics Presentation by Ravi Namboori Visit
1 C.A.L. Bailer-Jones. Machine Learning. Data exploration and dimensionality reduction Machine learning, pattern recognition and statistical data modelling.
Machine Learning with Spark MLlib
Clustering MacKay - Chapter 20.
Basic machine learning background with Python scikit-learn
Staff Scheduling at USPS Mail Processing & Distribution Centers
SharePoint Online: Migration Planning to avoid Mistakes
Intro to Machine Learning
Hyperparameters, bias-variance tradeoff, validation
Overview of Machine Learning
Introduction Apache Mesos is a type of open source software that is used to manage the computer clusters. This type of software has been developed by the.
Intro to Machine Learning
Data Transformations targeted at minimizing experimental variance
Concave Minimization for Support Vector Machine Classifiers
What is Artificial Intelligence?
Presentation transcript:

2014 How Do Celebrities Tweet? A Data Science Case Study Christina Zou Data Scientist, Twitter October 2014 #GHC

The Problem The Problem: come up with a method to meaningfully and efficiently classify VITs (Very Important Tweeters) by how they use Twitter

2014

Data Scientist Lesson # 1 Build intuition before building models.

2014 K-Means Clustering  For a given k (the # of clusters), iteratively find the set of cluster centers {u_i} and partitions {S_i} to minimize the objective function, a sum of squared errors, over data points x:

2014 Why Use K-Means Clustering?  An unsupervised learning method - no training labels or assumptions about use case categories needed  Simple, intuitive model  Relatively efficient - O(KNTD), where K = # clusters, N = # points, T = # dimensions per point, D = # iterations, and K, T, D generally <<< N

2014 Feature Extraction  Adhoc jobs to pull raw features for testing and initial model training −Twitter uses Pig/Scalding for Hadoop jobs −Feature values pulled from a variety of HDFS tables, smartly joined −Timeframe can be days - week(s) to write and run these scripts!  Production jobs to regularly collect features (and eventually compute and store use case classifications) −Unit testing −Configure, optimize, and deploy onto a cluster (Apache Mesos) −Scheduling, alerting, and monitoring (Apache Aurora)

2014 Data Scientist Lesson # 2 Data scientists spend a lot of time on tasks that are not sexy.

2014 Feature Engineering: Not So Fast…  If we don’t normalize by # of tweets, the effects of other signals will be drowned out: #tweets with ‘you’ => # tweets with ‘you’ / # tweets  We need to account for varying distributions between different features: feature => (feature - mean(feature)) / var(feature) (i.e. z_score(feature))  Outliers need to be dealt with: replace anything over the 1.5IQR range with the upper/lower bounds of the IQR  Missing data needs to be dealt with: Replace missing data with the mean

2014 Data Scientist Lesson # 3 Know the importance of cleaning your data.

2014 Feature Engineering: Curse of Dimensionality  Clustering high-dimension feature sets is dangerous even if you clean your data. This is the curse of dimensionality: higher dimensionality means... −You need more data to avoid issues of sparsity −In high dimensionality, the distance between points becomes more uniform −Harder to interpret an n-dimensional user feature vector if n is large  Solutions: −Manual feature selection −Dimensionality reduction: Principal Components Analysis, Singular Value Decomposition

2014 Data Scientist Lesson # 4 Working with big data is fundamentally different (i.e. requires different techniques) than working with ‘little’ data.

2014 Feature Engineering: The Feature Set

2014 The Model  Fitting the model is straightforward in R (in this case):  And we get results of the form: −Cluster 1 Center: {# tweets = 0.8, # engagements received/tweet = 0.2, time spent in app = 0.5,...} −Cluster 2 Center: {# tweets = 0.2, # engagements received/tweet = -0.4, time spent in app = 0.1,...} −…

2014 The Model  Q: what ‘k’ do we choose for k-means? I.e., how many clusters do we want?  A: There’s no right answer.  We use the elbow plot method and common sense:

2014 Data Scientist Lesson # 5 Data science is both a science and an art.

2014 Results and Interpretation

2014 Results  We have two healthy VIT use cases: −Super Sharers: Produces a high volume of personal, stream-of-consciousness tweets. Mostly young, in Music, News. AFGR = 2.0x median −Partnerships Influencers: Networking, career- oriented, media-savvy older tweeters in News, Music, Sports, TV. AFGR = 1.3x median  Behaviors we want to encourage out of VITs include: −High-volume tweeting of personal content −Outbound social engagements (giving faves, RTs, follows, etc.) −Media-laden tweets

2014

Data Scientist Lesson # 6 Interpretation is an underrated but critical skill for data scientists.

2014 Questions? 1. Build intuition before building models. 2. Don’t underestimate the data extraction step. 3. Clean your data! 4. Big data can break ‘regular-sized’ analytical techniques. 5. Data science is both a science and an art. 6. Interpretation is an underrated but critical skill for data

2014 Got Feedback? Rate and Review the session using the GHC Mobile App To download visit