Data Science W205 Project Presentation Building a Subreddit Profiler Jason Goodman
11 Overview Project Goals Process Results Reflection Future Work Feel free to contact me:
22 The Project! - Or - Open Rstudio, then run: install.packages(“shiny”);require(shiny) runGitHub('subreddit_profiler','NosajGithub')
33 Project Goals Original Research QuestionsFinal Research Questions Which phrases are most associated with upvotes/downvotes? Which subreddits are the most positive/negative per sentiment analysis? Which subreddits use the most sophisticated language? How do different subreddits vary from one another? –What do people in each subreddit like? –What don’t they like? –What do they tend to talk about? –When do they use Reddit?
44 Process, Step 1: Extract Scrape top 5,000 safe-for-work subreddits from redditlist.com using Beautiful Soup Scrape Reddit with PRAW (the Python Reddit API Wrapper): –Read in a subreddit –Call up the top 1000 submissions –Grab the the best 200 comments –Store them on S3 in raw JSON with Boto Run for ~2 weeks from a Screen session on an EC2 instance 30,984,017 comments from 275 subreddits –9+ GB of text
55 Process, Step 2: Transform and Load JSON to an easier format for EMR Single, space-delimited line Unicode and weird characters
66 Ran 16 different MRJob modules to calculate data for all the metrics and n-grams –Some locally –Some with EMR (with 3 m1.large clusters, 10 tasks at a time) Looped through comments to find contents for the best/worst comments in ~O(n) time Process, Step 3: Analyze
77 Protip: Don’t run mrjob locally on large amounts of data
88 Process, Step 4: Visualize Results Cleaned and processed final results in R Built interface to results with Shiny, a product from RStudio
99 Results: Calculated Results 1-Grams 2-Grams 3-Grams 4-Grams N-GramsMetrics TimeTables Unique Authors Average Score per Submission Words per Comment Word Length Comments per Submission Gilded Highest Voted Comments Lowest Voted Comments Most Gilded Comments Most Common Words Comments per Day of Week Comments per Hour Comments per Week
10 Results: General Reddit Findings Reddit is growing fast Reddit is mostly US-based –Best time is 10am EST Reddit scoring vaguely operates by the power law AskReddit is special People Reddit least on Friday Lots of beautiful undiscovered inside jokes Tons of incredible material that never makes it big People like unicode People like dialogue People don’t like racism or sexism
11 Results: Example Specific Findings The subreddit with the most Reddit Gold is r/AskReddit with 7,127. Second? r/IAMA with 2,504 Top comment in r/DoesAnyoneElse: “This may be the first DAE where no one else does” The second most common word in r/NBA is “LeBron.” Third? “Player” r/Philosophy is in the 99 th percentile in both word length and number of words per comment, but the 8 th percentile in average score per comment The 7 th highest scoring 4-gram in r/Apple is “Steve Jobs would have” r/Bitcoin is in the 95 th percentile for reddit gold, but the 27 th in average score (You can buy Reddit gold with bitcoin.) r/Math really like ‘walks in to a bar’ jokes (‘the bartender says’, ‘orders a beer’, ‘mathematicians walk into’) The 9 th highest scoring 4-gram in r/nostalgia is “I still use Winamp”
12 Reflection What Worked WellWhat Didn’t Work Well Final product AWS PRAW Power of simple analyses Anything with downvotes Sentiment analysis / reading level Carriage returns! Shiny
13 Future Work Blog post Hosting somewhere / redoing interface Sharing on Reddit Deeper text interaction ElasticSearch Contact: