Download presentation
Presentation is loading. Please wait.
Published byDylan Phelps Modified over 8 years ago
1
Data Science W205 Project Presentation Building a Subreddit Profiler Jason Goodman
2
11 Overview Project Goals Process Results Reflection Future Work Feel free to contact me: Jason.Ko.Goodman@gmail.com
3
22 The Project! https://nosajshiny.shinyapps.io/subreddit_profiler/ - Or - Open Rstudio, then run: install.packages(“shiny”);require(shiny) runGitHub('subreddit_profiler','NosajGithub')
4
33 Project Goals Original Research QuestionsFinal Research Questions Which phrases are most associated with upvotes/downvotes? Which subreddits are the most positive/negative per sentiment analysis? Which subreddits use the most sophisticated language? How do different subreddits vary from one another? –What do people in each subreddit like? –What don’t they like? –What do they tend to talk about? –When do they use Reddit?
5
44 Process, Step 1: Extract Scrape top 5,000 safe-for-work subreddits from redditlist.com using Beautiful Soup Scrape Reddit with PRAW (the Python Reddit API Wrapper): –Read in a subreddit –Call up the top 1000 submissions –Grab the the best 200 comments –Store them on S3 in raw JSON with Boto Run for ~2 weeks from a Screen session on an EC2 instance 30,984,017 comments from 275 subreddits –9+ GB of text
6
55 Process, Step 2: Transform and Load JSON to an easier format for EMR Single, space-delimited line Unicode and weird characters
7
66 Ran 16 different MRJob modules to calculate data for all the metrics and n-grams –Some locally –Some with EMR (with 3 m1.large clusters, 10 tasks at a time) Looped through comments to find contents for the best/worst comments in ~O(n) time Process, Step 3: Analyze
8
77 Protip: Don’t run mrjob locally on large amounts of data
9
88 Process, Step 4: Visualize Results Cleaned and processed final results in R Built interface to results with Shiny, a product from RStudio
10
99 Results: Calculated Results 1-Grams 2-Grams 3-Grams 4-Grams N-GramsMetrics TimeTables Unique Authors Average Score per Submission Words per Comment Word Length Comments per Submission Gilded Highest Voted Comments Lowest Voted Comments Most Gilded Comments Most Common Words Comments per Day of Week Comments per Hour Comments per Week
11
10 Results: General Reddit Findings Reddit is growing fast Reddit is mostly US-based –Best time is 10am EST Reddit scoring vaguely operates by the power law AskReddit is special People Reddit least on Friday Lots of beautiful undiscovered inside jokes Tons of incredible material that never makes it big People like unicode People like dialogue People don’t like racism or sexism
12
11 Results: Example Specific Findings The subreddit with the most Reddit Gold is r/AskReddit with 7,127. Second? r/IAMA with 2,504 Top comment in r/DoesAnyoneElse: “This may be the first DAE where no one else does” The second most common word in r/NBA is “LeBron.” Third? “Player” r/Philosophy is in the 99 th percentile in both word length and number of words per comment, but the 8 th percentile in average score per comment The 7 th highest scoring 4-gram in r/Apple is “Steve Jobs would have” r/Bitcoin is in the 95 th percentile for reddit gold, but the 27 th in average score (You can buy Reddit gold with bitcoin.) r/Math really like ‘walks in to a bar’ jokes (‘the bartender says’, ‘orders a beer’, ‘mathematicians walk into’) The 9 th highest scoring 4-gram in r/nostalgia is “I still use Winamp”
13
12 Reflection What Worked WellWhat Didn’t Work Well Final product AWS PRAW Power of simple analyses Anything with downvotes Sentiment analysis / reading level Carriage returns! Shiny
14
13 Future Work Blog post Hosting somewhere / redoing interface Sharing on Reddit Deeper text interaction ElasticSearch Contact: Jason.Ko.Goodman@gmail.comJason.Ko.Goodman@gmail.com
Similar presentations
© 2025 SlidePlayer.com. Inc.
All rights reserved.