Data Science W205 Project Presentation Building a Subreddit Profiler Jason Goodman.

Slides:



Advertisements
Similar presentations
An Overview. Worth 40 credits – thats one third of the whole final year! Worth even more than that really … Its used to determine borderline awards Nobody.
Advertisements

The SeETL Business Presentation 1/1/2012
Scheduling Discoverer Reports Scheduling Standard Reports Printing & Re-printing Standard Reports Focus on Reports Session 2 To print: Right click Choose.
An Overview.  Worth 60 credits – that’s one third of the whole MSc!  Worth even more than that……..  It is used to determine final awards  Nobody gets.
Sarah Reonomy OSCON 2014 ANALYZING DATA WITH PYTHON.
Universal Access: More People. More Situations Content or Graphics Content or Graphics? An Empirical Analysis of Criteria for Award-Winning Websites Rashmi.
Feedback from the participants of the Cam23 Web 2.0/social media programme ANDY PRIESTNER CELINE CARTY.
CS 765 – Fall 2014 Paulo Alexandre Regis Reddit analysis.
RSS/ INFORMATION AGGREGATORS Clare Santos- Gacad EDT 180 Nex t.
CS 106 Introduction to Computer Science I 10 / 16 / 2006 Instructor: Michael Eckmann.
Let’s get blogging A general guide to documenting your work in EE 400d, including: Content Suggestions Formatting Photos & Photo Editing Video Linking.
Creating Podcast By Mary A. Malinconico Gloucester County College By Mary A. Malinconico Gloucester County College.
Web server and web browser It’s a take and give policy in between client and server through HTTP(Hyper Text Transport Protocol) Server takes a request.
Processing Data using Amazon Elastic MapReduce and Apache Hive Team Members Frank Paladino Aravind Yeluripiti.
Python Programming Fundamentals
Top Social Media Platforms for Professionals Presented by Jason A. Hicks.
Robert Currier, Mote Marine Laboratory Dr. Barbara Kirkpatrick, TAMU/GCOOS.
SAS Workshop Lecture 1 Lecturer: Annie N. Simpson, MSc.
Microsoft ® Office Access ™ 2007 Training Choose between Access and Excel ICT Staff Development presents:
Choose between Access and Excel Right questions, right program If you’re having trouble choosing between Access and Excel, take a moment to answer an important.
CIT 590 Intro to Programming Last lecture on Python.
/425 Declarative Methods - J. Eisner /425 Declarative Methods Prof. Jason Eisner MWF 3-4pm (sometimes 3-4:15)
Twitter.  Twitter is a social networking and micro-blogging service that enables its users to send and read other user’s updates.
CS 765 – Fall 2014 Paulo Alexandre Regis Reddit analysis.
BlogWall at Kent Ridge MRT Station Janaka Prasad 02/07/2008.
Negotiation – Cont. March 4, 2014 Basic Concepts, Negotiation Prep PowerPoint Summary of: Key Negotiation Concepts.
Text Processing and More about Wrapper Classes
By Ava Mason. The first question, do you eat healthily has the options yes and always. Yes and always are the same so, to improve I will use a small.
Social Media 101 An Overview of Social Media Basics.
Lesson 4 Using Variables in Python – Creating a Simple ChatBot Program.
PRESENTED BY LISA FRASE A Writer’s Work is Never Done revise – v. 1. The act or process of changing or modifying, as of a book or other written material.
Prepare an Asset List Project 4 Due date: Friday, September 24 th.
Introduction to Management
Week 4: Creative Writing Ms. Moran. Monday, September 28, Welcome to class! 0 Today our desks should look like this (NEW groups of 4 ). YOU MUST.
Logan Schmitt.  Software Engineer: Software engineering is designing, developing or fixing software programs.  I would like to be a software engineer.
FORESTUR How to work… …with this training platform? …with this methodology?
By: Giovanni Procopio. Sports Medicine are therapists who are trained for the treatment and rehabilitation of sport and athletic injuries. They are state-
June 2013 BIG DATA SCIENCE: A PATH FORWARD. CONFIDENTIAL | 2  Data Science Lead.
TECHNOLOGY BY:LUIS CEPEDA Per 4/5 iPad & Fly Pen Fusion Current Events Project #2.
Tips And Tricks For Getting The Most Out Of The CPUG Discussion Board.
Introduction to: Python and OpenSesame FOR PROS. OpenSesame In OpenSesame you can add Python in-line codes which enables complex experiment. We will go.
Chris Knight Beginners’ workshop.
/16 Final Project Report By Facializer Team Final Project Report Eagle, Leo, Bessie, Five, Evan Dan, Kyle, Ben, Caleb.
Dato Confidential 1 Danny Bickson Co-Founder. Dato Confidential 2 Successful apps in 2015 must be intelligent Machine learning key to next-gen apps Recommenders.
Sample Math Tutoring Session.
DAY 4. MAKING SOMETHING ‘JUMP’ Simple! Move it a certain amount ‘up’ Glide back to your original spot.
I was looking through many APIs to figure out what I wanted to use and how I wanted to develop this Twitterbot. My early attempts consisted of developing.
CSC 108H: Introduction to Computer Programming Summer 2012 Marek Janicki.
CSC 108H: Introduction to Computer Programming Summer 2011 Marek Janicki.
GCSE COMPUTER SCIENCE Practical Programming using Python
A Simple Approach for Author Profiling in MapReduce
Introduction to gathering and analyzing data via APIs Gus Cavanaugh
A Playful Introduction to Programming by Jason R. Briggs
Day 1 on Google Cloud Platform
Module 5 Working with Data
Asset List & Content Creation
4. Finding the Average, Mode and Median
Correlating Stock Price Shifts with Predictions from Twitter
Accelerate Your Self-Service Data Analytics
Prof. Jason Eisner MWF 3-4pm (sometimes 3-4:15)
Web archive data and researchers’ needs: how might we meet them?
Fintan The Amazing Fish of Knowledge…
Students as self-teachers
Reminders Outliers Reading Schedule – be sure to keep up with your reading! Chapters 3-5 due Monday we return from Thanksgiving. Outliers Next Reading.
Academic & More Group 4 谢知晖 王逸雄 郭嘉宋 程若愚.
PolyAnalyst Web Report Training
CS2911 Week 3, Lab Today Thursday Friday Review Muddiest Point Lab 3
PolyAnalyst Web Report Training
Utilising Canvas to create a collaborative learning environment…..
Igor Stančin, Alan Jović to: {igor.stancin,
Presentation transcript:

Data Science W205 Project Presentation Building a Subreddit Profiler Jason Goodman

11 Overview Project Goals Process Results Reflection Future Work Feel free to contact me:

22 The Project! - Or - Open Rstudio, then run: install.packages(“shiny”);require(shiny) runGitHub('subreddit_profiler','NosajGithub')

33 Project Goals Original Research QuestionsFinal Research Questions Which phrases are most associated with upvotes/downvotes? Which subreddits are the most positive/negative per sentiment analysis? Which subreddits use the most sophisticated language? How do different subreddits vary from one another? –What do people in each subreddit like? –What don’t they like? –What do they tend to talk about? –When do they use Reddit?

44 Process, Step 1: Extract Scrape top 5,000 safe-for-work subreddits from redditlist.com using Beautiful Soup Scrape Reddit with PRAW (the Python Reddit API Wrapper): –Read in a subreddit –Call up the top 1000 submissions –Grab the the best 200 comments –Store them on S3 in raw JSON with Boto Run for ~2 weeks from a Screen session on an EC2 instance 30,984,017 comments from 275 subreddits –9+ GB of text

55 Process, Step 2: Transform and Load JSON to an easier format for EMR Single, space-delimited line Unicode and weird characters

66 Ran 16 different MRJob modules to calculate data for all the metrics and n-grams –Some locally –Some with EMR (with 3 m1.large clusters, 10 tasks at a time) Looped through comments to find contents for the best/worst comments in ~O(n) time Process, Step 3: Analyze

77 Protip: Don’t run mrjob locally on large amounts of data

88 Process, Step 4: Visualize Results Cleaned and processed final results in R Built interface to results with Shiny, a product from RStudio

99 Results: Calculated Results 1-Grams 2-Grams 3-Grams 4-Grams N-GramsMetrics TimeTables Unique Authors Average Score per Submission Words per Comment Word Length Comments per Submission Gilded Highest Voted Comments Lowest Voted Comments Most Gilded Comments Most Common Words Comments per Day of Week Comments per Hour Comments per Week

10 Results: General Reddit Findings Reddit is growing fast Reddit is mostly US-based –Best time is 10am EST Reddit scoring vaguely operates by the power law AskReddit is special People Reddit least on Friday Lots of beautiful undiscovered inside jokes Tons of incredible material that never makes it big People like unicode People like dialogue People don’t like racism or sexism

11 Results: Example Specific Findings The subreddit with the most Reddit Gold is r/AskReddit with 7,127. Second? r/IAMA with 2,504 Top comment in r/DoesAnyoneElse: “This may be the first DAE where no one else does” The second most common word in r/NBA is “LeBron.” Third? “Player” r/Philosophy is in the 99 th percentile in both word length and number of words per comment, but the 8 th percentile in average score per comment The 7 th highest scoring 4-gram in r/Apple is “Steve Jobs would have” r/Bitcoin is in the 95 th percentile for reddit gold, but the 27 th in average score (You can buy Reddit gold with bitcoin.) r/Math really like ‘walks in to a bar’ jokes (‘the bartender says’, ‘orders a beer’, ‘mathematicians walk into’) The 9 th highest scoring 4-gram in r/nostalgia is “I still use Winamp”

12 Reflection What Worked WellWhat Didn’t Work Well Final product AWS PRAW Power of simple analyses Anything with downvotes Sentiment analysis / reading level Carriage returns! Shiny

13 Future Work Blog post Hosting somewhere / redoing interface Sharing on Reddit Deeper text interaction ElasticSearch Contact: