CS 765 – Fall 2014 Paulo Alexandre Regis Reddit analysis
Outline REVIEW REDDIT API DATA COLLECTION / CLEANING NETWORK CREATION TOOLS CONCLUSION Q&A
What is reddit? Reddit is an open-source platform that supports the interaction of communities. It has been used as news hub, Q&A platform, internet hoax/meme propagation. Some characteristics include voting, posting, commenting. Has public API that allows data crawling. Has not been deeply studied.
The API 30 requests per minute limit, max. 100 results per request = 3000 results per minute PRAW: API wrapper, takes care of API limits Comment tree can be flattened by PRAW (not like described in the report)
Comment tree
Data collection Total subreddits (17 MB) Filtered subreddits Posts (721 MB) Comments * (> 2GB) Usersunknown * Estimated, in progress
Reddit stats
Data cleaning At least 300 subscribers Not a snapshot Reddit doesn’t stop! Repeated results Anonymizer before data is available
Data cleaning
Network creation Nodes are users Edges happens when they comment on the same post Examine when a threshold is applied
Tools PRAW (did I mention this library is important?) Graph visualizer (Pajek, Gephi, igraph)
Analysis proposal Calculate the node degree (number of links) in different scenarios Compare the calculated value with the users “karma” Compare network with other social networks previous studies Is it power-law? Small-world?
Conclusion Time constraint Expected crawling time? More 2 weeks just for comments Plan B: analyze with data collected
Questions?