Big Data: GitHub & Spark Nathan H • Michael Y Gabriel M • James W
A Brief History 2003 - Google File System paper released 2004 - MapReduce: Simplified Data Processing on Large Clusters 2006 - Hadoop is born 2014 - Spark is born
All About Git Git: Distributed version control system for managing source code and working with teams. GitHub: A platform built around git for sharing open source projects. GitHub Archive: An open source project that aims to record GitHub events through time.
GitHub
What insights can we glean? Data Events: Stars, Commits, Comments, Tickets, Forks, etc. User Data: Repos, Name, ID, Email, Bio, Orgs, Followers, etc. Analysis Goal: Can we suggest repos? How: Machine Learning - Collaborative Filtering Tooling: Vagrant, Docker, Spark, Jupyter (User(name='jchristi', id=642929), Repo(name='LinuxStandardBase/lsb', id=18297319))
Recommendation Engines “A recommendation engine is a feature that filters items by predicting how a user might rate them.” Explicit Feedback Netflix Implicit Feedback Amazon Facebook
Results http://github.com/nathanph/gh-recommender
Future Work Improve Recommendation System: Commits, Tickets, Comments, Forks, Followings, etc. Projects: Beehive Big Data Analytics
Acknowledgements Team NSF S-STEM Dr. Tashakkori GitHub - Vagrant - Docker - Spark - Jupyter - Ubuntu
References https://www.githubarchive.org http://spark.apache.org https://github.com/jupyter/docker-stacks