Download presentation
Presentation is loading. Please wait.
1
Big Data: GitHub & Spark
Nathan H • Michael Y Gabriel M • James W
2
A Brief History 2003 - Google File System paper released
MapReduce: Simplified Data Processing on Large Clusters Hadoop is born Spark is born
3
All About Git Git: Distributed version control system for managing source code and working with teams. GitHub: A platform built around git for sharing open source projects. GitHub Archive: An open source project that aims to record GitHub events through time.
4
GitHub
5
What insights can we glean?
Data Events: Stars, Commits, Comments, Tickets, Forks, etc. User Data: Repos, Name, ID, , Bio, Orgs, Followers, etc. Analysis Goal: Can we suggest repos? How: Machine Learning - Collaborative Filtering Tooling: Vagrant, Docker, Spark, Jupyter (User(name='jchristi', id=642929), Repo(name='LinuxStandardBase/lsb', id= ))
6
Recommendation Engines
“A recommendation engine is a feature that filters items by predicting how a user might rate them.” Explicit Feedback Netflix Implicit Feedback Amazon Facebook
7
Results
8
Future Work Improve Recommendation System:
Commits, Tickets, Comments, Forks, Followings, etc. Projects: Beehive Big Data Analytics
9
Acknowledgements Team NSF S-STEM Dr. Tashakkori
GitHub - Vagrant - Docker - Spark - Jupyter - Ubuntu
10
References https://www.githubarchive.org http://spark.apache.org
Similar presentations
© 2025 SlidePlayer.com. Inc.
All rights reserved.