Big Data: GitHub & Spark

Big Data: GitHub & Spark
Nathan H • Michael Y Gabriel M • James W

A Brief History 2003 - Google File System paper released
MapReduce: Simplified Data Processing on Large Clusters Hadoop is born Spark is born

All About Git Git: Distributed version control system for managing source code and working with teams. GitHub: A platform built around git for sharing open source projects. GitHub Archive: An open source project that aims to record GitHub events through time.

GitHub

What insights can we glean?
Data Events: Stars, Commits, Comments, Tickets, Forks, etc. User Data: Repos, Name, ID, , Bio, Orgs, Followers, etc. Analysis Goal: Can we suggest repos? How: Machine Learning - Collaborative Filtering Tooling: Vagrant, Docker, Spark, Jupyter (User(name='jchristi', id=642929), Repo(name='LinuxStandardBase/lsb', id= ))

Recommendation Engines
“A recommendation engine is a feature that filters items by predicting how a user might rate them.” Explicit Feedback Netflix Implicit Feedback Amazon Facebook

Results

Future Work Improve Recommendation System:
Commits, Tickets, Comments, Forks, Followings, etc. Projects: Beehive Big Data Analytics

Acknowledgements Team NSF S-STEM Dr. Tashakkori
GitHub - Vagrant - Docker - Spark - Jupyter - Ubuntu

References https://www.githubarchive.org http://spark.apache.org

Big Data: GitHub & Spark

Similar presentations

Presentation on theme: "Big Data: GitHub & Spark"— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

Big Data: GitHub & Spark

Similar presentations

Presentation on theme: "Big Data: GitHub & Spark"— Presentation transcript:

Similar presentations

About project

Feedback