Presentation is loading. Please wait.

Presentation is loading. Please wait.

Big Data: GitHub & Spark

Similar presentations


Presentation on theme: "Big Data: GitHub & Spark"— Presentation transcript:

1 Big Data: GitHub & Spark
Nathan H • Michael Y Gabriel M • James W  

2 A Brief History 2003 - Google File System paper released
MapReduce: Simplified Data Processing on Large Clusters Hadoop is born Spark is born

3 All About Git Git: Distributed version control system for managing source code and working with teams. GitHub: A platform built around git for sharing open source projects. GitHub Archive: An open source project that aims to record GitHub events through time.

4 GitHub

5 What insights can we glean?
Data Events: Stars, Commits, Comments, Tickets, Forks, etc. User Data: Repos, Name, ID, , Bio, Orgs, Followers, etc. Analysis Goal: Can we suggest repos? How: Machine Learning - Collaborative Filtering Tooling: Vagrant, Docker, Spark, Jupyter (User(name='jchristi', id=642929), Repo(name='LinuxStandardBase/lsb', id= ))

6 Recommendation Engines
“A recommendation engine is a feature that filters items by predicting how a user might rate them.” Explicit Feedback Netflix Implicit Feedback Amazon Facebook

7 Results

8 Future Work Improve Recommendation System:
Commits, Tickets, Comments, Forks, Followings, etc. Projects: Beehive Big Data Analytics

9 Acknowledgements Team NSF S-STEM Dr. Tashakkori
GitHub - Vagrant - Docker - Spark - Jupyter - Ubuntu

10 References https://www.githubarchive.org http://spark.apache.org


Download ppt "Big Data: GitHub & Spark"

Similar presentations


Ads by Google