Big Data: GitHub & Spark

Slides:



Advertisements
Similar presentations
How to Grade Wikis Ways to look for and grade evidence of collaboration & build strong partnerships.
Advertisements

Recommender System with Hadoop and Spark
Docker Martin Meyer Agenda What is Docker? –Docker vs. Virtual Machine –History, Status, Run Platforms –Hello World Images and Containers.
03 | Application Lifecycle Management Susan Ibach| Technical Evangelist Christopher Harrison | Head Geek.
@martinwoodward
Version Control with git. Version Control Version control is a system that records changes to a file or set of files over time so that you can recall.
Hadoop Team: Role of Hadoop in the IDEAL Project ●Jose Cadena ●Chengyuan Wen ●Mengsu Chen CS5604 Spring 2015 Instructor: Dr. Edward Fox.
Identifying and Incorporating Latencies in Distributed Data Mining Algorithms Michael Sevilla.
Committed to Deliver….  We are Leaders in Hadoop Ecosystem.  We support, maintain, monitor and provide services over Hadoop whether you run apache Hadoop,
© Spinnaker Labs, Inc. Google Cluster Computing Faculty Training Workshop Open Source Tools for Teaching.
RMLL visits at CERN – July 2012 What is it used for? Depositing Archiving Organizing Disseminating Any type of document ~350GB of PDFs at CERN ~20TB.
The DSpace Course Module – An introduction to DSpace.
First Indico Workshop Indico Project Status José Benito González López May 2013 CERN.
Introduction to Git and GitHub
Erin Zimmerman ISTC 705 Web Applications for Education A Web 2.0 Tool for Education.
Source Control Primer Patrick Cozzi University of Pennsylvania CIS Spring 2012.
Version Control. How do you share code? Discussion.
Information Systems and Network Engineering Laboratory II DR. KEN COSH WEEK 1.
Team 708 – Hardwired Fusion Created by Nam Tran 2014.
1 Applied CyberInfrastructure Concepts ISTA 420/520 Fall
May 2, 2013 An introduction to DSpace. Module 1 – An Introduction By the end of this module, you will … Understand what DSpace is, and what it can be.
Go Deep Federated Identity Management Team Foundation Server SCM and Version Control Team Foundation Service Agile Planning.
Matthew Winter and Ned Shawa
@mariorod1 source control models.
This material is based upon work supported by the U.S. Department of Energy Office of Science under Cooperative Agreement DE-SC , the State of Michigan.
GitHub and the MPI Forum: The Short Version December 9, 2015 San Jose, CA.
Social Searching and Information Recommendation Systems Hassan Zamir.
New Database Existing Database Designer Centric Code Centric Database First Reverse engineer model in EF Designer Classes auto-generated from.
Recommendation Systems By: Bryan Powell, Neil Kumar, Manjap Singh.
Spark and Jupyter 1 IT - Analytics Working Group - Luca Menichetti.
1 Seattle University Master’s of Science in Business Analytics Key skills, learning outcomes, and a sample of jobs to apply for, or aim to qualify for,
Information Systems and Network Engineering Laboratory I DR. KEN COSH WEEK 1.
Streamlining the development of your mobile app(s) Frequently releasing value to users Constantly maintaining quality Monitoring app health and engagement.
What is it and why it matters? Hadoop. What Is Hadoop? Hadoop is an open-source software framework for storing data and running applications on clusters.
1 Ivan Marsic Rutgers University LECTURE 2: Software Configuration Management.
This material is based upon work supported by the U.S. Department of Energy Office of Science under Cooperative Agreement DE-SC , the State of Michigan.
New Database Existing Database Designer Centric Code Centric Database First Reverse engineer model in EF Designer Classes auto-generated from.
Raju Subba Open Source Project: Apache Spark. Introduction Big Data Analytics Engine and it is open source Spark provides APIs in Scala, Java, Python.
Pipe Engineering.
GitHub Insights Understanding Open – Microsoft
Big Data is a Big Deal!.
Information Systems and Network Engineering Laboratory II
Big Data is a Big Deal! Capstone Project
Sushant Ahuja, Cassio Cristovao, Sameep Mohta
Hadoop and Analytics at CERN IT
Git and GitHub primer.
Big Data A Quick Review on Analytical Tools
Status and Challenges: January 2017
Version Control.
ArchiveSpark Andrej Galad 12/6/2016 CS-5974 – Independent Study
Machine Learning With Python Sreejith.S Jaganadh.G.
Hadoop Clusters Tess Fulkerson.
Data science and machine learning at scale, powered by Jupyter
HPML Conference, Lyon, Sept 2018
Git Best Practices Jay Patel Git Best Practices.
GitHub A Tool for software collaboration James Skon
4/5/2019 2:30 AM © Microsoft Corporation. All rights reserved. MICROSOFT MAKES NO WARRANTIES, EXPRESS, IMPLIED OR STATUTORY, AS TO THE INFORMATION IN THIS.
GitHub and Git.
Patrick Cozzi University of Pennsylvania CIS Fall 2012
Version Control with Git
Information & Democracy
Information & Democracy
Big-Data Analytics with Azure HDInsight
Enol Fernandez & Giuseppe La Rocca EGI Foundation
Information & Democracy
Democracy and Information
Computational Environment Management
Democracy and Information
Information & Democracy
CS122B: Projects in Databases and Web Applications Spring 2018
Presentation transcript:

Big Data: GitHub & Spark Nathan H • Michael Y Gabriel M • James W  

A Brief History 2003 - Google File System paper released 2004 - MapReduce: Simplified Data Processing on Large Clusters 2006 - Hadoop is born 2014 - Spark is born

All About Git Git: Distributed version control system for managing source code and working with teams. GitHub: A platform built around git for sharing open source projects. GitHub Archive: An open source project that aims to record GitHub events through time.

GitHub

What insights can we glean? Data Events: Stars, Commits, Comments, Tickets, Forks, etc. User Data: Repos, Name, ID, Email, Bio, Orgs, Followers, etc. Analysis Goal: Can we suggest repos? How: Machine Learning - Collaborative Filtering Tooling: Vagrant, Docker, Spark, Jupyter (User(name='jchristi', id=642929), Repo(name='LinuxStandardBase/lsb', id=18297319))

Recommendation Engines “A recommendation engine is a feature that filters items by predicting how a user might rate them.” Explicit Feedback Netflix Implicit Feedback Amazon Facebook

Results http://github.com/nathanph/gh-recommender

Future Work Improve Recommendation System: Commits, Tickets, Comments, Forks, Followings, etc. Projects: Beehive Big Data Analytics

Acknowledgements Team NSF S-STEM Dr. Tashakkori GitHub - Vagrant - Docker - Spark - Jupyter - Ubuntu

References https://www.githubarchive.org http://spark.apache.org https://github.com/jupyter/docker-stacks