Big Data: GitHub & Spark

Slides:

Advertisements

Similar presentations

How to Grade Wikis Ways to look for and grade evidence of collaboration & build strong partnerships.

Advertisements

Recommender System with Hadoop and Spark

Docker Martin Meyer Agenda What is Docker? –Docker vs. Virtual Machine –History, Status, Run Platforms –Hello World Images and Containers.

03 | Application Lifecycle Management Susan Ibach| Technical Evangelist Christopher Harrison | Head Geek.

@martinwoodward

Version Control with git. Version Control Version control is a system that records changes to a file or set of files over time so that you can recall.

Hadoop Team: Role of Hadoop in the IDEAL Project ●Jose Cadena ●Chengyuan Wen ●Mengsu Chen CS5604 Spring 2015 Instructor: Dr. Edward Fox.

Identifying and Incorporating Latencies in Distributed Data Mining Algorithms Michael Sevilla.

Committed to Deliver….  We are Leaders in Hadoop Ecosystem.  We support, maintain, monitor and provide services over Hadoop whether you run apache Hadoop,

© Spinnaker Labs, Inc. Google Cluster Computing Faculty Training Workshop Open Source Tools for Teaching.

RMLL visits at CERN – July 2012 What is it used for? Depositing Archiving Organizing Disseminating Any type of document ~350GB of PDFs at CERN ~20TB.

The DSpace Course Module – An introduction to DSpace.

First Indico Workshop Indico Project Status José Benito González López May 2013 CERN.

Introduction to Git and GitHub

Erin Zimmerman ISTC 705 Web Applications for Education A Web 2.0 Tool for Education.

Source Control Primer Patrick Cozzi University of Pennsylvania CIS Spring 2012.

Version Control. How do you share code? Discussion.

Information Systems and Network Engineering Laboratory II DR. KEN COSH WEEK 1.

Team 708 – Hardwired Fusion Created by Nam Tran 2014.

1 Applied CyberInfrastructure Concepts ISTA 420/520 Fall

May 2, 2013 An introduction to DSpace. Module 1 – An Introduction By the end of this module, you will … Understand what DSpace is, and what it can be.

Go Deep Federated Identity Management Team Foundation Server SCM and Version Control Team Foundation Service Agile Planning.

Matthew Winter and Ned Shawa

@mariorod1 source control models.

This material is based upon work supported by the U.S. Department of Energy Office of Science under Cooperative Agreement DE-SC , the State of Michigan.

GitHub and the MPI Forum: The Short Version December 9, 2015 San Jose, CA.

Social Searching and Information Recommendation Systems Hassan Zamir.

New Database Existing Database Designer Centric Code Centric Database First Reverse engineer model in EF Designer Classes auto-generated from.

Recommendation Systems By: Bryan Powell, Neil Kumar, Manjap Singh.

Spark and Jupyter 1 IT - Analytics Working Group - Luca Menichetti.

1 Seattle University Master’s of Science in Business Analytics Key skills, learning outcomes, and a sample of jobs to apply for, or aim to qualify for,

Information Systems and Network Engineering Laboratory I DR. KEN COSH WEEK 1.

Streamlining the development of your mobile app(s) Frequently releasing value to users Constantly maintaining quality Monitoring app health and engagement.

What is it and why it matters? Hadoop. What Is Hadoop? Hadoop is an open-source software framework for storing data and running applications on clusters.

1 Ivan Marsic Rutgers University LECTURE 2: Software Configuration Management.

This material is based upon work supported by the U.S. Department of Energy Office of Science under Cooperative Agreement DE-SC , the State of Michigan.

New Database Existing Database Designer Centric Code Centric Database First Reverse engineer model in EF Designer Classes auto-generated from.

Raju Subba Open Source Project: Apache Spark. Introduction Big Data Analytics Engine and it is open source Spark provides APIs in Scala, Java, Python.

Pipe Engineering.

GitHub Insights Understanding Open – Microsoft

Big Data is a Big Deal!.

Information Systems and Network Engineering Laboratory II

Big Data is a Big Deal! Capstone Project

Sushant Ahuja, Cassio Cristovao, Sameep Mohta

Hadoop and Analytics at CERN IT

Git and GitHub primer.

Big Data A Quick Review on Analytical Tools

Status and Challenges: January 2017

Version Control.

ArchiveSpark Andrej Galad 12/6/2016 CS-5974 – Independent Study

Machine Learning With Python Sreejith.S Jaganadh.G.

Hadoop Clusters Tess Fulkerson.

Data science and machine learning at scale, powered by Jupyter

HPML Conference, Lyon, Sept 2018

Git Best Practices Jay Patel Git Best Practices.

GitHub A Tool for software collaboration James Skon

4/5/2019 2:30 AM © Microsoft Corporation. All rights reserved. MICROSOFT MAKES NO WARRANTIES, EXPRESS, IMPLIED OR STATUTORY, AS TO THE INFORMATION IN THIS.

GitHub and Git.

Patrick Cozzi University of Pennsylvania CIS Fall 2012

Version Control with Git

Information & Democracy

Information & Democracy

Big-Data Analytics with Azure HDInsight

Enol Fernandez & Giuseppe La Rocca EGI Foundation

Information & Democracy

Democracy and Information

Computational Environment Management

Democracy and Information

Information & Democracy

CS122B: Projects in Databases and Web Applications Spring 2018

Presentation transcript:

Big Data: GitHub & Spark Nathan H • Michael Y Gabriel M • James W

A Brief History 2003 - Google File System paper released 2004 - MapReduce: Simplified Data Processing on Large Clusters 2006 - Hadoop is born 2014 - Spark is born

All About Git Git: Distributed version control system for managing source code and working with teams. GitHub: A platform built around git for sharing open source projects. GitHub Archive: An open source project that aims to record GitHub events through time.

GitHub

What insights can we glean? Data Events: Stars, Commits, Comments, Tickets, Forks, etc. User Data: Repos, Name, ID, Email, Bio, Orgs, Followers, etc. Analysis Goal: Can we suggest repos? How: Machine Learning - Collaborative Filtering Tooling: Vagrant, Docker, Spark, Jupyter (User(name='jchristi', id=642929), Repo(name='LinuxStandardBase/lsb', id=18297319))

Recommendation Engines “A recommendation engine is a feature that filters items by predicting how a user might rate them.” Explicit Feedback Netflix Implicit Feedback Amazon Facebook

Results http://github.com/nathanph/gh-recommender

Future Work Improve Recommendation System: Commits, Tickets, Comments, Forks, Followings, etc. Projects: Beehive Big Data Analytics

Acknowledgements Team NSF S-STEM Dr. Tashakkori GitHub - Vagrant - Docker - Spark - Jupyter - Ubuntu

References https://www.githubarchive.org http://spark.apache.org https://github.com/jupyter/docker-stacks