The Student’s Guide to Apache Spark

Slides:



Advertisements
Similar presentations
Test Automation Success: Choosing the Right People & Process
Advertisements

Scikit-learn: Machine learning in Python
Sarah Reonomy OSCON 2014 ANALYZING DATA WITH PYTHON.
DYNAMICS CRM AS AN xRM DEVELOPMENT PLATFORM Jim Novak Solution Architect Celedon Partners, LLC
Practical Machine Learning Pipelines with MLlib Joseph K. Bradley March 18, 2015 Spark Summit East 2015.
Appendix: The WEKA Data Mining Software
1 Research Groups : KEEL: A Software Tool to Assess Evolutionary Algorithms for Data Mining Problems SCI 2 SMetrology and Models Intelligent.
BOĞAZİÇİ UNIVERSITY DEPARTMENT OF MANAGEMENT INFORMATION SYSTEMS MATLAB AS A DATA MINING ENVIRONMENT.
Mantid Stakeholder Review Nick Draper 01/11/2007.
Matthew Winter and Ned Shawa
Data analysis tools Subrata Mitra and Jason Rahman.
Collaborative Query Previews in Digital Libraries Lin Fu, Dion Goh, Schubert Foo Division of Information Studies School of Communication and Information.
Spark and Jupyter 1 IT - Analytics Working Group - Luca Menichetti.
OSSIM Technology Overview Mark Lucas. “Awesome” Open Source Software Image Map (OSSIM)
In part from: Yizhou Sun 2008 An Introduction to WEKA Explorer.
ODL based AI/ML for Networks Prem Sankar Gopannan, Ericsson
Data Summit 2016 H104: Building Hadoop Applications Abhik Roy Database Technologies - Experian LinkedIn Profile:
United Nations Economic Commission for Europe Statistical Division CSPA: The Future of Statistical Production Steven Vale UNECE
Cloud Analytics Platforms Christian Frey. About AIDA Our mission is to advance knowledge in data analytics through research, education and outreach Our.
Introduction The concept of a web framework originates from the basic idea that every web application obtains its foundations from a similar set of guidelines.
Detecting Web Attacks Using Multi-Stage Log Analysis
Image taken from: slideshare
Deep Neural Networks: A Hands on Challenge
Data Mining, Machine Learning, Data Analysis, etc. scikit-learn
Data Platform and Analytics Foundational Training
Recommendation in Scholarly Big Data
Big Data is a Big Deal!.
Matlab.
Sushant Ahuja, Cassio Cristovao, Sameep Mohta
Machine Learning Library for Apache Ignite
Introduction to Spark Streaming for Real Time data analysis
Visio.
Attie Bioinformatics Server Redesign
Leveraging BI in SharePoint with PowerPivot and Power View
Visio.
Introduction to R Programming with AzureML
Data Mining 101 with Scikit-Learn
Prepared by Kimberly Sayre and Jinbo Bi
September 11, Ian R Brooks Ph.D.
Spark Software Stack Inf-2202 Concurrent and Data-Intensive Programming Fall 2016 Lars Ailo Bongo
Introduction to Azure Machine Learning Studio
Continuous - Discrete Sampling Demo (CON2DIS) team
Intro to Machine Learning
Tools of Software Development
CMPT 733, SPRING 2016 Jiannan Wang
GIFT / Fiscal Data Package Iteration 3
CS110: Discussion about Spark
Machine Learning with Weka
Module 01 ETICS Overview ETICS Online Tutorials
Predicting Pneumonia & MRSA in Hospital Patients
Fintan The Amazing Fish of Knowledge…
Spark and Scala.
Agenda About Excel/Calc Spreadsheets Key Features
Working with Spark With Focus on Lab3.
Analytics: Its More than Just Modeling
Toolbox Benchmarking data session BDVe Meetup, Sofia May 15, 2018
Intro to Machine Learning
Lecture 10 – Introduction to Weka
Our Data Science Roadmap
Python for Data Analysis
CMPT 733, SPRING 2017 Jiannan Wang
Working with Spark With Focus on Lab3.
Databricks and End-to-End Processes Demo Links & Help
What's New in eCognition 9
Bridge the Gap Between Statistician and Data Analysis Professionals
Igor Stančin, Alan Jović to: {igor.stancin,
Advisor: Dr.vahidipour Zahra salimian Shaghayegh jalali Dec 2017
Dimension Load Patterns with Azure Data Factory Data Flows
Machine Learning for Cyber
Spark with R Martijn Tennekes
Presentation transcript:

The Student’s Guide to Apache Spark Xiurong Lin, Ryan Borowicz, Jayanti Trivedi, Abhishek Devarakonda 12/14/16

Agenda Project Background Spark ML SparkR Spark Plotly Visualization Lessons Learned

Project Background Initial Plan Modified Plan Final Product Spark Streaming via meetup API Modified Plan Overall Spark Tutorial with focus on modules not extensively covered in class Utilizing different datasets depending on the task (meetup included) Final Product Tutorial covering all of the different Spark modules Working implementations of Spark SQL, Spark ML, and SparkR Also tested Spark Streaming

Spark ML

Spark ML Overview Benefits Limitations Ability to utilize single platform for big data problems Growing user community and documentation Limitations Limited set of algorithms Lacking in certain features No cost-sensitive modeling Lack of Python support for dimension reduction Comparison to Rapid Miner and Sci-Kit-Learn SparkML has familiar interface for users of Sci-Kit-Learn Found pipeline structure to be more intuitive in SparkML SparkML lacks all of the functionality of Sci-Kit-Learn

Spark ML Pipeline Load Data Convert to DataFrame Normalize Transformer Estimator Pipeline Evaluator Load Load Data Convert to DataFrame Normalize Feature Selection Dimension Reduction (PCA) Vector Assembler Text Processing (Tokenizer, StopWordsRemover) Classification Regression Clustering Collaborative Filtering Tree Ensembles Transformers Parameter Grid Tuning Cross-Validation Estimator Evaluator Metrics Visuals 6

Patient Classification Demo – Logistic Regression Load and Convert Transform

Patient Classification Demo – Logistic Regression Estimate Pipeline Evaluate

Meetup Topic Model Load and Convert Transform

Meetup Topic Model Transform Write File

Spark R

Spark R Overview Benefits Limitations Comparison to R Performance improvements Familiarity for R users Limitations Currently working on integration with SparkML Currently includes a small subset of overall R functionality and libraries Comparison to R Dramatic speed improvements on large datasets Similar interface working off Spark DataFrames

SparkR Meetup Demo – Load & Visualize

SparkR Meetup Demo – Clustering

SparkR Meetup Demo – Regression TRAIN & FIT EVALUATE PREDICT

Spark Plotly Visualization

Plotly Visualization Benefits Amazing way of creating interactive graphs inside Ipython notebook Plots can be hosted and shared easily Signup on plotly website API key will be generated Connect with pyspark Plotted histogram using Wisconsin Breast cancer dataset from UCI public datasets

Line and Scatter Plots Scatter Plots Line Graphs

Sharable and editable from anywhere Graphs can be saved on the Plotly website and directly edited from there Graphs can also be shared to multiple platforms Plots can be collaboratively edited

Lessons Learned Spark can address a variety of use cases Increasingly integrated with existing products (Python, R, etc.) Web resources are limited due to being a new product – opportunity for students Broad topics with lack of clear definition Ensure technical infrastructure is in place prior to project start Merging code from different sources without proper tracking

Questions?