Working with Spark With Focus on Lab3.

Slides:



Advertisements
Similar presentations
Machine Learning Homework
Advertisements

Matei Zaharia Large-Scale Matrix Operations Using a Data Flow Engine.
1 CS1001 Lecture Overview Java Programming Java Programming Midterm Review Midterm Review.
Algorithms (Contd.). How do we describe algorithms? Pseudocode –Combines English, simple code constructs –Works with various types of primitives Could.
Practical Machine Learning Pipelines with MLlib Joseph K. Bradley March 18, 2015 Spark Summit East 2015.
© 2015 IBM Corporation UNIT 2: BigData Analytics with Spark and Spark Platforms 1 Shelly Garion IBM Research -- Haifa.
CS 114 – Class 02 Topics  Computer programs  Using the compiler Assignments  Read pages for Thursday.  We will go to the lab on Thursday.
CSE 486/586 CSE 486/586 Distributed Systems Graph Processing Steve Ko Computer Sciences and Engineering University at Buffalo.
Introduction to Hadoop Programming Bryon Gill, Pittsburgh Supercomputing Center.
Python – May 11 Briefing Course overview Introduction to the language Lab.
Introduction of Geoprocessing Lecture 9. Geoprocessing  Geoprocessing is any GIS operation used to manipulate data. A typical geoprocessing operation.
Introduction to Eclipse Programming with an Integrated Development Environment.
A High Flying Overview CS139 – Fall 2006 How far we have come.
+ Auto-Testing Code for Teachers & Beginning Programmers Dr. Ronald K. Smith Graceland University.
Centre de Calcul de l’Institut National de Physique Nucléaire et de Physique des Particules Apache Spark Osman AIDEL.
A High Flying Overview CS139 – Fall 2010 How far we have come.
Introduction to the NetBeans Platform Certified Training Course Geertjan Wielenga Sun Microsystems.
Raju Subba Open Source Project: Apache Spark. Introduction Big Data Analytics Engine and it is open source Spark provides APIs in Scala, Java, Python.
Foundations of Programming: Java
Chapter 1. Introduction.
Architecture and design
Dive Into® Visual Basic 2010 Express
Big Data is a Big Deal!.
Shiny App with d3 data visualization
PROTECT | OPTIMIZE | TRANSFORM
Sushant Ahuja, Cassio Cristovao, Sameep Mohta
Process API COMP 755.
SparkBWA: Speeding Up the Alignment of High-Throughput DNA Sequencing Data - Aditi Thuse.
Machine Learning Library for Apache Ignite
Topic: Programming Languages and their Evolution + Intro to Scratch
Topo Sort on Spark GraphX Lecturer: 苟毓川
Introduction to Eclipse
ITCS-3190.
Distributed Programming in “Big Data” Systems Pramod Bhatotia wp
Assurance Scoring: Using Machine Learning and Analytics to Reduce Risk in the Public Sector Matt Thomson 17/11/2016.
Spark Presentation.
Topic: Functions – Part 2
Big Data Analytics: HW#3
Data Platform and Analytics Foundational Training
Apache Spark Resilient Distributed Datasets: A Fault-Tolerant Abstraction for In-Memory Cluster Computing Aditya Waghaye October 3, 2016 CS848 – University.
Prepared by Kimberly Sayre and Jinbo Bi
2018 Valid Cloudera CCA175 Dumps Questions - CCA175 Braindumps Dumpsprofessor.com
Introduction to Spark.
Our Data Science Roadmap
CS139 – Fall 2010 How far we have come
Some Tips for Using Eclipse
CMPT 733, SPRING 2016 Jiannan Wang
Apache Spark & Complex Network
Methodology & Current Results
Weka Package Weka package is open source data mining software written in Java. Weka can be applied to your dataset from the GUI, the command line or called.
CS110: Discussion about Spark
Apache Spark Lecture by: Faria Kalim (lead TA) CS425, UIUC
Introduction to Apache
Data Science with Python
FLIPPED CLASSROOM ACTIVITY CONSTRUCTOR – USING EXISTING CONTENT
Spark and Scala.
VI-SEEM data analysis service
Working with Spark With Focus on Lab3.
CISC124 Assignment 4 on Inheritance due next Friday.
Department of Intelligent Systems Engineering
Apache Spark Lecture by: Faria Kalim (lead TA) CS425 Fall 2018 UIUC
Spark and Scala.
IDE’s and Debugging.
Our Data Science Roadmap
CMPT 733, SPRING 2017 Jiannan Wang
The Student’s Guide to Apache Spark
Big-Data Analytics with Azure HDInsight
Lecture 20 – Practice Exercises 4
Visual Data Flows – Azure Data Factory v2
Presentation transcript:

Working with Spark With Focus on Lab3

Objectives Learn to Explore various features of Spark distribution Apply the Spark APIs to solve problems Run simple examples provided in Spark distribution Use Spark ML libraries Organize your toolbox for data anlytics 6/26/2019

Introduction Apache Spark™ is a unified analytics engine for large-scale data processing. Speed Ease of use Generality Run anywhere Community contribution 6/26/2019

References https://spark.apache.org/ https://spark.apache.org/examples.html https://databricks.com/spark/about 6/26/2019

Explore the Spark Distribution Look at the various features Explore the various folders; lets do that README.md data examples Find out how to submit/run the examples and provision data for the execution Run couple of simple examples 6/26/2019

Spark Features and Folders 6/26/2019

Scala API spark-shell --master local[*] sc sc.{tab} val rdd = sc.parallelize(Array(1,2,2,4),4) //transformation rdd.count() //action rdd.collect() //action rdd.saveAsTextFile(…) 6/26/2019

Running the examples ./bin/run-example <name of the example> ./bin/run-example SparkPi ./bin/run-example ml.KMeansExample 6/26/2019

Submitting Examples ./bin/spark-submit lab3/src/wordcount.py LICENSE ./bin/spark-submit /home/hadoop/spark/examples/src/main/python/wordcount.py LICENSE (Thought process: Hmm..I liked the wordcount.py ; let me move it to my workspace.) I created a folder lab3 In it data and src folders; move wordcount.py into the src folder (like cloning off git.) Now run the same program from your workspace and make sure there are no dependencies from the old location. ./bin/spark-submit lab3/src/wordcount.py LICENSE 6/26/2019

Collecting programs for your toolbox We have gathered wordcount.py You may want explore and move some other ml and mllib programs into to your src folder Data needed can be moved into lab3/data folder 6/26/2019

Program with Parameters ./bin/spark-submit examples/src/main/python/pagerank.py data/mllib/pagerank_data.txt 10 Lets run the pagerank from the given setting of the distribution. Move pagerank.py to lab3/src folder Move pagerank_data.txt to lab3/data folder Now you execute the same submit command in your workspace: (Thought process: Make sure it is repeatable in your environment!) ./bin/spark-submit lab3/src/pagerank.py lab3/data/pagerank_data.txt 10 6/26/2019

Summarizing: Examples The Spark distribution provides a number of examples along with data files that can readily used. This supports the functions of a data scientist and allows her/him to focus on problem solving You can use these solutions, update them to fit your names, and “repurpose” them to analyze your needs. 1.Understand the data representation and the algorithm program name 2. Execute program spark/ml/mllib program (solved) : java, python, scala, R 3. Interpret the output, apply it, visualize it 4. Document & Communicate data files output parameters 6/26/2019

Step 2 Done, How about Step 1 and 3 Step 1: How to create input data? What format? Transform real world objects, into data in format needed by the standard ml/mllib programs. Step 3: Interpret the output data from the ml/mllib programs Lets explore these steps with pagerank program given in the examples of the spark distribution. 6/26/2019

Real-world Object: network of links 1 2 3 4 6/26/2019

How to represent the data Look at pagerank_data.txt1 2 1 2 1 3 1 4 2 1 3 1 4 1 One more info: each node is assigned weight /rank of 1 Which do you think is authoritative? (Thought process: Have an expected answer!) 6/26/2019

Analyze and outcome 4 has rank: 0.753997565294. Is that what you expected? 6/26/2019

From last lectures: Recall, we hand-traced it! 6/26/2019

One more example String tokenizer: ./bin/spark-submit examples/src/main/python/mltokenizer_example.py (Thought Process: Before you run it, find out what is your input, what is your expected output?) Input: hard-coded: Dataframe [(0, "Hi I heard about Spark"), (1, "I wish Java could use case classes"), (2, "Logistic,regression,models,are,neat") ], ["id", "sentence"]) Question: Can read your input files into the Dataframe and reuse the code? 6/26/2019

Tokenizer; Output +-----------------------------------+------------------------------------------+------+ |sentence |words |tokens| |Hi I heard about Spark |[hi, i, heard, about, spark] |5 | |I wish Java could use case classes |[i, wish, java, could, use, case, classes]|7 | |Logistic,regression,models,are,neat|[logistic, regression, models, are, neat] |5 | (Thought process: I like the tokenizer, I can use it for my lab3; let me copy it into the lab3 workspace.) 6/26/2019

Summary I can go on with many more examples… I have shown you how to use the ml and mllib and other spark code. What is the advantage over simple libraries/single node/non-spark implementation? You can instantly scale your application to larger data set and to large spark cluster! Make sure you attribute all the programs you used! Don’t miss this fantastic opportunity to learn! 6/26/2019