Running Apache Spark on HPC clusters

Slides:

Advertisements

Similar presentations

MAP REDUCE PROGRAMMING Dr G Sudha Sadasivam. Map - reduce sort/merge based distributed processing Best for batch- oriented processing Sort/merge is primitive.

Advertisements

Berkley Data Analysis Stack Shark, Bagel. 2 Previous Presentation Summary Mesos, Spark, Spark Streaming Infrastructure Storage Data Processing Application.

Spark: Cluster Computing with Working Sets

Spark Fast, Interactive, Language-Integrated Cluster Computing.

Introduction to Spark Shannon Quinn (with thanks to Paco Nathan and Databricks)

Matei Zaharia, Mosharaf Chowdhury, Tathagata Das, Ankur Dave, Justin Ma, Murphy McCauley, Michael Franklin, Scott Shenker, Ion Stoica Spark Fast, Interactive,

Mesos A Platform for Fine-Grained Resource Sharing in Data Centers Benjamin Hindman, Andy Konwinski, Matei Zaharia, Ali Ghodsi, Anthony D. Joseph, Randy.

Hadoop: The Definitive Guide Chap. 8 MapReduce Features

Contents HADOOP INTRODUCTION AND CONCEPTUAL OVERVIEW TERMINOLOGY QUICK TOUR OF CLOUDERA MANAGER.

MapReduce High-Level Languages Spring 2014 WPI, Mohamed Eltabakh 1.

Introduction to Hadoop Programming Bryon Gill, Pittsburgh Supercomputing Center.

Spark Streaming Large-scale near-real-time stream processing

Grid Computing at Yahoo! Sameer Paranjpye Mahadev Konar Yahoo!

Spark. Spark ideas expressive computing system, not limited to map-reduce model facilitate system memory – avoid saving intermediate results to disk –

Data Engineering How MapReduce Works

Matthew Winter and Ned Shawa

Big Data Infrastructure Week 3: From MapReduce to Spark (2/2) This work is licensed under a Creative Commons Attribution-Noncommercial-Share Alike 3.0.

Filtering, aggregating and histograms A FEW COMPLETE EXAMPLES WITH MR, SPARK LUCA MENICHETTI, VAG MOTESNITSALIS.

INTRODUCTION TO HADOOP. OUTLINE  What is Hadoop  The core of Hadoop  Structure of Hadoop Distributed File System  Structure of MapReduce Framework.

Centre de Calcul de l’Institut National de Physique Nucléaire et de Physique des Particules Apache Spark Osman AIDEL.

Resilient Distributed Datasets A Fault-Tolerant Abstraction for In-Memory Cluster Computing Matei Zaharia, Mosharaf Chowdhury, Tathagata Das, Ankur Dave,

Advanced Computing Facility Introduction

PySpark Tutorial - Learn to use Apache Spark with Python

Python Spark Intro for Data Science

Image taken from: slideshare

Architecture and design

Welcome to Indiana University Clusters

Big Data is a Big Deal!.

Spark Programming By J. H. Wang May 9, 2017.

PROTECT | OPTIMIZE | TRANSFORM

Concept & Examples of pyspark

Sushant Ahuja, Cassio Cristovao, Sameep Mohta

About Hadoop Hadoop was one of the first popular open source big data technologies. It is a scalable fault-tolerant system for processing large datasets.

SparkBWA: Speeding Up the Alignment of High-Throughput DNA Sequencing Data - Aditi Thuse.

Machine Learning Library for Apache Ignite

Introduction to Distributed Platforms

INTRODUCTION TO PIG, HIVE, HBASE and ZOOKEEPER

Hadoop Tutorials Spark

Spark Presentation.

Scaling SQL with different approaches

Data Platform and Analytics Foundational Training

Apache Spark Resilient Distributed Datasets: A Fault-Tolerant Abstraction for In-Memory Cluster Computing Aditya Waghaye October 3, 2016 CS848 – University.

MapReduce Computing Paradigm Basics Fall 2013 Elke A. Rundensteiner

Introduction to Spark.

Kishore Pusukuri, Spring 2018

湖南大学-信息科学与工程学院-计算机与科学系

Cse 344 May 4th – Map/Reduce.

SETL: Efficient Spark ETL on Hadoop

CCR Advanced Seminar: Running CPLEX Computations on the ISE Cluster

CS110: Discussion about Spark

Apache Spark Lecture by: Faria Kalim (lead TA) CS425, UIUC

Spark and Scala.

Apache Spark Lecture by: Faria Kalim (lead TA) CS425 Fall 2018 UIUC

Charles Tappert Seidenberg School of CSIS, Pace University

Spark and Scala.

Introduction to Spark.

Spark From many sources 4/12/2019.

Recitation #4 Tel Aviv University 2017/2018 Slava Novgorodov

CS639: Data Management for Data Science

5/7/2019 Map Reduce Map reduce.

Apache Hadoop and Spark

Fast, Interactive, Language-Integrated Cluster Computing

Big-Data Analytics with Azure HDInsight

MapReduce: Simplified Data Processing on Large Clusters

Lecture 29: Distributed Systems

CS639: Data Management for Data Science

Map Reduce, Types, Formats and Features

Presentation transcript:

Running Apache Spark on HPC clusters Shameema Oottikkal Data Application Engineer Ohio SuperComputer Center email:soottikkal@osc.edu 11/10/2016 TDA-Data Analytics Month

Apache Spark Apache Spark is an open source cluster computing framework originally developed in the AMPLab at University of California, Berkeley but was later donated to the Apache Software Foundation where it remains today. In contrast to Hadoop's disk-based analytics paradigm, Spark has multi-stage in-memory analytics.

Spark ecosystem

Spark workflow Spark applications run as independent sets of processes on a cluster, coordinated by the SparkContext object in your main program (called the driver program). Require cluster managers which allocate resources across applications. Once connected, Spark acquires executors on nodes in the cluster, which are processes that run computations and store data for your application. Next, it sends your application code (defined by JAR or Python files passed to SparkContext) to the executors. Finally, SparkContext sends tasks to the executors to run.

RDD- Resilient Distributed Datasets RDD- Transformations and Actions RDD (Resilient Distributed Dataset) is main logical data unit in Spark. An RDD is distributed collection of objects. Distributed means, each RDD is divided into multiple partitions. Each of these partitions can reside in memory or stored on disk of different machines in a cluster. RDDs are immutable (Read Only) data structure. You can’t change original RDD, but you can always transform it into different RDD with all changes you want. In Spark all work is expressed as creating new RDDs, transforming existing RDDs, or calling operations on RDDs to compute a result RDD- Transformations and Actions Transformations are executed on demand. That means they are computed lazily. Eg: filter, join, sort Actions return final results of RDD computations. Actions triggers execution using lineage graph to load the data into original RDD, carry out all intermediate transformations and return final results to Driver program or write it out to file system. Eg: collect(), count(), take()

Interactive Analysis with the Spark Shell ./bin/pyspark # Opens SparkContext 1. Create an RDD >>> textFile = sc.textFile("README.md") 2. Transformation of RDD >>>linesWithSpark = textFile.filter(lambda line: "Spark" in line) 3. Action on RDD >>> linesWithSpark.count() # Number of items in this RDD 126 4. Combining Transformation and Actions >>> textFile.filter(lambda line: "Spark" in line).count() # How many lines contain "Spark"? 15 Word count Example >>>wordCounts = textFile.flatMap(lambda line: line.split()).map(lambda word: (word,1)).reduceByKey(lambda a, b: a+b) >>> wordCounts.collect() [(u'and', 9), (u'A', 1), (u'webpage', 1), (u'README', 1), (u'Note', 1), (u'"local"', 1), (u'variable', 1), ...]

RDD Basics import urllib f = urllib.urlretrieve ("http://kdd.ics.uci.edu/databases/kddcup99/kddcup.data_10_percent.gz", "kddcup.data_10_percent.gz") data_file="./kddcup.data_10_percent.gz" raw_data=sc.textFile(data_file) normal_raw_data=raw_data.filter(lambda x: 'normal.' in x) normal_count=normal_raw_data.count() print ("There are %f 'normal' interactions" % (normal_count)) #predefined map functions def parse_interaction(line): elems = line.split(",") tag = elems[41] return (tag, elems) key_csv_data=raw_data.map(parse_interaction)

Running Spark using PBS script 1. Create an App in python: Stati.py from pyspark import SparkContext import urllib f = urllib.urlretrieve ("http://kdd.ics.uci.edu/databases/kddcup99/kddcup.data.gz","kddcup.data.gz") data_file = "./kddcup.data.gz" sc = SparkContext(appName="Stati") raw_data = sc.textFile(data_file) import numpy as np def parse_interaction(line): line_split = line.split(",") symbolic_indexes = [1,2,3,41] clean_line_split=[item for i, item in enumerate(line_split) if i not in symbolic_indexes] return np.array([float(x) for x in clean_line_split]) vector_data=raw_data.map(parse_interaction) from pyspark.mllib.stat import Statistics from math import sqrt summary = Statistics.colStats(vector_data) print ("Duration Statistics:") print (" Mean %f" % (round(summary.mean()[0],3))) print ("St. deviation : %f"%(round(sqrt(summary.variance()[0]),3))) print (" Max value: %f"%(round(summary.max()[0],3))) print (" Min value: %f"%(round(summary.min()[0],3)))

2. Create a PBS script: Stati.pbs #PBS -N spark-statistics #PBS -l nodes=18:ppn=28 #PBS -l walltime=00:10:00 module load spark/2.0.0 cd $TMPDIR pbs-spark-submit stati.py > stati.log 3. Run Spark job qsub stati.pbs 4. Output: stati.log sync from spark://n0381.ten.osc.edu:7077 starting org.apache.spark.deploy.master.Master, logging to /nfs/15/soottikkal/spark/kdd/spark-soottikkal-org.apache.spark.deploy.master.Master-1-n0381.ten.osc.edu.out failed to launch org.apache.spark.deploy.master.Master: full log in /nfs/15/soottikkal/spark/kdd/spark-soottikkal-org.apache.spark.deploy.master.Master-1-n0381.ten.osc.edu.out Duration Statistics: Mean 48.342000 St. deviation : 723.330000 Max value: 58329.000000 Min value: 0.000000 Total value count: 4898431.000000 Number of non-zero values: 118939.000000 SPARK_MASTER=spark://n0381.ten.osc.edu:7077

1. Create App in python: sql.py SparkSQL 1. Create App in python: sql.py from pyspark import SparkContext import urllib f = urllib.urlretrieve ("http://kdd.ics.uci.edu/databases/kddcup99/kddcup.data.gz","kddcup.data.gz") data_file = "./kddcup.data.gz" raw_data = sc.textFile(data_file).cache() #getting a data frame from pyspark.sql import SQLContext sqlContext = SQLContext(sc) from pyspark.sql import Row csv_data=raw_data.map(lambda l: l.split(",")) row_data=csv_data.map(lambda p: Row( duration=int(p[0]), protocal_type=p[1], service=p[2], flag=p[3], src_bytes=int(p[4]), dst_bytes=int(p[5]) ) interactions_df = sqlContext.createDataFrame(row_data) interactions_df.registerTempTable("interactions") # Select tcp network interactions with more than 1 second duration and no transfer from destination tcp_interactions = sqlContext.sql(""" SELECT duration, dst_bytes FROM interactions WHERE protocal_type = 'tcp' AND duration > 1.000 AND dst_bytes = 0 """) tcp_interactions.show() tcp_interactions_out = tcp_interactions.map(lambda p: "Duration: %f, Dest. bytes: %f"%(p.duration, p.dst_bytes)) for ti_out in tcp_interactions_out.collect(): print ti_out interactions_df.printSchema() interactions_df.select("protocal_type", "duration", "dst_bytes").groupBy("protocal_type").count().show() interactions_df.select("protocal_type", "duration", "dst_bytes").filter(interactions_df.duration>1000).filter(interactions_df.dst_bytes==0).groupBy("protocal_type").count().show()

Output: sql.log

Data mining of historical jobs records of OSC’s clusters CASE STUDY Data mining of historical jobs records of OSC’s clusters Aim: To understand client utilizations of OSC recourses. Data: Historical records of every Job that ran on any OSC clusters that includes information's such as number of nodes, software, CPU time and timestamp. Import to Spark Data till 2016 Save as parquet file Analysis Reload to Spark Newer Data Append to parquet file DATA on MYSQL DB

Pyspark code #importing data df=sqlContext.read.parquet("/fs/scratch/pbsacct/Jobs.parquet") #Which types of queue is mostly used df_mysql.select("jobid",queue").groupBy("queue").count().show() #Which software is used most? from pyspark.sql.functions import df_mysql.select("jobid","sw_app").groupBy ("sw_app").count().sort(col("count").desc()) .show()’ #who uses gaussian software most? df_mysql.registerTempTable("mysql”) sqlContext.sql(" SELECT username FROM mysql WHERE sw_app='gaussian’ " ).show()

Results Statistics MYSQL SPARK Job vs CPU 1 hour 5 sec CPU vs Account Walltime vs user 1.40 hour

References 1. Spark Programming Guide 2. Data Exploration with Spark https://spark.apache.org/docs/2.0.0/programming-guide.html -Programming with Scala, Java and Python 2. Data Exploration with Spark http://www.cs.berkeley.edu/~rxin/ampcamp-ecnu/data-exploration-using-spark.html 3. Interactive Data Analytics in SparkR http://ampcamp.berkeley.edu/5/exercises/sparkr.html https://www.osc.edu/documentation/software_list/spark_documentation