Neelesh Srinivas Salian

Slides:



Advertisements
Similar presentations
© 2013 Adobe Systems Incorporated. All Rights Reserved. Adobe Confidential. Jim Donahue | Principal Scientist Adobe Systems Technology Lab Flint: Making.
Advertisements

MCTS Guide to Microsoft Windows Server 2008 Network Infrastructure Configuration Chapter 11 Managing and Monitoring a Windows Server 2008 Network.
Patrick Wendell Databricks Deploying and Administering Spark.
Using the Windows Event Viewer and Task Scheduler Chapter 5.
Lecture 2 – MapReduce CPE 458 – Parallel Programming, Spring 2009 Except as otherwise noted, the content of this presentation is licensed under the Creative.
Hadoop Ecosystem Overview
SQL Server memory architecture and debugging memory Issues
HOW-TO | CIF CONSULT | Redouane BELBAHRI Informatica Create a crash dump using “Debug Diag”
Codeigniter is an open source web application. It occupies a very small amount of space in the memory and is most useful for developers who aim to develop.
Our Experience Running YARN at Scale Bobby Evans.
Troubleshooting Tips and Tricks Derick Larson Kinetic Data.
EXPOSE GOOGLE APP ENGINE AS TASKTRACKER NODES AND DATA NODES.
Introduction to Apache Hadoop Zibo Wang. Introduction  What is Apache Hadoop?  Apache Hadoop is a software framework which provides open source libraries.
Contents HADOOP INTRODUCTION AND CONCEPTUAL OVERVIEW TERMINOLOGY QUICK TOUR OF CLOUDERA MANAGER.
Grid Computing at Yahoo! Sameer Paranjpye Mahadev Konar Yahoo!
Spark. Spark ideas expressive computing system, not limited to map-reduce model facilitate system memory – avoid saving intermediate results to disk –
© 2006, National Research Council Canada © 2006, IBM Corporation Solving performance issues in OTS-based systems Erik Putrycz Software Engineering Group.
Hadoop IT Services Hadoop Users Forum CERN October 7 th,2015 CERN IT-D*
Matthew Winter and Ned Shawa
Copyright 2007, Information Builders. Slide 1 Machine Sizing and Scalability Mark Nesson, Vashti Ragoonath June 2008.
INTRODUCTION TO HADOOP. OUTLINE  What is Hadoop  The core of Hadoop  Structure of Hadoop Distributed File System  Structure of MapReduce Framework.
Centre de Calcul de l’Institut National de Physique Nucléaire et de Physique des Particules Apache Spark Osman AIDEL.
Learn. Hadoop Online training course is designed to enhance your knowledge and skills to become a successful Hadoop developer and In-depth knowledge of.
Apache Hadoop on Windows Azure Avkash Chauhan
ORNL is managed by UT-Battelle for the US Department of Energy Spark On Demand Deploying on Rhea Dale Stansberry John Harney Advanced Data and Workflows.
CIT 140: Introduction to ITSlide #1 CSC 140: Introduction to IT Operating Systems.
Raju Subba Open Source Project: Apache Spark. Introduction Big Data Analytics Engine and it is open source Spark provides APIs in Scala, Java, Python.
Leverage Big Data With Hadoop Analytics Presentation by Ravi Namboori Visit
LARGE-SCALE DATA ANALYSIS WITH APACHE SPARK ALEXEY SVYATKOVSKIY.
- Inter-departmental Lab
Big Data is a Big Deal!.
PROTECT | OPTIMIZE | TRANSFORM
Daniel Templeton, Cloudera, Inc.
How Alluxio (formerly Tachyon) brings a 300x performance improvement to Qunar’s streaming processing Xueyan Li (Qunar) & Chunming Li (Garena)
Hadoop Architecture Mr. Sriram
About Hadoop Hadoop was one of the first popular open source big data technologies. It is a scalable fault-tolerant system for processing large datasets.
Machine Learning Library for Apache Ignite
Introduction to Spark Streaming for Real Time data analysis
Data Analytics and CERN IT Hadoop Service
Introduction to Distributed Platforms
Hadoop and Analytics at CERN IT
ITCS-3190.
Spark and YARN: Better Together
How to download, configure and run a mapReduce program In a cloudera VM Presented By: Mehakdeep Singh Amrit Singh Chaggar Ranjodh Singh.
MIDAS- Molecular Dynamics Analysis Tutorial February 2017
Hadoop Tutorials Spark
Spark Presentation.
Enabling Scalable and HA Ingestion and Real-Time Big Data Insights for the Enterprise OCJUG, 2014.
Data Platform and Analytics Foundational Training
Introduction Enosis Learning.
Software Engineering Introduction to Apache Hadoop Map Reduce
Apache Spark Resilient Distributed Datasets: A Fault-Tolerant Abstraction for In-Memory Cluster Computing Aditya Waghaye October 3, 2016 CS848 – University.
Latest June Cloudera CCA175 Exam Questions - CCA175 Exam Dumps PDF
2018 Valid Cloudera CCA175 Dumps Questions - CCA175 Braindumps Dumpsprofessor.com
Introduction to Spark.
Distributed Systems CS
Introduction Enosis Learning.
CS110: Discussion about Spark
Hadoop for SQL Server Pros
Introduction to Apache
Execution Framework: Hadoop 2.x
Spark and Scala.
TIM TAYLOR AND JOSH NEEDHAM
Lecture 16 (Intro to MapReduce and Hadoop)
Distributed Systems CS
Charles Tappert Seidenberg School of CSIS, Pace University
Containerized Spark at RBC
IBM C IBM Big Data Engineer. You want to train yourself to do better in exam or you want to test your preparation in either situation Dumpspedia’s.
Big-Data Analytics with Azure HDInsight
Presentation transcript:

Neelesh Srinivas Salian Breaking Spark Top 5 Mistakes to avoid when using Apache Spark in Production Neelesh Srinivas Salian

Outline Introduction to Spark Information gathering Classification of the Issues Description Remedies Q&A

So just to begin with a brief introduction: Spark is an in-memory framework that enables a user to process data and perform analytics using the different components within it. This includes Core (Which is the main crux or the engine of Spark) Streaming (which allows you to process data that comes to the user in the form a stream of bits) MLLib ( the machine learning library that empowers the users to explore the usage of Machine learning algorithms to enable better insights) SQL ( the SQL engine in the Spark framework ) Spark, on the other hand, is a data-processing tool that operates on those distributed data collections; it doesn't do distributed storage. Having been in the environment that includes a lot of other components that form the CDH ecosystem, Spark has increased a lot in terms of Adoption.

Information Gathering Source: Cloudera Customer Opened Cases Posts in Cloudera Community Forums Spark use cases for the following: Spark Core Spark Streaming Spark on YARN Regular Spark Applications and memory configuration problems Problems in the Ingest pipelines Problems in interacting with Kafka Usually Resource Management issues

Classification Scaling of the Architecture Memory Configurations End user Code  Incompatible Dependencies  Administration/Operation related issues

1. Scaling of the Architecture 1. Increase in the cluster Size 2. Utilization of the Cluster

1. Scale of the Architecture: Larger Datasets 1. Understand the initial setup and current one 2. Look at logs / console to find the error 3. Determine parameters 4. Tune away

2. Memory Configurations YARN Remember: Container Memory yarn.nodemanager.resource.memory-mb   Container Memory Maximum yarn.scheduler.maximum-allocation-mb Container Memory Minimum yarn.scheduler.minimum-allocation-mb Standalone Java Heap Size of Worker in Bytes worker_max_heapsize Java Heap Size of Master in Bytes master_max_heapsize

2. Memory Configurations: Insufficient Memory Symptoms: Executor error: ERROR util.SparkUncaughtExceptionHandler: Uncaught exception in thread Thread[Executor task launch worker-0,5,main] java.lang.OutOfMemoryError: Java heap space Driver error: ERROR actor.ActorSystemImpl: Uncaught fatal error from thread [sparkDriver-akka.actor.default-dispatcher-60] shutting down ActorSystem [sparkDriver] Cause: The driver or executor is configured with a memory limit that is too low for the application being run.

2. Memory Configurations: Insufficient Memory Understand Spark Configuration Order of Precedence Spark configuration file: Parameters are passed to SparkConf. Command-line arguments are passed to spark-submit, spark-shell, or pyspark. Cloudera Manager UI properties set in the spark-defaults.conf configuration file. Verify the Driver and Executor Memory Used Environment tab in History Server Check driver and executor memory usage values Increase the Heap Caveat: do not set the spark.driver.memory config using SparkConf , use - - driver-memory spark.driver.memory spark.executor.memory

3. End User Code Why chose Spark code over MapReduce? Expensive Operations Syntax / Command Errors

4. Incompatible Dependencies 1. Incorrectly built jars 2. Version mismatch

4. Incompatible Dependencies: Classpath Issues Symptoms Error: Could not find or load main class org.apache.spark.deploy.yarn.ApplicationMaster java.lang.ClassNotFoundException: org.apache.hadoop.mapreduce.MRJobConfig java.lang.NoClassDefFoundError: org/apache/hadoop/fs/FSDataInputStream ImportError: No module named qt_returns_summary_wrapper

4. Incompatible Dependencies: Classpath Issues What to Do? Verify if an application is using classes that are not part of the framework (Hadoop/Spark) Verify that an application is built using correct artifacts.

5. Administration/Operation Related issues 1. Management of Cluster 2. User-related issues 3. Security/Permissions 4. No Configurations setup

Summary Learn from the Mistakes Be prepared References: Tuning Spark Blog Cloudera Tuning YARN Cloudera Spark Documentation

Credits Wilfred Paula Bjorn Mark Anand Sean Attila

Thank you Neelesh Srinivas Salian LinkedIn Email: neeleshssalian@gmail.com Twitter: NeelS7