Apache Hadoop and Spark Instructor: Bei Kang
Huge Amount of Data every Minute
Hadoop Hadoop is an open source, Scalable, and Fault tolerant framework written in Java. It efficiently processes large volumes of data on a cluster of commodity hardware. Hadoop is not only a storage system but is a platform for large data storage as well as processing. Hadoop consists of three key parts – Hadoop Distributed File System (HDFS) – It is the storage layer of Hadoop. Map-Reduce – It is the data processing layer of Hadoop. YARN – It is the resource management layer of Hadoop.
Hadoop Hadoop Architecture – courtesy of Data Flair
Hadoop – HDFS System Hadoop Distributed File System (HDFS)
Hadoop – HDFS System NameNode and DataNodes HDFS has a master/slave architecture. An HDFS cluster consists of a single NameNode, a master server that manages the file system namespace and regulates access to files by clients. In addition, there are a number of DataNodes, usually one per node in the cluster, which manage storage attached to the nodes that they run on. HDFS exposes a file system namespace and allows user data to be stored in files. Internally, a file is split into one or more blocks and these blocks are stored in a set of DataNodes. The NameNode executes file system namespace operations like opening, closing, and renaming files and directories. It also determines the mapping of blocks to DataNodes. The DataNodes are responsible for serving read and write requests from the file system’s clients. The DataNodes also perform block creation, deletion, and replication upon instruction from the NameNode.
Hadoop – HDFS System The NameNode and DataNode are pieces of software designed to run on commodity machines. These machines typically run a GNU/Linux operating system (OS). HDFS is built using the Java language; any machine that supports Java can run the NameNode or the DataNode software. Usage of the highly portable Java language means that HDFS can be deployed on a wide range of machines. A typical deployment has a dedicated machine that runs only the NameNode software. Each of the other machines in the cluster runs one instance of the DataNode software. The architecture does not preclude running multiple DataNodes on the same machine but in a real deployment that is rarely the case. The existence of a single NameNode in a cluster greatly simplifies the architecture of the system. The NameNode is the arbitrator and repository for all HDFS metadata. The system is designed in such a way that user data never flows through the NameNode.
Hadoop – HDFS Stsyem Goals of HDFS Fault detection and recovery : Since HDFS includes a large number of commodity hardware, failure of components is frequent. Therefore HDFS should have mechanisms for quick and automatic fault detection and recovery. Huge datasets : HDFS should have hundreds of nodes per cluster to manage the applications having huge datasets. Hardware at data : A requested task can be done efficiently, when the computation takes place near the data. Especially where huge datasets are involved, it reduces the network traffic and increases the throughput.
Hadoop – Map Reduce
Hadoop – Map Reduce Bear, Deer, River and Car Example – word count
Hadoop – Map Reduce Another example: The chart is electrical consumption of an organization. It contains the monthly electrical consumption and the annual average for various years. Jan Feb Mar Apr May Jun Jul Aug Sep Oct Nov Dec Avg 1979 23 2 43 24 25 26 1980 27 28 30 31 29 1981 32 33 34 35 36 1984 39 38 41 42 40 1985 00 45
Hadoop – Map Reduce The above data is saved as sample.txtand given as input. The input file looks as shown below. 1979 23 23 2 43 24 25 26 26 26 26 25 26 25 1980 26 27 28 28 28 30 31 31 31 30 30 30 29 1981 31 32 32 32 33 34 35 36 36 34 34 34 34 1984 39 38 39 39 39 41 42 43 40 39 38 38 40 1985 38 39 39 39 39 41 41 41 00 40 39 39 45 What we can do to find the max or min values for the given data?
Apache Spark Apache Spark is an open-source, lightning fast big data framework which is designed to enhance the computational speed. Hadoop MapReduce, read and write from the disk as a result it slows down the computation. While Spark can run on top of Hadoop and provides a better computational speed solution. Apache Spark – Spark is lightning fast cluster computing tool. Apache Spark runs applications up to 100x faster in memory and 10x faster on disk than Hadoop. Because of reducing the number of read/write cycle to disk and storing intermediate data in-memory Spark makes it possible. On the other hand, Hadoop MapReduce – MapReduce reads and writes from disk, as a result, it slows down the processing speed. Supports Java, Scala and Python
Relation and Difference between Spark and Hadoop 1. Both Hadoop and Spark are tools or frameworks for big data handling. 2. Hadoop focuses on HDFS, data storage and map reduce, whereas Spark focus on in-memory calculation. 3. Hadoop processes the static data sets, whereas Spark can process either static or real time/streaming data. 4. Spark uses Hadoop as the file/data storage. 5. Disaster recovery – Hadoop is better than Spark because of disk operations vs memory operations.
Spark – RDD System partitioned collection of records Resilient Distributed Datasets partitioned collection of records spread across the cluster read-only caching dataset in memory different storage levels available fallback to disk possible
Spark – RDD Operations transformations to build RDDs through deterministic operations on other RDDs transformations include map, filter, join Lazy evaluation: Nothing computed until an action requires it actions to return value or export data actions include count, collect, save triggers execution
Spark Example (Python) # Estimate π (compute-intensive task). # Pick random points in the unit square ((-1, -1) to (1,1)), # See how many fall in the unit circle. The fraction should be π / 4 # Note that “parallelize” method creates an RDD def sample(p): x, y = random(), random() return 1 if x*x + y*y < 1 else 0 count = spark.parallelize(range(0, NUM_SAMPLES), partitions).map(sample) \ .reduce(add) print "Pi is roughly %f" % (4.0 * count / NUM_SAMPLES)
Pi calculation
Spark Example – Word Count lines = spark.read.text(sys.argv[1]).rdd.map(lambda r: r[0]) counts = lines.flatMap(lambda x: x.split(' ')) \ .map(lambda x: (x, 1)) \ .reduceByKey(add) output = counts.collect() for (word, count) in output: print("%s: %i" % (word, count))
Spark Example – Word Count .flatMap: we take the RDD of lines and transform it to an RDD of words. map: we transform RDD of words into RDD of tuples (word, 1). It’s also called key-value RDD.reduceByKey: for each key (word) we reduce all the values by summing all the values together.