Presentation is loading. Please wait.

Presentation is loading. Please wait.

Hadoop Javad Azimi May What is Hadoop? Software platform that lets one easily write and run applications that process vast amounts of data. It includes:

Similar presentations


Presentation on theme: "Hadoop Javad Azimi May What is Hadoop? Software platform that lets one easily write and run applications that process vast amounts of data. It includes:"— Presentation transcript:

1 Hadoop Javad Azimi May 2015

2 What is Hadoop? Software platform that lets one easily write and run applications that process vast amounts of data. It includes: – MapReduce – offline computing engine – HDFS – Hadoop distributed file system Yahoo! is the biggest contributor Here's what makes it especially useful: Scalable: It can reliably store and process petabytes. Economical: It distributes the data and processing across clusters of commonly available computers (in thousands). Efficient: By distributing the data, it can process it in parallel on the nodes where the data is located. Reliable: It automatically maintains multiple copies of data and automatically redeploys computing tasks based on failures. 2Sathya Sai University, Prashanti Nilayam

3 HDFS The Hadoop Distributed File System (HDFS) is a distributed file system designed to run on commodity hardware. It has many similarities with existing distributed file systems. However, the differences from other distributed file systems are significant. highly fault-tolerant and is designed to be deployed on low-cost hardware. provides high throughput access to application data and is suitable for applications that have large data sets. part of the Apache Hadoop Core project. The project URL is http://hadoop.apache.org/core/.http://hadoop.apache.org/core/ 3Sathya Sai University, Prashanti Nilayam

4 Hadoop: Assumptions It is written with large clusters of computers in mind and is built around the following assumptions: Hardware will fail. Processing will be run in batches. Thus there is an emphasis on high throughput as opposed to low latency. Applications that run on HDFS have large data sets. A typical file in HDFS is gigabytes to terabytes in size. It should provide high aggregate data bandwidth and scale to hundreds of nodes in a single cluster. It should support tens of millions of files in a single instance. Moving Computation is Cheaper than Moving Data. 4Sathya Sai University, Prashanti Nilayam

5 MapReduce in a nutshell 5 This work was partially supported by the SCAPE Project. The SCAPE project is co‐funded by the European Union under FP7 ICT‐2009.4.1 (Grant Agreement number 270137). Task1 Task 2 Task 3 Output data Aggregated Result © Sven Schlarb

6 Service On Top of Hadoop SQL oriented Languages Hive Pig Machine learning Tool Mahout Clustering Classification Batch based collaborative filtering …

7 Next Generations (Spark) Apache Spark (2010) : A Fault-Tolerant Abstraction for In-Memory Cluster Computing. NSDI 2012. Easy To Develop Rich Java Apis Interactive Shell 2-5 time less code Fast To Run In-Memory Storage Up to 100 times faster than Hadoop Map Reduce

8 Next Generation (Graph Lab) Yucheng Low et al. GraphLab: A New Parallel Framework for Machine Learning. UAI 2010. Yucheng Low, et al. Distributed GraphLab: A Framework for Machine Learning and Data Mining in the Cloud. PVLDB 2012. http://graphlab.org/projects/index.html http://graphlab.org/resources/publications.html Data graph Update functions and the scope

9 Graphics Processing Unit (GPU) What is GPU? It is a processor optimized for 2D/3D graphics, video, visual computing, and display. It is highly parallel, highly multithreaded multiprocessor optimized for visual computing. It serves as both a programmable graphics processor and a scalable parallel computing platform. Heterogeneous Systems: combine a GPU with a CPU

10 CUDA CUDA which stands for Compute Unified Device Architecture, is a parallel computing platform and programming model created by NVIDIA and implemented by the graphics processing units (GPUs) that they produce. CUDA gives developers direct access to the virtual instruction set and memory of the parallel computational elements in CUDA GPUs.parallel computingNVIDIAgraphics processing unitsinstruction set

11 Testing - Matrices Test the multiplication of two matrices. Creates two matrices with random floating point values. We tested with matrices of various dimensions… Dim\TimeCUDACPU 64x640.417465 ms18.0876 ms 128x1280.41691 ms18.3007 ms 256x2562.146367 ms145.6302 ms 512x5128.093004 ms1494.7275 ms 768x76825.97624 ms4866.3246 ms 1024x102452.42811 ms66097.1688 ms 2048x2048407.648 msDidn’t finish 4096x40963.1 secondsDidn’t finish

12 Result: SVM classification on GPU (Speedup over LibSVM)


Download ppt "Hadoop Javad Azimi May What is Hadoop? Software platform that lets one easily write and run applications that process vast amounts of data. It includes:"

Similar presentations


Ads by Google