Presentation is loading. Please wait.

Presentation is loading. Please wait.

Big-Data – Big Challenge. MySelf.. Name :- Yuvraj J Bhosale 3.5 Years of IT experience in PL/SQL development and Performance tuning. Experienced in creating.

Similar presentations


Presentation on theme: "Big-Data – Big Challenge. MySelf.. Name :- Yuvraj J Bhosale 3.5 Years of IT experience in PL/SQL development and Performance tuning. Experienced in creating."— Presentation transcript:

1 Big-Data – Big Challenge

2 MySelf.. Name :- Yuvraj J Bhosale 3.5 Years of IT experience in PL/SQL development and Performance tuning. Experienced in creating Logical, Physical Data Model and generating Database script using Erwin. Hands on SQL,Oracle PL-SQL,UNIX, Informatica, Data Modeling Rich Knowledge on Big-Data, MongoDB, Hadoop, MapReduce, Hive, HBase, Data Warehousing, Core Java,DB2 Working in Mphasis an HP company as Oracle PL-SQL Developer Being Member of BigAnalytics group(Research group for Big- Data) in Mphasis

3 You will learn below topics : Understand Big Data & Hadoop Ecosystem Hadoop Distributed File System – HDFS Installing Hadoop on single cluster Running Pig and Hive Scripts Running distributed programs on single cluster Understand NoSQL Databases and different types of NoSQL Databases Hands on with Hadoop, Hive and MongoDB Understand Sqoop and Oozie

4 Content What is Big Data ? Why Big Data ? Introduction to the Hadoop Ecosystem Hadoop Approach and MapReduce The Hadoop Distributed File System (HDFS) Introduction to NoSQL Types of NoSQL ? Writing Hive, HBase and MongoDB Script Pentaho and Hadoop ?

5

6 What is Big Data? “Big Data is a collection of data sets so large and complex that it becomes difficult to process using on-hand database management tools or traditional data processing applications. The challenges include capture, cu-ration, storage, search, sharing, transfer, analysis, and visualization.” - Wiki or "Big data are high volume, high velocity, and/or high variety information assets that require new forms of processing to enable enhanced decision making, insight discovery and process optimization.“ - 3 V's Model Theory There are 7 billion mobile phones in use in world as of 2012. There are 30 billion pieces of content shared on Facebook each month. 15 out of 17 major business sectors in the United States have more data stored per company that the US Library of Congress.

7 Why Big Data? ‘Big Data’ used to represent massive and large amounts of unstructured data that cannot traditionally store in a Relational form in enterprise databases. Data storage defined in order of PETA BYTES, EXA BYTES and much higher in volume to the current storage limits in enterprises which TERA BYTES. Generally it is considered as Unstructured data and not really falling the under the relational database design which the enterprises have been used to Data Generated using unconventional methods outside of data entry like, RFID, Sensor networks etc... Data is time sensitive and consists of data collected with relevance to the time zones

8 TDA Vs. BDA Traditional Data warehouse Analytics Big Data Analytics Analyzes on the known data Terrain that too the data that is well understood, cleansed and in line with the business metadata. Targeted at unstructured data. No guarantee that the incoming data is well formed and clean and devoid of any errors? Built on top of the relational data model, relationships between the subjects of interests have been created inside the system and the analysis is done based on them Difficult to establish relationship between all the information like unstructured data in the images, videos, Mobile generated information, RFID etc... Have to be considered in big data analytics Batch oriented and we need to wait for nightly ETL and transformation jobs to complete before the required insight is obtained. Aimed for real time analysis of the data using the support of the software meant for it Parallelism in a traditional analytics system is achieved through costly hardware like MPP (Massively Parallel Processing) systems or SMP systems. Parallelism can be achieved through commodity hardware and new generation of analytical software like Hadoop or other Analytical databases.

9 Use Cases for Big Data Analytics Enterprises can understand the value of Big Data Analytics based on the use cases and how the traditional problems can be solved with the help of Big Data Analytics. – Customer Satisfaction and Warranty Analysis – Competitor Market Penetration Analysis – Healthcare / Epidemic Research & Control – Product Feature and Usage Analysis – Future Direction Analysis

10 Hadoop Ecosystem

11 Hadoop® project develops open-source software for reliable, scalable, distributed computing. The Apache Hadoop software library is a framework that allows for the distributed processing of large data sets across clusters of computers using simple programming models. It is designed to scale up from single servers to thousands of machines, each offering local computation and storage. Rather than rely on hardware to deliver high-availability, the library itself is designed to detect and handle failures at the application layer, so delivering a highly-available service on top of a cluster of computers, each of which may be prone to failures. The project includes these modules: Hadoop Common: The common utilities that support the other Hadoop modules. Hadoop Distributed File System (HDFS™): A distributed file system that provides high- throughput access to application data. Hadoop YARN: A framework for job scheduling and cluster resource management. Hadoop MapReduce: A YARN-based system for parallel processing of large data sets.

12 Data Distribution in Hadoop Framework The Hadoop Approach Hadoop is designed to efficiently process large volumes of information by connecting many commodity computers together to work in parallel. The theoretical 1000-CPU machine described earlier would cost a very large amount of money, far more than 1,000 single-CPU or 250 quad-core machines. Hadoop will tie these smaller and more reasonably priced machines together into a single cost-effective compute cluster.

13 MapReduce Algorithm “MapReduce is a programming model for processing large data sets with a parallel, distributed algorithm on a cluster” – Wiki MapReduce program comprises a Map () procedure performs filtering and sorting and a Reduce () procedure that performs a summary. The “MapReduce System” orchestrates by marshaling the distributed servers, running the various tasks in parallel, managing all communications and data transfers between the various parts of the system, providing for redundancy and failures, and overall management of the whole process. E.g. If you need to count the number of student in a class and if you apply MapReduce algorithm then Map will do sorting students by first name into queues, one queue for each name and Reduce will do counting the number of students in each queue, yielding name frequencies. Map step: The master node takes the input, divides it into smaller sub-problems, and distributes them to worker nodes. A worker node may do this again in turn, leading to a multi-level tree structure. The worker node processes the smaller problem, and passes the answer back to its master node. Reduce step: The master node then collects the answers to all the sub-problems and combines them in some way to form the output – the answer to the problem it was originally trying to solve.

14 MapReduce Model – 5 Steps Prepare the Map() input – the "MapReduce system" designates Map processors, assigns the K1 input key value each processor would work on, and provides that processor with all the input data associated with that key value. Run the user-provided Map () code – Map () is run exactly once for each K1 key value, generating output organized by key values K2. "Shuffle" the Map output to the Reduce processors – the MapReduce system designates Reduce processors, assigns the K2 key value each processor would work on, and provides that processor with all the Map-generated data associated with that key value. Run the user-provided Reduce () code – Reduce () is run exactly once for each K2 key value produced by the Map step. Produce the final output – the MapReduce system collects all the Reduce output, and sorts it by K2 to produce the final outcome

15 MapReduce Model – 5 Steps

16 Hadoop Distributed File System (HDFS) “HDFS is a distributed, scalable, and portable file system written in Java for the Hadoop framework.” HDFS is a block-structured file system: individual files are broken into blocks of a fixed size. These blocks are stored across a cluster of one or more machines with data storage capacity. Individual machines in the cluster are referred to as DataNodes. Each DataNodes serves up blocks of data over the network using a block protocol specific to HDFS. The file system uses the TCP/IP layer for communication. Each node in a Hadoop instance typically has a single namenode a cluster of DataNodes form the HDFS cluster. Clients use Remote procedure call (RPC) to communicate between each other. A file can be made of several blocks, and they are not necessarily stored on the same machine; the target machines which hold each block are chosen randomly on a block-by- block basis. Thus access to a file may require the cooperation of multiple machines, but supports file sizes far larger than a single-machine DFS; individual files can require more space than a single hard drive could hold.

17 The Design of HDFS HDFS is a file system designed for storing very large files with streaming data access patterns, running on clusters on commodity hardware. Very large files “Very large” in this context means files that are hundreds of megabytes, gigabytes, or terabytes in size. There are Hadoop clusters running today that store petabytes of data. Streaming data access HDFS is built around the idea that the most efficient data processing pattern is a write-once, read-many-times pattern. A dataset is typically generated or copied from source, then various analyses are performed on that dataset over time. Each analysis will involve a large proportion, if not all, of the dataset, so the time to read the whole dataset is more important than the latency in reading the first record. Commodity hardware Hadoop doesn’t require expensive, highly reliable hardware to run on. It’s designed to run on clusters of commodity hardware (commonly available hardware available from multiple vendors†) for which the chance of node failure across the cluster is high, at least for large clusters. HDFS is designed to carry on working without a noticeable interruption to the user in the face of such failure.

18 NameNodes and DataNodes HDFS cluster has two types of node operating in a master-worker pattern: a namenode (the master) and a number of datanodes (workers). NameNode :- The namenode manages the file system namespace. It maintains the file system tree and the metadata for all the files and directories in the tree. This information is stored persistently on the local disk in the form of two files: the namespace image and the edit log. The namenode also knows the datanodes on which all the blocks for a given file are located, however, it does not store block locations persistently, since this information is reconstructed from datanodes when the system starts. Without the namenode, the file system cannot be used. In fact, if the machine running the namenode were obliterated, all the files on the file system would be lost since there would be no way of knowing how to reconstruct the files from the blocks on the DataNode :- Datanodes are the work horses of the file system. They store and retrieve blocks when they are told to (by clients or the namenode), and they report back to the namenode periodically with lists of blocks that they are storing.

19 Configuring HDFS Installing Hadoop on a Single Node Cluster Setup 1. Ubuntu 12.04 Details (Virtual Box- 32 or 64 bit ) 2. RAM – 1 GB 3. HDD – 40 GB 4. OpenJDK 1.6 1.Java Installation Hadoop requires a working Java 1.5+ (aka Java 5) installation. # Update the source list $ sudo apt-get update 2.1 Add following PPA and install the latest Oracle Java (JDK) 6 in Ubuntu # Install open jdk-1.6 sudo apt-get update && sudo apt-get install oracle-jdk6-installer 2.2 Check whether jdk-1.6 is installed or not. java -version The full JDK which will be placed in /usr/lib/jvm/ (well, this directory is actually a symlink on Ubuntu). After installation, make a quick check whether Sun’s JDK is correctly set up

20 Configuring HDFS 2. Configuring SSH 1) Hadoop requires SSH access to manage its nodes, i.e. remote machines plus your local machine if you want to use Hadoop on it. ssh-keygen –t rsa –P “”

21 Configuring HDFS 2) The second line will create an RSA key pair with an empty password. Generally, using an empty password is not recommended, but in this case it is needed to unlock the key without your interaction (you don’t want to enter the passphrase every time Hadoop interacts with its nodes).Second, you have to enable SSH access to your local machine with this newly created key. cat $HOME/Hadoop/id_rsa.pub >> /home/Hadoop/authorized_keys

22 Hadoop Installation Download Hadoop from the Apache Download Mirrors and extract the contents of the Hadoop package to a location of your choice. I picked /usr/local/hadoop. $ cd /usr/local $ sudo tar xzf hadoop-1.0.4.tar.gz $ sudo mv hadoop-1.0.4 hadoop

23 Hadoop Installation Update $HOME/.bashrc Add the following lines to the end of the $HOME/.bashrc file of local user. Update hadoop-env.sh Open conf/hadoop-env.sh in the editor of your choice and set the JAVA_HOME environment variable to the Sun JDK/JRE 6 directory.

24 Configuration of XML file’s In this section, we will configure the directory where Hadoop will store its data files, the network ports it listens to, etc. Our setup will use Hadoop’s Distributed File System, HDFS, even though our little “cluster” only contains our single local machine. You can leave the settings below “as is” with the exception of the hadoop.tmp.dir parameter – this parameter you must change to a directory of your choice. We will use the directory /app/hadoop/tmp in this tutorial. Hadoop’s default configurations use hadoop.tmp.dir as the base temporary directory both for the local file system and HDFS, so don’t be surprised if you see Hadoop creating the specified directory automatically on HDFS at some later point. Now we create the directory and set the required ownerships and permissions: Add the following snippets between the... tags in the respective configuration XML file. In file conf/hdfs-site.xml: conf/hdfs-site.xml dfs.replication 1 The actual number of replications can be specified when the file is created. The default is used if replication is not specified in create time.

25 Configuration of XML file’s In file conf/core-site.xml: conf/core-site.xml hadoop.tmp.dir /app/hadoop/tmp A base for other temporary directories. fs.default.name hdfs://localhost:54310 The name of the default file system. A URI whose scheme and authority determine the FileSystem implementation. The uri's scheme determines the config property (fs.SCHEME.impl) naming the FileSystem implementation class. The uri's authority is used to determine the host, port, etc. for a filesystem.

26 Configuration of XML file’s In file conf/mapred-site.xml: conf/mapred-site.xml mapred.job.tracker localhost:54311 The host and port that the MapReduce job tracker runs at. If "local", then jobs are run in-process as a single map and reduce task.

27 Formatting the HDFS file system via the NameNode The first step to starting up your Hadoop installation is formatting the Hadoop file system which is implemented on top of the local file system of your “cluster” (which includes only your local machine if you followed this tutorial). You need to do this the first time you set up a Hadoop cluster.

28 Control Single-Node Cluster 1. Starting your single-node cluster Run the command:/usr/local/hadoop/bin/start-all.sh 2. Stopping your single-node cluster Run the command root@oracle:~$ /usr/local/hadoop/bin/stop-all.sh

29 Control Single-Node Cluster 3. Running a MapReduce job We will now run your first Hadoop MapReduce job. We will use the Word Count example job which reads text files and counts how often words occur. The input is text files and the output is text files, each line of which contains a word and the count of how often it occurred, separated by a tab. More information of what happens behind the scenes is available at the Hadoop Wiki. 4. Copy local example data to HDFS :

30 Control Single-Node Cluster 5. Run the MapReduce job Now, we actually run the Word Count example job.


Download ppt "Big-Data – Big Challenge. MySelf.. Name :- Yuvraj J Bhosale 3.5 Years of IT experience in PL/SQL development and Performance tuning. Experienced in creating."

Similar presentations


Ads by Google