Hadoop Introduction
Audience Introduction of students – Name – Years of experience – Background – Do you know Java? – Do you know linux? – Any exposure to Data Warehousing? 2
Big Data History Google – Started with My SQL for search engine (scalability was major issue) – Developed solution from scratch Distributed file system – GFS Distributed processing – Map Reduce Big Table 3
Big Data Characteristics - Volume, Variety and Velocity Batch – Hadoop with Map Reduce – GFS -> HDFS – Map Reduce -> Hadoop Map Reduce Operational but not Transactional – NoSQL – Google Big Table -> HBase 4
Characteristics Volume – Huge amounts of data Variety – Structured and Semi Structured Velocity – Speed at which data needs to be processed 5
Hadoop core components HDFS – Storage YARN/Map Reduce – Processing 6
Oracle Architecture Storage Network Switch (interconnect) Network Switch (interconnect) Database Servers
Hadoop Architecture Metadata Helper Storage Processing Processing Master
Hadoop Architecture HDFS Map Reduce
HDFS Namenode Secondary Namenode Datanode Map Reduce
Processing MRv1/Classic MRv2/YARN * We will look into details later 11
Typical Hadoop Cluster Network Switch(es) HDFS YARN HDFS YARNHDFSYARNHDFSYARN HDFSYARNHDFSYARNHDFSYARN HDFSYARNHDFSYARNHDFSYARN HDFSYARNHDFSYARNHDFSYARN HDFSYARNHDFSYARNHDFSYARN HDFSYARNHDFSYARNHDFSYARN HDFSYARNHDFSYARNHDFSYARN HDFSYARNHDFSYARNHDFSYARN
Typical Hadoop Cluster Network Switch(es) NN RM SNN DNNMDNNMDNNM DNNMDNNMDNNM DNNMDNNMDNNM DNNMDNNMDNNM DNNMDNNMDNNM DNNMDNNMDNNM DNNMDNNMDNNM DNNMDNNMDNNM
Hadoop eco system 14 Distributed File System (HDFS) Map Reduce Hadoop Core Components Hive Pig Flume Non Map Reduce Impala Presto Sqoop Oozie Mahout Hadoop eco system Hadoop Components HBase
Difference between Oracle and Hadoop Oracle Architecture Hadoop Architecture Theoretical differences 15
Oracle Architecture Storage Network Switch (interconnect) Network Switch (interconnect) Database Servers
Oracle Architecture Servers – Cluster of servers with same binaries and configuration – All of them run the same back end processes Storage – NFS (Network File System) – Mounted to all the database servers Software – Binaries will be installed on all the servers Parameter files – init.ora, pfile, spfile etc Backend processes (same on all of them) – smon – pmon – etc Memory – Same amount of memory on all the nodes in the cluster Memory Structures – pga – sga Shared pool (cache of code like sql plans) Database buffer cache Network – Typically 3 network switches – one for interconnect of nodes, one to connect with storage, one for public connectivity 17
Hadoop Architecture Metadata Helper Storage Processing Processing Master
Hadoop Architecture Storage -> Hadoop Distributed File System Processing – Map Reduce (majority of Hadoop eco system tools use Map Reduce for processing) – Non Map Reduce 19
Hadoop eco system Hadoop Core Components Non Map Reduce Hive Pig Flume Sqoop Oozie Mahout Hadoop eco system Hadoop Components Distributed File System (HDFS) Map Reduce Impala Presto HBase Spark
Oracle Big Data Appliance Hadoop Distributions and Hadoop Appliances 21 Hadoop Hive Sqoop Many more Monitoring Hortonworks Cloudera MapR Many More
Hadoop Distributions and Hadoop Appliances Hadoop and eco system tools are Apache open source projects Cloudera, Hortonworks and other leading Hadoop based technology companies commit to these open source projects They provide training, support and services. Cloudera have proprietary monitoring tool developed for large hadoop clusters. It is free up to 50 nodes after which license fee needs to be paid. Hortonworks uses Ambari which is a open source monitoring tool. No license fee 22
HDFS How to copy files to and from HDFS? What is HDFS? What are HDFS daemons? Explain namenode, secondary namenode as well as datanode. What are different parameter files? Explain importance of the parameters. What is “final” for a parameter? What is Gateway node? What is the role of gateway node with respect to HDFS? What is block, block size and how data is distributed? What is fault tolerance? What is the role of replication factor? What is default block size and what is replication factor? Given a scenario, explain how files are stored in HDFS? Understand size of each of the block and replication factor. How to override parameters such as block size and replication factor while copying files? 23
Map Reduce How to run map reduce job? What is difference between classic and yarn? What is map task and reduce task? What are the map reduce daemons in classic? What are the map reduce daemons in yarn? What is application master? What is job history server? How to troubleshoot logs? How fault tolerance works in map reduce? What is split size? How to override parameters while running programs? What is the role of Gateway node with respect to running map reduce jobs? 24
Map Reduce - Programming If we develop map reduce program, what is the criteria for developing map function and reduce function? What are the steps involved in development life cycle? What is shuffle and sort? What are different input and output formats? What are different key and value classes? Why we have new set of classes such as IntWritable, Text etc instead of Java primitive classes Integer, String? What are the steps involved in developing custom key and value classes? 25
Map Reduce – shuffle and sort For each input data set how many times map function will be invoked? How map output will look like? How data will be partitioned and sorted? What is spill? How many times it happens? How reducer input will look like? For each input data set how many times reducer will be invoked? 26
Apache Hive What is Hive? How data is stored and processed in Hive? Where is metadata stored? How tables are created in Hive? What is the difference between managed table and external table? How to specify delimiters? What are different file formats supported by Hive? 27
Apache Sqoop 28
Apache Spark 29