Getting Data into Hadoop September 18th 2017 Kyung Eun Park, D.Sc. kpark@towson.edu
Contents Data Lake from Data Store or Data Warehouese Overview of the main tools for data ingestion into Hadoop 1.1 Spark 1.2 Sqoop 1.3 Flume Basic Methods for importing CSV data into HDFS and Hive tables
Hadoop: Setting up a Single Node Cluster Set up and configure a single-node Hadoop installation Required software for Linux: (Ubuntu 16.04.1 x64 LTS) Java ssh: $ sudo apt-get install ssh Installing Download Edit etc/hadoop/hadoop-env.sh # set to the root of your Java installation export JAVA_HOME=/usr/bin Set JAVA_HOME in your .bashrc shell file JAVA_HOME=/usr/lib/jvm/default-java export JAVA_HOME PATH=$PATH:$JAVA_HOME export PATH export HADOOP_CLASSPATH=${JAVA_HOME}/lib/tools.jar $ source .bashrc $ echo $JAVA_HOME Try the following command: $ bin/hadoop http://hadoop.apache.org/docs/r2.7.4/hadoop-project-dist/hadoop-common/SingleCluster.html
Hadoop as a Data Lake With traditional database or data warehouse approach Adding data to the database: Requires ETL (extract, transform, and load) Data transformation into a pre-determined schema before loading Data usage must be decided during the ETL step Later changes costs Data discarded in the ETL step due to mismatch with the schema or capacity constraints (needed one only!) Hadoop approach: a central storage space for all data in the HDFS Inexpensive and redundant storage of large datasets Lower cost than traditional systems
Standalone Operation Copy the unpacked conf directory to use as input 'dfs[a-z.]+ Standalone Operation Copy the unpacked conf directory to use as input Find and display every match of the given regular expression Output is written to the given output directory mkdir input cp etc/hadoop/*.xml input bin/hadoop jar share/hadoop/mapreduce/hadoop-mapreduce-examples- 2.7.4.jar grep input output 'dfs[a-z.]+'
MapReduce MapReduce Schema on read MapReduce Application Software framework for writing application which process vast amounts of data in-parallel on large clusters of commodity hardware The framework takes care of scheduling tasks, monitoring them and re- executes the failed tasks Schema on read Programmers and users to enforce a structure to suit their needs when they access data c.f.) schema on write of the traditional data warehouse approach requiring upfront design and assumptions about the usage of the data MapReduce Application https://hadoop.apache.org/docs/stable/hadoop-mapreduce-client/hadoop-mapreduce-client-core/MapReduceTutorial.html
Why Raw Format? For data science purposes, keeping all data in raw format is beneficial Because it is not clear which data items may be valuable to a given data science goal Hadoop application applies a schema to data as it reads them from the lake Advantages of data lake approach over a traditional approach All data are available: no need for assumptions about future data use All data are sharable: no technical hurdle to data sharing All access methods are available: any processing engine (MapReduce, Spark, etc.) or applications (Hive, Spark-SQP, Pig) can be used to examine and process data
Data Warehouses vs. Hadoop Data Lake Hadoop as a complement to data warehouses The growth of new data from disparate sources quickly fill the data lake Social media Click streams Sensor data, Moving objects, etc. Traditional ETL stages may not keep up with the rate at which data are entering the lake Both supports access to data. However, in the Hadoop case it can happen as soon as the data are available in the lake.
ETL Process vs. Data Lake Source B Source C Source A Data usage decided Enter ETL Process Enter Data Lake Data Lake ETL Discarded data Schema on Write Data Warehouse Hadoop Relational database Raw format data User Schema on Read
The Hadoop Distributed File System (HDFS) All Hadoop applications operate on data stored in HDFS HDFS is not a general file system, but a specialized streaming file system Explicit copy to and from the HDFS file system needed Optimized for reading and writing of large files Writing Data to HDFS Sliced into many small sub-units (blocks, shards) Replicated across the servers in a Hadoop cluster: to avoid data loss reliability Transparently written to the cluster nodes Processing Slices processed in parallel at the same time Exporting, transferring files out of HDFS Slices assembled and written as one file on the host file system Single instance of HDFS No file slicing or replication!
Direct File Transfer to Hadoop HDFS Using native HDFS commands Copy a file (test) to HDFS: use put command $ hdfs dfs –put test View files in HDFS: use ls command (ls –la) $ hdfs dfs –ls Copy a file from HDFS to the local file system: use get command $ hdfs dfs –get another-test More commands: refer to Appendix B
Importing Data from Files into Hive Tables An SQL-like tool for analyzing data in HDFS Useful for feature generation Importing data into Hive Tables Existing text-based files exported from spreadsheets or databases Tab-separated values (TSV) Comma-separated values (CSV) Raw txt JSON, etc Two types of Hive Table Internal table: fully managed by Hive, stored in an optimized format (ORC) External table: not managed by Hive, use only a metadata description to access the data in its raw form, delete only the definition (metadata about the table) in Hive After importing, process data using a variety of tools including Hive’s SQL query processing, Pig, or Spark Hive Tables as virtual tables: used when the data resides outside of Hive https://hive.apache.org/
CSV Files into Hive Tables A comma delimited text file (CSV file) imported into a Hive table Hive Installation and configuration Install Hive 1.2.2 $tar –xzvf apache-hive-1.2.2-bin.tar.gz Create a directory in HDFS to hold the file $ bin/hdfs dfs –mkdir game Put the file in the directory $ bin/hdfs dfs –put 4days*.csv game First load the data as an external Hive table Start a Hive shell $ hive hive> CREATE EXTERNAL TABLE IF NOT EXISTS events (ID INT, NAME STRING ,…) > …
Hive Interactive Shell Commands All commands end with ; quit or exit Add List Delete !<cmd> : execute a shell command from the hive shell <query> : executes a hive query and prints results to standard out Source FILE <fild> : used to execute a script file inside the CLI Set http://hadooptutorial.info/hive-interactive-shell-commands/
Importing Data into Hive Tables Using Spark Apache SPARK: A modern processing engine focusing on in-memory processing Abstracted as an immutable distributed collection of items called a resilient distributed dataset (RDD) RDDs : created from Hadoop (e.g. HDFS files) or by transforming other RDDs Each dataset in an RDD: divided into logical partitions and computed on different nodes of the cluster transparently Spark’s DataFrame: built on top of an RDD, but data are organized into named columns like RDBMS table, similar to a data frame in R Can be created from different data sources: Existing RDDs, Structured data files, JSON datasets, Hive tables, External databases
Next Class: Hadoop Tutorial Please try to install Hadoop, Hive, Spark Next week lab: Importing Data into HDFS and Hive and process the data using MapReduce and Spark engine