Getting Data into Hadoop

Slides:



Advertisements
Similar presentations
Module 13: Performance Tuning. Overview Performance tuning methodologies Instance level Database level Application level Overview of tools and techniques.
Advertisements

CS525: Special Topics in DBs Large-Scale Data Management MapReduce High-Level Langauges Spring 2013 WPI, Mohamed Eltabakh 1.
EHarmony in Cloud Subtitle Brian Ko. eHarmony Online subscription-based matchmaking service Available in United States, Canada, Australia and United Kingdom.
BigData Tools Seyyed mohammad Razavi. Outline  Introduction  Hbase  Cassandra  Spark  Acumulo  Blur  MongoDB  Hive  Giraph  Pig.
Data - Information - Knowledge
CS525: Big Data Analytics MapReduce Languages Fall 2013 Elke A. Rundensteiner 1.
Take An Internal Look at Hadoop Hairong Kuang Grid Team, Yahoo! Inc
HADOOP ADMIN: Session -2
Hadoop Ida Mele. Parallel programming Parallel programming is used to improve performance and efficiency In a parallel program, the processing is broken.
HBase A column-centered database 1. Overview An Apache project Influenced by Google’s BigTable Built on Hadoop ▫A distributed file system ▫Supports Map-Reduce.
Hadoop Basics -Venkat Cherukupalli. What is Hadoop? Open Source Distributed processing Large data sets across clusters Commodity, shared-nothing servers.
Introduction to Hadoop and HDFS
Contents HADOOP INTRODUCTION AND CONCEPTUAL OVERVIEW TERMINOLOGY QUICK TOUR OF CLOUDERA MANAGER.
Distributed Systems Fall 2014 Zubair Amjad. Outline Motivation What is Sqoop? How Sqoop works? Sqoop Architecture Import Export Sqoop Connectors Sqoop.
MapReduce High-Level Languages Spring 2014 WPI, Mohamed Eltabakh 1.
A NoSQL Database - Hive Dania Abed Rabbou.
Tutorial: Big Data Algorithms and Applications Under Hadoop KUNPENG ZHANG SIDDHARTHA BHATTACHARYYA
Copyright © 2015, SAS Institute Inc. All rights reserved. THE ELEPHANT IN THE ROOM SAS & HADOOP.
Apache PIG rev Tools for Data Analysis with Hadoop Hadoop HDFS MapReduce Pig Statistical Software Hive.
INTRODUCTION TO HADOOP. OUTLINE  What is Hadoop  The core of Hadoop  Structure of Hadoop Distributed File System  Structure of MapReduce Framework.
Hadoop Introduction. Audience Introduction of students – Name – Years of experience – Background – Do you know Java? – Do you know linux? – Any exposure.
Hadoop Javad Azimi May What is Hadoop? Software platform that lets one easily write and run applications that process vast amounts of data. It includes:
Image taken from: slideshare
PROTECT | OPTIMIZE | TRANSFORM
How Alluxio (formerly Tachyon) brings a 300x performance improvement to Qunar’s streaming processing Xueyan Li (Qunar) & Chunming Li (Garena)
Hadoop Architecture Mr. Sriram
About Hadoop Hadoop was one of the first popular open source big data technologies. It is a scalable fault-tolerant system for processing large datasets.
Hadoop Aakash Kag What Why How 1.
Database Services Katarzyna Dziedziniewicz-Wojcik On behalf of IT-DB.
Hadoop.
Introduction to Distributed Platforms
Unit 2 Hadoop and big data
ITCS-3190.
INTRODUCTION TO PIG, HIVE, HBASE and ZOOKEEPER
Chapter 14 Big Data Analytics and NoSQL
Spark Presentation.
Hadoop.
Enabling Scalable and HA Ingestion and Real-Time Big Data Insights for the Enterprise OCJUG, 2014.
Hive Mr. Sriram
Hadoop: what is it?.
Pyspark 최 현 영 컴퓨터학부.
Introduction to HDFS: Hadoop Distributed File System
SQOOP.
CFS Community Day Core Flight System Command and Data Dictionary Utility December 4, 2017 NASA JSC/Kevin McCluney December 4, 2017.
Central Florida Business Intelligence User Group
Overview of Hadoop MapReduce MapReduce is a soft work framework for easily writing applications which process vast amounts of.
Apache Spark Resilient Distributed Datasets: A Fault-Tolerant Abstraction for In-Memory Cluster Computing Aditya Waghaye October 3, 2016 CS848 – University.
Lab #2 - Create a movies dataset
Microsoft Dumps PDF Cloudera CCA175 Dumps PDF CCA Spark and Hadoop Developer Exam - Performance Based Scenarios RealExamCollection.com.
Ministry of Higher Education
CS6604 Digital Libraries IDEAL Webpages Presented by
Introduction to PIG, HIVE, HBASE & ZOOKEEPER
Server & Tools Business
CS110: Discussion about Spark
Introduction to Apache
Overview of big data tools
Spark and Scala.
Setup Sqoop.
CSE 491/891 Lecture 21 (Pig).
Hadoop Installation and Setup on Ubuntu
8 6 MySQL Special Topics A Guide to MySQL.
Cloud Computing for Data Analysis Pig|Hive|Hbase|Zookeeper
IBM C IBM Big Data Engineer. You want to train yourself to do better in exam or you want to test your preparation in either situation Dumpspedia’s.
Big-Data Analytics with Azure HDInsight
Introduction to Azure Data Lake
CS639: Data Management for Data Science
Copyright © JanBask Training. All rights reserved Get Started with Hadoop Hive HiveQL Languages.
Pig Hive HBase Zookeeper
Presentation transcript:

Getting Data into Hadoop September 18th 2017 Kyung Eun Park, D.Sc. kpark@towson.edu

Contents Data Lake from Data Store or Data Warehouese Overview of the main tools for data ingestion into Hadoop 1.1 Spark 1.2 Sqoop 1.3 Flume Basic Methods for importing CSV data into HDFS and Hive tables

Hadoop: Setting up a Single Node Cluster Set up and configure a single-node Hadoop installation Required software for Linux: (Ubuntu 16.04.1 x64 LTS) Java ssh: $ sudo apt-get install ssh Installing Download Edit etc/hadoop/hadoop-env.sh # set to the root of your Java installation export JAVA_HOME=/usr/bin Set JAVA_HOME in your .bashrc shell file JAVA_HOME=/usr/lib/jvm/default-java export JAVA_HOME PATH=$PATH:$JAVA_HOME export PATH export HADOOP_CLASSPATH=${JAVA_HOME}/lib/tools.jar $ source .bashrc $ echo $JAVA_HOME Try the following command: $ bin/hadoop http://hadoop.apache.org/docs/r2.7.4/hadoop-project-dist/hadoop-common/SingleCluster.html

Hadoop as a Data Lake With traditional database or data warehouse approach Adding data to the database: Requires ETL (extract, transform, and load) Data transformation into a pre-determined schema before loading Data usage must be decided during the ETL step Later changes costs Data discarded in the ETL step due to mismatch with the schema or capacity constraints (needed one only!) Hadoop approach: a central storage space for all data in the HDFS Inexpensive and redundant storage of large datasets Lower cost than traditional systems

Standalone Operation Copy the unpacked conf directory to use as input 'dfs[a-z.]+ Standalone Operation Copy the unpacked conf directory to use as input Find and display every match of the given regular expression Output is written to the given output directory mkdir input cp etc/hadoop/*.xml input bin/hadoop jar share/hadoop/mapreduce/hadoop-mapreduce-examples- 2.7.4.jar grep input output 'dfs[a-z.]+'

MapReduce MapReduce Schema on read MapReduce Application Software framework for writing application which process vast amounts of data in-parallel on large clusters of commodity hardware The framework takes care of scheduling tasks, monitoring them and re- executes the failed tasks Schema on read Programmers and users to enforce a structure to suit their needs when they access data c.f.) schema on write of the traditional data warehouse approach requiring upfront design and assumptions about the usage of the data MapReduce Application https://hadoop.apache.org/docs/stable/hadoop-mapreduce-client/hadoop-mapreduce-client-core/MapReduceTutorial.html

Why Raw Format? For data science purposes, keeping all data in raw format is beneficial Because it is not clear which data items may be valuable to a given data science goal Hadoop application applies a schema to data as it reads them from the lake Advantages of data lake approach over a traditional approach All data are available: no need for assumptions about future data use All data are sharable: no technical hurdle to data sharing All access methods are available: any processing engine (MapReduce, Spark, etc.) or applications (Hive, Spark-SQP, Pig) can be used to examine and process data

Data Warehouses vs. Hadoop Data Lake Hadoop as a complement to data warehouses The growth of new data from disparate sources  quickly fill the data lake Social media Click streams Sensor data, Moving objects, etc. Traditional ETL stages may not keep up with the rate at which data are entering the lake Both supports access to data. However, in the Hadoop case it can happen as soon as the data are available in the lake.

ETL Process vs. Data Lake Source B Source C Source A Data usage decided Enter ETL Process Enter Data Lake Data Lake ETL Discarded data Schema on Write Data Warehouse Hadoop Relational database Raw format data User Schema on Read

The Hadoop Distributed File System (HDFS) All Hadoop applications operate on data stored in HDFS HDFS is not a general file system, but a specialized streaming file system Explicit copy to and from the HDFS file system needed Optimized for reading and writing of large files Writing Data to HDFS Sliced into many small sub-units (blocks, shards) Replicated across the servers in a Hadoop cluster: to avoid data loss  reliability Transparently written to the cluster nodes Processing Slices processed in parallel at the same time Exporting, transferring files out of HDFS Slices assembled and written as one file on the host file system Single instance of HDFS No file slicing or replication!

Direct File Transfer to Hadoop HDFS Using native HDFS commands Copy a file (test) to HDFS: use put command $ hdfs dfs –put test View files in HDFS: use ls command (ls –la) $ hdfs dfs –ls Copy a file from HDFS to the local file system: use get command $ hdfs dfs –get another-test More commands: refer to Appendix B

Importing Data from Files into Hive Tables An SQL-like tool for analyzing data in HDFS Useful for feature generation Importing data into Hive Tables Existing text-based files exported from spreadsheets or databases Tab-separated values (TSV) Comma-separated values (CSV) Raw txt JSON, etc Two types of Hive Table Internal table: fully managed by Hive, stored in an optimized format (ORC) External table: not managed by Hive, use only a metadata description to access the data in its raw form, delete only the definition (metadata about the table) in Hive After importing, process data using a variety of tools including Hive’s SQL query processing, Pig, or Spark Hive Tables as virtual tables: used when the data resides outside of Hive https://hive.apache.org/

CSV Files into Hive Tables A comma delimited text file (CSV file) imported into a Hive table Hive Installation and configuration Install Hive 1.2.2 $tar –xzvf apache-hive-1.2.2-bin.tar.gz Create a directory in HDFS to hold the file $ bin/hdfs dfs –mkdir game Put the file in the directory $ bin/hdfs dfs –put 4days*.csv game First load the data as an external Hive table Start a Hive shell $ hive hive> CREATE EXTERNAL TABLE IF NOT EXISTS events (ID INT, NAME STRING ,…) > …

Hive Interactive Shell Commands All commands end with ; quit or exit Add List Delete !<cmd> : execute a shell command from the hive shell <query> : executes a hive query and prints results to standard out Source FILE <fild> : used to execute a script file inside the CLI Set http://hadooptutorial.info/hive-interactive-shell-commands/

Importing Data into Hive Tables Using Spark Apache SPARK: A modern processing engine focusing on in-memory processing Abstracted as an immutable distributed collection of items called a resilient distributed dataset (RDD) RDDs : created from Hadoop (e.g. HDFS files) or by transforming other RDDs Each dataset in an RDD: divided into logical partitions and computed on different nodes of the cluster transparently Spark’s DataFrame: built on top of an RDD, but data are organized into named columns like RDBMS table, similar to a data frame in R Can be created from different data sources: Existing RDDs, Structured data files, JSON datasets, Hive tables, External databases

Next Class: Hadoop Tutorial Please try to install Hadoop, Hive, Spark Next week lab: Importing Data into HDFS and Hive and process the data using MapReduce and Spark engine