CMSC 491 Hadoop-Based Distributed Computing Spring 2015 Adam Shook

Slides:



Advertisements
Similar presentations
From SQL to Hadoop and Back The “Sqoop” about Data Connections between
Advertisements

Sqoop 2 Introduction Mengwei Ding, Software Engineer Intern at Cloudera.
1 Configuring Web services (Week 15, Monday 4/17/2006) © Abdou Illia, Spring 2006.
Chapter 7 Managing Data Sources. ASP.NET 2.0, Third Edition2.
Hadoop Ecosystem Overview
Sharepoint Portal Server Basics. Introduction Sharepoint server belongs to Microsoft family of servers Integrated suite of server capabilities Hosted.
Hadoop File Formats and Data Ingestion
HADOOP ADMIN: Session -2
Chapter 5 Using SAS ® ETL Studio. Section 5.1 SAS ETL Studio Overview.
3 Chapter Three Administering and Configuring SQL Server 2000.
Session 5: Working with MySQL iNET Academy Open Source Web Development.
Data Formats CMSC 491 Hadoop-Based Distributed Computing Spring 2015 Adam Shook.
9 Chapter Nine Extracting and Transforming Data with SQL Server 2000.
Programming with Microsoft Visual Basic 2012 Chapter 13: Working with Access Databases and LINQ.
1Copyright © 2012, Oracle and/or its affiliates. All rights reserved. Insert Information Protection Policy Classification from Slide 8 Reporting from Contract.
Hadoop File Formats and Data Ingestion
ASP.NET Programming with C# and SQL Server First Edition
PHP Programming with MySQL Slide 8-1 CHAPTER 8 Working with Databases and MySQL.
Oracle9 i JDeveloper for Database Developers and DBAs Brian Fry Principal Product Manager Oracle JDeveloper Oracle Corporation.
Week 7 Working with the BASH Shell. Objectives  Redirect the input and output of a command  Identify and manipulate common shell environment variables.
HBase A column-centered database 1. Overview An Apache project Influenced by Google’s BigTable Built on Hadoop ▫A distributed file system ▫Supports Map-Reduce.
Module 7: Fundamentals of Administering Windows Server 2008.
Fundamentals of Database Chapter 7 Database Technologies.
Distributed Systems Fall 2014 Zubair Amjad. Outline Motivation What is Sqoop? How Sqoop works? Sqoop Architecture Import Export Sqoop Connectors Sqoop.
Hive Facebook 2009.
Module 10: Monitoring ISA Server Overview Monitoring Overview Configuring Alerts Configuring Session Monitoring Configuring Logging Configuring.
Introduction to Sqoop. Table of Contents Sqoop - Introduction Integration of RDBMS and Sqoop Sqoop use case Sample sqoop commands Key features of Sqoop.
IT 456 Seminar 5 Dr Jeffrey A Robinson. Overview of Course Week 1 – Introduction Week 2 – Installation of SQL and management Tools Week 3 - Creating and.
Chapter 8 Collecting Data with Forms. Chapter 8 Lessons Introduction 1.Plan and create a form 2.Edit and format a form 3.Work with form objects 4.Test.
Data Management Console Synonym Editor
A NoSQL Database - Hive Dania Abed Rabbou.
What’s new in Kentico CMS 5.0 Michal Neuwirth Product Manager Kentico Software.
D ISTRIBUTED S YSTEMS Apache Flume Muhammad Afaq.
Experiment Management System CSE 423 Aaron Kloc Jordan Harstad Robert Sorensen Robert Trevino Nicolas Tjioe Status Report Presentation Industry Mentor:
3 Copyright © 2004, Oracle. All rights reserved. Working in the Forms Developer Environment.
Distributed Time Series Database
Nov 2006 Google released the paper on BigTable.
DSpace System Architecture 11 July 2002 DSpace System Architecture.
Impala. Impala: Goals General-purpose SQL query engine for Hadoop High performance – C++ implementation – runtime code generation (using LLVM) – direct.
Recent Enhancements to Quality Assurance and Case Management within the Emissions Modeling Framework Alison Eyth, R. Partheepan, Q. He Carolina Environmental.
2 Copyright © 2006, Oracle. All rights reserved. Configuring Recovery Manager.
The overview How the open market works. Players and Bodies  The main players are –The component supplier  Document  Binary –The authorized supplier.
UpgradinguPortal to What’s new that matters Better use of third party frameworks Faster! Improved caching Drag and Drop New Skin & Theme Accessibility.
Session 11: Cookies, Sessions ans Security iNET Academy Open Source Web Development.
CMPE 226 Database Systems April 19 Class Meeting Department of Computer Engineering San Jose State University Spring 2016 Instructor: Ron Mak
Introduction to MySQL  Working with MySQL and MySQL Workbench.
Learn. Hadoop Online training course is designed to enhance your knowledge and skills to become a successful Hadoop developer and In-depth knowledge of.
17 Copyright © 2006, Oracle. All rights reserved. Information Publisher.
Moscow, November 16th, 2011 The Hadoop Ecosystem Kai Voigt, Cloudera Inc.
1 Copyright © 2008, Oracle. All rights reserved. Repository Basics.
Apache Hadoop on Windows Azure Avkash Chauhan
CPSC8985 FA 2015 Team C3 DATA MIGRATION FROM RDBMS TO HADOOP By Naga Sruthi Tiyyagura Monika RallabandiRadhakrishna Nalluri.
9 Copyright © 2004, Oracle. All rights reserved. Getting Started with Oracle Migration Workbench.
C Copyright © 2009, Oracle. All rights reserved. Using SQL Developer.
Hadoop Introduction. Audience Introduction of students – Name – Years of experience – Background – Do you know Java? – Do you know linux? – Any exposure.
Integrating ArcSight with Enterprise Ticketing Systems
Integrating ArcSight with Enterprise Ticketing Systems
Hadoop Architecture Mr. Sriram
Real-Time Processing with Apache Flume, Kafka, and Storm Kamlesh Dhawale Ankalytics
Consulting Services JobScheduler Architecture Decision Template
Collecting heterogeneous data into a central repository
Sqoop Mr. Sriram
SQOOP.
Lab #2 - Create a movies dataset
Overview of Azure Data Lake Store
Introduction to Apache
Setup Sqoop.
Data processing with Hadoop
Map Reduce, Types, Formats and Features
Presentation transcript:

CMSC 491 Hadoop-Based Distributed Computing Spring 2015 Adam Shook Moving Data CMSC 491 Hadoop-Based Distributed Computing Spring 2015 Adam Shook

Agenda Sqoop Flume NiFi

Apache Sqoop

Sqoop - SQL to Hadoop Sqoop is a tool designed to transfer data between Hadoop and relational databases Top-level Apache project developed by Cloudera Use Sqoop to move data between an RDBMS and HDFS Sqoop automates most of this process, relying on the database to describe the schema for the data to be imported Sqoop uses MapReduce to import and export the data Uses a JDBC or custom interface

What Can You Do With Sqoop . . . The input to the import process is a database table Sqoop will read the table row-by-row into HDFS The import process is performed in parallel. The output of this process is a set of files containing a copy of the imported table These files may be text files, binary, Avro or SequenceFiles Generates Java class which can encapsulate one row of the imported data that can be reused in subsequent MapReduce processing of the data Used to serialize / deserialize Sequence File formats Parse the delimited-text form of a record

What Else Can You Do With Sqoop . . . You can export an HDFS file to an RDBMS Sqoop’s export process reads a set of delimited text files from HDFS in parallel Parses them into records Inserts them as rows into an RDBMS table Incremental imports are supported Sqoop includes commands which allow you to inspect the RDBMS you are connected to

Sqoop Is Flexible Most aspects of the import, code generation, and export processes can be customized You can control the specific row range or columns imported You can specify particular delimiters and escape characters for the file-based representation of the data You can specify the file format used. Sqoop provides connectors for MySQL, PostgreSQL, Netezza, Oracle, SQL Server, and DB2. There is also a generic JDBC connector

To Use Sqoop To use Sqoop specify the tool you want to use and the arguments that control the tool Standard Syntax sqoop tool-name [tool-arguments] Help is available sqoop help (tool-name) or sqoop import --help

Sqoop Tools Tools to import / export data: Tools to inspect a database sqoop import sqoop import-all-tables sqoop create-hive-table sqoop export Tools to inspect a database sqoop list-databases sqoop list-tables

Sqoop Arguments Common tool arguments Import control arguments --connect JDBC connect string --username username for authentication --password password for authentication Import control arguments --append Append data to an existing dataset in HDFS --as-textfile Imports data as plain text (default) --table Table to read --target-dir HDFS target directory --where WHERE clause used for filtering Sqoop also provides arguments and options for output line formatting, input parsing, Hive, code generation, HBase and many others

Sqoop Examples Import an employees table from the HR database $ sqoop import --connect jdbc:mysql:// database.example.com/hr \ --username abc --password 123 --table employees Import an employees table from the HR database, but only employee’s whose salary exceeds $70000 $ sqoop import --connect jdbc:mysql:// database.example.com/hr \ --username abc --password 123 --table employees \ --where “salary > 70000” Export new employee data into the employees table in the HR database $ sqoop export --connect jdbc:mysql://database.example.com/hr --table employees --export-dir /new_employees

Apache Flume

What Is Flume? Apache Flume is a distributed, reliable system for efficiently collecting, aggregating and moving large amounts of log data from many different sources into HDFS Supports complex multi-hop flows where events may travel through multiple agents before reaching HDFS Allows fan-in and fan-out flows Supports contextual routing and backup routes (fail-over) Events are staged in a channel on each agent and are delivered to the next agent (like HDFS) in the flow Removed from a channel after they are stored in the channel of next agent or in HDFS

Flume Components Event – data being collected Flume Agent – source, channel, sink Source – where the data comes from Channel – repository for the data Sink – next destination for the data

How Does It Work? A Flume event is data flow A Flume agent is a (JVM) process that hosts the components (source, channel, sink) through which events flow from an external source to the next destination (hop) A Flume source receives events sent to it by an external source like a web server Format specific When a Flume source receives an event it stores it into one or more channels. Channel is a passive store Can be a memory channel Can be a durable

How Does It Work? The sink removes the event from the channel and puts it into an external repository (HDFS) or forwards it to the source of the next Flume agent (next hop) in the flow The source and sink within the given agent run asynchronously with the events staged in the channel.

Single Agent

Multi-Agent

Multiplexing Channels

Consolidation

Configuration To define the flow within a single agent Basic syntax: list the sources / channels / sinks point the source and sink to a channel Basic syntax: # list the sources, sinks and channels for the agent <Agent>.sources = <Source> <Agent>.sinks = <Sink> <Agent>.channels = <Channel1> <Channel2> # set channel for source <Agent>.sources.<Source>.channels = <Channel1> <Channel2> ... # set channel for sink <Agent>.sinks.<Sink>.channel = <Channel1>

Flume Example Agent lets a user generate events and display them to the console. Defines a single agent named a1 a1 has a source that listens for data on port 44444 a1 has a channel that buffers event data in memory a1 has a sink that logs event data to the console # example.conf: A single-node Flume configuration # Name the components on this agent a1.sources = r1 a1.sinks = k1 a1.channels = c1  

Flume Example . . . # Describe/configure the source a1.sources.r1.type = netcat a1.sources.r1.bind = localhost a1.sources.r1.port = 44444 # Describe the sink a1.sinks.k1.type = logger   # Use a channel which buffers events in memory a1.channels.c1.type = memory a1.channels.c1.capacity = 1000 a1.channels.c1.transactionCapacity = 100 # Bind the source and sink to the channel a1.sources.r1.channels = c1 a1.sinks.k1.channel = c1

Built-In Flume Sources Avro Thrift Exec JMS Spooling Directory NetCat Sequence Generator Syslog HTTP

Built-In Flume Sinks HDFS Logger Avro Thrift IRC File Roll Null HBase Solr Elastic Search

Built-In Flume Channels Memory JDBC File Pseudo Transaction

Flume Interceptors Attach functions to sources for some type of transformation Convert event to a new format Add a timestamp Change your car's oil

Code Examples Let's take a look at some custom Flume code! Twitter Source PostgreSQL Sink

APACHE NIFI

Moving Data So, what do you use? Bash, Python, Perl, PHP Do you have a project folder called “Loaders”? Database Replication Apache Falcon Sqoop (Relational Data System) Flume (Web server log ingesting)

NiFi: History Photo Cred: wikipedia.org

NiFi: Why? Photo Cred: www.niagarafallslive.com

What does this Nifi look like?

Features! This thing moves files! Visual Representation of Data Flow and ETL Processing Guaranteed Delivery of Data Manages Data Delivery, Flow, Age Off, Etc Prioritization on Multiple levels Extendable Tracking and Logs

Overall Architecture Java, Java, and more Java JVM Webserver Flow Controller Flow File Repo Content Repo Provenance Repo There is a notion of Nifi Cluster

How do I install? Requirements Download Create User (Optional) Java 7+ Download http://nifi.incubator.apache.org/downloads/Create User Create User (Optional) useradd nifi Move to destination Directory and Extract Tar I like /opt/nifi Edit Configs (Optional) Start Linux: ($(NifiHome)/bin/nifi.sh) Windows: bin/start-nifi.sh Logs logs/ You can install as a service as well

And Demo Time

Nifi User Interface Terminology and Overview Flowfile Components Processors (This is where we will spend a bunch of our time today) Processor Groups Remote Processor Groups Input Ports Output Ports Funnels Templates Labels Relationships Actions Management Navigation Status

Nifi User Interface Summary Page Extremely useful when diagnosing a problem, large amount of processors

Data Provenance Ability to dive down into FlowFile details Useful for searches, troubleshooting, optimization. Ability to search for specific events Ability to replay FlowFile Graph of FlowFile lineage.

NiFi Extensions Developers who have a basic knowledge of Java can extend components of NiFi Able to Extend: Processors Reporting Tasks Controller Services FlowFile Prioritizations Authority Providers

OMG, what is a NAR? NiFi Archive Allows for dependency separation from other components Defeats the dreaded “NoClassDefFoundError” (and hours of trying to figure out what library is causing the problem) via ClassLoader isolation Use nifi-nar-maven-plugin Great instructions in the end of the NiFi Developers Guide

Extending Processors Best place to look is at already written processors in the source code Example GetFile.java: nifi-0.0.1-incubating-source-release.zip\nifi-0.0.1-incubating\nifi-nar-bundles\nifi-standard-bundle\nifi-standard-processors\src\main\java\org\apache\nifi\processors\standard\GetFile.java

Resources http://sqoop.apache.org http://flume.apache.org http://nifi.incubator.apache.org