Presentation is loading. Please wait.

Presentation is loading. Please wait.

Distributed Systems Fall 2014 Zubair Amjad. Outline Motivation What is Sqoop? How Sqoop works? Sqoop Architecture Import Export Sqoop Connectors Sqoop.

Similar presentations


Presentation on theme: "Distributed Systems Fall 2014 Zubair Amjad. Outline Motivation What is Sqoop? How Sqoop works? Sqoop Architecture Import Export Sqoop Connectors Sqoop."— Presentation transcript:

1 Distributed Systems Fall 2014 Zubair Amjad

2 Outline Motivation What is Sqoop? How Sqoop works? Sqoop Architecture Import Export Sqoop Connectors Sqoop Commands

3 Motivation SQL servers are already deployed opulent worldwide Nightly processing is done on SQL servers for years As more organizations deploy Hadoop to analyze vast streams of information, they may find they need to transfer large amount of data between Hadoop and their existing databases Loading bulk data into Hadoop or accessing it from map-reduce applications is a challenging task Transferring data using script is an inefficient and time-consuming task Traditional DB already have reporting, data visualization etc. applications built in enterprise Bringing processed data from Hadoop to these applications is the need

4 What is Sqoop A tool to automate data transfer between Hadoop and relational databases Transform data in Hadoop with MapReduce or Hive Export data back into RDB Allows easy import and export of data from structured datastores RD, Enterprise data warehouses, and NoSQL Systems Provision data from external system on to HDFS Sqoop integrates with Oozie to allow scheduling and automation of import and export tasks Oozie is a workflow scheduler system to manage Apache Hadoop jobs Sqoop uses a connector based architecture which supports plugins that provide connectivity to new external systems

5 How Sqoop Works Runs on a Hadoop cluster and has access to Hadoop core Sqoop uses Mappers to slice the incoming data Data is placed to HDFS The dataset being transferred is sliced up into different partitions A Map-only job is launched with individual mappers responsible of transferring a slice of the dataset Each record of the data is handled in a type safe manner since Sqoop uses the database matadata to infer the data types Many data transfer formats supported

6 Sqoop Architecture

7 Sqoop Import Command $ sqoop import --connect jdbc:mysql://localhost/DB_NAME --table TABLE_NAME --username USER_NAME --password PASSWORD Import Subcommands --connect, --username & --password Part of connection string --table Database table name

8 Sqoop Import Step 1 Sqoop examines the database to gather the necessary matadata for the data being imported Step 2 A Map-only Hadoop job submitted to cluster by Sqoop A Map-only job performs data transfer using the matadata captured in step 1

9 Sqoop Import The imported data is saved in a directory on HDFS based on the table being imported User can specify any alternative directory where the files should be populated By default these files contain comma delimited fields, with new lines separating different records User can override the format in which data is copied over by explicitly specifying the field separator and record terminator characters Sqoop also supports different data formats for importing data

10 Sqoop Export Command $ sqoop export --connect jdbc:mysql://SERVER/DB_NAME --table TARGET_TABLE_NAME --username USER_NAME --password PASSWORD --export-dir EXPORT_DIR Export Subcommands --connect, --username & --password Part of connection string --table Target database table name --export-dir HDFS directory from where the data will be exported

11 Sqoop Export Step 1 Sqoop examines the database for matadata, followed by the second step of transferring the data Step 2 Data transfer Sqoop divides the input dataset into splits Sqoop uses individual map tasks to push the splits to the database Each map task performs this transfer over many transections in order to ensure optimal throughput and minimal resource utilization

12 Sqoop Connectors Generic JDBC connector Can be used to connect to any database that is accessible via JDBC Default Sqoop connector Designed for specific databases such as MySQL, PostgreSQL, Oracle. SQL Server and DB2 Fast-path connector Fast-path connectors specialized to use specific batch tools transferring data with high throughput MySQL and PostgreSQL databases

13 Sqoop Commands Available Commands Codegen Generate code to interact with database records create-hive-table Import a table definition into Hive Eval Evaluate a SQL statement and display the results Export Export an HDFS directory to a database table Help List available commands Import Import a table from a database to HDFS

14 Available Commands import-all-tables Import tables from a database to HDFS Job Work with saved jobs list-databases List available databases on a server list-tables List available tables in a database Merge Merge results of incremental imports Metastore Run a standalone Sqoop metastore Version Display version information Sqoop Commands

15 Thank you


Download ppt "Distributed Systems Fall 2014 Zubair Amjad. Outline Motivation What is Sqoop? How Sqoop works? Sqoop Architecture Import Export Sqoop Connectors Sqoop."

Similar presentations


Ads by Google