SQL on Hadoop. Todays agenda Introduction Hive – the first SQL approach Data ingestion and data formats Impala – MPP SQL.

SQL on Hadoop

Todays agenda Introduction Hive – the first SQL approach Data ingestion and data formats Impala – MPP SQL

Why SQL? Data warehousing… Structured data – organization of the data – optimized data access Declarative data processing – No need to have developer skills, but… – Portable – universal language – We are lazy SQL drivers supported – No need of Hadoop client installation – Easier integration with the current systems

Why not SQL It is not RDBMS! – big tables joins should by avoided – no indexes by default – no primary keys and constraints Not suited for OLTP – no locks – no transactions write once – read many Additional data structuring during data shipping (ETL) needed Not all problems can be solved with SQL

SQL on Hadoop HDFS Hadoop Distributed File System Hbase NoSql columnar store YARN Cluster resource manager MapReduce Hive SQL Pig Scripting Flume Log data collector Sqoop Data exchange with RDBMS Oozie Workflow manager Mahout Machine learning Zookeeper Coordination Impala SQL Spark Large scale data proceesing 5

SQL on Hadoop Client Metadata SQL master node JDBC/ODBC server SQL engine HDFS Executor Cluster Node HDFS Executor Cluster Node HDFS Executor Cluster Node HDFS Executor SQL Data Tables definition lookup YARN

There are others exotic animals… Purely on Hadoop – Stinger.next/Hive on Tez (improved MR executions, ACID, etc) – Presto (graph based processing, multiple data sources) – SparkSQL (Spark based) See Greg Rhan slides: https://speakerdeck.com/grahn/an-independent- comparison-of-open-source-sql-on-hadoop https://speakerdeck.com/grahn/an-independent- comparison-of-open-source-sql-on-hadoop

Summary SQL on Hadoop is not for OLTP! …but for data warehousing workloads … ad-hoc queries Enforces semi-structuring of the data Does not enforce using certain data format on HDFS

SQL on Hadoop. Todays agenda Introduction Hive – the first SQL approach Data ingestion and data formats Impala – MPP SQL.

Similar presentations

Presentation on theme: "SQL on Hadoop. Todays agenda Introduction Hive – the first SQL approach Data ingestion and data formats Impala – MPP SQL."— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

SQL on Hadoop. Todays agenda Introduction Hive – the first SQL approach Data ingestion and data formats Impala – MPP SQL.

Similar presentations

Presentation on theme: "SQL on Hadoop. Todays agenda Introduction Hive – the first SQL approach Data ingestion and data formats Impala – MPP SQL."— Presentation transcript:

Similar presentations

About project

Feedback