Download presentation
Presentation is loading. Please wait.
Published byAlaina Stanley Modified over 9 years ago
1
SQL on Hadoop
2
Todays agenda Introduction Hive – the first SQL approach Data ingestion and data formats Impala – MPP SQL
3
Why SQL? Data warehousing… Structured data – organization of the data – optimized data access Declarative data processing – No need to have developer skills, but… – Portable – universal language – We are lazy SQL drivers supported – No need of Hadoop client installation – Easier integration with the current systems
4
Why not SQL It is not RDBMS! – big tables joins should by avoided – no indexes by default – no primary keys and constraints Not suited for OLTP – no locks – no transactions write once – read many Additional data structuring during data shipping (ETL) needed Not all problems can be solved with SQL
5
SQL on Hadoop HDFS Hadoop Distributed File System Hbase NoSql columnar store YARN Cluster resource manager MapReduce Hive SQL Pig Scripting Flume Log data collector Sqoop Data exchange with RDBMS Oozie Workflow manager Mahout Machine learning Zookeeper Coordination Impala SQL Spark Large scale data proceesing 5
6
SQL on Hadoop Client Metadata SQL master node JDBC/ODBC server SQL engine HDFS Executor Cluster Node HDFS Executor Cluster Node HDFS Executor Cluster Node HDFS Executor SQL Data Tables definition lookup YARN
7
There are others exotic animals… Purely on Hadoop – Stinger.next/Hive on Tez (improved MR executions, ACID, etc) – Presto (graph based processing, multiple data sources) – SparkSQL (Spark based) See Greg Rhan slides: https://speakerdeck.com/grahn/an-independent- comparison-of-open-source-sql-on-hadoop https://speakerdeck.com/grahn/an-independent- comparison-of-open-source-sql-on-hadoop
8
Summary SQL on Hadoop is not for OLTP! …but for data warehousing workloads … ad-hoc queries Enforces semi-structuring of the data Does not enforce using certain data format on HDFS
Similar presentations
© 2024 SlidePlayer.com. Inc.
All rights reserved.