Megabytes Gigabytes Terabytes Petabytes Purchase detail Purchase record Payment record Purchase detail Purchase record Payment record ERP CRM WEB BIG DATA Offer details Support Contacts Customer Touches Segmentation Web logs Offer history A/B testing Dynamic Pricing Affiliate Networks Search Marketing Behavioral Targeting Dynamic Funnels User Generated Content Mobile Web SMS/MMS Sentiment External Demographics HD Video, Audio, Images Speech to Text Product/Service Logs Social Interactions & Feeds Business Data Feeds User Click Stream Sensors / RFID / Devices Spatial & GPS Coordinates Increasing Data Variety and Complexity Transactions + Interactions + Observations = BIG DATA
APPLICATIONS DATA SYSTEM REPOSITORIES SOURCES Existing Sources (CRM, ERP, Clickstream, Logs) RDBMSEDWMPP Business Analytics Custom Applications Packaged Applications Source: IDC 2.8 ZB in % from New Data Types 15x Machine Data by ZB by 2020 OLTP, ERP, CRM Systems Unstructured documents, s Clickstream Server logs Sentiment, Web Data Sensor. Machine Data Geolocation
OPERATIONS TOOLS Provision, Manage & Monitor DEV & DATA TOOLS Build & Test DATA SYSTEM REPOSITORIES SOURCES RDBMSEDWMPP OLTP, ERP, CRM Systems Documents, s Web Logs, Click Streams Social Networks Machine Generated Sensor Data Geolocation Data Governance & Integration SecurityOperations Data Access Data Management APPLICATIONS Business Analytics Custom Applications Packaged Applications OLTP, ERP, CRM Systems Unstructured documents, s Clickstream Server logs Sentiment, Web Data Sensor. Machine Data Geolocation
SCALE SCOPE New Analytic Apps New types of data LOB-driven
SCALE SCOPE A Modern Data Architecture/Data Lake New Analytic Apps New types of data LOB-driven RDBMS MPP EDW Governance & Integration SecurityOperations Data Access Data Management Data Lake An architectural shift in the data center that uses Hadoop to deliver deeper insight across a large, broad, diverse set of data at efficient scale
HDP 2.1 Hortonworks Data Platform Provision, Manage & Monitor Ambari (SCOM) Zookeeper Scheduling Oozie Data Workflow, Lifecycle & Governance Falcon Sqoop Flume WebHDFS YARN : Data Operating System DATA MANAGEMENT SECURITY DATA ACCESS GOVERNANCE & INTEGRATION Authentication Authorization Accounting Data Protection Storage: HDFS Resources: YARN Access: Hive, … Pipeline: Falcon Cluster: Knox OPERATIONS Script Pig Search Solr SQL Hive/Tez, HCatalog NoSQL HBase Stream Storm Others In-Memory Analytics, ISV engines 1°°°°°°°°° °°°°°°°°°° °°°°°°°°°° ° ° N HDFS (Hadoop Distributed File System) Batch Map Reduce Deployment Choice LinuxWindowsOn-PremiseCloud Hortonworks Data Platform (HDP) The Only Completely Open Distribution for Apache Hadoop Fundamentally Versatile and Comprehensive enterprise capabilities Wholly Integrated for deep ecosystem interoperability
HDP certifies most recent & stable community innovation Hortonworks Data Platform Solr Hadoop &YARN Pig Tez Hive & HCatalog HBase Sqoop Oozie Zookeeper Mahout Ambari Storm Flume Knox Phoenix HDP 1.3 May HDP 2.0 October 2013 HDP 2.1 April 2014 SecurityOperations Data Access Data Management Falcon Governance & Integration
Traditional Database SCALE (storage & processing) Hadoop Platform NoSQL MPP Analytics EDW schema speed governance best fit use processing Required on write Required on read Reads are fast Writes are fast Standards and structured Loosely structured Limited, no data processing Processing coupled with data data types Structured Multi and unstructured Interactive OLAP Analytics Complex ACID Transactions Operational Data Store Data Discovery Processing unstructured data Massive Storage/Processing
All offerings co-engineered by Hortonworks and Microsoft Enjoy seamless interoperability across on-premises and cloud
DATA ACCESS YARN : Data Operating System DATA MANAGEMENT 1°°°°°°°°° °°°°°°°°°° °°°°°°°°°° ° ° N HDFS (Hadoop Distributed File System) Script Pig Search Solr SQL Hive/Tez, HCatalog NoSQL HBase Accumulo Stream Storm Others In-Memory Analytics, ISV engines Batch Map Reduce
Single Use System Batch Apps Multi Use Data Platform Batch, Interactive, Online, Streaming, … 1 st Gen of Hadoop HDFS (redundant, reliable storage) MapReduce (cluster resource management & data processing) Redundant, Reliable Storage (HDFS) Efficient Cluster Resource Management & Shared Services (YARN) Flexible Data Processing Hive, Pig, others… Batch MapReduce Batch & Interactive Tez Online Data Processing HBase, Accumulo Stream Processing Storm others … 2 nd Gen of Hadoop Classic Hadoop Apps
NodeManager map 1.1 vertex NodeManager map 1.2 reduce 1.1 Batch vertex vertex vertex Interactive SQL ResourceManager Scheduler Real-Time nimbus 0 nimbus 1 nimbus 2
Business Analytics Custom Apps Apache YARN Apache MapReduce 1 ° ° ° ° ° ° ° ° ° ° ° ° ° N Apache Tez Apache Hive SQL ° ° ° ° ° ° HDFS (Hadoop Distributed File System) Apache Hive Contribution… an Open Community at its finest 1,672 Jira Tickets Closed 145 Developers 44 Companies ~390,000 Lines Of Code Added… (2x) 13 Months
Replaces MapReduce as primitive for Hive, Pig, etc Task with pluggable Input, Processor and Output Tez Task - Task Processor InputOutput
Hive – MRHive – Tez SELECT a.state JOIN (a, c) SELECT c.price SELECT JOIN(a, b) GROUP BY a.state COUNT(*) AVERAGE(c.price) MMM R R MM R MM R M M R HDFS MMM R R R MM R R SELECT a.state, c.itemId JOIN (a, c) JOIN(a, b) GROUP BY a.state COUNT(*) AVERAGE(c.price) SELECT SELECT a.state, COUNT(*), AVERAGE(c.price) FROM a JOIN b ON ( = JOIN c ON (a.itemId = c.itemId) GROUP BY a.state Tez avoids unneeded writes to HDFS
Hive SQL DatatypesHive SQL Semantics INTSELECT, INSERT TINYINT/SMALLINT/BIGINTGROUP BY, ORDER BY, SORT BY BOOLEANJOIN on explicit join key FLOATInner, outer, cross and semi joins DOUBLESub-queries in FROM clause STRINGROLLUP and CUBE TIMESTAMPUNION BINARYWindowing Functions (OVER, RANK, etc) DECIMALCustom Java UDFs ARRAY, MAP, STRUCT, UNIONStandard Aggregation (SUM, AVG, etc.) DATEAdvanced UDFs (ngram, Xpath, URL) VARCHARSub-queries for IN/NOT IN, HAVING CHARExpanded JOIN Syntax INTERSECT / EXCEPT Hive 0.12 (HDP 2.0) Hive 0.11 Hive 0.13 (HDP 2.1) SQL Compliance Hive provides a wide array of SQL datatypes and semantics so your existing tools integrate more seamlessly with Hadoop
Disaster Recovery and Backup between environments Publishing data between environments for Discovery Site to Site Site to Cloud
Define sophisticated retention policies Simplify data retention for audit, compliance, or for data re-processing Staged Data Retain 5 Years Cleansed Data Retain 3 Years Conformed Data Retain 3 Years Presented Data Retain Last Copy Only
HDFS (Hadoop Distributed File System) ° ° ° ° ° ° ° ° ° ° ° ° ° ° ° ° ° ° MapReduce Indexing Job
Enterprise Identity Provider LDAP/AD Enterprise Identity Provider LDAP/AD Identity Providers Knox Gateway GWGW DMZ A stateless reverse proxy instance deployed in DMZ Firewall HDP Cluster 1 Masters JT NN Web HCat Oozie YARN HBase Hive DN TT HDP Hadoop Cluster 2 Masters JT NN Web HCat Oozie YARN HBase Hive DN TT -Requests streamed through GW to Hadoop services after auth. -URLs rewritten to refer to gateway -Requests streamed through GW to Hadoop services after auth. -URLs rewritten to refer to gateway Firewall REST Client JDBC Client Browser
Ambari SCOM Mgmt Pack HADOOP Storage & Process at Scale Ambari SCOM Server Ambari SCOM Server aggregates + exposes Hadoop metrics Ambari SCOM monitors health + alerts in case of problems