Introduction to building DataFlows with Apache NiFi

Slides:



Advertisements
Similar presentations
Welcome to Middleware Joseph Amrithraj
Advertisements

HyperContent 2.0 JA-SIG Winter Conference December 5, 2005 Alex Vigdor, Columbia University.
EASY LOGISTICS CENTER - the TURNTABLE for information, documents and processes EASY LOGISTICS CENTER DOCUMENTS SHOP CONTENT COMMUNITY MODULES EASY ENTERPRISE.
Report Distribution Report Distribution in PeopleTools 8.4 Doug Ostler & Eric Knapp 7264.
How Clients and Servers Work Together. Objectives Learn about the interaction of clients and servers Explore the features and functions of Web servers.
Chapter 11 ASP.NET JavaScript, Third Edition. 2 Objectives Learn about client/server architecture Study server-side scripting Create ASP.NET applications.
Slide 1 of 9 Presenting 24x7 Scheduler The art of computer automation Press PageDown key or click to advance.
Understanding and Managing WebSphere V5
“This presentation is for informational purposes only and may not be incorporated into a contract or agreement.”
A Brief Overview by Aditya Dutt March 18 th ’ Aditya Inc.
Web Servers Web server software is a product that works with the operating system The server computer can run more than one software product such as .
LAYING OUT THE FOUNDATIONS. OUTLINE Analyze the project from a technical point of view Analyze and choose the architecture for your application Decide.
Data Management Kelly Clynes Caitlin Minteer. Agenda Globus Toolkit Basic Data Management Systems Overview of Data Management Data Movement Grid FTP Reliable.
Microsoft Active Directory(AD) A presentation by Robert, Jasmine, Val and Scott IMT546 December 11, 2004.
SUSE Linux Enterprise Desktop Administration Chapter 12 Administer Printing.
MCSE Guide to Microsoft Exchange Server 2003 Administration Chapter Two Installing and Configuring Exchange Server 2003.
HyperContent 2.0 Common Solutions Group September 21, 2005 Alex Vigdor, Columbia University.
TCP/IP (Transmission Control Protocol / Internet Protocol)
I.R.I.S. © 2006, All rights reserved 1 GENERALI Belgium, a global Documentum Content Management Solution since 2004.
Module 1: Overview of Microsoft Office SharePoint Server 2007.
MCSE Guide to Microsoft Exchange Server 2003 Administration Chapter One Introduction to Exchange Server 2003.
IPS Infrastructure Technological Overview of Work Done.
Microsoft Partner since 2011
12. DISTRIBUTED WEB-BASED SYSTEMS Nov SUSMITHA KOTA KRANTHI KOYA LIANG YI.
October 2014 HYBRIS ARCHITECTURE & TECHNOLOGY 01 OVERVIEW.
E-Business Infrastructure PRESENTED BY IKA NOVITA DEWI, MCS.
Hossein Haghighat Sabet CRM Consultant |
PARTNER UPDATE V9 MILESTONE 1 & 2.
Pilot Kafka Service Manuel Martín Márquez. Pilot Kafka Service Manuel Martín Márquez.
J2EE Platform Overview (Application Architecture)
What is BizTalk ?
Protecting a Tsunami of Data in Hadoop
Connected Infrastructure
TV Broadcasting What to look for Architecture TV Broadcasting Solution
Univa Grid Engine Makes Work Management Automatic and Efficient, Accelerates Deployment of Cloud Services with Power of Microsoft Azure MICROSOFT AZURE.
DocFusion 365 Intelligent Template Designer and Document Generation Engine on Azure Enables Your Team to Increase Productivity MICROSOFT AZURE APP BUILDER.
Introduction to Distributed Platforms
Module Overview Installing and Configuring a Network Policy Server
INTRODUCTION TO PIG, HIVE, HBASE and ZOOKEEPER
Chapter 10 Data Analytics for IoT
Open Source distributed document DB for an enterprise
Consulting Services JobScheduler Architecture Decision Template
Get the Most Out of GoAnywhere: Agents
The Client/Server Database Environment
Connected Infrastructure
SUBMITTED BY: NAIMISHYA ATRI(7TH SEM) IT BRANCH
CHAPTER 3 Architectures for Distributed Systems
Data Networking Fundamentals
Veeam Backup Repository
#01 Client/Server Computing
CompTIA Server+ Certification (Exam SK0-004)
Ministry of Higher Education
Capital One Architecture Team and DataTorrent
Johannes Peter MediaMarktSaturn Retail Group
DeFacto Planning on the Powerful Microsoft Azure Platform Puts the Power of Intelligent and Timely Planning at Any Business Manager’s Fingertips Partner.
Accelerate Your Self-Service Data Analytics
MARMIND’s New Service Delivers a Single Centralized Marketing Plan That Connects Teams, Campaigns and Outcomes by Using the Power of the Azure Platform.
Lecture 1: Multi-tier Architecture Overview
Design engineer deliver.
Collaborative Business Solutions
Introduction to Apache
Media365 Portal by Ctrl365 is Powered by Azure and Enables Easy and Seamless Dissemination of Video for Enhanced B2C and B2B Communication MICROSOFT AZURE.
An Introduction to Software Architecture
Introduction to Cyberspace
APACHE WEB SERVER.
Azure Container Service
Features Overview.
Microsoft Virtual Academy
#01 Client/Server Computing
Iserve – Bulk Cash Deposit Kiosk
Presentation transcript:

Introduction to building DataFlows with Apache NiFi Andrew Psaltis HDF / IoT / Cybersecurity Architect apsaltis@hortonworks.com @itmdata https://www.linkedin.com/in/andrewpsaltis

Hortonworks Company Profile 100 ~1000 800+ Founded in 2011 % subscription customers 1 ONLY employees across ST HADOOP 1,600+ 16 open source Apache Hadoop data platform provider to go public countries TM IPO 4Q14 (NASDAQ: HDP) technology partners

Participating with a Growing and Thriving Ecosystem 1600+ Partners 3000+ members 15,000+ Weekly visitors

Customer Growth in 2015 Key Highlights 800+ Doubled the customer base in 2015 800+

Hortonworks Momentum in Every Industry Financial Services Customers 5 5 of the top 1 0 0 Retail Customers 7 5 of the top 1 0 0 Telco Customers 8 of the top 9 in North America Automotive Customers 8 of the world’s top 2 0 Watch and Read about our customers at: www.hortonworks.com/customers

and Apache is the center of gravity INTERNET OF ANYTHING AGE OF DATA Open source is the norm, and Apache is the center of gravity Founded: 2011 IPO: 2014

Simplistic View of Enterprise Data Flow Acquire Data Process and Analyze Data Data Flow Store Data

Realistic View of Enterprise Data Flow Different organizations/business units across different geographic locations…

Realistic View of Enterprise Data Flow Interacting with different business partners and customers

Apache NiFi Created to address the challenges of global enterprise dataflow Key features: Visual Command and Control Data Lineage (Provenance) Data Prioritization Data Buffering/Back-Pressure Control Latency vs. Throughput Secure Control Plane / Data Plane Scale Out Clustering Extensibility

Apache NiFi What is Apache NiFi used for? Reliable and secure transfer of data between systems Delivery of data from sources to analytic platforms Enrichment and preparation of data: Conversion between formats Extraction/Parsing Routing decisions What is Apache NiFi NOT used for? Distributed Computation Complex Event Processing Joins / Complex Rolling Window Operations

Apache NiFi Deep Dive

Terminology FlowFile Unit of data moving through the system Content + Attributes (key/value pairs) Processor Performs the work, can access FlowFiles Connection Links between processors Queues that can be dynamically prioritized Process Group Set of processors and their connections Receive data via input ports, send data via output ports

Visual Command & Control Drag and drop processors to build a flow Start, stop, and configure components in real time View errors and corresponding error messages View statistics and health of data flow Create templates of common processor & connections

Provenance/Lineage Tracks data at each point as it flows through the system Records, indexes, and makes events available for display Handles fan-in/fan-out, i.e. merging and splitting data View attributes and content at given points in time

Prioritization Configure a prioritizer per connection Determine what is important for your data – time based, arrival order, importance of a data set Funnel many connections down to a single connection to prioritize across data sets Develop your own prioritizer if needed

Back-Pressure Configure back-pressure per connection Based on number of FlowFiles or total size of FlowFiles Upstream processor no longer scheduled to run until below threshold

Latency vs. Throughput Choose between lower latency, or higher throughput on each processor Higher throughput allows framework to batch together all operations for the selected amount of time for improved performance Processor developer determines whether to support this by using @SupportsBatching annotation

Security Control Plane Data Plane Pluggable authentication 2-Way SSL, LDAP, Kerberos Pluggable authorization File-based authority provider out of the box Multiple roles to defines access controls Audit trail of all user actions Data Plane Optional 2-Way SSL between cluster nodes Optional 2-Way SSL on Site-To-Site connections (NiFi-to-NiFi) Encryption/Decryption of data through processors Provenance for audit trail of data

Extensibility Built from the ground up with extensions in mind Service-loader pattern for… Processors Controller Services Reporting Tasks Prioritizers Extensions packaged as NiFi Archives (NARs) Deploy NiFi lib directory and restart Provides ClassLoader isolation Same model as standard components

Rapid Ecosystem Adoption: 130+ Processors NEW Rapid Ecosystem Adoption: 130+ Processors HL7 FTP UDP XML SFTP Hash Encrypt GeoEnrich Merge Tail Scan Extract Evaluate Replace Duplicate Execute Translate Split Convert HTTP Route Content Route Context Route Text Control Rate Distribute Load Email HTML Image Syslog AMQP

Architecture Master NiFi Cluster Manager (NCM) Slaves NiFi Nodes OS/Host JVM NiFi Cluster Manager – Request Replicator Web Server OS/Host JVM Flow Controller Web Server Processor 1 Extension N FlowFile Repository Content Repository Provenance Repository Local Storage Master NiFi Cluster Manager (NCM) Slaves NiFi Nodes OS/Host JVM Flow Controller Web Server Processor 1 Extension N FlowFile Repository Content Repository Provenance Repository Local Storage OS/Host JVM Flow Controller Web Server Processor 1 Extension N FlowFile Repository Content Repository Provenance Repository Local Storage OS/Host JVM Flow Controller Web Server Processor 1 Extension N FlowFile Repository Content Repository Provenance Repository Local Storage

Apache NiFi Site-To-Site

Site-To-Site Direct communication between two NiFi instances Push to Input Port on receiver, or Pull from Output Port on source Communicate between clusters, standalone instances, or both Handles load balancing and reliable delivery Secure connections using certificates (optional)

Site-To-Site Push Source connects Remote Process Group to Input Port on destination Site-To-Site takes care of load balancing across the nodes in the cluster NCM Node 1 Input Port Standalone NiFi RPG Node 2 Input Port

Site-To-Site Pull Destination connects Remote Process Group to Output Port on the source If source was a cluster, each node would pull from each node in cluster NCM Node 1 RPG Standalone NiFi Output Port Node 2 RPG

Site-To-Site Client Code for Site-To-Site broken out into reusable module https://github.com/apache/nifi/tree/master/nifi-commons/nifi-site-to-site-client Foundation for integration with stream processing platforms NCM Node 1 Output Port Java Program Site-To-Site Client Node 2 Output Port

Current Stream Processing Integrations Spark Streaming - NiFi Spark Receiver https://github.com/apache/nifi/tree/master/nifi-external/nifi-spark-receiver Storm – NiFi Spout & Bolt https://github.com/apache/nifi/tree/master/nifi-external/nifi-storm-spout Flink – NiFi Source & Sink https://github.com/apache/flink/tree/master/flink-streaming-connectors/flink-connector-nifi Apex - NiFi Input Operators & Output Operators https://github.com/apache/incubator-apex- malhar/tree/master/contrib/src/main/java/com/datatorrent/contrib/nifi

Bi-Directional Data Flows

Drive Data to Core for Analysis Drive data from sources to central data center for analysis Tiered collection approach at various locations, think regional data centers Edge NiFi Batch Analytics Core NiFi Edge Stream Processing NiFi

Dynamically Adjusting Data Flows Push analytic results back to core NiFi Push results back to edge locations/devices to change behavior Edge NiFi Batch Analytics Core NiFi Edge Stream Processing NiFi

Future Work

Apache NiFi Roadmap HA Control Plane HA Data Plane Multi-Tenancy Zero Master cluster, Web UI accessible from any node Auto-Election of “Cluster Coordinator” and “Primary Node” through ZooKeeper HA Data Plane Ability to replicate data across nodes in a cluster Multi-Tenancy Restrict Access to portions of a flow Allow people/groups with in an organization to only access their portions of the flow Extension Registry Create a central repository of NARs and Templates Move most NARs out of Apache NiFi distribution, ship with a minimal set

Apache NiFi Roadmap Variable Registry Redesign of User Interface Define environment specific variables through the UI, reference through EL Make templates more portable across environments/instances Redesign of User Interface Modernize look & feel, improve usability, support multi-tenancy Continued Development of Integration Points New processors added continuously! MiNiFi Complimentary data collection agent to NiFi’s current approach Small, lightweight, centrally managed agent that integrates with NiFi for follow-on dataflow management

Thanks! Resources Apache NiFi Mailing Lists https://nifi.apache.org/mailing_lists.html Apache NiFi Documentation https://nifi.apache.org/docs.html Getting started developing extensions https://cwiki.apache.org/confluence/display/NIFI/Maven+Projects+for+Extensions https://nifi.apache.org/developer-guide.html Contact Info: Email: apsaltis@hortonworks.com Twitter: @itmdata LinkedIn: https://www.linkedin.com/in/andrewpsaltis

Thank you