Architecting an Edge to Core to Cloud Data Pipeline

Architecting an Edge to Core to Cloud Data Pipeline
Unify Analytics Engines with an In-Place Data Pipeline Santosh Rao Senior Technical Director, NetApp Mar

Early Generation Big Data Analytics Platform
Big Data Analytics Software Designed to Deliver Initial Analytics Solutions Primary Considerations – Cost & Agility Scalability, Availability, Governance are after thoughts Typical Approach – Cloud or Commodity Infrastructure Leading to Unpredictable RoI as Copies Manifest 3-5 Replicas copied across LoB, Functions © 2017 NetApp, Inc. All rights reserved NETAPP CONFIDENTIAL ---

Early Generation Analytics Platform Challenges
Unpredictable performance Inefficient Storage utilization File Copies Storage Demon File Storage Replication G Media and node failures Not enterprise ready Total cost of ownership Storage and compute tied (creates imbalance)

Evolving To A Data Pipeline

Extending the Data Pipeline from Edge to Core to Cloud
Data Lifecycle Challenges Edge Core Public Cloud Initial point of data collection and aggregation Dedicated hardware, private cloud deployments Hosted as-a-service solutions; long-term data archival © 2017 NetApp, Inc. All rights reserved NETAPP CONFIDENTIAL --- Multicloud

Traditional Data Architecture
Issues with Traditional on-premises Data workflow and Commodity Infrastructure Siloed architecture Traditional Lambda architecture: Data processing pipeline Notes The limits of traditional on-premises workflow and architecture. Let’s clarify what we mean by “traditional” —we are talking about non-services oriented architectures that may have siloed applications and modern applications built on virtualized infrastructures. In these architectures, data is not always free to move to where it is needed. Challenges and problems with traditional architectures include: Inflexible Fixed ratio of compute to storage Difficult to change database schemas Costly Hardware licensing and underutilization IT support Slow Takes months to upgrade systems and to deploy new applications Need to determine what is corollary in hybrid cloud web services Creates data management challenges Data copies disk failures I/O bandwidth Archival and disaster recovery Start with shared infrastructure and just have speaker note to ask audience if are on DAS and have statement to speak to that? Users Analytics Services Reports Web tier/ App server Stream analytics Traditional data warehouse Data Lake © 2018 NetApp, Inc. All rights reserved NETAPP CONFIDENTIAL --- IoT Data Hadoop Data Lake NoSQL Inflexible | Months to upgrade systems Poor utilization | High TCO | Fixed compute to storage ratio Inability to access the cloud Data management challenges

Next Generation Data Pipeline
Unified Insights across Lines of Business and Functions Unified Enterprise Data Lake Federate Data Sources across 2nd and 3rd Platform In-Place Access to Data Pipeline (Copy Avoidance) Future Proof to allow shifts in Architecture Deployment – PoC at LoB, Scale for Production Use Scale Edge, Core and Cloud as a Single Pipeline Governance, Data Protection, Security on Data Pipeline Lowest TCO over life of Solution

Meet Diverse Needs across Enterprise Functions
Data Scientists Real-world Data for App Dev Data Architects Future Proof Architecture Data/IT Admins Lowest TCO in face of shrinking budgets We typically sell to the IT Admins on the right side Seek Extensible Architecture: Architecture spans Edge, Core and Cloud Future Proof to allow shifts in deployment Tiering is the new Scaling No Data and Compute Sprawl Architect for overall TCO Balance Cost & TTM : Reduce IT/Licensing costs Automated Data Lifecycle Management Meet SLAs Non-Disruptive Operations Need Agile Infrastructure : Refreshed access to Production Data Enable DevOps, Data Scientist Enable Multi Tenant Data Science Data, AI/ML Models as Code API Based © 2017 NetApp, Inc. All rights reserved. --- NETAPP CONFIDENTIAL --- Balanced Architecture to Deliver for Stakeholders

Data Pipeline Requirements
Consider needs for each stage of the typical data pipeline Edge Core Cloud Data Ingest, Collection, Transport Massive data (few TB/device/day) Real-time Edge Analytics/AI Ultra Low Latency Network Bandwidth Smart Data Movement Edge Data Ingest, Collection, Transport to the Core Massive amounts of data growing exponentially May require intelligence at the edge Core Prep in Data Lake, Training Clusters, Deployment, Transport to Cloud and to Edge Demands raw I/O bandwidth for parallel operations to accelerate training phase Requires ultra-low latency for access for deployed inference models As amount of data grows exponentially, cost of storing data also grows exponentially Cloud Data Lake, Training, Deployment, GPU as a Service, Archive Data lake requires performance for data ingest and streaming data into training cluster Training cluster requires parallel processing Deployed inference models require ultra-low latency access Archive demands Cost-effective storage forever Multi-cloud strategy requires data movement among Data Lake, AI Model Training, Deployment, Serving Ultra High I/O bandwidth ( GBps) Ultra-low latency (micro – nanosec) Linear Scale (1-128 node AI) Overall TCO for PB Data Lake, Training, Deployment, GPU as a Service, Archive Cloud Analytics, AI/DL/ML Consume and not Operate Cloud Vendor vs. On-Prem stack Cost-effective Archive Need to avoid Cloud Lock-in

AI Infrastructure Guiding Principals: 5 Factors
Smoothing the flow of the Data Pipeline Choice of Filesystem (Lustre, HDFS, GPFS, NFS) Ability to federate diverse data sources both structured and un-structured Smart Data Movement Data as a Service Leading Edge Performance (NVMe, NVMeoF, NVDIMM, 3DXPoint)

Tech Topics:10 Dimensions of AI Infrastructure
Smoothing the flow of the Data Pipeline Both Random and Sequential IO Ultra Low Latency for Inference Ultra High Bandwidth/Parallelism for Training Linear Scale Single Namespace & Metadata Client side Caching for Iterative Scans Copy Avoidance In-Place Access to Data (HPC, Analytics, AI/DL) Availability of Smart Data Movement Ease of management at Scale Service Levels for Model Serving Need to uplevel the story

Our Solution: NetApp Data Pipeline
Extends from Edge to Core to Cloud. Federates Data Sources, Compute Engines and Clouds. Notes Workflow diagram Begins with data collection (metadata); messages generated and transmitted to NetApp Data ingested Data processed and aggregated (prepared) Data archived and secured Data analytics performed (machine learning) Data services delivered Data management: Copy management, data storage efficiencies, data movement, data availability, data security (this is unique to NetApp) Details AutoSupport v2 on-prem analytics architecture Replaced HDFS with AFF for real-time cluster Reduced our storage usage from 12PB on DAS to 1.3PB on AFF Reduced compute nodes from 120 to 40 Additional Info Not only are we using our own data to drive a digital transformation through predictive analytics, we are delivering it through the NetApp Data Fabric. We have over 20 years of experience in managing IoT data through our history of AutoSupport. We have used the innovation through the Data Fabric solutions to securely take advantage of the scale of the public cloud to deliver Machine Learning and AI analytics through the NFSaaS. BD Cluster AI/DL Cluster IoT Data HDInsight HDFS-NFS In-Place Analytics In-Place AI/DL Databricks/EMR NetApp HDFS-NFS Connector NetApp HDFS-NFS Connector HDFS/NFS ExpressRoute Archive Direct Connect Data Lake Unified Data Lake Data Fabric NetApp Cloud Volumes Secure Data

Data Pipeline for Artificial Intelligence / Deep Learning
Edge Core Cloud Data Aggregation Data Lake High Performance DevOps Cloud AI/DL Data Fabric Ingest Data Prep Training Cluster Deployment Training Set 3 Training Set 2 Test Set Training Set 1 Repo Data Lake IM1 Public or Private Cloud Archive IM2 © 2017 NetApp, Inc. All rights reserved NETAPP CONFIDENTIAL --- IM3 Data collection Edge-level AI (TensorRT) Aggregation Normalization Raw I/O bandwidth Parallel streams Ultra-low latency Model Serving

Hybrid Cloud Data Pipeline
Ingest Edge Near the Cloud Cloud Deployment Small Footprint Smart Data Movement Compute & Data Separation Data as a Service, Near Cloud Data, Data Tiering, Cloud AI, GPU as a Service Public or Private Cloud Archive Data lake Training Set 3 Training Set 2 Test Set Training Set 1 Repo IM1 IM2 IM3 Training Express Route Direct Connect AI / Deep learning in the cloud Cold data, Backup, Clone © 2017 NetApp, Inc. All rights reserved NETAPP CONFIDENTIAL ---

Key Takeaways Edge Cloud Core Future Proof Accelerate DL
NetApp Data Pipeline : Building Blocks for your entire data flow—From Edge to Core to Cloud Ultra-High Bandwidth Edge Cloud Ultra-Low Latency Data Fabric for AI ONTAP Select ONTAP Cloud Core ONTAP 9 AFF Smart Data Mover Future Proof Accelerate DL © 2018 NetApp, Inc. All rights reserved NETAPP CONFIDENTIAL ---

Backup

Solution Architecture for the Data Pipeline
Edge Core Cloud On-Premises Deep Learning Pipeline AFF Flexpod for DGIX/Plexistor Data Prep Data Lake Training Cluster 1 2 3 Training Sets Deployment IM3 IM2 IM1 Repo ONTAP Select © 2018 NetApp, Inc. All rights reserved NETAPP CONFIDENTIAL --- Cloud-Based Deep Learning Pipeline ONTAP Select

IDC Study : NetApp vs. Commodity TCO

Architecting an Edge to Core to Cloud Data Pipeline

Similar presentations

Presentation on theme: "Architecting an Edge to Core to Cloud Data Pipeline"— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

Architecting an Edge to Core to Cloud Data Pipeline

Similar presentations

Presentation on theme: "Architecting an Edge to Core to Cloud Data Pipeline"— Presentation transcript:

Similar presentations

About project

Feedback