Capital One Architecture Team and DataTorrent

Capital One Architecture Team and DataTorrent
Technical Deep Dive Feb 4, 2016 © 2015 DataTorrent Confidential – Do Not Distribute

Agenda Meet and greet (9 AM- 9:30 AM)
Capital One led (9:30 AM- 12:00PM) Fast Data Reference Architecture (Marty Dewey) Discussion on other Use cases (Matt Beamon) Open for discussion Lunch (12- 12:30) DataTorrent Led (12:30- 4:30) DataTorrent with high level Business Model Overview Apex Architecture Apex Dev Ops Model Roadmap/ Our community engagement plans Share some customer learning's including Use Patterns Insight to execution problem- white boarding exercise Security

Core Features Apache Apex Architecture Streaming Analytics Platform
Real-time and Batch Processing Scalable and Highly Available Managed State Library of pre-built operators Code Reuse

Apex Platform Overview

Native Hadoop YARN is the resource manager
HDFS used for storing any persistent state

Apache Malhar Library

Application Programming Model
Directed Acyclic Graph (DAG) er Operator Filtered Stream Enriched Stream er Operator er Operator Output Stream Tuple Tuple Enriched Stream Filtered Stream er Operator A Stream is a sequence of data tuples An Operator takes one or more input streams, performs computations & emits one or more output streams Each Operator is YOUR custom business logic in java, or built-in operator from our open source library Operator has many instances that run in parallel and each instance in single-threaded Directed Acyclic Graph (DAG) is made up of operations and streams

Application Specification

Advanced Windowing Support
Application window Sliding window and tumbling window Checkpoint window No artificial latency

Guarantees and Performance
Stateful Fault Tolerance Processing Semantics Data Locality Supported out of the box Application state Application master state No data loss Automatic recovery Lunch test Buffer server At least once At most once Exactly once Stream locality for placement of operators Rack local – Distributed deployment Node local – Data does not traverse NIC Container local – Data doesn’t need to be serialized Thread local – Operators run in same thread Data locality

Partitioning Logical DAG Physical DAG 1 2 3 Input Operator Compute
Output Operator Physical DAG 1 Compute Operator 2 Input Operator Compute Operator Unifier Operator Output Operator 3 Compute Operator` Logical DAG Physical DAG Stream Split and Unifier

Unifier Combines outputs of multiple partitions Runs as an operator
Logic depends on the operator functionality Example if operator is computing average, unifier is computing final average from individual average and counts Default unifier if none specified Cascading unification possible if unification needs to be done in multiple stages

Stream Split Utilize hashcode and mask to determine Partition
Mask picks the last n bits of the hashcode of the tuple StreamCodec can be used to specify custom hashcode Custom partitioner can be used to change default map Mask (0x11) Partition 00 1 01 2 10 3 11 4 tuple:{ Sensor, , 34, GOOD } Hashcode:

MxN Partitioning DAG Default Mechanism Stateless Partitioner Input
Operator Compute Operator Output Operator Input Operator Compute Operator Output Operator Input Operator Compute Operator Output Operator Compute Operator Default Mechanism Stateless Partitioner <property> <name>dt.application.<streamingApp>.operator.<name>.attr.PARTITIONER</name> <value>com.datatorrent.common.partitioner.StatelessPartitioner:4</value> </property> Mask and partitions

Parallel Partitioning
Input Operator Compute Operator Output Operator Input Operator Compute Operator Output Operator Input Operator Compute Operator Output Operator DAG <property> <name>dt.application.<streamApp>.operator.<name>.port.input.attr.PARTITION_PARALLEL</name> <value>true</value> </property> And we can combine this with MxN Partitioning

Custom partitioning Custom stream splitting
Distribution of state during initial or dynamic partitioning Kafka operators scale according to number of kafka partitions Re-distribution of state during dynamic partitioning Mask (0x00) Partition 00 1 2 3 4 tuple:{ Sensor, , 34, GOOD } Hashcode:

Data Processing Pipeline Example
App Builder

Logical Plan

Physical Plan

Real Time Visualization

Dynamic Updates Dynamic topology updates
Properties of operators can be changed New operators can be added

Key Differences between Storm, Spark and Apex
Capability Storm Spark Apex Hadoop Native No Yes Millions of events/second ingestion Billions of events/second processing Sub-second processing latency Application & system stateful fault tolerance Number of operators 6 7 450+ Lights out, GUI management console Application builder for non-developer Application dashboard framework and builder

Apache Apex Roadmap Core Features

Apex Roadmap High Level API’s. Integration with Samoa, H2O, Dato
Iterative processing support Dynamic application property changes Ability to add new processing code to the DAG Native support for batch processing Encrypted Streams Enable non-affinity of operators Scalable Join Operators Integration with broader open source ecosystem

DataTorrent Roadmap Core Features

Roadmap Focus on Solutions Ease of Use Configuration vs Coding
Ingestion Analytics ……… Ease of Use Make it more drag and drop oriented Configuration vs Coding

Apache Apex Dev Ops Core Features

Apex Dev Ops Model Leverage investment in hadoop/ native hadoop
Strong Benchmarking and Certification support Rich RESTful Web Services Visualization: For application monitoring and application data Security including Kerberos, LDAP, RBAC Rolling Upgrades Lunch Test Backward Compatibility

Our Learning's from other customer
Most large enterprise customers transitioning to modern architectures to get faster insight Batch Processes Real Time Proprietary Open Source Scale Up  Scale Out Costly Hardware Commodity Big data projects often fail due Short supply skill requirements Operational shortcomings of the existing platforms Operations costs far outweigh the functional costs Operational Costs = 80% Functional Costs = 20%

Cloud Learning's Pros Quick Time to market Easy to Deploy
Amortization of cost Expand as needed Outsource Operational and Security Expertise Cons Shared Hardware Cannot guarantee SLA At a certain scale, $$$$ Cost of migration

Big Data Use Cases- Streaming and Faster Batch
Core Features

Back Up Thank You

Capital One Architecture Team and DataTorrent

Similar presentations

Presentation on theme: "Capital One Architecture Team and DataTorrent"— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

Capital One Architecture Team and DataTorrent

Similar presentations

Presentation on theme: "Capital One Architecture Team and DataTorrent"— Presentation transcript:

Similar presentations

About project

Feedback