Download presentation
Presentation is loading. Please wait.
1
Capital One Architecture Team and DataTorrent
Technical Deep Dive Feb 4, 2016 © 2015 DataTorrent Confidential – Do Not Distribute
2
Agenda Meet and greet (9 AM- 9:30 AM)
Capital One led (9:30 AM- 12:00PM) Fast Data Reference Architecture (Marty Dewey) Discussion on other Use cases (Matt Beamon) Open for discussion Lunch (12- 12:30) DataTorrent Led (12:30- 4:30) DataTorrent with high level Business Model Overview Apex Architecture Apex Dev Ops Model Roadmap/ Our community engagement plans Share some customer learning's including Use Patterns Insight to execution problem- white boarding exercise Security
4
Core Features Apache Apex Architecture Streaming Analytics Platform
Real-time and Batch Processing Scalable and Highly Available Managed State Library of pre-built operators Code Reuse
5
Apex Platform Overview
6
Native Hadoop YARN is the resource manager
HDFS used for storing any persistent state
7
Apache Malhar Library
8
Application Programming Model
Directed Acyclic Graph (DAG) er Operator Filtered Stream Enriched Stream er Operator er Operator Output Stream Tuple Tuple Enriched Stream Filtered Stream er Operator A Stream is a sequence of data tuples An Operator takes one or more input streams, performs computations & emits one or more output streams Each Operator is YOUR custom business logic in java, or built-in operator from our open source library Operator has many instances that run in parallel and each instance in single-threaded Directed Acyclic Graph (DAG) is made up of operations and streams
9
Application Specification
10
Advanced Windowing Support
Application window Sliding window and tumbling window Checkpoint window No artificial latency
11
Guarantees and Performance
Stateful Fault Tolerance Processing Semantics Data Locality Supported out of the box Application state Application master state No data loss Automatic recovery Lunch test Buffer server At least once At most once Exactly once Stream locality for placement of operators Rack local – Distributed deployment Node local – Data does not traverse NIC Container local – Data doesn’t need to be serialized Thread local – Operators run in same thread Data locality
12
Partitioning Logical DAG Physical DAG 1 2 3 Input Operator Compute
Output Operator Physical DAG 1 Compute Operator 2 Input Operator Compute Operator Unifier Operator Output Operator 3 Compute Operator` Logical DAG Physical DAG Stream Split and Unifier
13
Unifier Combines outputs of multiple partitions Runs as an operator
Logic depends on the operator functionality Example if operator is computing average, unifier is computing final average from individual average and counts Default unifier if none specified Cascading unification possible if unification needs to be done in multiple stages
14
Stream Split Utilize hashcode and mask to determine Partition
Mask picks the last n bits of the hashcode of the tuple StreamCodec can be used to specify custom hashcode Custom partitioner can be used to change default map Mask (0x11) Partition 00 1 01 2 10 3 11 4 tuple:{ Sensor, , 34, GOOD } Hashcode:
15
MxN Partitioning DAG Default Mechanism Stateless Partitioner Input
Operator Compute Operator Output Operator Input Operator Compute Operator Output Operator Input Operator Compute Operator Output Operator Compute Operator Default Mechanism Stateless Partitioner <property> <name>dt.application.<streamingApp>.operator.<name>.attr.PARTITIONER</name> <value>com.datatorrent.common.partitioner.StatelessPartitioner:4</value> </property> Mask and partitions
16
Parallel Partitioning
Input Operator Compute Operator Output Operator Input Operator Compute Operator Output Operator Input Operator Compute Operator Output Operator DAG <property> <name>dt.application.<streamApp>.operator.<name>.port.input.attr.PARTITION_PARALLEL</name> <value>true</value> </property> And we can combine this with MxN Partitioning
17
Custom partitioning Custom stream splitting
Distribution of state during initial or dynamic partitioning Kafka operators scale according to number of kafka partitions Re-distribution of state during dynamic partitioning Mask (0x00) Partition 00 1 2 3 4 tuple:{ Sensor, , 34, GOOD } Hashcode:
18
Data Processing Pipeline Example
App Builder
19
Data Processing Pipeline Example
Logical Plan
20
Data Processing Pipeline Example
Physical Plan
21
Data Processing Pipeline Example
Real Time Visualization
22
Dynamic Updates Dynamic topology updates
Properties of operators can be changed New operators can be added
23
Key Differences between Storm, Spark and Apex
Capability Storm Spark Apex Hadoop Native No Yes Millions of events/second ingestion Billions of events/second processing Sub-second processing latency Application & system stateful fault tolerance Number of operators 6 7 450+ Lights out, GUI management console Application builder for non-developer Application dashboard framework and builder
24
Apache Apex Roadmap Core Features
25
Apex Roadmap High Level API’s. Integration with Samoa, H2O, Dato
Iterative processing support Dynamic application property changes Ability to add new processing code to the DAG Native support for batch processing Encrypted Streams Enable non-affinity of operators Scalable Join Operators Integration with broader open source ecosystem
26
DataTorrent Roadmap Core Features
27
Roadmap Focus on Solutions Ease of Use Configuration vs Coding
Ingestion Analytics ……… Ease of Use Make it more drag and drop oriented Configuration vs Coding
28
Apache Apex Dev Ops Core Features
29
Apex Dev Ops Model Leverage investment in hadoop/ native hadoop
Strong Benchmarking and Certification support Rich RESTful Web Services Visualization: For application monitoring and application data Security including Kerberos, LDAP, RBAC Rolling Upgrades Lunch Test Backward Compatibility
30
Our Learning's from other customer
Most large enterprise customers transitioning to modern architectures to get faster insight Batch Processes Real Time Proprietary Open Source Scale Up Scale Out Costly Hardware Commodity Big data projects often fail due Short supply skill requirements Operational shortcomings of the existing platforms Operations costs far outweigh the functional costs Operational Costs = 80% Functional Costs = 20%
31
Cloud Learning's Pros Quick Time to market Easy to Deploy
Amortization of cost Expand as needed Outsource Operational and Security Expertise Cons Shared Hardware Cannot guarantee SLA At a certain scale, $$$$ Cost of migration
32
Big Data Use Cases- Streaming and Faster Batch
Core Features
33
Back Up Thank You
Similar presentations
© 2025 SlidePlayer.com. Inc.
All rights reserved.