Enabling Scalable and HA Ingestion and Real-Time Big Data Insights for the Enterprise OCJUG, 2014
Hadoop Ecosystem
Real-Time Insight with In-Memory ETL Batch M I R O Intermediate Files RDBMS EDW MPP In Memory P O I Stream Data Accumulation 24 Hours Data Processing 8 Hours Application Complexity Map-Reduce Real-Time Event Driven Seconds Complex Processing
Enterprise Repositories Visualize Business Analytics Business Intelligence Tools Visualization Tools SOURCE DATA Scalable Ingestion Enterprise Repositories RDBMS EDW NoSQL Analytics Alerts Hive Events Load Feed 1 Extract Transform XML Files Ad Hoc Query Feed 2 Load Load Feed…. Sensor data HDFS Raw Archive Feed 400 MS Queue’s Social Data Access Service Databases Feed Consumers/ Applications Scale Out
Stream Processing A Stream is a sequence of data events with schema 1 2 4 3 6 5 A Stream is a sequence of data events with schema An Operator takes input streams and compute output streams Each Operator is YOUR business logic in java, or from our library An Application is a Directed Acyclic Graph (DAG)
DataTorrent Hadoop GRID DT Console 4 1 dtCLI 3 6 2 5 Resource Manager DT Gateway NM NM NM NM MapReduce StrAM MapReduce MapReduce 3 1 MapReduce 4 6 MapReduce 2 5 MapReduce
DataTorrent Platform: . High Performance Extreme Scalability Mission Critical Hadoop 2.0 Native Real-time data ingestion In-memory processing Billions of operations per second DataTorrent automatically scales out/in to changing loads Sub-second latency with linear scalability Complex big data applications Built-in Fault-tolerance 24/7 uptime guaranteed Update your application while it's running! Runs on your existing Apache Hadoop cluster. Develop faster and support any business logic with our open-source framework. Integrate seamlessly with your existing data flow.
DataTorrent YARN Interaction DataTorrent is an java interface based API Default Implementation – Platform Custom Implementation – Application Development Platform components have various configuration properties Container Size (Hadoop Dependent) Operator Memory Max Number of Containers Locality of the Operators and Streams C-Group (Coming soon – Hadoop Dependent) Static and Dynamic Partitioning
Checkpointing Transparent, Distributed, and Asynchronous Resource Requirements directly proportional to Size of the state Frequency of checkpointing Most operators have small (a few KB) state footprint Techniques to lower the cost Identify the state with minimum footprint Use external storage Incremental checkpoints Faster media Stateless Operators Less frequent Disable
DataTorrent vs Alternatives Developed Ground-up to do Streaming natively in Hadoop Relieves Application Developers from Fault Tolerance High Performance yet Resource Friendly Linearly Scalable Hadoop Native and co-exists with other Hadoop Applications 500+ Open Source Operators UI Dashboard Widgets Preferred by Enterprises after Trying Alternatives Enterprise Grade Support
Real-Time Fault Tolerant Use Cases Big Data ETL Offload Predictive Analytics Scalable Ingestion Operational Monitoring and Alerts Real-Time Business Actions Internet of Things Security
Demos malhar-users@googlegroups.com https://www.datatorrent.com/developers/