Real-Time Processing with Apache Flume, Kafka, and Storm Kamlesh Dhawale Ankalytics www.ankalytics.com
Topics Flume Kafka Storm Demo
Flume Used for creating streaming data flow Distributed Reliable Support for many inbound ingest protocols Real-Time Streaming Offline/Batch processing
Flume components Source Channel Sink Web/ File … HDFS/ NoSQL…
Source HTTP Spool Directory Exec a1.sources.src-1.type = spooldir a1.sources.r1.type = http a1.sources.r1.port = 5140 a1.sources.r1.channels = c1 a1.sources.r1.handler = org.example.rest.RestHandler Spool Directory a1.sources.src-1.type = spooldir a1.sources.src-1.channels = ch-1 a1.sources.src-1.spoolDir = /var/log/apache/flumeSpool a1.sources.src-1.fileHeader = true Exec a1.sources.r1.type = exec a1.sources.r1.command = tail -F /var/log/secure
Channel Memory – High Throughput, not reliable JDBC – Durable, slower File – Good throughput , supports recovery
Sink HDFS HIVE Kafka a1.sinks.k1.type = hdfs a1.sinks.k1.channel = c1 a1.sinks.k1.hdfs.path = /flume/events/%y-%m-%d/%H%M/%S a1.sinks.k1.hdfs.filePrefix = events- HIVE a1.sinks.k1.type = hive a1.sinks.k1.hive.metastore = thrift://127.0.0.1:9083 a1.sinks.k1.hive.database = logsdb a1.sinks.k1.hive.table = weblogs Kafka a1.sinks.k1.type = org.apache.flume.sink.kafka.KafkaSink a1.sinks.k1.topic = mytopic a1.sinks.k1.brokerList = localhost:9092
Flow Can chain Multiplex Fan-in Fan-out
Config file # example.conf: A single-node Flume configuration # Name the components on this agent a1.sources = r1 a1.sinks = k1 a1.channels = c1 # Describe/configure the source a1.sources.r1.type = netcat a1.sources.r1.bind = localhost a1.sources.r1.port = 44444 # Describe the sink a1.sinks.k1.type = logger # Use a channel which buffers events in memory a1.channels.c1.type = memory a1.channels.c1.capacity = 1000 a1.channels.c1.transactionCapacity = 100 # Bind the source and sink to the channel a1.sources.r1.channels = c1 a1.sinks.k1.channel = c1
Starting flume bin/flume-ng agent --conf conf --conf-file example.conf --name a1 -Dflume.root.logger=INFO,console
Kafka
Yet Another messaging System? Design Considerations Log aggregator Distributed Batch messages to reduce number of connections Offline/Periodic consumption Pull model
Architecture
Reliability Uses Zookeeper for node and consumer status. At-least-Once delivery (using offset you can get exactly-Once processing) Built-in data loss auditing
Topic A producer writes to a topic and consumer reads from a topic. Topic is divided into ordered set of partitions. Each partition is consumed by one consumer at a time. Offset is maintained for each consumer per partition.
More on Topic… Partition count determines the maximum consumer parallelism. Each partition can have multiple replicas. This provides failover. A broker can host multiple partition but can be leader for only one partition. The leader receives message and replicates to other servers.
Configuration Server.properties file Host name, Port Zookeepers
Start Kafka bin/zookeeper-server-start.sh config/zookeeper.properties bin/kafka-server-start.sh config/server.properties
Storm
Key concepts Topologies Streams Spouts Bolts Stream groupings Reliability Tasks Workers
Streams Unbounded sequence of tuples Tuple is a list of values
Spouts Generates Streams Can be Reliable or Unreliable Reliable spouts use ack() and fail(). Tuples can be replayed.
Bolts Used for filtering, functions, aggregations, joins, talking to databases, and more. Complex processing is achieved by using multiple bolts. Types of Bolt Interfaces: IRichBolt: this is general interface for bolts. Manual ack needed. IBasicBolt: this is a convenience interface for defining bolts that do filtering or simple functions. Auto ack.
Topology
Stream grouping Tells storm how to process tuples with available tasks Shuffle grouping – Tuples are randomly sent to tasks Fields grouping – Group processing by fields. Makes sure only one task processes a grouped field value. Shuffle Field A B A B “X” X Y Z X Y Z “A” C A B C A B X Y F X Y F
Storm Architecture Nimbus – Master node Zookeeper – Cluster Coordination Supervisor – Worker Processes
Storm cluster Nimbus : Master node. There can be only one master node in a cluster. Reassigns tasks in case of worker node failure. Zookeeper : Communication backbone in cluster. Maintains state to aid in failover/recovery. Supervisor : Worker node. Governs worker processes.
Storm Cluster – Runtime components Worker Node Nimbus Zookeeper node Worker Process Supervisor Executor Executor Task Task Task Task Task
Code/ Demo