Real-Time Processing with Apache Flume, Kafka, and Storm Kamlesh Dhawale Ankalytics www.ankalytics.com.

Real-Time Processing with Apache Flume, Kafka, and Storm Kamlesh Dhawale Ankalytics

Topics Flume Kafka Storm Demo

Flume Used for creating streaming data flow Distributed Reliable
Support for many inbound ingest protocols Real-Time Streaming Offline/Batch processing

Flume components Source Channel Sink Web/ File … HDFS/ NoSQL…

Source HTTP Spool Directory Exec a1.sources.src-1.type = spooldir
a1.sources.r1.type = http a1.sources.r1.port = 5140 a1.sources.r1.channels = c1 a1.sources.r1.handler = org.example.rest.RestHandler Spool Directory a1.sources.src-1.type = spooldir a1.sources.src-1.channels = ch-1 a1.sources.src-1.spoolDir = /var/log/apache/flumeSpool a1.sources.src-1.fileHeader = true Exec a1.sources.r1.type = exec a1.sources.r1.command = tail -F /var/log/secure

Channel Memory – High Throughput, not reliable JDBC – Durable, slower
File – Good throughput , supports recovery

Sink HDFS HIVE Kafka a1.sinks.k1.type = hdfs a1.sinks.k1.channel = c1
a1.sinks.k1.hdfs.path = /flume/events/%y-%m-%d/%H%M/%S a1.sinks.k1.hdfs.filePrefix = events- HIVE a1.sinks.k1.type = hive a1.sinks.k1.hive.metastore = thrift:// :9083 a1.sinks.k1.hive.database = logsdb a1.sinks.k1.hive.table = weblogs Kafka a1.sinks.k1.type = org.apache.flume.sink.kafka.KafkaSink a1.sinks.k1.topic = mytopic a1.sinks.k1.brokerList = localhost:9092

Flow Can chain Multiplex Fan-in Fan-out

Config file # example.conf: A single-node Flume configuration # Name the components on this agent a1.sources = r1 a1.sinks = k1 a1.channels = c1 # Describe/configure the source a1.sources.r1.type = netcat a1.sources.r1.bind = localhost a1.sources.r1.port = # Describe the sink a1.sinks.k1.type = logger # Use a channel which buffers events in memory a1.channels.c1.type = memory a1.channels.c1.capacity = 1000 a1.channels.c1.transactionCapacity = 100 # Bind the source and sink to the channel a1.sources.r1.channels = c1 a1.sinks.k1.channel = c1

Starting flume bin/flume-ng agent --conf conf --conf-file example.conf --name a1 -Dflume.root.logger=INFO,console

Yet Another messaging System?
Design Considerations Log aggregator Distributed Batch messages to reduce number of connections Offline/Periodic consumption Pull model

Architecture

Reliability Uses Zookeeper for node and consumer status.
At-least-Once delivery (using offset you can get exactly-Once processing) Built-in data loss auditing

Topic A producer writes to a topic and consumer reads from a topic.
Topic is divided into ordered set of partitions. Each partition is consumed by one consumer at a time. Offset is maintained for each consumer per partition.

More on Topic… Partition count determines the maximum consumer parallelism. Each partition can have multiple replicas. This provides failover. A broker can host multiple partition but can be leader for only one partition. The leader receives message and replicates to other servers.

Configuration Server.properties file Host name, Port Zookeepers

Start Kafka bin/zookeeper-server-start.sh config/zookeeper.properties
bin/kafka-server-start.sh config/server.properties

Key concepts Topologies Streams Spouts Bolts Stream groupings
Reliability Tasks Workers

Streams Unbounded sequence of tuples Tuple is a list of values

Spouts Generates Streams Can be Reliable or Unreliable
Reliable spouts use ack() and fail(). Tuples can be replayed.

Bolts Used for filtering, functions, aggregations, joins, talking to databases, and more. Complex processing is achieved by using multiple bolts. Types of Bolt Interfaces: IRichBolt: this is general interface for bolts. Manual ack needed. IBasicBolt: this is a convenience interface for defining bolts that do filtering or simple functions. Auto ack.

Topology

Stream grouping Tells storm how to process tuples with available tasks
Shuffle grouping – Tuples are randomly sent to tasks Fields grouping – Group processing by fields. Makes sure only one task processes a grouped field value. Shuffle Field A B A B “X” X Y Z X Y Z “A” C A B C A B X Y F X Y F

Storm Architecture Nimbus – Master node
Zookeeper – Cluster Coordination Supervisor – Worker Processes

Storm cluster Nimbus : Master node. There can be only one master node in a cluster. Reassigns tasks in case of worker node failure. Zookeeper : Communication backbone in cluster. Maintains state to aid in failover/recovery. Supervisor : Worker node. Governs worker processes.

Storm Cluster – Runtime components
Worker Node Nimbus Zookeeper node Worker Process Supervisor Executor Executor Task Task Task Task Task

Code/ Demo

Real-Time Processing with Apache Flume, Kafka, and Storm Kamlesh Dhawale Ankalytics www.ankalytics.com.

Similar presentations

Presentation on theme: "Real-Time Processing with Apache Flume, Kafka, and Storm Kamlesh Dhawale Ankalytics www.ankalytics.com."— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

Real-Time Processing with Apache Flume, Kafka, and Storm Kamlesh Dhawale Ankalytics www.ankalytics.com.

Similar presentations

Presentation on theme: "Real-Time Processing with Apache Flume, Kafka, and Storm Kamlesh Dhawale Ankalytics www.ankalytics.com."— Presentation transcript:

Similar presentations

About project

Feedback