Download presentation
Presentation is loading. Please wait.
Published byMarylou Lane Modified over 6 years ago
1
Real-Time Processing with Apache Flume, Kafka, and Storm Kamlesh Dhawale Ankalytics
2
Topics Flume Kafka Storm Demo
3
Flume Used for creating streaming data flow Distributed Reliable
Support for many inbound ingest protocols Real-Time Streaming Offline/Batch processing
4
Flume components Source Channel Sink Web/ File … HDFS/ NoSQL…
5
Source HTTP Spool Directory Exec a1.sources.src-1.type = spooldir
a1.sources.r1.type = http a1.sources.r1.port = 5140 a1.sources.r1.channels = c1 a1.sources.r1.handler = org.example.rest.RestHandler Spool Directory a1.sources.src-1.type = spooldir a1.sources.src-1.channels = ch-1 a1.sources.src-1.spoolDir = /var/log/apache/flumeSpool a1.sources.src-1.fileHeader = true Exec a1.sources.r1.type = exec a1.sources.r1.command = tail -F /var/log/secure
6
Channel Memory – High Throughput, not reliable JDBC – Durable, slower
File – Good throughput , supports recovery
7
Sink HDFS HIVE Kafka a1.sinks.k1.type = hdfs a1.sinks.k1.channel = c1
a1.sinks.k1.hdfs.path = /flume/events/%y-%m-%d/%H%M/%S a1.sinks.k1.hdfs.filePrefix = events- HIVE a1.sinks.k1.type = hive a1.sinks.k1.hive.metastore = thrift:// :9083 a1.sinks.k1.hive.database = logsdb a1.sinks.k1.hive.table = weblogs Kafka a1.sinks.k1.type = org.apache.flume.sink.kafka.KafkaSink a1.sinks.k1.topic = mytopic a1.sinks.k1.brokerList = localhost:9092
8
Flow Can chain Multiplex Fan-in Fan-out
9
Config file # example.conf: A single-node Flume configuration # Name the components on this agent a1.sources = r1 a1.sinks = k1 a1.channels = c1 # Describe/configure the source a1.sources.r1.type = netcat a1.sources.r1.bind = localhost a1.sources.r1.port = # Describe the sink a1.sinks.k1.type = logger # Use a channel which buffers events in memory a1.channels.c1.type = memory a1.channels.c1.capacity = 1000 a1.channels.c1.transactionCapacity = 100 # Bind the source and sink to the channel a1.sources.r1.channels = c1 a1.sinks.k1.channel = c1
10
Starting flume bin/flume-ng agent --conf conf --conf-file example.conf --name a1 -Dflume.root.logger=INFO,console
11
Kafka
12
Yet Another messaging System?
Design Considerations Log aggregator Distributed Batch messages to reduce number of connections Offline/Periodic consumption Pull model
13
Architecture
14
Reliability Uses Zookeeper for node and consumer status.
At-least-Once delivery (using offset you can get exactly-Once processing) Built-in data loss auditing
15
Topic A producer writes to a topic and consumer reads from a topic.
Topic is divided into ordered set of partitions. Each partition is consumed by one consumer at a time. Offset is maintained for each consumer per partition.
16
More on Topic… Partition count determines the maximum consumer parallelism. Each partition can have multiple replicas. This provides failover. A broker can host multiple partition but can be leader for only one partition. The leader receives message and replicates to other servers.
17
Configuration Server.properties file Host name, Port Zookeepers
18
Start Kafka bin/zookeeper-server-start.sh config/zookeeper.properties
bin/kafka-server-start.sh config/server.properties
19
Storm
20
Key concepts Topologies Streams Spouts Bolts Stream groupings
Reliability Tasks Workers
21
Streams Unbounded sequence of tuples Tuple is a list of values
22
Spouts Generates Streams Can be Reliable or Unreliable
Reliable spouts use ack() and fail(). Tuples can be replayed.
23
Bolts Used for filtering, functions, aggregations, joins, talking to databases, and more. Complex processing is achieved by using multiple bolts. Types of Bolt Interfaces: IRichBolt: this is general interface for bolts. Manual ack needed. IBasicBolt: this is a convenience interface for defining bolts that do filtering or simple functions. Auto ack.
24
Topology
25
Stream grouping Tells storm how to process tuples with available tasks
Shuffle grouping – Tuples are randomly sent to tasks Fields grouping – Group processing by fields. Makes sure only one task processes a grouped field value. Shuffle Field A B A B “X” X Y Z X Y Z “A” C A B C A B X Y F X Y F
26
Storm Architecture Nimbus – Master node
Zookeeper – Cluster Coordination Supervisor – Worker Processes
27
Storm cluster Nimbus : Master node. There can be only one master node in a cluster. Reassigns tasks in case of worker node failure. Zookeeper : Communication backbone in cluster. Maintains state to aid in failover/recovery. Supervisor : Worker node. Governs worker processes.
28
Storm Cluster – Runtime components
Worker Node Nimbus Zookeeper node Worker Process Supervisor Executor Executor Task Task Task Task Task
29
Code/ Demo
Similar presentations
© 2025 SlidePlayer.com. Inc.
All rights reserved.