Flume is a system used to capture streaming at a high rate of speed and dump into a target system. The target system can be Hadoop (or) JMS.In latest flume, it can also write into Hbase directory. Flume has two products:
Flume (Old Generation)
Flume – NG (Next Generation)
Flume (Old Generation):- Flume is a separate cluster, (which is not part of Hadoop) which is an external system to Hadoop. Flume has two types of nodes:
Single Flow old flume
Physical node (JVM machine)-There are two types of physical nodes
Logical node (Configuration file which contains information about source and sink)
(a) Flume master:-
Fault tolerance (bad)
(b) Flume slaves:-
There are two types of flume slaves:
Flume agent- Flume agent is responsible to collect streams that mean whenever an event is generated immediately agent will collect event and writes into flume collector.
Flume collector: – It collects events from the agent and buffers them till threshold is reached and writes into given target. Here, the threshold can be based on volume (or) size (number of events).When the threshold of flume collector reaches it informs to flume master, then master requests zookeeper to create a connection to Hadoop. Once a connection is created, flume collector is able to write into HDFS.
In above model, the flume collector has 3 actions.
-Collecting events from agent
-Buffering (prepare batch volume)
-Writing into target
Disadvantages of old generation Flume:-
If source application is generating events at a high rate of frequency, the agent is collecting and writing into flume collector. Once the threshold is reached, flume master is writing into Hadoop, then previously buffered data will be flushed out (cleared out) then only next batch will be prepared till then agent has to wait.
Problem-1: During this process, lots of other events will be generated by the source application. i.e., the rate of flume collection is less than the rate of events generation which leads to huge delay.
Problem-2: The flume collector’s data can be sent for only one target either HDFS/JMS. We can’t send this data to multiple targets (JMS, HDFS)
Fault tolerance is very bad.
During flow execution, if agent master is down it will assign agent work to another slave machine.
Problem-3: If flume collector is down, the master will make another slave, as flume collector but previously buffered events will be missed which leads to loss of data.
To overcome these three problems the next model has been introduced in flume called Flume-NG.
Flume – NG (Next Generation):- In Flume – NG flume collector is removed from the architecture, only flume agent exists. In Flume – NG agent is a logical component (not a node).
Flume – Next Generation
Agent has 3 components.
(a) Source:-It is used to collect the events when the threshold is reached then it writes into the channel.
Source Behavior:-Whenever the source collected event from a streaming source it applies a checkpoint on the current event. Then immediately it applies a write attempt to channel. If a channel is busy it waits for 3 sec and reattempts of writing an event. If an event is successfully written into the channel, the channel will send an acknowledgment to a source, then the source will collect next event from streaming app from its next event of a checkpointed event. During this process of writing into the channel, if write is failed then the source will get “channel write exception”. In such case, the source will do 3 attempts of writing the same event to channel. After 3 attempts, the source will flush the event and recollect the same event from streaming app. This is called one cycle of reattempts. Such recycles will be attempted for 3 times. If still the error is repeated then flume source will leave the event and collects its next event, checkpoint also will be modified.
(b) Channel: – It is to collect the events from source and filter the events and write into the sink. It is responsible for buffering events. Buffering will happen until given batch size (or) threshold reached. The threshold can be events based (or) volume based. Volume should be mentioned in bytes. Buffering of events can happen at RAM (or) Disk. The default is RAM. Once channel threshold got reached, channel informs to sink, then sink collects events in the form of transactions and writes into the target.
(c) Sink:-It is to collect the events from the channel and write it into output i.e., JVM/Hadoop. If the sink has successfully written all buffered data into the target, then sink will get acknowledgment for each transaction from the target. For each acknowledgment, one transaction is done. If any transaction of sink got failed, it gives 3 reattempts, if it still fails after 9th attempt, sink flushes the current transaction events and will collect events of next transaction. For this process, we will use non-sensitive data only.