Sunday, May 8, 2011

Installing Flume in the cluster - A complete step by step tutorial

Flume Cluster Setup :




In the previous post, you used Flume on a single machine with the default configuration settings. With the default settings, nodes automatically search for a Master on localhost on a standard port. In order for the Flume nodes to find the Master in a fully distributed setup, you must specify site-specific static configuration settings.

Before we start:-
Before we start configure flume, you need to have a running Hadoop cluster, which will be the centralize storage for flume. Please refer to Installing Hadoop in the cluster - A complete step-by-step tutorial post before continuing.

Installation steps:-

Perform following steps on Master Machine.
1. Download flume-0.9.1.tar.gz from  https://github.com/cloudera/flume/downloads   and extract to some path in your computer. Now I am calling Flume installation root as $FLUME_INSTALL_DIR. 

2. Edit the file /etc/hosts on the master machine (Also in agent and collector machines) and add the following lines.

192.168.41.67 flume-master
192.168.41.53 flume-collector
hadoop-namenode-machine-IP hadoop-namenode

3. Open the file $FLUME_INSTALL_DIR/conf/flume-site.xml and Edit the following properties.

<?xml version="1.0"?>
<?xml-stylesheet type="text/xsl" href="configuration.xsl"?>
<configuration>
<property>
<name>flume.master.servers</name>
<value>flume-master</value>
</property>
<property>
<name>flume.collector.event.host</name>
<value>flume-collector</value>
<description>This is the host name of the default "remote" collector.
</description>
</property>
<property>
<name>flume.collector.port</name>
<value>35853</value>
<description>This default tcp port that the collector listens to in order to receive events it is collecting.
</description>
</property>
</configuration>

4. Repeat step 1 to 3 on collector and agents machines.
Note: - The Agent Flume nodes are co-located on machines with the service that is producing logs.

Start flume processes:-

1. Start Flume master:- The Master can be manually started by executing the following command on Master Machine.
        1.1 $Flume_INSTALL_DIR/bin/flume master
1.2 After the Master is started, you can access it by pointing a web browser to http://flume-master:35871/. This web page displays the status of all Flume nodes that have contacted the Master, and shows each node’s currently assigned configuration. When you start this up without Flume nodes running, the status and configuration tables will be empty.

2. Start Flume collector:- The Collector can be manually started by executing the following command on Collector Machine.
         2.1 $Flume_INSTALL_DIR/bin/flume node –n flume-collector

2.2 To check whether a Flume node (collector) is up, point your browser to the Flume Node status page athttp://flume-collector:35862/. Each node displays its own data on a single table that includes diagnostics and metrics data about the node, its data flows, and the system metrics about the machine it is running on. If you have multiple instances of the flume node program running on a machine, it will automatically increment the port number and attempt to bind to the next port (35863, 35864, etc) and log the eventually selected port.

2.3 If the node is up, you should also refresh the Master’s status page (http://flume-master:35871) to make sure that the node has contacted the Master. You brought up one node whose name is flume-collector, so you should have one node listed in the Master’s node status table.

3. Start Flume agent:- The Agent can be manually started by executing the following command on Agent Machine (agent Flume nodes are co-located on machines with the service that is producing logs.)
      3.1 $Flume_INSTALL_DIR/bin/flume node –n flume-agent

      3.2 Perform step 2.3 again.

Note: - Similarly you can start other Flume agent by executing following commands:-
Start second agent:- $Flume_INSTALL_DIR/bin/flume node –n flume-agent1
Start third agent:- $Flume_INSTALL_DIR/bin/flume node –n flume-agent2

Configuring Flume nodes via master:-

1. Configuration of Flume Collector: - On the Master’s web page click on the config link. Enter the following values into the "Configure a node" form, and then click Submit.
Node name: flume-collector
Source: collectorSource(35853)
Sink: collectorSink("hdfs://hadoop-namenode:9000/user/flume /logs/%Hoo ","%{host}-")
Note: - The collector writes to an HDFS cluster (assuming the HDFS namenode machine is called hadoop-namenode)

2. Configuration of Flume Agent:- On the Master’s web page, click on the config link. Enter the following values into the "Configure a node" form, and then click Submit.
Node name: flume-agent
Source: tail(“path/to/logfile”)
Ex:- tail("/home/$USER/logAnalytics/dot.log")
Sink: agentSink("flume-collector",35853)

Note: - Use same configuration for each Flume Agent.

16 comments:

Anonymous said...

I'm having trouble figuring out how to change these settings if I have multiple collectors, particularly the values in flume.collector.event.host. Would I list them all like in the flume.master.servers field? Thanks!

Ankit Jain said...

Hi,

Yes we can list them all like in the flume.master.servers field.


flume.collector.event.host
192.168.41.53,192.168.41.85
This is the host name of the default "remote" collector.



flume.collector.port
35853
This default tcp port that the collector listens to in order to receive events it is collecting.



Thanks,
Ankit

Jay said...

Interesting. I kept flume.collector.event.host "localhost" on all of the agents and collectors in my development setup, and didn't have any problems. The setting is the same in production, where we have three flows going from each of the 60 physical agents, two physical collectors, and 1 master. The agents all have autoE2EAgent sinks; some tailing, some listening on ports for sources. The collectors have autoCollector sources and S3 HDFS sinks.

Now, we haven't quite diagnosed why this is happening, but multiple agents' logical nodes either ERROR or are LOST every day. The errors are often hidden in the logs, but mainly have to do with communication problems with the collector.

I wonder if I changed this field to list the hostnames of the collectors, the problems might alleviate.

Just out of curiosity, how stable is your flume set up? Do you have any monitoring on it, or run any scripts to auto-restart nodes, etc?

Ankit Jain said...

My cluster is quite stable. I don't have any monitoring on it.

Unknown said...

1. Failed to open thrift event sink to flume-collector:35853 : java.net.ConnectException: Connection refused
even after giving ip also it failed.

how to get started

2.
where the data in hadoop is pushed .Now is the directory /zapads/flume/logs/ needs to b created .or is it hadoop data directory...

M confused

collectorSink( "hdfs://192.168.1.98:54310/zapads/flume /logs/ ", "%{host}-" )

RajeevGupta said...

Hi Ankit,
My setup is something like that ...
2 node cluster of hdfs one is master and other machines slave and master are slave node.

hadoop-master : flume-master as well as flume-agent
hadoop-slave : flume-collector
so while writing the file from collector getting exception like "failed on local exception java.io.ioexception broken pipe"

Could you help me on this. And as you mentioned auto increment in port number for agent node should work but it's nt working. And One more thing in cluster setup do we need flume-conf.xml because even mentioned collector port in flume-site.xml collector is opened on default port of the flume-conf.xml.

Version :flume-distribution-0.9.4 and hadoop-0.20.2

Anonymous said...

Is it recommended to have flume-master and hadoop namenode to be on the same node

Ankit Jain said...

nope, you can deploy flume master and hadoop namenode on different machine.

chhaya vishwakarma said...

hi ankit ,
can i have agents and collecotrs on windows machine and master on unix machine? will this create any problem

Anonymous said...

how to get data from website like yahoo stock market can you suggest some method thank you..

Ankit Jain said...

We have to write a flume source which internally use yahoo stock API to fetch the real time stream.

Please let me know if more details are required.

Thanks,
Ankit

Anonymous said...

Hi Ankit ,

os : RHEL
CDH4
flume ur explanation document it's not working

Anonymous said...

i got below error can you please post solution

2013-04-23 06:26:25,285 [logicalNode flume-collector3-40] ERROR connector.DirectDriver: Driving src/sink failed! LazyOpenSource | LazyOpenDecorator because Server IPC version 7 cannot communicate with client version 3
org.apache.hadoop.ipc.RemoteException: Server IPC version 7 cannot communicate with client version 3
at org.apache.hadoop.ipc.Client.call(Client.java:739)
at org.apache.hadoop.ipc.RPC$Invoker.invoke(RPC.java:220)
at sun.proxy.$Proxy4.getProtocolVersion(Unknown Source)
at org.apache.hadoop.ipc.RPC.getProxy(RPC.java:359)
at org.apache.hadoop.hdfs.DFSClient.createRPCNamenode(DFSClient.java:104)
at org.apache.hadoop.hdfs.DFSClient.(DFSClient.java:207)
at org.apache.hadoop.hdfs.DFSClient.(DFSClient.java:168)
at org.apache.hadoop.hdfs.DistributedFileSystem.initialize(DistributedFileSystem.java:82)
at org.apache.hadoop.fs.FileSystem.createFileSystem(FileSystem.java:1420)
at org.apache.hadoop.fs.FileSystem.access$300(FileSystem.java:67)
at org.apache.hadoop.fs.FileSystem$Cache.getInternal(FileSystem.java:1448)
at org.apache.hadoop.fs.FileSystem$Cache.get(FileSystem.java:1436)
at org.apache.hadoop.fs.FileSystem.get(FileSystem.java:197)
at org.apache.hadoop.fs.Path.getFileSystem(Path.java:175)
at com.cloudera.flume.handlers.hdfs.CustomDfsSink.open(CustomDfsSink.java:117)
at com.cloudera.flume.handlers.hdfs.EscapedCustomDfsSink.openWriter(EscapedCustomDfsSink.java:102)
at com.cloudera.flume.handlers.hdfs.EscapedCustomDfsSink.append(EscapedCustomDfsSink.java:117)
at com.cloudera.flume.core.EventSinkDecorator.append(EventSinkDecorator.java:58)
at com.cloudera.flume.handlers.rolling.RollSink.append(RollSink.java:166)
at com.cloudera.flume.core.EventSinkDecorator.append(EventSinkDecorator.java:58)
at com.cloudera.flume.handlers.endtoend.AckChecksumChecker.append(AckChecksumChecker.java:178)
at com.cloudera.flume.collector.CollectorSink.append(CollectorSink.java:151)
at com.cloudera.flume.core.EventSinkDecorator.append(EventSinkDecorator.java:58)
at com.cloudera.flume.handlers.debug.LazyOpenDecorator.append(LazyOpenDecorator.java:69)
at com.cloudera.flume.core.connector.DirectDriver$PumperThread.run(DirectDriver.java:92)

usha Punia said...

is it compulsory to install hadoop to get flume working ,or is there any other way out so that we can skip cofiguring hadoop cluster?

Ankit Jain said...

Hi Usha,
Its not mandatory to install Hadoop cluster for flume if you don't want to store your tailing log data into HDFS.

Bhagyashree said...

Hi Ankit,
I am getting address bind exception while running flume agent for the avro source..
Can you provide me some pointers on this?

Post a Comment