Tuesday 11 October 2011

Facebook Scribe + Hadoop

Hi
   So last Thursday I got my MS in CS, so now I have some free time to play with stuff. And the stuff I like to play with is: software :P .

Anyway, I'm back to Scribe and now I wanted to configure it so I can get my logs stored in HadoopFS.

So the architecture idea is a distributed logger saving the data in a distributed filesystem. Easy right? well... kind of. It turns out that I had boost 1.47 (see my previous post) and you have to make some changes in order to use that version of boost. I wanted to use Hadoop 0.21.0, but it looks like Scribe HDFS Support was done with Hadoop 0.20 and it also looks like the HadoopFS team does not like to do backward compatible interfaces :(

So let's go "step by step",
In order to compile and install Scribe you will need libhdfs, and in order to build it you have 2 options:
Option 1 use the ivy script in $HADOOP_HOME/hdfs :
$ ant clean compile-c++-libhdfs

I don't know if that works for you, in my case it didn't work, It requires hadoop-commons-0.21.0 for maven, I installed it but it didn't work so I decided to go with the easier version:
Option 2 build it from $HADOOP_HOME/hdfs/src/c++/libhdfs :
$ make
$ make install

done, it will put the libs in $HADOOP_HOME/dist/lib/

So now it's time to configure Scribe.
$ ./configure --enable-hdfs CPPFLAGS='-I$HADOOP_HOME/hdfs/src/c++/libhdfs/ -I/usr/lib/jvm/java-6-sun-1.6.0.26/include/ -I/usr/lib/jvm/java-6-sun-1.6.0.26/include/linux' LDFLAGS='-L/usr/lib/jvm/java-6-sun-1.6.0.26/jre/lib/i386/client/ -L$HADOOP_HOME/dist/lib/' --with-hadooppath=$HADOOP_HOME

Once again, you will need to fix the config.status and add -DBOOST_FILESYSTEM_VERSION=2 to S["DEFS"]

Now we are ready to build and install Scribe.:
$ make

... fails ... remember that incompatibility I talked about a couple of minutes ago? :D

ok... now this is your decision, but I choose to change HdfsFile.cpp and add the needed extra parameter in hdfsDelete that means recursive delete:

hdfsDelete(fileSys, filename.c_str(), 1);

Now you can build (hopefully ...) without issues:
$ make
$ sudo make install

One last thing: if you fail to find lhdfs when running scribe you can set LD_LIBRARY_PATH

$ export LD_LIBRARY_PATH=~/dev/hadoop-0.21.0/dist/lib:/usr/lib/jvm/java-6-sun-1.6.0.26/jre/lib/i386/server/
hopefully It works for you too :D

If you need a sample configuration for hdfs you can get it from the examples directory in your Scribe dist.

Possible problems you might find when running scribe:

1- When trying to store it will complain about not finding org.apache.hadoop....Configuration
Solution: add hadoop-common to your classpath. Also add commons-logging if you don't have it already:

export CLASSPATH=$CLASSPATH:$HADOOP_HOME/hadoop-common-0.21.0.jar:$HADOOP_HOME/hadoop-hdfs-0.21.0.jar:$HADOOP_HOME/lib/commons-logging-api-1.1.jar:$HADOOP_HOME/lib/commons-logging-1.1.1.jar

2-An error saying  "HDFS is not configured for file hdfs://yourserver:yourport/filePath"
You will have to manually create the filePath:
$HADOOP_HOME/bin/hadoop fs -mkdir filePath

After that try and see if it fails again
3- If it fails again... with a message "Exception: hdfsListDirectory call failed" just copy a file to your path:
$HADOOP_HOME/bin/hadoop fs -copyFromLocal someFile filePath

Run scribe again hopefully it will work and after that you can delete your file "somefile".

Yes it's ugly, I didn't find a better way yet :S


Cheers,
Fernando

No comments:

Post a Comment