Developer's Nightmare: Hadoop with Scala WordCount Example

I had some "free" time in the office so I decided to start researching Hadoop. I had ideas on Hadoop concepts, read a couple of articles, but never worked with it (although I worked with similar technologies).
I also wanted to learn Scala, I think it's interesting to be able to use functional programming from Java.

So I thought: mmmm if the map part of a mapreduce is basically applying a function (map) to some data... then a functional code should be more suitable for doing that task in certain situations, and more important: it should be more readable.
=======================================================================
Update: Shadoop is basically a Scala Object that wraps your int, bool, string to the Hadoop wrappers. This way you can write
val one = 1 instead of new IntWritable(1);
=======================================================================
So without further introduction, let's go to the example.
1- Get the Hadoop WordCount example source code

You'll see that they define an inner class

public static class Map extends MapReduceBase implements Mapper

That's the map function that I want to get rid off, in order to do that I made a scala class (I could have used a scala object and keep the Map class, I'm not sure what's best in terms of performance).

class WordCountScala extends MapReduceBase with Mapper[LongWritable, Text, Text, IntWritable] {

  val one = new IntWritable(1);

  val word = new Text();

  def map(key: LongWritable, value: Text, output: OutputCollector[Text, IntWritable], reporter: Reporter): Unit = {

    var line = value.toString();

    line.split(" ").foreach(a => {word.set(a);output.collect(word, one);});

  }

}

Ok, it's not a clean example of the advantages of functional programming but the idea is to show that you can use scala with Hadoop ;)

The code is quite simple for every word in the value string (a word is defined here in the same way as in the sample code from apache: any string of chars without blanks) we add 1 to the count.

You can go back to the WordCount.java from the example, remove the Map inner class and set the mapper class to WordCountScala

The code will look like this:

import java.io.IOException;

import java.util.Iterator;

import org.apache.hadoop.fs.Path;

import org.apache.hadoop.io.IntWritable;

import org.apache.hadoop.io.Text;

import org.apache.hadoop.mapred.FileInputFormat;

import org.apache.hadoop.mapred.FileOutputFormat;

import org.apache.hadoop.mapred.JobClient;

import org.apache.hadoop.mapred.JobConf;

import org.apache.hadoop.mapred.MapReduceBase;

import org.apache.hadoop.mapred.OutputCollector;

import org.apache.hadoop.mapred.Reducer;

import org.apache.hadoop.mapred.Reporter;

import org.apache.hadoop.mapred.TextInputFormat;

import org.apache.hadoop.mapred.TextOutputFormat;

public class WordCount {

    public static class Reduce extends MapReduceBase implements

            Reducer<Text, IntWritable, Text, IntWritable> {

        public void reduce(Text key, Iterator<IntWritable> values,

                OutputCollector<Text, IntWritable> output, Reporter reporter)

                throws IOException {

            int sum = 0;

            while (values.hasNext()) {

                sum += values.next().get();

            }

            output.collect(key, new IntWritable(sum));

        }

    }

    public static void main(String[] args) throws Exception {

        JobConf conf = new JobConf(WordCount.class);

        conf.setJobName("wordcount");

        conf.setOutputKeyClass(Text.class);

        conf.setOutputValueClass(IntWritable.class);

        conf.setMapperClass(WordCountScala.class);

        conf.setCombinerClass(Reduce.class);

        conf.setReducerClass(Reduce.class);

        conf.setInputFormat(TextInputFormat.class);

        conf.setOutputFormat(TextOutputFormat.class);

        FileInputFormat.setInputPaths(conf, new Path(args[0]));

        FileOutputFormat.setOutputPath(conf, new Path(args[1]));

        JobClient.runJob(conf);

    }

}

Time to compile this thing. I use "The Scala IDE for Eclipse", but if you want to build it from console Assuming that you have the code for both classes in $HADOOP_HOME/testScala folder:

1- compile scala:

scalac -classpath ../lib/commons-logging-1.1.1.jar:../hadoop-common-0.21.0.jar:../hadoop-mapred-0.21.0.jar:../lib/avro-1.3.2.jar:../lib/log4j-1.2.15.jar WordCountScala.scala

2-compile java:

javac -classpath ../lib/commons-logging-1.1.1.jar:../hadoop-common-0.21.0.jar:../hadoop-mapred-0.21.0.jar:../lib/avro-1.3.2.jar:../lib/log4j-1.2.15.jar:$SCALA_HOME/lib/scala-library.jar:. WordCount.java

3-run:

java -classpath ../lib/commons-logging-1.1.1.jar:../hadoop-common-0.21.0.jar:../hadoop-mapred-0.21.0.jar:../lib/avro-1.3.2.jar:../lib/log4j-1.2.15.jar:../lib/jackson-core-asl-1.4.2.jar:../lib/jackson-mapper-asl-1.4.2.jar:../lib/commons-httpclient-3.1.jar:$SCALA_HOME/lib/scala-library.jar:. WordCount ../input/ ../output/

4-Check the output:

cat ../output/*

Developer's Nightmare

Wednesday 31 August 2011

Hadoop with Scala WordCount Example

1 comment: