Friday 16 December 2011

R: First Impressions (and a bit of Apache Mahout)

Been in Madrid for the last 2 months so I didn't have much time to write here, but I want to take some time now to write about R.

R is programming language for statistics and data mining. For the project I'm working with we wanted to analyse similarities in the economical behaviour of different zones.
I did it with Apache Mahout, it worked great but on the way someone mentioned using R (because I don't have a huge data set). I explained that even though I don't have it now I'm just doing POCs and I will so I needed to use Mahout.

The thing is that after that I realised that I had to analyse other things that might have less data, IE: activity in 1 day for old people in a small area, I thought "mmm... that R language might be good for me"

So I downloaded it, "installed" it and quickly got what I needed, a beautiful (?) ggplot graph:
These are expenses at 9:50 am, all aggregated data. So I created a simple script with R, got the data for a month and combined with ffmpeg you can created a video (sorry, can't show it yet).

Let me show you some simple R code (most likely there are better ways to do this, I'm a newbie in R).
I try follow the Google R style guide

The language has something beautiful: if you load a set of data you can easily get statistical information of it IE:
I will create a list with ages of people:
> myData.ages <- c(1, 1, 1, 2, 3, 20, 23, 40, 40, 50, 55, 60, 70, 90, 65, 43, 34, 3, 54, 2, 43, 65, 2, 34, 5, 34)

> summary(myData.ages)
   Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
   1.00    3.00   34.00   32.31   53.00   90.00 

Now lets do something a little bit more interesting suppose we have ages of 10 people :
testData.age1 <- c(age=2);
testData.age2 <- c(age=10);
testData.age3 <- c(age=17);
testData.age4 <- c(age=19);
testData.age5 <- c(age=30);
testData.age6 <- c(age=27);
testData.age7 <- c(age=40);
testData.age8 <- c(age=48);
testData.age9 <- c(age=66);
testData.age10 <- c(age=97);
We can create a matrix for it, it's a 10x1 matrix


> data.myData <- matrix(c(testData.age1, testData.age2, testData.age3, testData.age4, testData.age5, testData.age6, testData.age7, testData.age8, testData.age9, testData.age10), ncol=1, dimnames=list(c("test1", "test2", "test3", "test4", "test5", "test6", "test7", "test8", "test9", "test10")))



Now we can run a basic clustering algorithm: kmeans and we will ask 4 clusters

kmeans(x=data.myData, centers=4)$cluster
 test1  test2  test3  test4  test5  test6  test7  test8  test9 test10 
     3      3      1      1      1      1      2      2      2      4

so what does this mean?
basically 1 and 2 are related (they are kids) 3,4,5 and 6 are related (young people), 7,8,9 are closely related (adults?) and 97 is alone (elder?)

Mahout vs R 
Why should I use Mahout if R has more algorithms implemented and you can get the data from HDFS through Hive? I would use Mahout when I can do distributed calculations, if the distributed part is just getting a "small" subset of the data to process it, then I would use R + HDFS basically because R is more mature than Mahout and as I said: it has lots of algorithms already implemented.

These are my first impressions I might change my mind after reading a little bit more.




Tuesday 11 October 2011

Facebook Scribe + Hadoop

Hi
   So last Thursday I got my MS in CS, so now I have some free time to play with stuff. And the stuff I like to play with is: software :P .

Anyway, I'm back to Scribe and now I wanted to configure it so I can get my logs stored in HadoopFS.

So the architecture idea is a distributed logger saving the data in a distributed filesystem. Easy right? well... kind of. It turns out that I had boost 1.47 (see my previous post) and you have to make some changes in order to use that version of boost. I wanted to use Hadoop 0.21.0, but it looks like Scribe HDFS Support was done with Hadoop 0.20 and it also looks like the HadoopFS team does not like to do backward compatible interfaces :(

So let's go "step by step",
In order to compile and install Scribe you will need libhdfs, and in order to build it you have 2 options:
Option 1 use the ivy script in $HADOOP_HOME/hdfs :
$ ant clean compile-c++-libhdfs

I don't know if that works for you, in my case it didn't work, It requires hadoop-commons-0.21.0 for maven, I installed it but it didn't work so I decided to go with the easier version:
Option 2 build it from $HADOOP_HOME/hdfs/src/c++/libhdfs :
$ make
$ make install

done, it will put the libs in $HADOOP_HOME/dist/lib/

So now it's time to configure Scribe.
$ ./configure --enable-hdfs CPPFLAGS='-I$HADOOP_HOME/hdfs/src/c++/libhdfs/ -I/usr/lib/jvm/java-6-sun-1.6.0.26/include/ -I/usr/lib/jvm/java-6-sun-1.6.0.26/include/linux' LDFLAGS='-L/usr/lib/jvm/java-6-sun-1.6.0.26/jre/lib/i386/client/ -L$HADOOP_HOME/dist/lib/' --with-hadooppath=$HADOOP_HOME

Once again, you will need to fix the config.status and add -DBOOST_FILESYSTEM_VERSION=2 to S["DEFS"]

Now we are ready to build and install Scribe.:
$ make

... fails ... remember that incompatibility I talked about a couple of minutes ago? :D

ok... now this is your decision, but I choose to change HdfsFile.cpp and add the needed extra parameter in hdfsDelete that means recursive delete:

hdfsDelete(fileSys, filename.c_str(), 1);

Now you can build (hopefully ...) without issues:
$ make
$ sudo make install

One last thing: if you fail to find lhdfs when running scribe you can set LD_LIBRARY_PATH

$ export LD_LIBRARY_PATH=~/dev/hadoop-0.21.0/dist/lib:/usr/lib/jvm/java-6-sun-1.6.0.26/jre/lib/i386/server/
hopefully It works for you too :D

If you need a sample configuration for hdfs you can get it from the examples directory in your Scribe dist.

Possible problems you might find when running scribe:

1- When trying to store it will complain about not finding org.apache.hadoop....Configuration
Solution: add hadoop-common to your classpath. Also add commons-logging if you don't have it already:

export CLASSPATH=$CLASSPATH:$HADOOP_HOME/hadoop-common-0.21.0.jar:$HADOOP_HOME/hadoop-hdfs-0.21.0.jar:$HADOOP_HOME/lib/commons-logging-api-1.1.jar:$HADOOP_HOME/lib/commons-logging-1.1.1.jar

2-An error saying  "HDFS is not configured for file hdfs://yourserver:yourport/filePath"
You will have to manually create the filePath:
$HADOOP_HOME/bin/hadoop fs -mkdir filePath

After that try and see if it fails again
3- If it fails again... with a message "Exception: hdfsListDirectory call failed" just copy a file to your path:
$HADOOP_HOME/bin/hadoop fs -copyFromLocal someFile filePath

Run scribe again hopefully it will work and after that you can delete your file "somefile".

Yes it's ugly, I didn't find a better way yet :S


Cheers,
Fernando

Thursday 15 September 2011

Google+ Developers API

So Google finally decided to make the Google+ API public. Nothing very interesting, it looks like there's no search functionality, neither a stream one

This initial release is focused on public data only.
If you want to give it a try, go to your Google API console and activate the Google+ API.

Then go to "API Access" get your api key and test it:
curl https://www.googleapis.com/plus/v1/people/109813896768294978296?key=[yourKey]

That query returns the json-formatted information for the user.

A more complex example: if you want to see Sergey's last 3 updates :
https://www.googleapis.com/plus/v1/people/109813896768294978296/activities/public?alt=json&maxResults=3&pp=1&fields=id%2Ctitle%2Citems(title%2Cgeocode%2Ckind%2Cobject%2Freplies%2Curl)&key=[yourKey]

For the list of fields and how to query you can take a look at the Activities List or People List

The api is quite simple, you can get People information or activity information and filter them with the &fields parameter

Long story short: useless API, you need the person ID and since you won't get the friends IDs I don't see any practical use right now.

I would love to see v2... soon


Thursday 8 September 2011

Distributed Logging with Facebook Scribe + Status MBean

Most likely this will be my last post for about a month, my Thesis director just sent me an email and I have to go back to work on my thesis again (hopefully finishing it by the end of the month and getting my MS :D ).

So this time I had to take a look at what Facebook uses for logging: Scribe.

"Scribe is a server for aggregating log data that's streamed in real
time from clients. It is designed to be scalable and reliable.
 "

The way it works (or at least the way I configure it) is quite simple: you have a central logging server and a local logging instance in each node. The way it works is roughtly the following:
  1. You log against your local instance that acts as a proxy delivering log messages to the central server.
  2. If the central server goes down, your local instance will keep logging messages and saving them to filesystem (configurable).
  3. Your local server will retry to send logs to the central server from time to time (also configurable)
  4. When the central server comes back up the local server will send all pending log messages and clean your FS up.
Logs are organized by Category which helps filtering.
Note that the way described the central server is again a bottleneck, it will not scale. That's why you should think about hitting a load balancer instead of the central server.
There's no admin console, no alarms for the server status, nothing. If the central server goes down: you won't notice it. If the local server goes down ... well, most likely your box went down so you won't be logging to that server anyway.
In any case, most likely you'd like to write something to check the status, and in order to to that the class com.facebook.fb303.FacebookService.Client provides a couple of methods to check the server status. It's a generic interface for a Thrift service but will give you the basic information.
Scribe also comes with 2 sample scripts for controlling the servers, you can find them in $SCRIBE_SRC_HOME/examples

If you plan to compile scribe, please use Boost <1.46 version. If you use a newer version Boost, they changed the default filesystem to v3 and scribe uses v2. Workaround: After runing ./bootstrap.sh edit config.status and add -DBOOST_FILESYSTEM_VERSION=2 where all the other params are set (ie: search for -DHAVE_BOOST). I currently have Scribe working with Boost 1.47 and made the client libs with Thrift 0.7.0 .

I wrote a simple MBean to check the status of Scribe servers, and modify/clean up the log4j appender described in this Article. You can take a look at the code here.



Tuesday 6 September 2011

First steps with Rails: Inheritance

So I went back to my Rails code trying to figure out how to do the inheritance with scaffold.

Turns out... you can't do it :S (or at least I couldn't find an article on how to do it and all examples show a different path). You can specify the --parent attribute but it won't create the necessary migrations and the auto-generated CRUD pages will only contain the new fields.
So what I did was this:
I run

rails g scaffold TwitterUser --parent=User

That creates the class TwitterUser<User and the htmls for it. After that I had to manually create a migration. Rails uses STI out of the box, so I used that :P

My new migration had to add a column type of type string you have to use that name so rails will set the subclass name automatically for you.

After that you can add your custom columns.

class AddAndRemoveTwitterUserColumnsToUser < ActiveRecord::Migration
  def up
    add_column :users, :twitterName, :string
    add_column :users, :twitter_Id, :integer
    add_column :users, :type, :string
  end
  def down
    remove_column :users, :twitterName
    remove_column :users, :twitter_Id
    remove_column :users, :type
  end
end

After that you can go to your app/view and change the html and support the new fields. It's not that hard but I thought that scaffold tool could have done the work for me.


Wednesday 31 August 2011

Hadoop with Scala WordCount Example

I had some "free" time in the office so I decided to start researching Hadoop. I had ideas on Hadoop concepts, read a couple of articles, but never worked with it (although I worked with similar technologies).
I also wanted to learn Scala, I think it's interesting to be able to use functional programming from Java.

So I thought: mmmm if the map part of a mapreduce is basically applying a function (map) to some data... then a functional code should be more suitable for doing that task in certain situations, and more important: it should be more readable.
=======================================================================
Update: Shadoop is basically a Scala Object that wraps your int, bool, string to the Hadoop wrappers. This way you can write
val one = 1 instead of new IntWritable(1);
=======================================================================
So without further introduction, let's go to the example.
1- Get the Hadoop WordCount example source code


You'll see that they define an inner class
public static class Map extends MapReduceBase implements Mapper

That's the map function that I want to get rid off, in order to do that I made a scala class (I could have used a scala object and keep the Map class, I'm not sure what's best in terms of performance).

class WordCountScala extends MapReduceBase with Mapper[LongWritable, Text, Text, IntWritable] {
  val one = new IntWritable(1);
  val word = new Text();


  def map(key: LongWritable, value: Text, output: OutputCollector[Text, IntWritable], reporter: Reporter): Unit = {
    var line = value.toString();
    line.split(" ").foreach(a => {word.set(a);output.collect(word, one);});
  }


}
Ok, it's not a clean example of the advantages of functional programming but the idea is to show that you can use scala with Hadoop ;)

The code is quite simple for every word in the value string  (a word is defined here in the same way as in the sample code from apache: any string of chars without blanks) we add 1 to the count.

You can go back to the WordCount.java from the example, remove the Map inner class and set the mapper class to WordCountScala

The code will look like this:
import java.io.IOException;
import java.util.Iterator;


import org.apache.hadoop.fs.Path;
import org.apache.hadoop.io.IntWritable;
import org.apache.hadoop.io.Text;
import org.apache.hadoop.mapred.FileInputFormat;
import org.apache.hadoop.mapred.FileOutputFormat;
import org.apache.hadoop.mapred.JobClient;
import org.apache.hadoop.mapred.JobConf;
import org.apache.hadoop.mapred.MapReduceBase;
import org.apache.hadoop.mapred.OutputCollector;
import org.apache.hadoop.mapred.Reducer;
import org.apache.hadoop.mapred.Reporter;
import org.apache.hadoop.mapred.TextInputFormat;
import org.apache.hadoop.mapred.TextOutputFormat;


public class WordCount {
   
    public static class Reduce extends MapReduceBase implements
            Reducer<Text, IntWritable, Text, IntWritable> {
        public void reduce(Text key, Iterator<IntWritable> values,
                OutputCollector<Text, IntWritable> output, Reporter reporter)
                throws IOException {
            int sum = 0;
            while (values.hasNext()) {
                sum += values.next().get();
            }
            output.collect(key, new IntWritable(sum));
        }
    }


    public static void main(String[] args) throws Exception {
        JobConf conf = new JobConf(WordCount.class);
        conf.setJobName("wordcount");


        conf.setOutputKeyClass(Text.class);
        conf.setOutputValueClass(IntWritable.class);


        conf.setMapperClass(WordCountScala.class);
        conf.setCombinerClass(Reduce.class);
        conf.setReducerClass(Reduce.class);


        conf.setInputFormat(TextInputFormat.class);
        conf.setOutputFormat(TextOutputFormat.class);


        FileInputFormat.setInputPaths(conf, new Path(args[0]));
        FileOutputFormat.setOutputPath(conf, new Path(args[1]));


        JobClient.runJob(conf);
    }
}



Time to compile this thing. I use "The Scala IDE for Eclipse", but if you want to build it from console Assuming that you have the code for both classes in $HADOOP_HOME/testScala folder:

1- compile scala:
scalac -classpath ../lib/commons-logging-1.1.1.jar:../hadoop-common-0.21.0.jar:../hadoop-mapred-0.21.0.jar:../lib/avro-1.3.2.jar:../lib/log4j-1.2.15.jar WordCountScala.scala

2-compile java:
javac -classpath ../lib/commons-logging-1.1.1.jar:../hadoop-common-0.21.0.jar:../hadoop-mapred-0.21.0.jar:../lib/avro-1.3.2.jar:../lib/log4j-1.2.15.jar:$SCALA_HOME/lib/scala-library.jar:. WordCount.java

3-run:
java -classpath ../lib/commons-logging-1.1.1.jar:../hadoop-common-0.21.0.jar:../hadoop-mapred-0.21.0.jar:../lib/avro-1.3.2.jar:../lib/log4j-1.2.15.jar:../lib/jackson-core-asl-1.4.2.jar:../lib/jackson-mapper-asl-1.4.2.jar:../lib/commons-httpclient-3.1.jar:$SCALA_HOME/lib/scala-library.jar:. WordCount ../input/ ../output/

4-Check the output:
cat ../output/*

Monday 29 August 2011

First impressions of Rails (and first entry)

I've been a Java developer for the last 8+ years. I worked with some other languages: C++, Fortran (yes... I know... ), javaScript. But most of my work was done in Java.

I recently decided that I want to try Ruby. I have a new project in mind, and everybody talks about how fast developing with Rails/Grails is, so I wanted to give it a chance.

What do you do when you change from one language to another? get a reference book? Well...not what I did :D It happens that my brother has been working with Rails for more than a year so instead of getting a book I'm trying to do it the hard way: reading blogs and asking my bro. whenever I find something weird.

OK so time to say what I think of Rails right now. Remember, just 1 day working with it (4/5hours actually):

I REALLY like Cucumber. It's great that you can specify your test in a human readable way, I don't have a Functional Analyst in my project, but I think you could use cucumber to let nontechnical people to write some test cases for you. Someone told me today that there are frameworks in java for doing the same thing: JBehave (Looks like I haven't done any full app in a while :D ). Anyway, important thing to notice: these are integration tests, you still need to create your unit tests.

I like how Rails organizes your app, it basically forces you to use good programming practices(thumbs up). Scripts are great you can generate your model classes, the db schema and all simple forms for your CRUD operations

What I DON'T Like is that you have to think in terms of the db first. I like to think in terms of objects, I don't want to pay attention to the DB schema when I'm writing my model.

Here's an example: Suppose you want to create a class SpecialUser with some attributes... let's say: name and twitterId.

rails g scaffold SpecialUser name:string twitterId:string

this works great, in order to make it work you run
rake db:migrate

and you are ready to go.

Now let's say we have a class User with the attribute name and we want this to be a subclass, so if you read the doc you'll find that you can specify the parent class with --parent so you run the script:

rails g scaffold SpecialUser twitterId:string --parent=User

You go an check the code and find

class SpecialUser < User
end

So you're like: "mmmmmmm this is weird... where's my twitterId attribute?" moreover, you check db/migration and there's no script for this new class... but if you do a grep by twitterId you'll find that the html pages were created.

So: looks like IF you want to have inheritance Rails will do PART of the job but you will have to write the db and model by yourself. I'm wondering how hard would have been to make it out of the box.

Next step: find out if there's a simple way to do inheritance or I have to write the whole thing by myself.