Developer's Nightmare: R: First Impressions (and a bit of Apache Mahout)

Been in Madrid for the last 2 months so I didn't have much time to write here, but I want to take some time now to write about R.

R is programming language for statistics and data mining. For the project I'm working with we wanted to analyse similarities in the economical behaviour of different zones.
I did it with Apache Mahout, it worked great but on the way someone mentioned using R (because I don't have a huge data set). I explained that even though I don't have it now I'm just doing POCs and I will so I needed to use Mahout.

The thing is that after that I realised that I had to analyse other things that might have less data, IE: activity in 1 day for old people in a small area, I thought "mmm... that R language might be good for me"

So I downloaded it, "installed" it and quickly got what I needed, a beautiful (?) ggplot graph:

These are expenses at 9:50 am, all aggregated data. So I created a simple script with R, got the data for a month and combined with ffmpeg you can created a video (sorry, can't show it yet).

Let me show you some simple R code (most likely there are better ways to do this, I'm a newbie in R).

I try follow the Google R style guide

The language has something beautiful: if you load a set of data you can easily get statistical information of it IE:

I will create a list with ages of people:

> myData.ages <- c(1, 1, 1, 2, 3, 20, 23, 40, 40, 50, 55, 60, 70, 90, 65, 43, 34, 3, 54, 2, 43, 65, 2, 34, 5, 34)

> summary(myData.ages)

Min. 1st Qu. Median Mean 3rd Qu. Max.

1.00 3.00 34.00 32.31 53.00 90.00

Now lets do something a little bit more interesting suppose we have ages of 10 people :

testData.age1 <- c(age=2);

testData.age2 <- c(age=10);

testData.age3 <- c(age=17);

testData.age4 <- c(age=19);

testData.age5 <- c(age=30);

testData.age6 <- c(age=27);

testData.age7 <- c(age=40);

testData.age8 <- c(age=48);

testData.age9 <- c(age=66);

testData.age10 <- c(age=97);

We can create a matrix for it, it's a 10x1 matrix

> data.myData <- matrix(c(testData.age1, testData.age2, testData.age3, testData.age4, testData.age5, testData.age6, testData.age7, testData.age8, testData.age9, testData.age10), ncol=1, dimnames=list(c("test1", "test2", "test3", "test4", "test5", "test6", "test7", "test8", "test9", "test10")))

Now we can run a basic clustering algorithm: kmeans and we will ask 4 clusters

kmeans(x=data.myData, centers=4)$cluster

test1 test2 test3 test4 test5 test6 test7 test8 test9 test10

3 3 1 1 1 1 2 2 2 4

so what does this mean?

basically 1 and 2 are related (they are kids) 3,4,5 and 6 are related (young people), 7,8,9 are closely related (adults?) and 97 is alone (elder?)

Mahout vs R

Why should I use Mahout if R has more algorithms implemented and you can get the data from HDFS through Hive? I would use Mahout when I can do distributed calculations, if the distributed part is just getting a "small" subset of the data to process it, then I would use R + HDFS basically because R is more mature than Mahout and as I said: it has lots of algorithms already implemented.

These are my first impressions I might change my mind after reading a little bit more.

Developer's Nightmare

Friday, 16 December 2011

R: First Impressions (and a bit of Apache Mahout)

No comments:

Post a Comment