Friday, 16 December 2011

R: First Impressions (and a bit of Apache Mahout)

Been in Madrid for the last 2 months so I didn't have much time to write here, but I want to take some time now to write about R.

R is programming language for statistics and data mining. For the project I'm working with we wanted to analyse similarities in the economical behaviour of different zones.
I did it with Apache Mahout, it worked great but on the way someone mentioned using R (because I don't have a huge data set). I explained that even though I don't have it now I'm just doing POCs and I will so I needed to use Mahout.

The thing is that after that I realised that I had to analyse other things that might have less data, IE: activity in 1 day for old people in a small area, I thought "mmm... that R language might be good for me"

So I downloaded it, "installed" it and quickly got what I needed, a beautiful (?) ggplot graph:
These are expenses at 9:50 am, all aggregated data. So I created a simple script with R, got the data for a month and combined with ffmpeg you can created a video (sorry, can't show it yet).

Let me show you some simple R code (most likely there are better ways to do this, I'm a newbie in R).
I try follow the Google R style guide

The language has something beautiful: if you load a set of data you can easily get statistical information of it IE:
I will create a list with ages of people:
> myData.ages <- c(1, 1, 1, 2, 3, 20, 23, 40, 40, 50, 55, 60, 70, 90, 65, 43, 34, 3, 54, 2, 43, 65, 2, 34, 5, 34)

> summary(myData.ages)
   Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
   1.00    3.00   34.00   32.31   53.00   90.00 

Now lets do something a little bit more interesting suppose we have ages of 10 people :
testData.age1 <- c(age=2);
testData.age2 <- c(age=10);
testData.age3 <- c(age=17);
testData.age4 <- c(age=19);
testData.age5 <- c(age=30);
testData.age6 <- c(age=27);
testData.age7 <- c(age=40);
testData.age8 <- c(age=48);
testData.age9 <- c(age=66);
testData.age10 <- c(age=97);
We can create a matrix for it, it's a 10x1 matrix


> data.myData <- matrix(c(testData.age1, testData.age2, testData.age3, testData.age4, testData.age5, testData.age6, testData.age7, testData.age8, testData.age9, testData.age10), ncol=1, dimnames=list(c("test1", "test2", "test3", "test4", "test5", "test6", "test7", "test8", "test9", "test10")))



Now we can run a basic clustering algorithm: kmeans and we will ask 4 clusters

kmeans(x=data.myData, centers=4)$cluster
 test1  test2  test3  test4  test5  test6  test7  test8  test9 test10 
     3      3      1      1      1      1      2      2      2      4

so what does this mean?
basically 1 and 2 are related (they are kids) 3,4,5 and 6 are related (young people), 7,8,9 are closely related (adults?) and 97 is alone (elder?)

Mahout vs R 
Why should I use Mahout if R has more algorithms implemented and you can get the data from HDFS through Hive? I would use Mahout when I can do distributed calculations, if the distributed part is just getting a "small" subset of the data to process it, then I would use R + HDFS basically because R is more mature than Mahout and as I said: it has lots of algorithms already implemented.

These are my first impressions I might change my mind after reading a little bit more.




No comments:

Post a Comment