R is programming language for statistics and data mining. For the project I'm working with we wanted to analyse similarities in the economical behaviour of different zones.
I did it with Apache Mahout, it worked great but on the way someone mentioned using R (because I don't have a huge data set). I explained that even though I don't have it now I'm just doing POCs and I will so I needed to use Mahout.
The thing is that after that I realised that I had to analyse other things that might have less data, IE: activity in 1 day for old people in a small area, I thought "mmm... that R language might be good for me"
So I downloaded it, "installed" it and quickly got what I needed, a beautiful (?) ggplot graph:
These are expenses at 9:50 am, all aggregated data. So I created a simple script with R, got the data for a month and combined with ffmpeg you can created a video (sorry, can't show it yet).
Let me show you some simple R code (most likely there are better ways to do this, I'm a newbie in R).
I try follow the Google R style guide
The language has something beautiful: if you load a set of data you can easily get statistical information of it IE:
I will create a list with ages of people:
> myData.ages <- c(1, 1, 1, 2, 3, 20, 23, 40, 40, 50, 55, 60, 70, 90, 65, 43, 34, 3, 54, 2, 43, 65, 2, 34, 5, 34)
> summary(myData.ages)
Min. 1st Qu. Median Mean 3rd Qu. Max.
1.00 3.00 34.00 32.31 53.00 90.00
Now lets do something a little bit more interesting suppose we have ages of 10 people :
testData.age1 <- c(age=2);
testData.age2 <- c(age=10);
testData.age3 <- c(age=17);
testData.age4 <- c(age=19);
testData.age5 <- c(age=30);
testData.age6 <- c(age=27);
testData.age7 <- c(age=40);
testData.age8 <- c(age=48);
testData.age9 <- c(age=66);
testData.age10 <- c(age=97);
We can create a matrix for it, it's a 10x1 matrix
> data.myData <- matrix(c(testData.age1, testData.age2, testData.age3, testData.age4, testData.age5, testData.age6, testData.age7, testData.age8, testData.age9, testData.age10), ncol=1, dimnames=list(c("test1", "test2", "test3", "test4", "test5", "test6", "test7", "test8", "test9", "test10")))
Now we can run a basic clustering algorithm: kmeans and we will ask 4 clusters
kmeans(x=data.myData, centers=4)$cluster
test1 test2 test3 test4 test5 test6 test7 test8 test9 test10
3 3 1 1 1 1 2 2 2 4
so what does this mean?
basically 1 and 2 are related (they are kids) 3,4,5 and 6 are related (young people), 7,8,9 are closely related (adults?) and 97 is alone (elder?)
Mahout vs R
Mahout vs R
Why should I use Mahout if R has more algorithms implemented and you can get the data from HDFS through Hive? I would use Mahout when I can do distributed calculations, if the distributed part is just getting a "small" subset of the data to process it, then I would use R + HDFS basically because R is more mature than Mahout and as I said: it has lots of algorithms already implemented.
These are my first impressions I might change my mind after reading a little bit more.
These are my first impressions I might change my mind after reading a little bit more.
No comments:
Post a Comment