Sunday, 22 September 2013

Kaggle Digit recognizer

This blogpost is an attempt at learning to recognize digits based on kaggle dataset.

The dataset has 42000 training examples.

Each training example is created by looking at the pixels of a 28X28 grid for each digit.

The process of creating the training labels will be detailed in a subsequent blogpost.

The dataset has a label for the digit and 28X28 = 784 columns of values at each pixel at the intersection.


This problem can be solved in a variety of ways, but for this post i will be learning from bdewilde (user in github) on training using KNN models.


KNN model works in a simple way.

1. Each label can be represented by a vector of columns.
2. Hence distance between vectors can be calculated between each label
3. The idea here is that labels with least distance between each other tend to cluster together and hence the name nearest neighbours
4. the k in KNN is a variable which looks for the nearest k vectors
5. a majority vote is taken among the vectors and the prediction would be the majority vote of the k vectors.
6. the value k can be optimised to improve accuracy
7. Change eucledian distance metric to another metric or to a window


The idea behind distance measurement is that points that are closer to an unlabeled point have higher weightage in voting than the points that are further away from the unlabeled point.

here's the algorithm that does the above for the digit dataset


1. Divide the training set to train and validation set
2. Reduce the dimensionality of the dataset by removing the columns with least variance
3. choose the optimal number of nearest neighbours to be considered
4. Decide upon the nearest neighbour window metric5. train the model for each k and kernel metric
6. Find the combination of k and kernel that gives least error for validation set.




library(kknn)
# load the training data
rawTrainData <- read.csv("C:/Users/Kishore/Desktop/kaggle/tutorials/digit recognizer/train.csv", header=TRUE)
# randomly sample rows of training data
train <- rawTrainData[sample(nrow(rawTrainData)), ]
train <- train[1:10000,]
# optimize knn for k and kernel
# using leave-one-out cross-validation
kMax <- 15
kernels <- c("triangular","rectangular","gaussian")
library(caret)
badCols <- nearZeroVar(train[,-1])
print(paste("Fraction of nearZeroVar columns:", round(length(badCols)/length(train),4)))
train <- train[, -(badCols+1)]
model_2 <- train.kknn(as.factor(label) ~ ., train, kmax=kMax, kernel=kernels)
plot(1:nrow(model_2$MISCLASS), model_2$MISCLASS[,1], type='n', col='blue', ylim=c(0.0,0.105),
     xlab="Number of Nearest Neighbors", ylab="Fractional Error Rate", main="kNN performance by k and kernel")
for(kern in kernels) {
    color=rainbow(length(kernels))[match(kern, kernels)]
    points(1:nrow(model_2$MISCLASS), model_2$MISCLASS[,kern], type='p', pch=17, col=color)
    lines(predict(loess(model_2$MISCLASS[,kern] ~ c(1:kMax))), col=color, lwd=2, lty="dotted")
}
model_2_best <- model_2$MISCLASS[model_2$best.parameters$k, model_2$best.parameters$kernel]
points(model_2$best.parameters$k, model_2_best, pch=17, col="black")
legend("bottomright", ncol=2, legend=c(kernels), col=rep(rainbow(length(kernels))), pch=c(rep(16,3), rep(17,3)), lwd=2, lty=c(rep("dotted",3)), bty="n", y.intersp=1.5, inset=0.01, cex=0.8)
As can be seen from the chart, triangular kernel with k=9 would give the least 
error for recognizing digits.

Friday, 20 September 2013

calling R from excel VB

R is a very good tool for data visualization and for generating reports apart from the large number of statistical packages it has to offer.

While in legacy systems data manipulation is done in excel.

This post will describe how to make these two systems talk - calling R code from Excel.


The utility for this tool would be when there is a legacy system that is difficult to change because of a number of dependencies on the up or downstream but one needs a capability that R is very strong at - be it statistical or data visualization and when one needs to execute the R code once certain steps are finished in excel




Here's the steps to follow:

1. Have the R code that you want to execute at a location: lets' say
C:/Users/kishorea/Desktop/summary_variance_code.txt
here, summary_variance_code.txt is the text file which contains R code

2. Execute the shell command from excel macro: Developer -> Create macro

Call Shell("C:/Program Files/R/R-2.12.2/bin/Rscript.exe C:/Users/kishorea/Desktop/summary_variance_code.txt")

Step 2 opens up the command window and executes the R script.


There you go, R running from excel