Sunday, 22 September 2013

Kaggle Digit recognizer

This blogpost is an attempt at learning to recognize digits based on kaggle dataset.

The dataset has 42000 training examples.

Each training example is created by looking at the pixels of a 28X28 grid for each digit.

The process of creating the training labels will be detailed in a subsequent blogpost.

The dataset has a label for the digit and 28X28 = 784 columns of values at each pixel at the intersection.


This problem can be solved in a variety of ways, but for this post i will be learning from bdewilde (user in github) on training using KNN models.


KNN model works in a simple way.

1. Each label can be represented by a vector of columns.
2. Hence distance between vectors can be calculated between each label
3. The idea here is that labels with least distance between each other tend to cluster together and hence the name nearest neighbours
4. the k in KNN is a variable which looks for the nearest k vectors
5. a majority vote is taken among the vectors and the prediction would be the majority vote of the k vectors.
6. the value k can be optimised to improve accuracy
7. Change eucledian distance metric to another metric or to a window


The idea behind distance measurement is that points that are closer to an unlabeled point have higher weightage in voting than the points that are further away from the unlabeled point.

here's the algorithm that does the above for the digit dataset


1. Divide the training set to train and validation set
2. Reduce the dimensionality of the dataset by removing the columns with least variance
3. choose the optimal number of nearest neighbours to be considered
4. Decide upon the nearest neighbour window metric5. train the model for each k and kernel metric
6. Find the combination of k and kernel that gives least error for validation set.




library(kknn)
# load the training data
rawTrainData <- read.csv("C:/Users/Kishore/Desktop/kaggle/tutorials/digit recognizer/train.csv", header=TRUE)
# randomly sample rows of training data
train <- rawTrainData[sample(nrow(rawTrainData)), ]
train <- train[1:10000,]
# optimize knn for k and kernel
# using leave-one-out cross-validation
kMax <- 15
kernels <- c("triangular","rectangular","gaussian")
library(caret)
badCols <- nearZeroVar(train[,-1])
print(paste("Fraction of nearZeroVar columns:", round(length(badCols)/length(train),4)))
train <- train[, -(badCols+1)]
model_2 <- train.kknn(as.factor(label) ~ ., train, kmax=kMax, kernel=kernels)
plot(1:nrow(model_2$MISCLASS), model_2$MISCLASS[,1], type='n', col='blue', ylim=c(0.0,0.105),
     xlab="Number of Nearest Neighbors", ylab="Fractional Error Rate", main="kNN performance by k and kernel")
for(kern in kernels) {
    color=rainbow(length(kernels))[match(kern, kernels)]
    points(1:nrow(model_2$MISCLASS), model_2$MISCLASS[,kern], type='p', pch=17, col=color)
    lines(predict(loess(model_2$MISCLASS[,kern] ~ c(1:kMax))), col=color, lwd=2, lty="dotted")
}
model_2_best <- model_2$MISCLASS[model_2$best.parameters$k, model_2$best.parameters$kernel]
points(model_2$best.parameters$k, model_2_best, pch=17, col="black")
legend("bottomright", ncol=2, legend=c(kernels), col=rep(rainbow(length(kernels))), pch=c(rep(16,3), rep(17,3)), lwd=2, lty=c(rep("dotted",3)), bty="n", y.intersp=1.5, inset=0.01, cex=0.8)
As can be seen from the chart, triangular kernel with k=9 would give the least 
error for recognizing digits.

No comments:

Post a Comment