Sunday 13 October 2013

Getting into Kaggle Top 20% in one evening

Kaggle.com is one of the top platform for data scientists to test their skills against the best and ranking among the top on its competitions is one of the hardest tasks to accomplish.

However, the gap between the top performers is so small and a minute improvement in score possibly follows the 80-20 distribution where one can reach the top 20% scores in reasonably small time, however achieving the top might take a lot more effort.

However, the top scores would face another problem -

Implementation in production - netflix paid $1Mn, in their competition, however, the models were so very complicated that it found hard to implement the winning model.


This post describes my results in achieving among the top 20% scores with

1. Feature engineering
2. Single GBM model

Feature engineering represents the modification of variables by either reshaping them or
GBM is quite possibly the best performing individual model among the many other machine learning techniques.

Competition 1: Carvana

Carvana had ~20 variables with 15 categorical & 5 continuous variables with the dependent variable being if the car was a bad buy or not.

Among the 15 categorical variables, 10 had not more than 5 factors, while 5 had a lot of factors (~1000 levels).

Given these were factors, they could not get directly into the regression problem.

So they were reshaped in a way where the factor was transformed into a variable dummy and given a binary value of whether it is a 1 or a 0.

That was pretty much it, we have ~100 variables and ~70K rows .

On top of this dataset, I have applied a GBM model with 1000 trees and voila, I was in the top 20% in one evening's effort out of ~600 participants.





Competition 2: Detecting influencers in social network

This competition was around identifying if A has more influence in a social network when compared to B.
Dataet was provided with features like # of followers, mentions, retweets, posts and other network features for the two users, with each pair of A, B in a new row.

Feature engineering for this dataset was around taking the difference of # of followers, mentions, posts etc., for A, B and also taking the ratio.

and voila this resulted in 12th position in the competition!!




No comments:

Post a Comment