Kaggle: The Data Behind the Data Science Competitions

Since 2010, Kaggle has hosted (or is currently hosting) 225 public data science competitions. Companies that have posted their data for data miners and statisticians to dig into range from the Otto Group to Santander to Home Depot.

76% of these public competitions have featured cash prizes ranging from $100 to $500,000 for the Heritage Health Prize, which subsequently featured a $3M prize for top eligible finishers of the competition. 

Kaggle has since expanded to include a datasets product and a data science collaboration platform called Kernels it hopes to commercialize, but is predominantly associated with its public and private data science competitions, which have propelled into a broader community of 600K+ users.

All of Kaggle's past and current public competitions (along with each competition's prize amount, number of teams and deadline) are available publicly, so I thought it would be interesting to look at some data behind the data science competitions. 

Here's some of what I found:

  • Kaggle public competitions featuring cash prizes peaked in 2013, falling below 30 in each of the past two years.
  • Despite fewer public competitions, more teams than ever are participating. The number of teams per public cash prize competition per year rose 4x between 2013 and 2015.
  • ~17% of all teams participating in a Kaggle public cash prize competition partook in an insurance-specific competition.

Competition frequency

Of those that featured a cash prize, Kaggle public competitions peaked in 2013 and have fallen below 30 in each of the past two years. Through early September, there have been just over 20 Kaggle cash prize competitions in 2016 YTD.

Here's what the frequency of public cash prize competitions looks like by month of competition deadline date. The highest amount of competition deadlines in a month to date fell in September 2012.

Prize Money

Despite fewer competitions, 2016 has featured more aggregate prize dollars deployed Kaggle public competitions than any year prior at $1.1M with nine competitions featuring $50K+ in prizes and three at $100K or more.

Teams

The number of teams actively participating in Kaggle public cash prize competitions has risen significantly over the past five years. Already in 2016 YTD, over 29K teams have participated in a cash prize competition. In 2013, fewer than 10K teams participation in one. Note: this data includes competitions that are limited or invitation-only. 

The upshot of this is that while more data scientists are keen to showcase their talent publicly on Kaggle, the number of opportunities to do so hasn't kept up. In aggregate, the number of teams per competition by year has risen to over 1300 in 2016 YTD from under 250 in 2013. This fact is not lost on Kaggle. As CEO Anthony Goldbloom said in a recent interview,

Companies sign up much slower than users, it’s a challenge. We have a mismatch in the velocity of the two sides of our market. To the extent that companies aren’t willing to put their data up online, we are not really a great solution...I think that, having worked with a lot of companies, there’s very little that hasn’t already sort of bled out from one company to the next. The benefits of getting outside ideas [to solve] your problems actually outweighs this mythical idea of protecting IP.

Insurance & data science talent on Kaggle

Lastly, I was also interested to see how Kaggle public competitions have attracted data science talent to participate in various insurance-specific challenges and what those competitions have looked like over time. Allstate's competition in 2011 to predict Bodily Injury Liability Insurance claim payments, notably, made headlines in the WSJ, US News & World Report, and Gigaom. And there are startups who have achieved quite a bit of success in insurance-related competitions such as DataRobot, which says it employs over a dozen data scientists who've made the Kaggle top 100 ratings and was itself co-founded by alums of Travelers who participated in competitions.

Eight different insurers have hosted cash prize competitions on Kaggle since 2010. In total, these competitions have attracted 14,866 teams and featured disclosed prizes, in aggregate, of $355,000. Interestingly of the nearly 90K teams that have participated in any Kaggle cash prize competition, nearly 17% partook in an insurance-specific challenge.

As the table below highlights, 2016 YTD has featured the highest number of insurers launching Kaggle competitions (though still fewer than 5 total). These competitions also shine a light on areas where insurers are interested in from a data science standpoint at the time of competition from more automated risk models in life insurance (Prudential) to evaluating the effectiveness of computer vision to spot distracted drivers (State Farm).