A Plotly Theme Party 🎉

Like many people, I suffer from a “healthy OCD” to make my Data Science presentations look aesthetically pleasing. Usually your audience will appreciate this as well, because a pretty and consistent presentation helps to better convey your message to your audience. A pattern that emerged when improving presentations was doing mundane stuff like aligning all elements on all slides, applying the same formatting, the same colors on all titles and charts, adjusting font sizes, et cetera.

Read More

An introduction to Support Vector Machines

As data scientists, we design and develop machine learning models, and some of them may appear to be magical. In practice though, when explaining how the internals of these algorithms work to end-users, it turns out they are not that magical at all, but rather quite simple and intuitive. In this post, I’d therefore like to show you that SVMs are not as complicated as they may seem.

Read More

The log10 of 0 is over 9000… right?

Recently, one of our models was taking a very long time to fit after adding a new feature. This was unexpected, as on previous occasions the model fitting process was actually quite fast. The model in question was a linear SVM and we could see it was not converging anymore when fitting the model. The first thing to check for with linear SVMs when this happens is numerical or scaling issues in the training dataset, as libSVM is in theory guaranteed to converge (source). Adding the new feature indeed resulted in numerical issues, i.e. a few values greater than 100.000 were remaining in the dataset after scaling. These values were dominating all other, much smaller, numerical values.

Read More

Using AWS Lambda and Slack to have fun while saving on EMR costs

We all have these times where we hack a piece of code together in 5 minutes. Usually, these pieces of code are not hidden gems, they tend to do simple stuff. Every once in a while though, you will find yourself writing a simple script which gives you a big smile afterwards. In this post, I will discuss one of these scripts which I made quite quickly, but still provides a lot of laughs for the entire team from time to time. Additionally, it also helps us save on AWS EMR costs and it keeps the minds within the team sharp. A win-win!

Read More

Helping our new Data Scientists start in Python: A guide to learning by doing

The Data Science team at Greenhouse Group is steadily growing and continuously changing. This also implies new Data Scientists and interns starting regularly. Each new Data Scientist we hire is unique and has a different set of skills. What they all have in common though is a strong analytical background and the practical ability to apply this on real business cases. The majority of our team for example studied Econometrics, a study which provides a strong foundation in probability theory and statistics.

Read More