I have recently started a collaboration with artist Svend Sømod who for a while has been working on a machine that can generate and print cocktails based on math and an existing data-set of cocktail recipes and ingredient combinations. This obviously caught my attention and I became curious on the underlying data for the project and what else could be done with it if one threw some machine learning models into the mix. So I got in touch with Svend to get a glimpse into the data used for the cocktail printer.
The data his machine bases it’s cocktails on comes from a data-set of cocktail recipes which internally is reformatted into a table that show how many times each cocktail ingredient has appeared together with every other cocktail ingredient in the recipes in the data-set. Thus we have an all-to-all table of cocktail ingredient combinations! The data-set contains combinations of 122 cocktail ingredients. Thus we have a table of size 122 x 122:
From the above table we see that many ingredients aren’t used together in existing cocktails (all the NaN values), so there is a lot of potentially new cocktail combinations to investigate! This seems like an interesting opportunity to apply a recommender system to predict all the unknown relationships and see what fascinating new cocktails would emerge. First however, let’s dive deeper into the data:
First thing is to investigate is the ratio of known to unknown combinations (values):
Known Combinations | 5314 |
Unknown Combinations | 9570 |
Ratio of Known | 35.7 % |
Only 35.7% known combinations is not much but we will have to work with what we have! We can plot these with a heat map to get an overview:
Note that the above heat map only displays the relationship of known vs unknown combinations. As anticipated there are many unknown combinations, however the heat map shows that the known values tend to focus around a subset of ingredients, while other ingredients will have very few known combinations. To further investigate, let’s plot the raw combination values:
Man that’s dark! It would appear that a few ingredients are used together to a much higher extent than the average population. Because of this and the linear scale used in this heat map, everything else gets very dark. If we clip (disallow) combinations larger than 10, we can create the following heat map:
Much better! Now we can get a better idea of the relationships between the different ingredients. Much like in the first heat map, we can see that some ingredients have a significantly higher number of combinations than others and that a dozen or so have combinations with almost all ingredients. However, the above image also shows a fairly nuanced image with many ingredients having a fair amount of known combinations with significant values.
Let’s go further by looking into combinations per ingredient. Summing the combinations pr ingredient produces the following histogram:
This shows a clear long-tailed distribution, with a few ingredients having a high combination sum (simple (syrup), lemon, lime, gin, vodka & rum), a fairly large midsection and an overweight of ingredients with a low number of combinations. The high combination segment is a combination of balancing ingredients (simple (syrup), lemon, lime) and hard alcohol (rum, vodka, gin). This makes sense! Knowing that all cocktails will need one or more source of alcohol as well as a way of balancing out the flavor. Interestingly enough water also has a high amount of combinations. This however is a “feature” of the data-set where water is counted as an ingredient if the cocktail is shaken over ice which many cocktails are.
Having examined the combinations pr ingredient, we can now look on step further at combinations per ingredient pair:
The above plot shows the top 150 combinations (the data-set contains 122 ingredients so plotting all pair combinations isn’t feasible). Again we see a long tailed distribution forming. Interestingly the top pair combinations are almost exclusively ingredients that also had high individual combination sums. This would indicate that the most popular ingredients most often appear together.
Alcohol, Sugar & Category
Some additional data I received from Svend is the alcohol & sugar levels of each ingredient, as well as category (booze, liqueur, juice, etc.). Let’s dive into that data!
The above sample only shows booze but trust me, there is more. We can plot the distribution between ingredient types to get an overview:
The largest category of ingredients being liqueurs with booze (hard alcohol) in a second place. The Dashes category covers ingredients added in small dashes (mainly bitters such as Angostura). Let’s see how categories of ingredients are distributed with accordance to alcohol and sweetness:
To no ones surprise all booze ingredients have a high level of alcohol. Liqueur has a few values over 40% but the majority lies in the range 20% – 40%. Wine and Sherry is in even lower in the range 10%-20% and juice, syrup & other being non alcoholic ingredients (1 exception being beer which is categorized as “other”). Dashes score higher given that they are bitters.
Looking at the sugar content, the syrups show an interesting pattern with two subgroups. While the majority of syrups are sweet, data reveals that a select few are in fact tart (such as lemon syrup). Liqueurs has the highest sweetness while maintaining a uniform distribution, followed by wine and juice.
Let’s compare sugar to alcohol:
The above plot shows a fairly good separation between the various ingredients.
Measuring Ingredient Category Combinations
Let’s dig into the ingredient combinations per category to find the most popular categories of cocktail ingredients. We will start by investigating the overall stats for categories:
As mentioned earlier, the ingredients with most combinations is either hard alcohol or balancing agents. Knowing this it comes as no surprise that booze has the highest amount of combinations followed by juice (due to lime juice and lemon juice). I find it a bit surprising that Syrup doesn’t score higher, given that simple (syrup) is the top used ingredient in the data set!
Looking at the averages we see a different picture. Dashes has jumped to the top (most likely due to it being by far the smallest category), while booze and liqueur has decreased significantly. It appears there is more to uncover from this data so let’s get an overview of the categorical distributions with a Density Estimation (KDE) plot:
Again the overall shape is that of a long tailed distribution. The above plot really emphasizes that a large portion of the ingredients are liqueurs, yet also shows that most liqueurs like in the mid-low section (combination wise).
To get a better idea of what is happening further out in the distribution tail, we can plot the same data but as a percentage wise filled plot:
What a difference! This time liqueurs are confined to the lower end (much like before), however booze, juice and syrup (to an extent) can be clearly observed to be the dominant players in the high combination area. We see that dashes appear to be existing only in the mid section. This (combined with the small sample size) must be the defining factors for the high average.
Putting everything together
Let’s go back to the cocktail combinations again and try a new approach to visualizing their relationships using dimensionality reduction! Using dimensionality reduction, we can project high dimensional data (in this case ingredient combinations with 122 dimensions) down to lower dimensional representation (a table with only 2 dimensions per ingredient) that is possible to visualize in a meaningful way. For this purpose I will be using the T-SNE algorithm. After projecting the cocktail ingredient combinations using T-SNE we can visualize the results as a scatter plot:
Interestingly T-SNE has chosen to group the ingredients with the highest combinations together. From this group we can observe a trace moving away where ingredient sums are slowly diminishing. This further proves that the most combined ingredients are combined with each other. We can however also spot a similar cluster among the midsection of combined ingredients.
One last visualization for the road! Let’s plot all cocktail ingredient combinations as a network graph:
Here the width of the connections(combinations) indicates the number of combinations of ingredients. This visualization is created such that the nodes (ingredients) with most connections are placed closer to the center. Once again we see that the top ingredients are most frequently used together and that a large proportion of ingredients are much further out and only being used in few combinations. The peripheral ingredients will also mainly have connections going towards the central nodes and less frequently between each other.
To Sum Up!
Through this analysis we have discovered the following about cocktail ingredient combinations:
- Most combinations are unknown (around 65 %)
- The known combination values tend to focus on a subset of ingredients
- A large midsection of ingredients have a moderate amount of known combinations
- The most combined ingredients are either hard alcohol or balancing agents
- The ingredients with most combinations appear frequently together
- The less frequent ingredients rarely appear together
- Liqueur is the highest represented category
- The top combination categories are booze, juice & syrup
Stay tuned for next entry where we I will ad recommender systems to the mix!