PyData London 2022

Clusterf*ck: A practical guide to Bayesian hierarchical modeling in Pymc3
06-19, 11:00–11:45 (Europe/London), Tower Suite 3

At Apollo Agriculture, a Kenya based agro-tech startup, one of the challenging problems we face is to predict yields of Kenyan maize farmers. Like almost all data-sets, this data-set has a hierarchical structure: farmers within the same region aren’t independent. By ignoring this fact, a model could predict yields entirely from the region of the farmer, but fails to find any other meaningful insights, and we may not even realize. However, if we “overcorrected,” treating each region as completely separate, each individual analysis could be underpowered. Enter the hero of our story: Bayesian hierarchical modeling. Using a practical example in Pymc3, we’ll follow this hero as they identify and overcome clustered data-sets.


With the lack of practical guides to use Bayesian hierarchical modeling in Python, many data scientists have shied away from using it. Using Pymc3, this talk will step through the process of using BHM on a real world hierarchical data-set. This walkthrough is aimed at all (data) scientists and researchers wanting to learn 1) how to recognize hierarchy in their data, 2) whether it matters, and 3) how to address it.

Every data-set has some degree of hierarchical structure, meaning that there are clusters in the data-set that are not completely independent.

At Apollo Agriculture, a Kenya based agro-tech startup, one of the challenging problems we face is to predict yields of Kenyan maize farmers. The data we work with has a hierarchical structure: farmers within the same region aren’t independent. We tried ignoring this hierarchical structure, and trained a (black box) machine learning model that predicted yields using all data (aka pooling). However, even when we excluded the region as a variable, the model made predictions based on variables that were a proxy of region, without us even knowing. The model did not learn any other meaningful relationships, because it spent all its power on detecting the hierarchy that we already knew.

To prevent this from happening, this talk will help you:
* Recognize hierarchy in your data-set
* Understand when this hierarchy might be important or worth addressing
* Address hierarchy in your data in a way that helps you get the most out of it

To address hierarchy in our data, we are looking for a technique that enables us to use the whole dataset simultaneously to fit the model parameters that share information across datasets and therefore reduce the model uncertainty. To further increase our statistical power, we want to incorporate prior scientific information about model parameters. We found that Bayesian hierarchical modeling helps us do exactly that. The brilliance of this approach is that the model allows for learning the degree to which you should consider different clusters as separate versus pooled, and helps you extract maximum value out of the data.


Prior Knowledge Expected

No previous knowledge expected

Hanna is a creative and passionate data scientist with experience in energy, agriculture, and credit risk. She has 3+ years of experience in data science and machine learning, and proven skills in ML Ops. She is currently working to help Kenyan smallholder farmers run more profitable businesses at Apollo Agriculture.