PyData London 2022

Data Validation for Data Science
06-17, 09:00–10:30 (Europe/London), Tower Suite 2

Have you ever worked really hard on choosing the best algorithm, tuned the parameters to perfection, and built awesome feature engineering methods only to have everything break because of a null value? Then this tutorial is for you! Data validation is often neglected in the process of working on data science projects. In this tutorial, we will demonstrate the importance of implementing data validation for data science in commercial, open-source, and even hobby projects. We will then dive into some of the open-source tools available for validating data in Python and learn how to use them so that edge cases will never break our models. The open-source Python community will come to our help and we will explore wonderful packages such as Pydantic for defining data models, Pandera for complementing the use of Pandas, and Great Expectations for diving deep into the data. This tutorial will benefit anyone working on data projects in Python who want to learn about data validation. Some Python programming experience and understanding of data science are required. The examples used and the context of the discussion is around data science, but the knowledge can be implemented in any Python oriented project.


For this tutorial, you will need a working Python environment with Jupyter installed, or just a web browser and a Google account for using Google Colab. We will go through the hands-on exercises together in Jupyter notebooks. The context of the tutorial is a standard data science project with the common practice architecture of data ingestion, feature engineering, model training, model serving, etc. In the first part of the tutorial, we will go through all of the common pitfalls where unexpected data values can impact the model performance, or even worse - break the run altogether. In light of the potential consequences, we will discuss the importance of data validation. For the second part of the tutorial we will dive into some of the open-sourced tools in the Python community that can help us with the validation task: Pydantic - For defining data models, types, and simple checks. Pandera - Used on top of Pandas Dataframes for schema validation. Great Expectations - a framework for data testing, quality, and profiling.
All the materials and notebooks needed can be found in this repository: https://github.com/NatanMish/data_validation


Prior Knowledge Expected

Previous knowledge expected

Senior Machine Learning Engineer at Zimmer Biomet. London School of Economics graduate with an MSc in Applied Social Data Science. Passionate about using Machine Learning to solve complicated problems. I have experience analysing and researching data in the financial, real estate, transportation and healthcare industries. Curious about (almost) everything and always happy to take on new experiences and challenges. I love finding bugs, especially if they're my own making!