PyData London 2022

Test your data like you test your code
06-18, 10:15–11:00 (Europe/London), Tower Suite 1

I will introduce the concept of data unit tests and why they are important in the workflow of data scientists when building data products. In this talk, you will learn a new tool you can use to ensure the quality of the products you build.


When data scientists build data products, they usually need to combine multiple data sources to train their models and then serve predictions. Making sure that the code and the data will be as expected throughout the full lifetime of the project is complex. To ensure the quality of the code, it is a best practice in software engineering to use automatic testing, this has a large corpus of support material. However, ensuring the quality of the data input and output holistically is not yet as well covered. In this talk, I will explain the concept of data unit tests and why they are important. Then I will present an overview of the current libraries helping to build data unit tests. Finally, I will explain how we integrated it into our workflow at GetYourGuide.


Prior Knowledge Expected

No previous knowledge expected

Theodore Meynard is a senior data scientist at GetYourGuide. He works on our recommender system to help customers to find the best activities to book and locations to explore. Before GetYourGuide, he was building the recommendation system at plista to help online newspapers to monetize their content. When he is not programming, he is also involved in Pydata Berlin, helping to organize monthly meetups. Finally, he loves to ride his bike looking for the best bakery-patisserie in town.