PyData London 2022

Python-centric Feature Stores
06-18, 11:45–12:30 (Europe/London), Tower Suite 3

Most enterprise data used by Data Scientists to train machine learning models is tabular data that comes from data warehouses and data lakes. Recent growth in the popularity of the modern data stack, based on lakehouses like Snowflake, Delta Lake, Big Query, and Redshift, have led to growth in the use of SQL-centric tools for data engineers, such as DBT. However, Data Scientists' language of choice is Python. How do we square this circle?


In this talk, we investigate the role of the Feature Store for machine learning in enabling Python native access to enterprise data for both training and serving features to models. In particular, we will describe the problem of how to create point-in-time consistent training data from features spread over many tables using a SQL backend from Python. We will look at how some tools provide Python ORM-style support for generating SQL, while others make it easier for Data Scientists to embed SQL in their Python pipelines. We will then introduce a third way where we provide a domain-specific language in Python that transparently generates SQL that runs on backend platforms. We will work with a motivation example - the (important) problem with predicting the height of surf at a beach.


Prior Knowledge Expected

No previous knowledge expected

Jim is the Co-founder and CEO of Hopsworks