PyData London 2022

Data Pipelining for Real-time ML Models
06-18, 15:00–15:45 (Europe/London), Tower Suite 1

Reinventing the wheel is usually not something we should be striving for, so why did we build our data pipeline from scratch? There are numerous design choices people make and they can highly affect the potential use cases. When making a custom pipeline you can make your own trade-offs between speed, throughput, simplicity and consistency of code/logic/data.

Market makers like Optiver are usually associated with ultra-low latency infrastructure, however there are plenty of use cases where human latency (seconds) is acceptable. Computing derived metrics, training models and making predictions as new data arrives are just a few such applications and what we will focus on in this presentation.

We will tackle some of the questions we asked ourselves on the design choices for our data pipeline.
* Should you write code that is used by both live and historical pipeline?
* How to improve research to production cycle?
* How do we ensure that real-time and backtest results match?
* How to improve development speed?
* What trade-offs to make if inputs/data arrive asynchronously?
* How to improve performance and reduce resource usage?
* How can we speed up day-to-day research?
* What to do with stateful nodes?

Basic knowledge of finance and data pipelining might be beneficial, but no specific knowledge is required to follow the presentation.


Predicting financial time-series is a challenge in itself, and doing it in real-time further increases the complexity. Ensuring that that the data matches between live (real-time) and historical (backtesting) applications is key for us. If they were to mismatch, the model would be less reliable for making trading decisions and could potentially lead to some terrible consequences. We will describe the trade-offs we made while designing our data pipeline through an example of an ML model.


Prior Knowledge Expected

No previous knowledge expected

Gabor works in the Statistical Arbitrage team at Optiver. He is responsible for building systematic trading strategies and designing the data pipelines. Prior to joining the team, he worked at a systematic hedge fund for 3.5 years. He holds an MSc degree in Mathematical and Computational Finance from the University of Oxford.