PyData London 2022

Data Science at Scale with Dask
06-17, 13:30–15:00 (Europe/London), Tower Suite 2

This tutorial is an introduction to Dask, an OSS Python library for distributed computing. We will walk through the many ways you can apply Dask to scale your Python code to work with larger datasets and/or transcend other compute-bound limitations.

The tutorial will cover:
- how to scale pandas with Dask
- how to scale NumPy with Dask
- how to parallelise your existing Python code with Dask
- how to scale to the cloud with Dask and Coiled

The tutorial assumes no prior knowledge of Dask.


An introduction to distributed computing:

When, why and how should you leverage distributed computing?
- Introduction to Dask, an OSS Python library for distributed computing
- How to parallelise your Python code with Dask:

Why parallelise your code?
- Using dask.delayed() to parallelise custom code
- Scaling your NumPy and pandas workflows:

How to scale your NumPy and pandas to larger-than-memory datasets?
- Dask Collections: Bags, Arrays and DataFrames
- Distributed Machine Learning with Dask:

How to build distributed ML models
- Bursting to the cloud to transcend local compute resources


Prior Knowledge Expected

No previous knowledge expected

Richard Pelgrim is a data scientist with a passion for communicating technical content in creative and compelling ways. Currently he does so as Developer Advocate at Coiled.io, the leading company built around the open-source Dask library for distributed computing in Python. Richard is regularly invited to give Dask tutorials at meet-ups and conferences and has a treasure chest of expert tips to support anyone looking to take their distributed computing to the next level.