PyData London 2022

To see our schedule with full functionality, like timezone conversion and personal scheduling, please enable JavaScript and go here.
08:00
08:00
60min
Registration & Breakfast
Tower Suite 1
09:00
09:00
90min
Data Validation for Data Science
Natan Mish

Have you ever worked really hard on choosing the best algorithm, tuned the parameters to perfection, and built awesome feature engineering methods only to have everything break because of a null value? Then this tutorial is for you! Data validation is often neglected in the process of working on data science projects. In this tutorial, we will demonstrate the importance of implementing data validation for data science in commercial, open-source, and even hobby projects. We will then dive into some of the open-source tools available for validating data in Python and learn how to use them so that edge cases will never break our models. The open-source Python community will come to our help and we will explore wonderful packages such as Pydantic for defining data models, Pandera for complementing the use of Pandas, and Great Expectations for diving deep into the data. This tutorial will benefit anyone working on data projects in Python who want to learn about data validation. Some Python programming experience and understanding of data science are required. The examples used and the context of the discussion is around data science, but the knowledge can be implemented in any Python oriented project.

Tower Suite 2
09:00
90min
SQLAlchemy and you - making SQL the best thing since sliced bread
Anders Bogsnes

Are you writing SQL strings in your code? Have you only used ORMs and want to start getting more control over your SQL?

SQLAlchemy is the gold-standard for working with SQL in Python and this tutorial will get you comfortable with working in it so you can take advantage of its power. We will go through Core and ORM abstractions so you'll be comfortable navigating through the different layers and be able to fully use the power of Python when writing your SQL

Tower Suite 3
09:00
90min
sktime - python toolbox for time series: how to implement your own estimator
Franz Kiraly

sktime is a widely used scikit-learn compatible library for learning with time series. sktime is easily extensible by anyone, and interoperable with the pydata/numfocus stack. This tutorial explains how to write your own sktime estimator, e.g., forecaster, classifier, transformer, by using sktime’s extension templates and testing framework. A custom estimator can live in any local code base, and will be compatible with sktime pipelines, or scikit-learn. A continuation of the sktime introductory tutorial at pydata [link]

Tower Suite 1
10:30
10:30
30min
Break & Snacks
Tower Suite 1
10:30
30min
Break & Snacks
Tower Suite 2
10:30
30min
Break & Snacks
Tower Suite 3
11:00
11:00
90min
Feature Engineering Made Simple
Kajanan Sangaralingam, Anindya Datta

Of all the choices made by data scientists in the course of building and operating models, feature engineering & selection is one of the most critical. Features have a substantive impact on a model’s quality, including its predictive accuracy and resilience. Unfortunately, as most ML scientists and practitioners are aware, feature engineering is more art than science. It is ad-hoc, messy, error-prone and ends up consuming 70-80% of the time and effort when building models, often resulting in sub-optimal feature selection leading to low-quality models. In this tutorial, we will introduce new ways of performing feature engineering, turning it into a systematic, procedural and scalable process, which is substantively more efficient than how it occurs currently. Participants will perform a hands-on, end-to-end, feature building exercise, with particular emphasis on feature engineering using Anovos (https://anovos.ai/ or https://github.com/anovos/anovos)

Tower Suite 2
11:00
90min
Probabilistic Python: An Introduction to Bayesian Modeling with PyMC
Chris Fonnesbeck

Bayesian statistical methods offer a powerful set of tools to tackle a wide variety of data science problems. In addition, the Bayesian approach generates results that are easy to interpret and automatically account for uncertainty in quantities that we wish to estimate and predict. Historically, computational challenges have been a barrier, particularly to new users, but there now exists a mature set of probabilistic programming tools that are both capable and easy to learn. We will use the newest release of PyMC (version 4) in this tutorial, but the concepts and approaches that will be taught are portable to any probabilistic programming framework.

This tutorial is intended for practicing and aspiring data scientists and analysts looking to learn how to apply Bayesian statistics and probabilistic programming to their work. It will provide learners with a high-level understanding of Bayesian statistical methods and their potential for use in a variety of applications. They will also gain hands-on experience with applying these methods using PyMC, specifically including the specification, fitting and checking of models applied to a couple of real-world datasets.

As this is an introductory tutorial, no direct experience with PyMC or Bayesian statistics will be required. However, to benefit maximally from the tutorial, learners should have some familiarity with basic statistics (things like regression and estimation) and with core components of the scientific Python stack (e.g. NumPy, pandas and Jupyter).

Tower Suite 3
11:00
90min
Train Object Detection with small Datasets
vincenzo crescimanna

Object detection, the task of localising and classifying objects in a scene, one of the most popular tasks in Computer Vision, has a main drawback: a large annotated dataset is necessary to train the model. Indeed, annotating a dataset is expensive, and the free available datasets are not enough, as they do not contain all the classes we are interested in. Thus, the goal of the tutorial is to introduce the main techniques to train a good object detector utilising the minimum amount of annotated data.

Tower Suite 1
12:30
12:30
60min
Lunch
Tower Suite 1
12:30
60min
Lunch
Tower Suite 2
12:30
60min
Lunch
Tower Suite 3
13:30
13:30
90min
Data Science at Scale with Dask
Richard Pelgrim

This tutorial is an introduction to Dask, an OSS Python library for distributed computing. We will walk through the many ways you can apply Dask to scale your Python code to work with larger datasets and/or transcend other compute-bound limitations.

The tutorial will cover:
- how to scale pandas with Dask
- how to scale NumPy with Dask
- how to parallelise your existing Python code with Dask
- how to scale to the cloud with Dask and Coiled

The tutorial assumes no prior knowledge of Dask.

Tower Suite 2
13:30
90min
Introducing more of the standard library
Simon Ward-Jones

For novice Python users who want to learn about some of the helpful modules that come in the python standard library. In particular we will talk about pathlib, datetime, collections, itertools, and functools! Please come with Jupyter Notebook installed.

Tower Suite 1
13:30
90min
Picking What to Watch Next - build a recommendation system
Cheuk Ting Ho

Recommendation algorithms are the driving force of many businesses: e-commerce, personalized advertisement, on-demand entertainment. Computer algorithms know what you like and present you with things that are customized for you. Here we will explore how to do that by building a system ourselves.

Tower Suite 3
15:00
15:00
30min
Break & Snacks
Tower Suite 1
15:00
30min
Break & Snacks
Tower Suite 2
15:00
30min
Break & Snacks
Tower Suite 3
15:30
15:30
90min
Document/sentence similarity solution using open source NLP libraries, frameworks and datasets
Ade Idowu

The need to develop robust document/text similarity measure solutions is an essential step for building applications such as Recommendation Systems, Search Engines, Information Retrieval Systems including other ML/AI applications such as News Aggregators or Automated Recruitment systems used to match CVs to job specification and so on. In general, text similarity is the measure of how words/tokens, tweets, phrases, sentences, paragraphs and entire documents are lexically and semantically close to each other. Texts/words are lexically similar if they have similar character sequence or structure and, are semantically similar if they have the same meaning, describe similar concepts and they are used in the same context.  

This tutorial will demonstrate a number of strategies for feature extraction i.e., transforming documents to numeric feature vectors. This transformation step is a prerequisite for computing the similarity between documents. Typically, each strategy will involve 4 steps, namely: 1) the use of standard natural language pre-processing techniques to prepare/clean the documents, 2) the transformation of the document text into numeric vectors/embeddings, 3) calculation of document similarity using metrics such as Cosine, Euclidean and Jaccard and, 4) validation of the findings

Tower Suite 3
15:30
90min
How to Stack Neural Networks together ? Ideas and Applications
Pranjal Biyani

Exploring the process, implementation, practical applications, and advantages of stacking neural networks together. The tutorial focuses on building tunable, high-performance, multi-data-type feature models seamlessly using network concatenations in TensorFlow. We implement 3 examples and also derive explainability for a stacked neural network.

Tower Suite 1
15:30
90min
Parallelism the Old Way: Using MPI in Python with mpi4py
Nick Radcliffe

MPI is one of the oldest best-established and best-tested approaches to parallel computing, with bindings for most languages and availability on most systems. MPI uses explicit message passing and can be used on "shared-nothing" systems (in which each process/processor has its own memory, unavailable to other processors) as well as shared-memory systems, (uniform and non-uniform).
This tutorial will provide a gentle introduction to parallel computing using specifically MPI using the Python mpi4py library.

Tower Suite 2
08:00
08:00
60min
Registration & Breakfast
Tower Suite 1
09:00
09:00
15min
Opening Notes
Tower Suite 1
09:15
09:15
45min
Possible Futures for Jupyter
Sylvain Corlay

Jupyter has changed the way we think about interactive computing, scientific communication, and science education as it has been adopted globally, both in academia and industry.

Tower Suite 1
10:00
10:00
15min
Break & Snacks
Tower Suite 1
10:15
10:15
45min
Fuzzy Matching at Scale
Thusal

Fuzzy Matching is a useful tool that has been well discussed. However, these popular methods based on edit-distances like Levenshtein or Jaro-Winkler have failed to keep up with increasing data sizes. This talk will walk you through modern methods based on character-based n-grams, vector space models, and approximate nearest neighbours for Fuzzy Matching at Scale.

Tower Suite 3
10:15
45min
Making fake data generators for open source healthcare data science projects
Matthew Cooper, Jennifer Hall
Tower Suite 2
10:15
45min
Test your data like you test your code
Theodore Meynard

I will introduce the concept of data unit tests and why they are important in the workflow of data scientists when building data products. In this talk, you will learn a new tool you can use to ensure the quality of the products you build.

Tower Suite 1
11:00
11:00
45min
How Pyodide and a new opensource community are improving children’s social work.
Tambe Tabitha Achere

Social care workers support the most disadvantaged children in the UK and we help improve the sector with Data and Digital. Due to the extremely sensitive nature of the data in this context and long bureaucratic processes, data tools could neither be created to function on the internet nor could be installed by the users. This is a talk about how we coached social care workers to build a data cleaning tool and how Pyodide enabled it to scale. This talk is for people intrigued by complex problems. No previous knowledge is required.

Tower Suite 2
11:00
45min
Running the first automatic speech recognition (ASR) model with HuggingFace
Mia Chang

Come and learn your first audio machine learning model with Automatic speech recognition (ASR) use case! ASR has been a popular application like voice-controlled assistants and voice-to-text/speech-to-text applications. These applications take audio clips as input and convert speech signals to text.

Tower Suite 1
11:00
45min
“Off with their I/Os!” - or how to contain madness by isolating your code
Sarah Diot-Girard

Engulfed in a tedious refactoring of your code, you’re adding the 7th layer of mocks to a test when you realise something must have gone wrong somewhere, but what ? You’ve written readable code, split into functions and classes to avoid long chunks of code, and yet, every time, you end up with hardly testable code, a test suite that runs for hours, functions with seventeen arguments, and you wonder if it’s you mocking the code or the code mocking you.

Follow the white rabbit with me to learn about usual problems of code organization and I/O architecture, and some tricks on how to handle I/Os and dependencies isolation. We might encounter a bit of SOLID advice, and maybe even a nice hat!

Tower Suite 3
11:45
11:45
45min
Audio Neural Networks without Ground Truth: How to avoid humans in the loop at all costs
Orian Sharoni

Training audio neural networks requires creating or using pre-existing manually tagged data. In this talk we will review the state of the art algorithms that automate this process and show how they can help in real-world use-cases.

Tower Suite 1
11:45
45min
Beyond pandas: The great Python dataframe showdown
Juan Luis Cano Rodríguez

The pandas library is one of the key factors that enabled the growth of Python in the Data Science industry and continues to help data scientists thrive almost 15 years after its creation. Because of this success, nowadays there are several open-source projects that claim to improve pandas in various ways, either by bringing it to a distributed computing setting (Dask), accelerating its performance with minimal changes (Modin), or offering slightly different API that solves some of its shortcomings (Polars).

In this talk we will go over some of the most widely used dataframe Python libraries beyond pandas, clarify the relationship between them, compare them in terms of project scope and proximity to the original pandas API, and offer advice on when to use each of them.

If you are a seasoned pandas user willing to explore alternatives, or a beginner user wondering what is all the fuzz about these new dataframe libraries, this talk is for you!

Tower Suite 2
11:45
45min
Executives at PyData
Ian Ozsvald

Executives at PyData is a facilitated discussion session for executives and leaders to discuss challenges around designing and delivering successful data projects, organizational communication, product management and design, hiring, and team growth.

We'll announce the agenda at the start of the session, you can ask questions or raise issues to get feedback from other leaders in the room, NumFOCUS board members and Ian and James.

Organized by Ian Ozsvald (London) and James Powell (New York)

Beaufort
11:45
45min
Python-centric Feature Stores
Jim Dowling

Most enterprise data used by Data Scientists to train machine learning models is tabular data that comes from data warehouses and data lakes. Recent growth in the popularity of the modern data stack, based on lakehouses like Snowflake, Delta Lake, Big Query, and Redshift, have led to growth in the use of SQL-centric tools for data engineers, such as DBT. However, Data Scientists' language of choice is Python. How do we square this circle?

Tower Suite 3
12:30
12:30
60min
Lunch
Tower Suite 1
12:30
60min
Lunch
Tower Suite 2
12:30
60min
Lunch
Tower Suite 3
13:30
13:30
45min
Machine Learning 2.0 with Hugging Face
Julien Simon

In this session, we’ll introduce you to Transformer models and what business problems you can solve with them. Then, we’ll show you how you can simplify and accelerate your machine learning projects end-to-end: experimenting, training, optimizing, and deploying. Along the way, we’ll run some demos to keep things concrete and exciting!

Tower Suite 2
13:30
120min
Make your first Jupyter open-source contribution
Afshin T. Darian

In this sprint, we will guide you step-by-step to show you how to make an open-source contribution to JupyterLab and other parts of Project Jupyter. We will start with a short tutorial on how to use Git and GitHub, how to author a change to a project, and how to open a pull request. We will also provide a list of curated issues that are straightforward to resolve to offer a good place to start. Make your first (or nth) Jupyter open-source contribution by opening a pull request today.

Beaufort
13:30
45min
Measurement and Fairness: Questions and Practices to Make Algorithmic Decision Making more Fair
Adrin Jalali

Machine learning is almost always used in systems which automate or semi-automate decision making processes. These decisions are used in recommender systems, fraud detection, healthcare recommendation systems, etc. Many systems, if not most, can induce harm by giving a less desirable outcome for cases where they should in fact give a more desired outcome, e.g. reporting an insurance claim to be fraud when indeed it is not.

In this talk we first go through different sources of harm which can creep into a system based on machine learning [1], and the types of harm an ML based system can induce [2].

Taking lessons from social sciences, one can see input and output values of automated systems as measurements of constructs or a proxy measurement of those constructs. In this talk we go through a set of questions one should ask before and while working on such systems. Some of these questions can be answered quantitatively, and others qualitatively [3].

[1] Suresh, H., Guttag, J., Kaiser, D., & Shah, J. (2021). Understanding Potential Sources of Harm throughout the Machine Learning Life Cycle. MIT Case Studies in Social and Ethical Responsibilities of Computing, (Summer 2021). https://doi.org/10.21428/2c646de5.c16a07bb
[2] The Trouble with Bias - NeurIPS 2017 Keynote - Kate Crawford, https://www.youtube.com/watch?v=fMym_BKWQzk
[3] Jacobs, Abigail Z., and Hanna Wallach. "Measurement and fairness." Proceedings of the 2021 ACM conference on fairness, accountability, and transparency. 2021.

Tower Suite 1
13:30
45min
Testing, testing: On experimental drift and data driven product design
Yizhar (Izzy) Toren

A/B testing is (and should be) the gold standard for making data driven decisions. However, basing your decisions solely on tests can lead to very bad product decisions, primarily because of different types of hard-to-track changes to your environment (aka "experimental drift"). In this talk I will explain what experimental drift is and how it can affect your product design and A/B testing choices. I will also review a few strategies of handling drift as a data scientist working in a product team and show examples.

Tower Suite 3
14:15
14:15
45min
Can you Read This? (Or: how I Improved Text Readability on the Web for the Visually Impaired)
Asya Frumkin

This talk will describe how I used deep learning to identify texts on a background image that are illegible for people with vision impairments. I will explain the challenges I encountered when using different OCR architectures for this task and talk about the original solution I came up with.

Tower Suite 1
14:15
45min
Models schm-odels: why you should care about Data-Centric AI
Marysia Winkels

Data Centric AI is the term coined by AI pioneer Andrew Ng for the movement that argues we shift our focus towards iterating on our data instead of models to improve machine learning predictions. But isn't this what we have always done? Why is this trend relevant now? Has something really changed, and if so, how does that change your work as a data scientist?

This talk will feature anecdotes and real-world examples of 'model-itis' that serve as an argument for data-centric AI, our lessons learned from winning the Data Centric AI competition, and practical tips on how you can integrate data-centric principles in your daily work.

Tower Suite 3
14:15
45min
Understanding your bank statement in 100ms
Chady Dimachkie, Robin Kahlow, Dr. Jonathan Kernes, Dr. Ilia Zintchenko

In the last year, the global number of fintech companies has nearly doubled. Yet, despite the rapid growth, there is one area of banking that has been notoriously difficult to modernize: financial transactions. More than 1 billion transactions occur every day around the world. Transactions are different in every country and language, require knowledge of every merchant and location, depend on the context of the surrounding parties involved and are specific for each use case. At Ntropy, we enable developers to parse financial transactions in under 100ms with super-human accuracy, unlocking the path to a new generation of autonomous finance, powering products and services that have never before been possible. We will for the first time discuss the key parts of our pipeline, made possible by the latest advancements in natural language understanding and unsupervised learning.

Tower Suite 2
15:00
15:00
45min
Data Pipelining for Real-time ML Models
Gabor Bakos

Reinventing the wheel is usually not something we should be striving for, so why did we build our data pipeline from scratch? There are numerous design choices people make and they can highly affect the potential use cases. When making a custom pipeline you can make your own trade-offs between speed, throughput, simplicity and consistency of code/logic/data.

Market makers like Optiver are usually associated with ultra-low latency infrastructure, however there are plenty of use cases where human latency (seconds) is acceptable. Computing derived metrics, training models and making predictions as new data arrives are just a few such applications and what we will focus on in this presentation.

We will tackle some of the questions we asked ourselves on the design choices for our data pipeline.
* Should you write code that is used by both live and historical pipeline?
* How to improve research to production cycle?
* How do we ensure that real-time and backtest results match?
* How to improve development speed?
* What trade-offs to make if inputs/data arrive asynchronously?
* How to improve performance and reduce resource usage?
* How can we speed up day-to-day research?
* What to do with stateful nodes?

Basic knowledge of finance and data pipelining might be beneficial, but no specific knowledge is required to follow the presentation.

Tower Suite 1
15:00
45min
Large Language Models for Real-World Applications - A Gentle Intro
Jay Alammar

Machine language understanding and generation has been undergoing rapid improvements due to recent breakthroughs in machine learning (e.g. large language models like GPT and BERT). And while big tech and NLP engineers were quick to capitalize on these models, the broader developer community lags in adopting these models and realizing their potential in their business domains.

This talk provides a gentle and highly visual overview of some of the main intuitions and real-world applications of large language models. It assumes no prior knowledge of language processing and aims to bring attendees up to date with the fundamental intuitions and applications of large language models.

Tower Suite 2
15:00
45min
Unlocking the power of gradient-boosted trees (using LightGBM)
Pedro Tabacof

Gradient-boosted trees (XGBoost, LightGBM, Catboost) have become the staple of machine learning for tabular datasets. While most data scientists have made use of them at some point, many don’t know the true power those Python libraries provide. I will take LightGBM as an example and show in practice how it handles missing value imputation and categorical encoding natively, the different loss functions it provides for different problems (including the creation of your own loss function!), and how to interpret the resulting models. My aim is to show how LightGBM is like a Swiss army knife for machine learning and why it is the most pragmatic choice for tabular problems.

Tower Suite 3
15:45
15:45
15min
Break & Snack
Tower Suite 1
16:00
16:00
60min
Lightning Talks

Lightning talks will take place at the end of the day on Saturday and Sunday in the plenary room. They will be 5-minute talks on any topic of interest for the PyData community. Sign-ups will be at the registration table. If we get more sign-ups than allotted spots we will randomly select talks out of a hat. We encourage anyone interested to sign-up for a 5-minute slot.

Tower Suite 1
17:00
17:00
60min
Social Event - Hosted by Hopsworks

Join us following the closing notes and lightning talks for an evening social hosted by Hopworks. Canapés and first rounds of drinks are sponsored by Hopsworks with a cash bar available thereafter. Make sure to also enjoy one of the Hopsworks beers available to you at the bar. The social will be in the foyer (expo space) at the conference venue.

Tower Suite 1
18:00
18:00
60min
Pub Quiz - hosted by quizmaster James Powell

During the social event, we will host our traditional pub quiz, hosted by quiz master James Powell, back in the main talk room. Come ready to meet your fellow community members, show off your knowledge and hopefully learn something new!

Tower Suite 1
08:00
08:00
60min
Breakfast
Tower Suite 1
09:00
09:00
45min
Keynote: Key Challenges in the PyData Ecosystem and How We Can All Make a Difference
Tania Allard

The PyData - and more broadly the scientific computing - ecosystem has seen massive growth both in adoption and complexity over the last few years, maybe decades. As for many other open-source ecosystems, this growth has also opened the door to complex socio-technical challenges. Many of which can directly impact the long-term sustainability of the ecosystem and its community.

This talk will dive into some of these current challenges and opportunities for us, the users, contributors, maintainers, activists, sponsors, and insert many other hats to help overcome those hurdles.

All while being intentional about the core tenents of collaboration, transparency, and openness that fuel our ecosystem.

Tower Suite 1
09:45
09:45
30min
Break & Snacks
Tower Suite 1
10:15
10:15
45min
A Hitchhiker’s Guide to MLOps
Davide Frazzetto

Bringing Machine Learning (ML) applications to a live production phase comes with all the same challenges of traditional software development, and more. Examples are: large datasets, tracking data quality and models quality, experiments reproducibility, and monitoring a live application. This talk is a grounded introduction to monitoring the ML lifecycle with only open source software.

Tower Suite 2
10:15
45min
Solving Real-World Business Problems with Bayesian Modeling
Thomas Wiecki

Among Bayesian early adopters, digital marketing is chief. While many industries are embracing Bayesian modeling as a tool to solve some of the most advanced data science problems, marketing is facing unique challenges for which this approach provides elegant solutions. Among these challenges are a decrease in quality data, driven by an increased demand for online privacy and the imminent "death of the cookie" which prohibits online tracking. In addition, as more companies are building internal data science teams, there is an increased demand for in-house solutions.

Tower Suite 1
10:15
330min
Unconference Track

Informal sessions led by attendees will be running all day Sunday, 19 June in the Beaufort Room. Unconference sessions are facilitated discussions on topics of interest, impromptu hacking sessions on a specific topic/library, anything that is on topic for the conference but falls outside the formal talks/tutorials tracks -- if you'd like to lead a session, propose your idea to the organisers at the registration desk and reserve time on the schedule.

Beaufort
10:15
45min
Using graph neural networks to embrace the dependency within your data
Usman Zafar

Many machine learning models we use today have the core assumption that our data needs to be tabular, but how often is this truly the case? What if our data points are not independent? By ignoring the potential interrelatedness of our data, do we lose meaningful information that our models cannot leverage? In this talk, we shall explore graph neural networks and highlight how they can solve interesting problems in a way that is intractable when limiting ourselves to using tabular data.

We will look at the limitations of common algorithms and highlight how some clever linear algebra enables us to incorporate more meaningful information into our models. Social network data is a popular example of where relationships are relevant but relationships exist in many types of data where it may not be so obvious. Whether it's e-commerce, logistics or molecular data, relationships within your data likely exist and making use of them can be incredibly powerful.

This talk will hopefully spark your curiosity and provide you with a way of looking at problems from a new angle. It is intended for anyone with an interest in machine learning and will only lightly touch on some technical details.

Tower Suite 3
11:00
11:00
45min
Beyond medical image segmentation. The road towards clinical insights.
Tomasz Bartczak, Adam Klimont

Recent progress in deep learning for medical imaging has led to impressive results. Among them is a fully automatic human organ segmentation from Computed Tomography (CT) scans. Organ segmentation can be the end goal in itself, e.g. when it is directly viewed by clinical teams. It can also serve as an input to diagnostic aid tools. Moreover, specific knowledge can be extracted out of segmentations to build databases. These databases can then be used for reasoning about the anatomy or planning treatment.

In this talk, we will describe a multi-stage pipeline for processing CT scans for abdominal aortic aneurysm (AAA) treatment planning. We will share our experience in sub-organ multilabel segmentation. We will discuss the challenges with common loss functions, and with metrics not being well aligned with clinical significance. We will show how enhanced segmentation can be used to represent patient anatomy in an accessible way for end-users who plan treatment for new patients.

Tower Suite 1
11:00
45min
Clusterf*ck: A practical guide to Bayesian hierarchical modeling in Pymc3
Hanna van der Vlis

At Apollo Agriculture, a Kenya based agro-tech startup, one of the challenging problems we face is to predict yields of Kenyan maize farmers. Like almost all data-sets, this data-set has a hierarchical structure: farmers within the same region aren’t independent. By ignoring this fact, a model could predict yields entirely from the region of the farmer, but fails to find any other meaningful insights, and we may not even realize. However, if we “overcorrected,” treating each region as completely separate, each individual analysis could be underpowered. Enter the hero of our story: Bayesian hierarchical modeling. Using a practical example in Pymc3, we’ll follow this hero as they identify and overcome clustered data-sets.

Tower Suite 3
11:00
45min
Notebooker: Production and Scheduling for your Jupyter Notebooks
Jon Bannister

Notebooker is an open-source web-based mongo-backed application which can help you turn your Jupyter Notebooks into reports which can be parametrised, scheduled, and shared in a few clicks. In this talk, I introduce Notebooker, how it works, and how it can help you.

Tower Suite 2
11:45
11:45
45min
Clean Architecture: How to structure your ML projects to reduce technical debt
Laszlo Sragner

Software engineering principles are frequently mentioned as a solution to data science's productivity problem. Unfortunately, rarely in a comprehensive format to be actionable or adopted for data-intensive use.

In this talk, I will present a framework that enables practitioners to structure their projects and manage changes throughout the product lifecycle at low effort.

Audience will also learn about a minimum set of programming concepts to make this a reality.

The key takeaway for any Data Scientist is that you don't need to be a master programmer to start taking care of your own codebase.

Tower Suite 2
11:45
45min
Don't Stop 'til You Get Enough - Hypothesis Testing Stop Criterion with “Precision Is The Goal”
Eyal Kazin איל קאזין

In hypothesis testing the stopping criterion for data collection is a non-trivial question that puzzles many analysts. This is especially true with sequential testing where demands for quick results may lead to biassed ones. I show how the belief that Bayesian approaches magically resolve this issue is misleading and how to obtain reliable outcomes by focusing on sample precision as a goal.

Tower Suite 3
11:45
45min
Extreme Multilabel Classification in the Biomedical NLP domain
Nick Sorros

Extreme multilabel classification refers to cases where the prediction space of a multilabel classifier is in the thousands of millions of labels which is an order of magnitude more than typical problems. The scale of such problems brings some unique challenges that one has to work around with such as memory, model size, train and inference time. This talk will discuss 1) how you can overcome those challenges, 2) relevant state of the art architectures for this problem 3) learning from the development of an transformers based nlp model to tag biomedical grants with 29K MeSH tags

Tower Suite 1
12:30
12:30
60min
Lunch
Tower Suite 1
12:30
60min
Lunch
Tower Suite 2
12:30
60min
Lunch
Tower Suite 3
13:30
13:30
45min
AUC is worthless: Lessons in transitioning from academic to business data science
Dillon Gardner

New data scientists often struggle to make major impacts on solving business problems despite impressive technical skills. A core challenge is the gap between how academics think about performance of models and what matters for a company. As an example, academic work summarizes a model’s receiver operator characteristic (ROC) curve with the area under the curve (AUC). This summary statistic is useless for business applications, which will always have unique trade-offs and constraints. Effective approaches to optimize model performance requires understanding the specific business requirements and how to map that to a well framed data science problem.

In this talk, I will go through a framework of how to think effectively about model trade-offs in terms of maximizing business utility. Through this exercise, we will build intuition for what is required for a model in production to be a success and how to collaborate more effectively with non-technical co-workers.

Tower Suite 3
13:30
45min
Accelerating High-Performance Machine Learning with HuggingFace, Optimum & Seldon
Alejandro Saucedo

Identifying the right tools for high performance production machine learning may be overwhelming as the ecosystem continues to grow at break-neck speed. In this session showcase how practitioners can productionise ML models in scalable ecosystems in an optimizable way without having to deal with the underlying infrastructure challenges. We will be taking a GPT-2 HuggingFace model, optimizing it with ONNX and deploying to MLServer at scale using Seldon.

Tower Suite 1
13:30
45min
Lessons Learned About Data & AI at Enterprises and SMEs
Alexander Hendorf

All one needs is strategy, skill and resources to make digitalization and AI happen. So why is everything taking so long? Shouldn’t you all be finished yesterday already? An honest talk about how to address the complexity of making data and AI happen in enterprises.

Tower Suite 2
14:15
14:15
45min
Building Successful Data Science Projects
Ian Ozsvald

Your data science projects haven't worked out so well - maybe you didn't have a plan, you suffered from surprising unknowns or you couldn't deliver what someone else promised. I'll share both some painful past experiences and explain choices that will increase your success. I'll root this in a recently shipped solution worth $1 million for a client.

Tower Suite 3
14:15
45min
Feature engineering for time series forecasting
Kishan Manani

To use our favourite supervised learning models for time series forecasting we first have to convert time series data into a tabular dataset of features and a target variable. In this talk we’ll discuss all the tips, tricks, and pitfalls in transforming time series data into tabular data for forecasting.

Tower Suite 2
14:15
45min
What is X up to? - NER and Relationship Extraction for Information Extraction
Ahmet Melek

Dealing with unstructured text to obtain information is one of the biggest aims in the field of natural language processing. In this talk, we will be demoing a solution where we have unstructured text on a particular topic, and we apply named entity recognition, together with relationship extraction, to extract structured data. We will be introducing our data source, the models that we use, and will be inspecting the end results, viewing particular statistics, and hovering over a graph, extracted from the raw text.

Tower Suite 1
15:00
15:00
45min
Rethinking Data Visualisation with PyScript
Valerio Maggio

PyScript leverages on the web browser to act as a ubiquitous virtual machine to deliver unprecedented Data Science use cases. Data Visualisation is the first and perhaps the most straightforward context in which PyScript can have its say. In this talk, we will present how PyScript can change the way data visualisation apps can be designed and delivered for complex data science use cases.

Tower Suite 1
15:00
45min
Signature methods for time series data
Sam Morley

Signatures are a mathematical tool that arise in the study of paths. Roughly speaking, they capture the fine structure of a path. It turns out that signatures are extremely useful for analysing time series data in a data science context. This is party because they can take irregularly sampled, highly oscillatory data and produce a single array of values of fixed size which can then be used as features in predictors etc. In this talk I will give a brief introduction to signatures and give a brief demonstration of how you can use them to analyse time series data. No mathematical background will be assumed.

Tower Suite 2
15:00
45min
Why do I need to know Python? I'm a pandas user…
James Powell

You use pandas every day. You know every keyword argument on every function, even .melt! You even know whether it's .rename, .rename_axis, or .set_axis that you want—and you get it right on the first try! So why bother learning Python? Sure, pandas is written in it, but outside of assembling parts of the pandas API, what's there that has any value in your life?

Tower Suite 3
15:45
15:45
15min
Break & Snacks
Tower Suite 1
16:00
16:00
45min
Keynote Congrats on making it through the conference! And other lighthearted thoughts on conversing about data
Dr. Susan Mulcahy

Congrats on making it to the end of the 3-day conference! Let's wrap up the wealth of knowledge you’ve gained with thoughts on the importance of sharing your technical knowledge outside this community as well. By bringing more of us into the conversation on data, we can represent an array of colleagues and specialisms, not all of which are technical. We’ll look at some examples of breaking down your message to its essential components in order to bridge the gap of differing specialisms. And of course, with some lighthearted points woven into the talk for good measure.

Tower Suite 1
16:45
16:45
60min
Lightning Talks

Lightning talks will take place at the end of the day on Saturday and Sunday in the plenary room. They will be 5-minute talks on any topic of interest for the PyData community. Sign-ups will be at the registration table. If we get more sign-ups than allotted spots we will randomly select talks out of a hat. We encourage anyone interested to sign-up for a 5-minute slot.

Tower Suite 1
17:45
17:45
15min
Closing Notes
Tower Suite 1