PyData London 2022

Feature Engineering Made Simple
06-17, 11:00–12:30 (Europe/London), Tower Suite 2

Of all the choices made by data scientists in the course of building and operating models, feature engineering & selection is one of the most critical. Features have a substantive impact on a model’s quality, including its predictive accuracy and resilience. Unfortunately, as most ML scientists and practitioners are aware, feature engineering is more art than science. It is ad-hoc, messy, error-prone and ends up consuming 70-80% of the time and effort when building models, often resulting in sub-optimal feature selection leading to low-quality models. In this tutorial, we will introduce new ways of performing feature engineering, turning it into a systematic, procedural and scalable process, which is substantively more efficient than how it occurs currently. Participants will perform a hands-on, end-to-end, feature building exercise, with particular emphasis on feature engineering using Anovos (https://anovos.ai/ or https://github.com/anovos/anovos)


Of all the choices made by data scientists in the course of building and operating models, feature selection is one of the most critical. Features have a substantive impact on a model’s quality, including its predictive accuracy and resilience. Therefore, feature engineering is one of the most important components of the Machine Learning workflow.

Unfortunately, as most ML scientists and practitioners are aware, Feature Engineering is more art than science. It is ad-hoc, messy, terribly error-prone and ends up consuming 70-80% of the effort and time when building models, often resulting in sub-optimal feature selection leading to low-quality models.

While there are a host of tools, mostly open-source, that help with parts of the feature engineering process, in particular in performing exploratory data analysis (EDA), their impact is modest: 1. The biggest problem in feature engineering is task orchestration – methodically performing a set of steps leading up to a set of “good”, model-ready features. Existing tools, such as PANDAS based packages, enable the performance of individual tasks (e.g., outlier detection) but the act of systematic orchestration is still totally left up to the modeller, and usually leads to a very ad-hoc, trial-and-error feature engineering workflow. 2. There are a few key problems in feature engineering that have no packaged solutions at all. One such problem is “cold-start” – when starting to select candidate features, what should the modeller do? The entire space of possible features for a given problem is usually very large, so a small subset needs to be identified for investigation – suboptimal candidate feature selection is usually very detrimental. This is one of the hardest issues in feature engineering. 3. Finally, virtually every open-source library is scale challenged, performing the in-memory computation in a single thread. When the base data has a meaningful scale, these are simply impractical to use.

In this tutorial, we will introduce new ways of performing feature engineering, turning it into a systematic, procedural and scalable process, which is substantively more efficient than how it occurs currently. Participants will perform a hands-on, end-to-end, model building exercise, with particular emphasis on feature engineering using Anovos (https://anovos.ai/ or https://github.com/anovos/anovos).

Anovos is a fast-growing open-source library built by data scientists at Mobilewalla with years of experience in applying Machine Learning techniques to some of the most extensive consumer data sets available. By rethinking ingestion and transformation, and including deeper analytics, drift identification, and stability analysis, Anovos aims to improve productivity and helps data scientists build more resilient, higher performing models. In addition, it automatically produces easily interpretable professional data reports that help users understand the nature of data at first sight and further enable data scientists to identify and engineer features.


Prior Knowledge Expected

Previous knowledge expected

Head of Data Science, Mobilewalla

Kajanan Sangaralingam manages the Data Science and AI function at Mobilewalla. He is passionate about solving real business problems using innovative AI/machine learning approaches. Prior to Mobilewalla, Kajanan worked as a Senior Data Scientist at Singapore Telecommunications where he honed his skills processing and analyzing large volumes of structured and unstructured data. He earned his Ph.D. at the National University of Singapore and his Bachelor of Science in Information Technology degree at the University of Moratuwa, Sri Lanka. His early work experience included many roles as a Senior Software Engineer and Software Engineer at companies in various industries.

Founder & CEO, Mobilewalla
Anindya Datta is a leading technologist and innovator with core contributions in best-in-class large-scale data management solutions, artificial intelligence, and internet technologies. As Founder, CEO, and Chairman of Mobilewalla, Anindya has combined the industry’s most robust data set with deep artificial intelligence and data science expertise to help enterprises build high performing, resilient predictive models.

Prior to Mobilewalla, Anindya founded Chutney Technologies which was acquired by Cisco Systems in 2005. He has been on the faculties of major research universities and institutes in the United States and abroad, including the Georgia Institute of Technology, University of Arizona, National University of Singapore, and Bell Laboratories. Anindya obtained his undergraduate degree from the Indian Institute of Technology (IIT) Kharagpur, and his MS and Ph.D. degrees from the University of Maryland, College Park, USA.