PyData London 2022

Extreme Multilabel Classification in the Biomedical NLP domain
06-19, 11:45–12:30 (Europe/London), Tower Suite 1

Extreme multilabel classification refers to cases where the prediction space of a multilabel classifier is in the thousands of millions of labels which is an order of magnitude more than typical problems. The scale of such problems brings some unique challenges that one has to work around with such as memory, model size, train and inference time. This talk will discuss 1) how you can overcome those challenges, 2) relevant state of the art architectures for this problem 3) learning from the development of an transformers based nlp model to tag biomedical grants with 29K MeSH tags


Extreme multilabel classification refers to cases where the prediction space of a multilabel classifier is in the thousands of millions of labels which is an order of magnitude more than typical problems. For example, each Wikipedia article is tagged with more than one label, hence the name multilabel and extreme because there are millions of potential possibilities. Another notable example would be Amazon products. In our case, we were developing a classification scheme for biomedical grants for a large biomedical funder, based on the Medical Subject Headings (MeSH) which consist of around 29K tags.

The scale of such problems brings some unique challenges that one has to work around with. The first challenge is memory, since the size needed to represent the data often surpasses even large instances in the cloud. Then comes the model size which tends to be quite large, mainly due to the large vocabulary sizes but also the capacity needed to perform in such large output spaces. Lastly, training and inference times tend to require multi cpu or gpu instances depending on the model. In particular, inference time might be difficult to reduce to near real time due to the large number of labels the models need to consider.

Over the course of two years, we experimented with a number of different approaches to the problem. We developed a custom neural network architecture inspired from Bert, Spacy and prior work which performed really well for a subset of the MeSH hierarchy. We also scaled to all 29K tags using both an extremely fast linear model from Amazon called XLinear and a transformer based architecture inspired from research called BertMesh (paper) which has close to state of the art performance. The linear model is currently in production tagging grants while the latter is uploaded in the HuggingFace hub and is free for everyone to use.

In this talk you will learn:

  • Ways to work with larger than memory data with certain characteristics,
  • An easy approach to reduce model size without hurting performance,
  • Some techniques to speed up training and inference time,
  • State of the art architectures in the area of extreme multilabel classification,
  • The learnings from working on this problem for the last two years

Prior Knowledge Expected

Previous knowledge expected

Nick has been working as a data scientist for the last 10 years. Prior to setting up MantisNLP, he was working for the Wellcome Trust, initially to set up and lead the data science team. Prior to that he worked for a couple of startups at different stages of maturity from few to dozens of employees in various sectors such as fintech and social networks. Before data science, Nick was studying and doing research at Imperial College.

During these years in the industry Nick found himself working more and more in NLP problems from detecting the language of tweets and identifying which entrepreneur statements were factual to tagging grants with thousands of labels and finding references in policy documents. This led him to create MantisNLP, a data science consultancy focused on NLP with a remote first culture and client worldwide.