PyData London 2022

Fuzzy Matching at Scale
06-18, 10:15–11:00 (Europe/London), Tower Suite 3

Fuzzy Matching is a useful tool that has been well discussed. However, these popular methods based on edit-distances like Levenshtein or Jaro-Winkler have failed to keep up with increasing data sizes. This talk will walk you through modern methods based on character-based n-grams, vector space models, and approximate nearest neighbours for Fuzzy Matching at Scale.


Have you ever used fuzzywuzzy and waited forever for your results? This talk will propose an alternate implementation of Fuzzy Matching based on the following methods:

  • character-based n-grams that breaks up a search term into tokens of length n
  • vector space models like TF-IDF, GloVe or word embeddings like BERT
  • approximate nearest neighbours to speed up nearest-neighbours search

This talk is designed for an audience with intermediate knowledge of string algorithms and concepts in NLP like word embeddings. If that sounds like you, and you are tired of waiting for fuzzywuzzy, this is the talk for you!


Prior Knowledge Expected

Previous knowledge expected