PyData London 2022

Document/sentence similarity solution using open source NLP libraries, frameworks and datasets
06-17, 15:30–17:00 (Europe/London), Tower Suite 3

The need to develop robust document/text similarity measure solutions is an essential step for building applications such as Recommendation Systems, Search Engines, Information Retrieval Systems including other ML/AI applications such as News Aggregators or Automated Recruitment systems used to match CVs to job specification and so on. In general, text similarity is the measure of how words/tokens, tweets, phrases, sentences, paragraphs and entire documents are lexically and semantically close to each other. Texts/words are lexically similar if they have similar character sequence or structure and, are semantically similar if they have the same meaning, describe similar concepts and they are used in the same context.  

This tutorial will demonstrate a number of strategies for feature extraction i.e., transforming documents to numeric feature vectors. This transformation step is a prerequisite for computing the similarity between documents. Typically, each strategy will involve 4 steps, namely: 1) the use of standard natural language pre-processing techniques to prepare/clean the documents, 2) the transformation of the document text into numeric vectors/embeddings, 3) calculation of document similarity using metrics such as Cosine, Euclidean and Jaccard and, 4) validation of the findings


Strategies and associated ML/NLP libraries that will be presented during the tutorial include::

1) Text/document pre-processing:
- Document pre-processing using NLTK library

2) Feature extraction – word/sentence embedding:
- Term frequency – Inverse Document Frequency (TF-IDF) with the aid of the Scikit-Learn library
- Pre-trained GloVe embedding with the aid of FSE/Gensim library
- A pre-trained GloVe embedding re-trained with Smooth Inverse Frequency (SIF) from scratch aid of the FSE library
- Sentence embedding using Google’s pre-trained Universal Sentence Encoder (USE)
- Sentence embedding using BERT via sentence-transformers library

3) Similarity computations will be done using Scikit-Learn pairwise library module:
- https://scikit-learn.org/stable/modules/classes.html#module-sklearn.metrics.pairwise

4) Visualization/Validation of findings:
- Dimensionality reduction using techniques such: PCA, TSNE, MDS and UMAP with the aid of Scikit-Learn library
- Visualization via scatter plot and heatmap with the aid of Matplotlib and Seaborn libraries
- Validation/comparision of the findings

5) Datasets used include:
- A simple dataset of book titles sourced from: https://raw.githubusercontent.com/noahjett/Movie-Goodreads-Analysis/master/books.csv
- The classic 20 News Group data sourced from Scikit-Learn dataset module: https://scikit-learn.org/0.19/datasets/twenty_newsgroups.html
- STS benchmark dataset located here: http://ixa2.si.ehu.es/stswiki/images/4/48/Stsbenchmark.tar.gz further details on this benchmark can be found here: https://ixa2.si.ehu.eus/stswiki/index.php/STSbenchmark

Work-in-progress code of the tutorial can be sourced here: https://github.com/aidowu1/Ades-NLP-Recepies


Prior Knowledge Expected

Previous knowledge expected