Understanding and Comparing Latent Space Characteristics of Multi-Modal Models

Exploring the Latent Space of CLIP-like Models Using Inter-Modal Pairs

Under Review: This paper is under review on the experimental track of the Journal of Visualization and Interaction. See the reviews and issues for this paper.

Abstract

Introduction
Multi-modal contrastive learning models are trained to map data from two or more modalities to a shared embedding space. This latent data representation can then be used for zero- or few-shot classification, cross-modal data retrieval, or generation tasks. Although remarkable results have been reported when testing multi-modal models on these tasks, understanding the latent representations remains challenging. In particular, many multi-modal models exhibit a phenomenon called the “modality gap”, leading to a latent space that cleanly separates the modalities.

Conclusion
This article introduces and compares three models trained on image-text pairs. We use these models and interactive visualizations to explain where the modality gap arises from, how it can be closed, and why closing it is important. In the second part, we introduce “Amumo”, a framework we implemented for analyzing multi-modal models. We describe various analysis tasks that can be performed with Amumo. In particular, Amumo can be used for (i) analyzing models, (ii) comparing models with each other, and (iii) analyzing multi-modal datasets. We demonstrate Amumo’s capabilities and generalizability using image, text, audio, and molecule data in combination with several different models.

Implementation
For smooth integration into research workflows, we implemented Amumo as a Python package with Jupyter widgets. We implemented the interactive visualizations in this article with JavaScript and plotly.js.

Demonstration & Materials
A minimal usage demonstration of Amumo is deployed with MyBinder. We also provide a demonstration of analyzing CLOOME with Amumo. The code for the Amumo python package and guidelines on how to use it can be found in the github repository or as an archived version on osf.

Introduction

Contrastive Language Image Pre-training (CLIP) and variations of this approach, like CyCLIP or CLOOB, are trained on image-text pairs with a contrastive objective. The goal of contrastive loss objectives is to minimize latent-space distances of data points that have the same underlying meaning. We refer to the particular cases of contrastive learning that CLIP-like models perform as multi-modal contrastive learning because they use two (or more) modes of data (e.g., images and texts) where each mode uses their own encoder to generate a latent embedding space. More specifically, the objective that CLIP is optimized for minimizes the distances between image-text embeddings of pairs that have the same semantic meaning while maximizing the distances to all other combinations of text and image embeddings. We would expect that such a shared latent space places similar concepts of images and texts close to each other, as demonstrated in the following sketch. However, the reality is a bit more complicated.

Example of how we imagined a 2-dimensional projection of CLIP's image and text embeddings. Image and text points are shown in one scatter plot and instants that are semantically similar are plotted close together. — Example of how we imagined a two-dimensional projection of CLIP's image and text embeddings.

CLIP and its Modality Gap

Despite the clear objective that is supposed to bring texts and images to a shared embedding space there is a phenomenon called "Modality Gap" which describes that embeddings of different modalities lie in their own embedding subspaces. The example below visualizes the Modality Gap between images and texts of CLIPWe use the official CLIP implementation by OpenAI: https://github.com/openai/CLIP with the RN50 image encoder. embeddings for a subset of 100 randomly selected images from MSCOCOs validation set. The image and text embeddings are projected to a 2-dimensional space and visualized in a scatter plot. We use a gray line to connect image-text pairs that belong together.

The use of dimensionality reduction methods to compute a 2-dimensional view of the data clearly shows the separation between the two modalities. However, dimensionality reduction comes hand in hand with a loss of information and possible distortion of data. We propose a different way to visualize the dimensionality gap. Similarity heatmaps are a simple yet effective way of visualizing latent space embeddings that also helped us to better understand the modality gap. Using this visualization also allowed us to gain interesting insights unrelated to the modality gap, which we could not have found with the scatter plot visualization alone.

The similarity heatmap below shows the same subset of 100 image-text pairs previously shown in a scatter plot. However, in this case, we show the cosine similaritiesNote that the cosine similarity can yield values between [-1; 1]. calculated between all image and text embeddings. This results in a matrix with four quadrants:

top-left: in-modal similarities of the 100 image embeddings,
bottom-right: in-modal similarities of the 100 text embeddings,
top-right: cross-modal similarities between the 100 image embeddings and the 100 text embeddings,
bottom-left: a transposed version of the cross-modal image-text similarities.

The diagonal axis of each quadrant represents the matching data points (i.e., for the in-modal similarities, the diagonal shows the similarities of the image or text embedding to itself, while for the cross-modal similarities, the diagonal shows the matching image-text pairs). The modality gap is immediately visible: the in-modal similarities are much higher overall compared to the cross-modal similarities.

Now, where does this modality gap even come from?

As analyzed by Liang et al., the modality gap already appears before models are trained, possibly caused by random weight initialization and different model architectures. The authors also highlight that the gap between modalities persists throughout training, which means that CLIP's objective function cannot overcome this phenomenon. A reason for that could be that the objective function only trains on the alignment of (non-)matching image-text combinations but does not contain any regularization terms with regard to the overall layout of the embedding spaces and in-modality alignment. Liang et al. also formally define the modality gap as the Euclidean difference between the centers of each modality:

\delta_{gap} = \frac{1}{n} \sum_{i=1}^n{x_i} - \frac{1}{n} \sum_{i=1}^n{y_i}

where x_i and y_i are the normalized image and text embedding vectors.

They also experimented with manually reducing the modality gap by moving the embeddings closer together along the gap vector.

However, since the modality subspaces trained by CLIP are not symmetricGoel et al. argue that the CLIP objective does, in fact, symmetrize the spaces in its optimal solution; in practice however this ideal scenario does not happen. to each other, a modification of the embeddings destroys the complex relationship between images and texts that was derived during training. This naturally results in an increasing loss when changing the distance between modalities that was originally trained, as shown in the visualization below. The x-axis shows the Euclidean distance between the two embedding centers (the black dashed line indicates the original distance). The y-axis shows the contrastive loss that results when moving the two embeddings closer together or further away from each other. We calculated these values with the whole 5000 samples MSCOCO validation set. It becomes visible that the global minimum of this manual intervention is at the point of the original (trained) modality gap.

Now, what do the similarity heatmap and scatter plot look like when we manually close the gap? As shown in the visualization below, the similarity matrix seems more homogeneous (i.e., in-modal similarities and cross-modal similarities are on a similar level), and points in the scatter plot are closer together. However, the scatter plot also shows that the edges between image-text pairs are still long, and the text embeddings concentrate more on the center compared to the image embeddings.

What About Other CLIP-Like Models?

As mentioned previously, the two modalities (i.e., texts and images) live on different embedding spaces, and the two embeddings vary in structure (i.e., they are not symmetric to each other). Recently published papers propose different versions of CLIP, where the objective function has been adjusted to regularize the trained embedding space. We look closer into two promising approaches, namely, CyCLIP and CLOOB.

CyCLIP

According to Goel et al. using CLIP-generated image-text embeddings interchangeably (e.g., for Language-guided image generation) is suboptimal because the embeddings are–in practice–not aligned. They propose an augmentation of the InfoNCE loss used to optimize CLIP that adds two regularization terms that enforce the two embedding spaces to be symmetric: one for in-modal symmetry (L_I) and one for cross-modal symmetry (L_C) of the similarities.

L_{CyCLIP} = L_{InfoNCE} + L_I + L_C

The visualizations below already hint that the embeddings are symmetric: The in-modal similarity heatmaps of the image and text quadrants look similar, and the modalities in the image-text points in the PCA projection seem almost parallel. However, it is also clear that the modality gap is still present.

We can confirm this with a slight modification of the similarity heatmap. For this, we calculate the difference between the in-modal similarities of images and texts, which results in a matrix where values approach zero (i.e., the matrix is a blue rectangle).

When looking back to Liang et al.’s experiments that tried to align the embeddings by moving them along the modality gap vector, we saw that this does not work well for CLIP embeddings because the space is not symmetrized. For CyCLIP, however, the symmetrization is enforced with the objective function, and we could see in the previous plots that the spaces are indeed mostly symmetric. We can now compare the loss landscapes between CyCLIP and CLIP embeddings for varying modality distances. In the visualization below, we can see that CyCLIP’s loss landscape looks very different from CLIP’s landscape: in the distance interval of [-1.5; 1.5] the loss does not change much, which is another indicator that CyCLIP indeed learns nearly symmetric embedding spacesNote that the loss outside the [-1.5; 1.5] interval increases. This could be due to the fact that the spaces are not perfectly symmetric. Another explanation could be that this happens due to normalizing the embeddings to the unit sphere, but the method proposed for closing the gap is done in Euclidean space..

Again, when closing the modality gap, the similarity heatmap becomes more homogenous (i.e., the in-modal similarities and cross-modal similarities are similarly strong). The image-text pairs in the scatter plot are closer, but still scattered.

Alternatively, we can use UMAP or tSNE projection to better utilize the neighborhoods of similar embeddings instead of linearly determining the axes with the highest variance, as done in PCA. In the case of CyCLIP (scatter plot on the right hand), the use of a neighborhood-based dimensionality reduction technique results in shorter edges and better clustering of similar embeddings. For CLIP (scatter plot on the left hand), edges remain long, and similar images and texts are not clustered together. See the Appendix for examples with 5000 samples.

Let us summarize how the loss changes for CLIP and CyCLIP when moving the embeddings together along the modality gap vector. For that, we again use the entire MSCOCO validation dataset of 5000 samples.

Model	Original Distance	Original Loss	Closed Distance	Closed Loss	Loss Difference
CLIP	0.818611	0.355370	0.035077	1.124780	0.769410
CyCLIP	0.873026	0.763433	0.001218	0.848867	0.085434

The numbers confirm that CyCLIP's loss changes far less than CLIP's loss when manually closing the gap – it neither gets better nor significantly worse. However, the question arises:

Why Do We Even Want to Close the Modality Gap, if We Do Not Gain Performance?

From a performance optimization point of view, this is a valid question. What's the point of interfering in a well-performing system? On the other hand, the alignment of embedding spaces can become important for other downstream tasks. For example, using image and text embeddings interchangeably, as done in language-guided image generation, relies on the fact that image and text embeddings are aligned with each other and live in the same space. Another aspect is that an aligned embedding space is closer to how humans expect multi-modal models to see the data. Furthermore, closing the modality gap allows us to actually visualize texts and images in the same space, and develop interactive exploration tools that help to understand multi-modal data (e.g., analyzing pairs of human written captions and machine-generated images to find insights about text-to-image generation models like StableDiffusion).

These example use cases of why closing the modality gap might be helpful should give you an incentive about why the pure "performance optimization" point of view is not the only one. In fact, if we can close the modality gap without significantly losing performance, we have a win-win situation!

CLOOB

In the previous section, we established a way to manually close the modality gap of CyCLIP embeddings. However, wouldn’t it be better to already close the gap during training and not rely on post-hoc manipulations? While we did not experiment with further modifying (Cy)CLIP’s objective to close the gap during training, we stumbled upon a different learning approach that naturally closes the gap.

Contrastive Leave One Out Boost (short: CLOOB) is a variation of CLIP that proposes an alternative objective together with an associative memory to train the model. The two main components of their method function are (i) modern Hopfield networksSee this blog post for a detailed explanation of modern hopfield networks: https://ml-jku.github.io/hopfield-layers/. and (ii) the InfoLOOB loss instead of the InfoNCE loss used by CLIP. The authors argue that their modifications solve CLIP's "explaining away" problem (i.e., focusing on a small subset of features while ignoring other relevant features) and InfoNCE's saturation problemSee Fürst et al. for more information..

However, we also observe something else: These modifications seem to aid the closure of the modality gap!

In the following, we investigate which of the major components of CLOOB are leading to this lower modality gap (i.e., the InfoLOOB loss or the use of modern Hopfield networks). For that, we look into models from an ablation study provided by the authors of CLOOB. For their study, they trained models with four different components: (i) InfoNCE (i.e., the CLIP objective); (ii) InfoLOOB; (iii) InfoNCE in combination with modern Hopfield networks; and (iv) InfoLoob in combination with Hopfield (i.e., the CLOOB objective).

They trained each model on two different datasets: the Conceptual Captions (CC) dataset and the YFCC dataset. The CC dataset contains 2.9 million images with high-quality captions and the YFCC dataset contains a subset of 15 million samples from YFCC100M. Furthermore, for the CC dataset, they provide model parameters from two different epochs (31 and 128).

In the visualizations below, we can see that the modality distance for models trained with the InfoLOOB objective is lower than for models trained with the InfoNCE objective. This leads to the assumption that InfoLOOB is the main driver that is responsible for closing the gap. However, the authors of CLOOB also mention in their paper that very high similarity values, for both matching and non-matching pairs, are considered as overfitting. This overfitting effect can also be reflected by a lower performance in downstream tasks. We see a similar effect when comparing modality distances between models trained on CC for 31 and 128 epochs. While the model trained for 128 epochs has better downstream performance the modality gap is similar or even slightly larger for this model.

The higher modality distance of the ‘InfoNCE - Hopfield’ models can be explained by the saturation effect of the InfoNCE objective when samples become too similar. Modern Hopfield networks induce higher similarity of retrieved samples, which in turn leads to stronger saturation of the InfoNCE objective and hampers learning.

Finally, when comparing the datasets, the modality distance for models trained on CC are very low compared to those trained on the YFCC dataset. This could be explained by the higher quality of the CC dataset. While captions in the CC dataset are almost always related to the corresponding image, captions in the YFCC dataset are more noisy and less likely to contain relevant information about the image. Instead, they might only contain the camera settings used to take the image or just text that is close to the image on a website.

Summary of Modality Gap Analysis

The following visualizations give an overview of the similarity heatmap for CLIP and CyCLIP before (left) and after (right) manually removing the modality gap, as well as the similarity heatmap for CLOOB with the official checkpoints from the paper and a CLOOB version that was trained on the LAION 400M dataset and used a ViT instead of a CNN to encode images. Note how the overall distribution of similarity values differs between the various embedding spaces. For example, the two CLOOB models seem to discriminate more strictly between matching and non-matching image-text pairs.

The previous sections taught us about the modality gap, where it comes from, and ways to close it. We also gave reasons for why closing the gap might be beneficial and now demonstrate how closing the modality gap can help with analyzing multi-modal data. We also introduced a handy new way of visualizing latent space embeddings and utilize this again in the following analyses. Finally, we also want to mention that there are ways to visually close the modality gap (i.e., without actually closing the gap in the embedding space), for example using UMAP’s out-of-sample extension (see Appendix).

Converting the Technique into a Tool

Using the previously introduced techniques, we implemented an interactive prototype called “Amumo” (Analyze Multi-Modal Models) (archived version). Users can switch between models, explore the similarity heatmap and scatter plot visualizations, manually close the modality gap, and try various projection methods.

Identifying Data Subsets

We can look into semantic subsets of data by filtering instances based on their captions. The following example shows the visualizations for the subset that contains the substring "dog". We notice that some lines in the similarity matrix have a darker color. When hovering over those darker lines, we can see that most of these instances correspond to images and texts about "hot dogs" or other images that do not show a dog or where a dog is in an uncommon setting. To make this even more obvious, we can use the "Cluster matrix by similarity" function that reorders the similarity heatmap such that similar lines are grouped together. One cluster that stands out in all three CLIP-like models is the "hot dog" cluster. However, we can also see clusters for "dog and frisbee", "dog and bed", or "dog and car".

Analyzing the DiffusionDB Dataset

We would like to see what the models’ latent-space embeddings look like for a dataset that is not (entirely) procured by humans. To this end, we use DiffusionDB, a collection of human-written captions and images generated from these captions by Stable Diffusion. We use a subset of 100 randomly selected samples to qualitatively explore the embedding spaces created by the CLIP models we previously introduced. You can use the instance of Amumo below to follow along with the analysis described.

With the default settings, we randomly explore the dataset and get a feeling for the data contained in this subset. We can investigate instances that are outliers in the similarity heatmap by hovering rows or cells that have particularly large or low similarity values. For example, there are some particularly bright cells scattered in the image in-modal similarity heatmap. Upon hovering, we see that all of these images are blurry. We know that DiffusionDB added blur filters for images that were detected to show inappropriate content. Interestingly, CLIP seems to create similar latent embeddings for blurry items, causing them to show high similarity in the similarity heatmap.

For further analyses, we choose "Cluster matrix by similarity" to order the matrix in a way that groups similar rows in the heatmap and investigate the clusters that are emerging. We can see a cluster for "impressionism and crystal" that seems to have homomorphic similarities over all images (i.e., there is a distinct purple line along all images of this cluster). Upon further investigation, we see that the captions in this cluster are mostly vague texts or single words (e.g., "crystal", "impressionism") that can apply to a lot of images. The same cluster becomes apparent in the text in-modal similarity heatmap, where all captions within the cluster seem to have high similarity.

Let’s close the modality gap to investigate clusters in a 2-dimensional scatter plot. We can either do this by switching to the CLOOB model or using CyCLIP in combination with the "Close modality gap" option. We see that the embeddings are aligned and can use the interactive scatter plot to investigate clusters. For example, we can try to find the cluster of blurry images, or we can try to find the cluster with instances of "impressionism". Of course, this would be much more fun on a larger scale :)

Analyzing ImageBind Using Text-Image-Audio Data

So far, we have looked into models trained to map two modalities to a shared space. ImageBind is a model trained to embed six different modalities: images, text, audio, depth, thermal, and inertial sensor data. The used approach “binds” the modalities using images-data pairs. More specifically, each modality was only paired with image (and video) data, but not across all other modalities. Although there is no explicit matching between modalities, ImageBind succeeded zero-shot evaluation between the modalities.

Going beyond the quantitative verification used by the authors of ImageBind, we are interested in how well the embedding spaces align between explicitly paired modalities and implicit pairing.

Text + Image

Let us first look into the explicit pairing of image-text modalities using our MSCOCO dataset. Similarly to CLIP, ImageBind exhibits a modality gap between images and texts, which makes sense because ImageBind also uses the InfoNCE loss.

Text + Image + Audio

For analyzing the implicit pairing of image, text, and audio modalities, we use a subset of the AudioSet dataset that contains animal sounds of birds, cats, dogs, and horses. The AudioSet dataset only contains text and audio pairs. However, each instance also contains the YouTube ID from which the audio data was retrieved. We use this ID to retrieve thumbnail images for each audio instance. This results in a triplet dataset of text-audio-image pairs that we can analyze with Amumo.

Here we can see that the in-modal similarities of images and texts are very pronounced (similar to how the CLIP similarities of images and texts look), which indicates a modality gap. The audio embeddings, on the other hand, have a lower overall similarity to other audio embeddings. In particular, similarities between audio embeddings and embeddings from other modalities seem more evenly distributed, which we previously saw in models that do not suffer from a modality gap. A reason for this might be that Audio data has higher noise-levels and more variation in its spectrogram compared to natural images. Therefore it is harder for encoders to only extract relevant features, leaving some noise behind in the latent representation, which, in turn, could lead to lower in-modal similarities.

When clustering the similarity matrix by text embeddings, four main clusters emerge—one for each type of animal included in the dataset—and one smaller cluster that contains a mix of cats and dogs labeled “domestic animals”.

We can also use the audio-to-audio similarities for clustering the heatmap. Interestingly, the audio embeddings seem to be very similar among all instances of cats and dogs. These instances are clustered under the label “domestic animals | pets”. When listening to the audio samples in this cluster, it becomes apparent that most instances contain human chatter.

Analyzing CLOOME Using Microscopy-Molecule Data

Moving away from natural images and audio, we now analyze CLOOME—a model that was trained on microscopy images and molecular structures. We compare two versions of CLOOME—one has been trained with the CLIP objective, the other with the CLOOB objective. The model architecture is the same for both versions: a CNN as an image encoder and a feed-forward NN that encodes the structure-based vector representation (ECFP) of molecules. For analyzing the models, we use the test set used by the authors.

When looking at the similarity heatmaps of each model, we can observe that the molecule-to-molecule distances seem to be the most pronounced, but overall, the similarities are much more distributed over the entire heatmap. When looking at the modality distances of both models, we can see that the CLOOB-based model has a slightly lower distance compared to the CLIP-based model. However, the difference is much less than what we experienced before with the natural images and texts. We can also observe this in the scatter plots: The modality pairs are not cleanly separated like what we observed with the natural images CLIP. This could be caused by the fact that CLIP was trained with a considerably larger dataset than CLOOME. Hence, if a molecule and its corresponding image are similar enough to a pair of samples in the training set, it would be easier for the model to have learned to encode them with high similarity. A different explanation could be the quality of the dataset. If modality pairs are matched precisely—as done in the case of molecule-microscopy pairs—the modality gap might be lower than for datasets that contain a lot of mismatched pairs (i.e., image-text pairs that are automatically retrieved from the internet). Another reason could be the different encoders used (transformer vs. feed-forward) or the properties of the data used for training (i.e., microscopy images have much lower variation than natural images; and molecular structures have very different properties compared to natural text). However, these hypotheses need to be further studied.

Augmentation Analyses

As previously demonstrated, we can identify patterns in datasets and subsets of datasets using the similarity heatmap visualizations. Now, we would also like to see if we can use the same techniques to find patterns in augmentations of a single data point. For example, we take a single image, generate rotated versions of this image, and use this augmented dataset to compute CLIP embeddings and similarities. The results of this experiment for the three CLIP-like models are shown in the visualization below (note that we again show two variants of the CLOOB model). To generate this dataset, we gradually rotate a selected image by 360 degrees over the course of 100 steps. Each step results in a “new” image and a new data point. Note that we only augment the image, but not the text, which results in a completely homogeneous similarity in the in-modal text quadrant of the heatmap and homogenous stripes along the text dimension in the cross-modal quadrants of the heatmap.

When looking at the in-modal image similarity quadrant of the heatmap for each model using augmentations of the first image, we can see an interesting pattern emerge. In addition to the bright yellow diagonal axis that corresponds to the similarities of images to themselves, there is also the perpendicular off-diagonal axis of the matrix sticking out. When hovering along the off-diagonal, we see that the two images along this axis are actually mirrored versions of each other. It seems like all models are invariant to the horizontal flip transformation for this image. We can also see a checkerboard-like pattern emerge for some images emerging in all models except for the CLOOB_LAION400M. When looking into the darker areas of this heatmap in more detail, we can see that the pattern occurs around multiples of 90-degree rotations. The fact that this pattern occurs mainly for the three models that use a CNN-based image encoderNote that CLIP and CLOOB_LAION400M (both 400M instances) were trained on a much larger dataset than CLOOB and CyCLIP, which could also be an indicator for varying robustness. and not for the one with the vision transformer could be an indicator that the two architectures vary in their ability to learn rotation invariant properties. The checkerboard-like pattern seems to be consistent with findings described by Timme et al. where they tested the rotation robustness of various CNN classifiers by measuring the accuracy. The accuracy of the CNNs showed local maxima at multiples of 90-degree rotations and was lower in-between those angles.

When looking at the overall distribution of similarity values, we also notice that the ViT-based CLOOB model seems to have more patches of low-similarity values compared to its CNN-based counterparts. This might indicate that ViT’s overall robustness to rotation transformations is lower. In further investigations, we might want to directly compare two versions of CLIP: the current version with the CNN-based image encoder and a version with a ViT-based image encoder, and study the phenomenon on a larger dataset.

Use the interactions to explore the heatmaps for different images yourself.

In a second experiment, we analyze the heatmaps for an image to which we add an increasingly higher noise level. When looking at the heatmaps for the first image, it seems like there is a certain level of noise for each model, after which the model cannot seem to recognize the content of the image anymore. All images with a higher level of noise than this threshold seem to look (almost) the same (as indicated by a bright yellow rectangle at the lower-right corner of the in-modal image similarity quadrant).

Similarly, for blurry images, we see that at a certain point of blurriness, all images look the same to the models, and they cannot map images and texts together. You can use the dropdown menu to explore the effects of various augmentations.

Pick Augmentation method:

Conclusion

Throughout this article, we investigated latent embeddings of CLIP-like models. Using scatter plots and similarity heatmaps, we visualized and analyzed the modality gap that naturally occurs for CLIP embeddings. Closing this gap without losing significant performance can be important for downstream tasks like image generation, visual analytics, or human understanding. We showed how to close the gap using CyCLIP in combination with a post-processing method that aligns the embedding spaces and investigated another model (CLOOB) that is able to align the spaces during training. Finally, we introduced Amumo, an interactive visual prototype that allows users to explore embeddings from multi-modal contrastive learning models to help with the understanding of their latent space embeddings. We used Amumo to analyze various (sub-)sets and augmentations of data. We demonstrate Amumo’s capabilities and generalizability using image, text, audio, and molecule data in combination with several different models. We believe that Amumo, and the similarity heatmap, in particular, are useful tools to create intuition about multi-modal latent space embeddings. It allows for the comparison of multi-modal models (e.g., their robustness to transformations) and can help to formulate hypotheses or ideas about such models. However, we want to stress that the analysis is based on a small subset of data points, and insights must still be verified on a larger scale.

Acknowledgments

This work was funded by the Austrian Marshall Plan Foundation under the Marshall Plan Scholarship, the Austrian Science Fund under grant number FWF DFH 23--N, and under the Human-Interpretable Machine Learning project (funded by the State of Upper Austria). The project was conducted during a research visit at the MIT-IBM Watson AI Lab in Cambridge, MA. ASF’s research position was funded by the European Union’s Horizon 2020 research and innovation program under the Marie Skłodowska-Curie Innovative Training Network - European Industrial Doctorate grant agreement No. 956832, “Advanced machine learning for Innovative Drug Discovery”. Finally, we want to give special thanks to the reviewers of the VISxAI Workshop 2023.

Research Material Statements

The data shown in this article was produced with the Amumo python package (archived version). We provide notebooks to reproduce the results of CLIP, CyCLIP, and CLOOB, CLOOB ablation, and CLOOME. We provide the notebooks for exporting the data used in the interactive article: CLIP, CyCLIP, and CLOOB, CLOOB ablation, and CLOOME.

Authorship

Christina Humer: Conceptualization, Software, Validation, Investigation, Writing - Original Draft, Visualization. Elisabeth Rumetshofer: Investigation, Resources, Writing - Original Draft. Ana Sánchez: Investigation, Resources, Writing - Original Draft. Vidya Prasad: Investigation, Writing - Review & Editing. Günter Klambauer: Writing - Review & Editing. Marc Streit: Conceptualization, Writing - Review & Editing. Hendrik Strobelt: Conceptualization, Writing - Review & Editing.

License

This work is licensed under a Creative Commons Attribution-ShareAlike 4.0 International License.

Conflict of Interest

The authors declare that there are no competing interests.

Appendix

Closing the Gap Using a Larger Dataset

Amumo - our interactive prototype - is an easy way to explore a small subset of image-text pairs. This analysis can help form intuition about a particular dataset or the model used to map them into a latent space embedding. In addition to the interactive prototype, we also want to showcase the results of the proposed methods for closing the modality gap with a larger dataset. To that end, we again used the entire 5000 sample MSCOCO validation dataset and applied the two methods introduced in the article. We then take the aligned embeddings, project them with UMAP, and plot them in a static 2-d scatter plot with lines connecting the matching image-text pairs.

Manually Removing the Modality Gap

For the first method to remove the modality gap, we need to compute CyCLIP embeddings and manually move the two embedding spaces together. The following scatter plot shows the 2-dimensional projection of these modified embeddings. The plot shows that clusters are forming and a lot of connection lines are within clusters, which means that instances that carry a similar meaning are indeed close together in the latent space embedding. However, there are also a lot of intra-cluster connections. The emergence of these long connections between the clusters can be caused by various factors.

For example, the manual modification of the latent space might disturb some parts of the latent space. Although CyCLIP does add restrictions to the objective function to facilitate the emergence of symmetric embedding spaces, it is only an approximation and the spaces are not perfectly perpendicular. Another influencing factor might be that the image-text pairs are not perceived as similar by the model and therefore placed in different areas of the embedding space. Finally, there may also be distortions coming from the dimensionality reduction technique we used for creating the 2-d space.

To investigate these assumptions further, we recommend the use of interactive tools that allow exploring large sets of points and clusters in a 2-d space (e.g., the Projection Space Explorer).

The following plot shows the same procedure of manually removing the modality gap, but here we used CLIP embeddings instead. This shows again, that manually removing the gap by moving CLIP embeddings on the same plane destroys the trained latent space too much to be useful anymore.

Inherently Removing the Modality Gap

The second method of removing the modality gap utilizes a bi-modal model capable of aligning the embedding spaces: CLOOB. In this case, we can directly use the image and text embeddings generated by the model and project it to a 2-d space for visualization. We see a similar result to before: clusters emerge and a lot of inter-cluster connections, but also plenty of intra-cluster connections. There also seems to be a rather large cluster of points with many connections.

In comparison, we also show the results for the same model, but trained with 400M instances of the LAION dataset and a vision transformer architecture used for image embedding instead of a CNN architecture. We can see smaller cluster entities and it seems like connections between clusters are less. To confirm these qualitative findings and be able to compare the methods, we would have to use quantitative measure (e.g., measuring how many intra-cluster connections there are for each method).

Visually Closing the Modality Gap

In the previous sections, we learned about two ways that can help us close the modality gap:

Manually: post-process embeddings by moving them together along the modality gap vector; this only makes sense if the two embedding spaces are symmetric, like in CyCLIP.
Inherently: define the model architecture and/or training objective in a way that aids the closing of the gap, like in CLOOB.

Let's also recall the reasons for why we would like to close the gap:

can have advantages for downstream tasks (e.g., if embeddings from two modalities need to be interchangeable)
aids the development of multi-modal visual analytics tools
match human expectations of how the embedding space should look like

From a visualization point of view, we might not care about the other two reasons, as long as we can visually close the modality gap. By visually closing the gap, we do not change the embeddings or the model, but map them into a shared low-dimensional space. This can be accomplished in several ways: (i) out-of-sample projection, (ii) concatenating image-text embeddings and treating them as one combined embedding. The first method results in a low-dimensional datapoint for each image and each text embedding; the second method results in a combined low-dimensional space for the image-text pairs. The method you would want to choose depends on the goal of the visualization.

Out-of-Sample Projection

We can first project embeddings of either images or texts using UMAP. This projection builds a neighborhood graph using in-modal similarities and results in a low-dimensional projection for one modality. We can utilize UMAP's out-of-sample projection to also project the embeddings of the second modality onto the space of the first modality. Since the out-of-sample projection again tries to map each point to the most similar points in the existing low-dimensional space, you can imagine this as using the cross-modal similarities between images and texts. Since this is what CLIP was trained on (i.e., optimizing distances between texts and images), the mapping should visually remove the modality gap.

As an example, we take the 5000 sample MSCOCO validation set and first fit and transform the image embeddings. We then transform the text embeddings with the existing UMAP embedding and show the results in a scatter plot. The overall structure seems to align images and texts; while there are a lot of cross-cluster connections that show that image-text pairs are not always close to each other, most of the connections seem to be within clusters. Cross-cluster connections can be indicators of various things. For example, the pairs may not be deemed similar by CLIP, which results in embeddings that are far away from each other in the high-dimensional latent space, which would be reflected in the low-dimensional projection of the embeddings. Another issue might come from the projection method itself. As mentioned previously, projecting data to a low-dimensional space comes with a loss of information that might introduce artifacts. The fact that we use out-of-sample projection might amplify this effect even further.

Visual analytics tools can be helpful in gaining further insights into the data and why certain pairs seem to be far away from each other. Basic interactions like hover information or selection summaries could already be a good start for further investigation. The rich nature of image and text data also allows for more advanced analytic visualizations; for example, texts can be used to extract labels for clusters that carry rich semantic meaning, or example images could be used to summarize clusters. Visually encoding the high-dimensional similarity of embedding pairs (e.g., as saturation of the lines between pairs) could be a helpful indicator that could show whether point pairs are far apart from each other due to artifacts from the projection or due to CLIP not recognizing them to be similar.

Concatenating Image and Text Embeddings

For a different kind of visual representation of image-text embedding spaces, we can simply concatenate the embedding vectors and transform them into a combined low-dimensional space. The low-dimensional space can be visualized in a scatter plot and visual analytics approaches can be used to explore the data. Note that with this approach, we only have one low-dimensional data point per image-text pair.