Exploring the Latent Space of CLIP-like Models Using Inter-Modal Pairs
Under Review: This paper is under review on the experimental track of the Journal of Visualization and Interaction. See the reviews and issues for this paper.
Introduction
Multi-modal contrastive learning models are trained to map data from two or more modalities to a shared
embedding space. This latent data representation can then be used for zero- or few-shot classification,
cross-modal data retrieval, or generation tasks. Although remarkable results have been reported when
testing multi-modal models on these tasks, understanding the latent representations remains challenging.
In particular, many multi-modal models exhibit a phenomenon called the “modality gap”, leading to a
latent space that cleanly separates the modalities.
Conclusion
This article introduces and compares three models trained on image-text pairs. We use these models and
interactive visualizations to explain where the modality gap arises from, how it can be closed, and why
closing it is important. In the second part, we introduce “Amumo”, a framework we implemented for
analyzing multi-modal models. We describe various analysis tasks that can be performed with Amumo. In
particular, Amumo can be used for (i) analyzing models, (ii) comparing models with each other, and (iii)
analyzing multi-modal datasets. We demonstrate Amumo’s capabilities and generalizability using image,
text, audio, and molecule data in combination with several different models.
Implementation
For smooth integration into research workflows, we implemented Amumo as a Python package with Jupyter
widgets. We implemented the interactive visualizations in this article with JavaScript and plotly.js.
Demonstration & Materials
A minimal usage demonstration of Amumo is deployed with MyBinder.
We also provide a demonstration of analyzing CLOOME with Amumo.
The code for the Amumo python package and guidelines on how to use it can be found in the github
repository.
Contrastive Language Image Pre-training (CLIP)
Despite the clear objective that is supposed to bring texts and images to a shared embedding space there is
a phenomenon called "Modality Gap"
The use of dimensionality reduction methods to compute a 2-dimensional view of the data clearly shows the separation between the two modalities. However, dimensionality reduction comes hand in hand with a loss of information and possible distortion of data. We propose a different way to visualize the dimensionality gap. Similarity heatmaps are a simple yet effective way of visualizing latent space embeddings that also helped us to better understand the modality gap. Using this visualization also allowed us to gain interesting insights unrelated to the modality gap, which we could not have found with the scatter plot visualization alone.
The similarity heatmap below shows the same subset of 100 image-text pairs previously shown in a scatter
plot. However, in this case, we show the cosine similarities
Now, where does this modality gap even come from?
As analyzed by Liang et al.
where
They also experimented with manually reducing the modality gap by moving the embeddings closer together along the gap vector.
However, since the modality subspaces trained by CLIP are not symmetric
Now, what do the similarity heatmap and scatter plot look like when we manually close the gap? As shown in the visualization below, the similarity matrix seems more homogeneous (i.e., in-modal similarities and cross-modal similarities are on a similar level), and points in the scatter plot are closer together. However, the scatter plot also shows that the edges between image-text pairs are still long, and the text embeddings concentrate more on the center compared to the image embeddings.
As mentioned previously, the two modalities (i.e., texts and images) live on different embedding spaces, and
the two embeddings vary
in structure (i.e., they are not symmetric to each other). Recently published papers propose different
versions of CLIP, where the objective function has been adjusted to regularize the trained embedding space.
We look closer into two promising approaches, namely, CyCLIP
According to Goel et al.
The visualizations below already hint that the embeddings are symmetric: The in-modal similarity heatmaps of the image and text quadrants look similar, and the modalities in the image-text points in the PCA projection seem almost parallel. However, it is also clear that the modality gap is still present.
We can confirm this with a slight modification of the similarity heatmap. For this, we calculate the difference between the in-modal similarities of images and texts, which results in a matrix where values approach zero (i.e., the matrix is a blue rectangle).
When looking back to Liang et al.’s
Again, when closing the modality gap, the similarity heatmap becomes more homogenous (i.e., the in-modal similarities and cross-modal similarities are similarly strong). The image-text pairs in the scatter plot are closer, but still scattered.
Alternatively, we can use UMAP or tSNE projection to better utilize the neighborhoods of similar embeddings instead of linearly determining the axes with the highest variance, as done in PCA. In the case of CyCLIP (scatter plot on the right hand), the use of a neighborhood-based dimensionality reduction technique results in shorter edges and better clustering of similar embeddings. For CLIP (scatter plot on the left hand), edges remain long, and similar images and texts are not clustered together. See the Appendix for examples with 5000 samples.
Let us summarize how the loss changes for CLIP and CyCLIP when moving the embeddings together along the modality gap vector. For that, we again use the entire MSCOCO validation dataset of 5000 samples.
Model | Original Distance | Original Loss | Closed Distance | Closed Loss | Loss Difference |
---|---|---|---|---|---|
CLIP | 0.818611 | 0.355370 | 0.035077 | 1.124780 | 0.769410 |
CyCLIP | 0.873026 | 0.763433 | 0.001218 | 0.848867 | 0.085434 |
The numbers confirm that CyCLIP's loss changes far less than CLIP's loss when manually closing the gap – it neither gets better nor significantly worse. However, the question arises:
From a performance optimization point of view, this is a valid question. What's the point of interfering in
a well-performing system? On the other hand, the alignment of embedding spaces can become important for
other downstream tasks. For example, using image and text embeddings interchangeably, as done in
language-guided image generation, relies on the fact that image and text embeddings are aligned with each
other and live in the same space. Another aspect is that an aligned embedding space is closer to how humans
expect multi-modal models to see the data. Furthermore, closing the modality gap allows us to actually
visualize texts and images in the same space, and develop interactive exploration tools that help to
understand multi-modal data (e.g., analyzing pairs of human written captions and machine-generated images to
find insights about text-to-image generation models like StableDiffusion
These example use cases of why closing the modality gap might be helpful should give you an incentive about why the pure "performance optimization" point of view is not the only one. In fact, if we can close the modality gap without significantly losing performance, we have a win-win situation!
In the previous section, we established a way to manually close the modality gap of CyCLIP embeddings. However, wouldn’t it be better to already close the gap during training and not rely on post-hoc manipulations? While we did not experiment with further modifying (Cy)CLIP’s objective to close the gap during training, we stumbled upon a different learning approach that naturally closes the gap.
Contrastive Leave One Out Boost (short: CLOOB)
However, we also observe something else: These modifications seem to aid the closure of the modality gap!
In the following, we investigate which of the major components of CLOOB are leading to this lower modality gap (i.e., the InfoLOOB loss or the use of modern Hopfield networks). For that, we look into models from an ablation study provided by the authors of CLOOB. For their study, they trained models with four different components: (i) InfoNCE (i.e., the CLIP objective); (ii) InfoLOOB; (iii) InfoNCE in combination with modern Hopfield networks; and (iv) InfoLoob in combination with Hopfield (i.e., the CLOOB objective).
They trained each model on two different datasets: the Conceptual Captions (CC) dataset
In the visualizations below, we can see that the modality distance for models trained with the InfoLOOB objective is lower than for models trained with the InfoNCE objective. This leads to the assumption that InfoLOOB is the main driver that is responsible for closing the gap. However, the authors of CLOOB also mention in their paper that very high similarity values, for both matching and non-matching pairs, are considered as overfitting. This overfitting effect can also be reflected by a lower performance in downstream tasks. We see a similar effect when comparing modality distances between models trained on CC for 31 and 128 epochs. While the model trained for 128 epochs has better downstream performance the modality gap is similar or even slightly larger for this model.
The higher modality distance of the ‘InfoNCE - Hopfield’ models can be explained by the saturation effect of the InfoNCE objective when samples become too similar. Modern Hopfield networks induce higher similarity of retrieved samples, which in turn leads to stronger saturation of the InfoNCE objective and hampers learning.
Finally, when comparing the datasets, the modality distance for models trained on CC are very low compared to those trained on the YFCC dataset. This could be explained by the higher quality of the CC dataset. While captions in the CC dataset are almost always related to the corresponding image, captions in the YFCC dataset are more noisy and less likely to contain relevant information about the image. Instead, they might only contain the camera settings used to take the image or just text that is close to the image on a website.
The following visualizations give an overview of the similarity heatmap for CLIP and CyCLIP before (left)
and after (right) manually removing the modality gap, as well as the similarity heatmap for CLOOB with the
official checkpoints from the paper and a CLOOB version that was trained on the LAION 400M dataset and used
a ViT instead of a CNN to encode images
The previous sections taught us about the modality gap, where it comes from, and ways to close it. We also gave reasons for why closing the gap might be beneficial and now demonstrate how closing the modality gap can help with analyzing multi-modal data. We also introduced a handy new way of visualizing latent space embeddings and utilize this again in the following analyses. Finally, we also want to mention that there are ways to visually close the modality gap (i.e., without actually closing the gap in the embedding space), for example using UMAP’s out-of-sample extension (see Appendix).
Using the previously introduced techniques, we implemented an interactive prototype called “Amumo” (Analyze Multi-Modal Models). Users can switch between models, explore the similarity heatmap and scatter plot visualizations, manually close the modality gap, and try various projection methods.
We can look into semantic subsets of data by filtering instances based on their captions. The following example shows the visualizations for the subset that contains the substring "dog". We notice that some lines in the similarity matrix have a darker color. When hovering over those darker lines, we can see that most of these instances correspond to images and texts about "hot dogs" or other images that do not show a dog or where a dog is in an uncommon setting. To make this even more obvious, we can use the "Cluster matrix by similarity" function that reorders the similarity heatmap such that similar lines are grouped together. One cluster that stands out in all three CLIP-like models is the "hot dog" cluster. However, we can also see clusters for "dog and frisbee", "dog and bed", or "dog and car".
We would like to see what the models’ latent-space embeddings look like for a dataset that is not (entirely)
procured by humans. To this end, we use DiffusionDB
With the default settings, we randomly explore the dataset and get a feeling for the data contained in this subset. We can investigate instances that are outliers in the similarity heatmap by hovering rows or cells that have particularly large or low similarity values. For example, there are some particularly bright cells scattered in the image in-modal similarity heatmap. Upon hovering, we see that all of these images are blurry. We know that DiffusionDB added blur filters for images that were detected to show inappropriate content. Interestingly, CLIP seems to create similar latent embeddings for blurry items, causing them to show high similarity in the similarity heatmap.
For further analyses, we choose "Cluster matrix by similarity" to order the matrix in a way that groups similar rows in the heatmap and investigate the clusters that are emerging. We can see a cluster for "impressionism and crystal" that seems to have homomorphic similarities over all images (i.e., there is a distinct purple line along all images of this cluster). Upon further investigation, we see that the captions in this cluster are mostly vague texts or single words (e.g., "crystal", "impressionism") that can apply to a lot of images. The same cluster becomes apparent in the text in-modal similarity heatmap, where all captions within the cluster seem to have high similarity.
Let’s close the modality gap to investigate clusters in a 2-dimensional scatter plot. We can either do this by switching to the CLOOB model or using CyCLIP in combination with the "Close modality gap" option. We see that the embeddings are aligned and can use the interactive scatter plot to investigate clusters. For example, we can try to find the cluster of blurry images, or we can try to find the cluster with instances of "impressionism". Of course, this would be much more fun on a larger scale :)
So far, we have looked into models trained to map two modalities to a shared space. ImageBind is a model
trained to embed six different modalities
Going beyond the quantitative verification used by the authors of ImageBind, we are interested in how well the embedding spaces align between explicitly paired modalities and implicit pairing.
Let us first look into the explicit pairing of image-text modalities using our MSCOCO dataset. Similarly to
CLIP, ImageBind exhibits a modality gap between images and texts, which makes sense because ImageBind also
uses the InfoNCE loss
For analyzing the implicit pairing of image, text, and audio modalities, we use a subset of the
AudioSet
Here we can see that the in-modal similarities of images and texts are very pronounced (similar to how the CLIP similarities of images and texts look), which indicates a modality gap. The audio embeddings, on the other hand, have a lower overall similarity to other audio embeddings. In particular, similarities between audio embeddings and embeddings from other modalities seem more evenly distributed, which we previously saw in models that do not suffer from a modality gap. A reason for this might be that Audio data has higher noise-levels and more variation in its spectrogram compared to natural images. Therefore it is harder for encoders to only extract relevant features, leaving some noise behind in the latent representation, which, in turn, could lead to lower in-modal similarities.
When clustering the similarity matrix by text embeddings, four main clusters emerge—one for each type of animal included in the dataset—and one smaller cluster that contains a mix of cats and dogs labeled “domestic animals”.
We can also use the audio-to-audio similarities for clustering the heatmap. Interestingly, the audio embeddings seem to be very similar among all instances of cats and dogs. These instances are clustered under the label “domestic animals | pets”. When listening to the audio samples in this cluster, it becomes apparent that most instances contain human chatter.
Moving away from natural images and audio, we now analyze CLOOME—a model that was trained on microscopy
images and molecular structures
When looking at the similarity heatmaps of each model, we can observe that the molecule-to-molecule distances seem to be the most pronounced, but overall, the similarities are much more distributed over the entire heatmap. When looking at the modality distances of both models, we can see that the CLOOB-based model has a slightly lower distance compared to the CLIP-based model. However, the difference is much less than what we experienced before with the natural images and texts. We can also observe this in the scatter plots: The modality pairs are not cleanly separated like what we observed with the natural images CLIP. This could be caused by the fact that CLIP was trained with a considerably larger dataset than CLOOME. Hence, if a molecule and its corresponding image are similar enough to a pair of samples in the training set, it would be easier for the model to have learned to encode them with high similarity. A different explanation could be the quality of the dataset. If modality pairs are matched precisely—as done in the case of molecule-microscopy pairs—the modality gap might be lower than for datasets that contain a lot of mismatched pairs (i.e., image-text pairs that are automatically retrieved from the internet). Another reason could be the different encoders used (transformer vs. feed-forward) or the properties of the data used for training (i.e., microscopy images have much lower variation than natural images; and molecular structures have very different properties compared to natural text). However, these hypotheses need to be further studied.
As previously demonstrated, we can identify patterns in datasets and subsets of datasets using the similarity heatmap visualizations. Now, we would also like to see if we can use the same techniques to find patterns in augmentations of a single data point. For example, we take a single image, generate rotated versions of this image, and use this augmented dataset to compute CLIP embeddings and similarities. The results of this experiment for the three CLIP-like models are shown in the visualization below (note that we again show two variants of the CLOOB model). To generate this dataset, we gradually rotate a selected image by 360 degrees over the course of 100 steps. Each step results in a “new” image and a new data point. Note that we only augment the image, but not the text, which results in a completely homogeneous similarity in the in-modal text quadrant of the heatmap and homogenous stripes along the text dimension in the cross-modal quadrants of the heatmap.
When looking at the in-modal image similarity quadrant of the heatmap for each model using augmentations of
the first image, we can see an interesting pattern emerge. In addition to the bright yellow diagonal axis
that corresponds to the similarities of images to themselves, there is also the perpendicular off-diagonal
axis of the matrix sticking out. When hovering along the off-diagonal, we see that the two images along this
axis are actually mirrored versions of each other. It seems like all models are invariant to the horizontal
flip transformation for this image. We can also see a checkerboard-like pattern emerge for some images
emerging in all models except for the CLOOB_LAION400M. When looking into the darker areas of this heatmap in
more detail, we can see that the pattern occurs around multiples of 90-degree rotations. The fact that this
pattern occurs mainly for the three models that use a CNN-based image encoder
When looking at the overall distribution of similarity values, we also notice that the ViT-based CLOOB model seems to have more patches of low-similarity values compared to its CNN-based counterparts. This might indicate that ViT’s overall robustness to rotation transformations is lower. In further investigations, we might want to directly compare two versions of CLIP: the current version with the CNN-based image encoder and a version with a ViT-based image encoder, and study the phenomenon on a larger dataset.
Use the interactions to explore the heatmaps for different images yourself.
In a second experiment, we analyze the heatmaps for an image to which we add an increasingly higher noise level. When looking at the heatmaps for the first image, it seems like there is a certain level of noise for each model, after which the model cannot seem to recognize the content of the image anymore. All images with a higher level of noise than this threshold seem to look (almost) the same (as indicated by a bright yellow rectangle at the lower-right corner of the in-modal image similarity quadrant).
Similarly, for blurry images, we see that at a certain point of blurriness, all images look the same to the models, and they cannot map images and texts together. You can use the dropdown menu to explore the effects of various augmentations.
Throughout this article, we investigated latent embeddings of CLIP-like models. Using scatter plots and similarity heatmaps, we visualized and analyzed the modality gap that naturally occurs for CLIP embeddings. Closing this gap without losing significant performance can be important for downstream tasks like image generation, visual analytics, or human understanding. We showed how to close the gap using CyCLIP in combination with a post-processing method that aligns the embedding spaces and investigated another model (CLOOB) that is able to align the spaces during training. Finally, we introduced Amumo, an interactive visual prototype that allows users to explore embeddings from multi-modal contrastive learning models to help with the understanding of their latent space embeddings. We used Amumo to analyze various (sub-)sets and augmentations of data. We demonstrate Amumo’s capabilities and generalizability using image, text, audio, and molecule data in combination with several different models. We believe that Amumo, and the similarity heatmap, in particular, are useful tools to create intuition about multi-modal latent space embeddings. It allows for the comparison of multi-modal models (e.g., their robustness to transformations) and can help to formulate hypotheses or ideas about such models. However, we want to stress that the analysis is based on a small subset of data points, and insights must still be verified on a larger scale.
This work was funded by the Austrian Marshall Plan Foundation under the Marshall Plan Scholarship, the Austrian Science Fund under grant number FWF DFH 23--N, and under the Human-Interpretable Machine Learning project (funded by the State of Upper Austria). The project was conducted during a research visit at the MIT-IBM Watson AI Lab in Cambridge, MA. ASF’s research position was funded by the European Union’s Horizon 2020 research and innovation program under the Marie Skłodowska-Curie Innovative Training Network - European Industrial Doctorate grant agreement No. 956832, “Advanced machine learning for Innovative Drug Discovery”. Finally, we want to give special thanks to the reviewers of the VISxAI Workshop 2023.
The data shown in this article was produced with the Amumo python package. We provide notebooks to reproduce the results of CLIP, CyCLIP, and CLOOB, CLOOB ablation, and CLOOME. We provide the notebooks for exporting the data used in the interactive article: CLIP, CyCLIP, and CLOOB, CLOOB ablation, and CLOOME.
Christina Humer: Conceptualization, Software, Validation, Investigation, Writing - Original Draft, Visualization. Elisabeth Rumetshofer: Investigation, Resources, Writing - Original Draft. Ana Sánchez: Investigation, Resources, Writing - Original Draft. Vidya Prasad: Investigation, Writing - Review & Editing. Günter Klambauer: Writing - Review & Editing. Marc Streit: Conceptualization, Writing - Review & Editing. Hendrik Strobelt: Conceptualization, Writing - Review & Editing.
The authors declare that there are no competing interests.
Amumo - our interactive prototype - is an easy way to explore a small subset of image-text pairs. This analysis can help form intuition about a particular dataset or the model used to map them into a latent space embedding. In addition to the interactive prototype, we also want to showcase the results of the proposed methods for closing the modality gap with a larger dataset. To that end, we again used the entire 5000 sample MSCOCO validation dataset and applied the two methods introduced in the article. We then take the aligned embeddings, project them with UMAP, and plot them in a static 2-d scatter plot with lines connecting the matching image-text pairs.
For the first method to remove the modality gap, we need to compute CyCLIP embeddings and manually move the two embedding spaces together. The following scatter plot shows the 2-dimensional projection of these modified embeddings. The plot shows that clusters are forming and a lot of connection lines are within clusters, which means that instances that carry a similar meaning are indeed close together in the latent space embedding. However, there are also a lot of intra-cluster connections. The emergence of these long connections between the clusters can be caused by various factors.
For example, the manual modification of the latent space might disturb some parts of the latent space. Although CyCLIP does add restrictions to the objective function to facilitate the emergence of symmetric embedding spaces, it is only an approximation and the spaces are not perfectly perpendicular. Another influencing factor might be that the image-text pairs are not perceived as similar by the model and therefore placed in different areas of the embedding space. Finally, there may also be distortions coming from the dimensionality reduction technique we used for creating the 2-d space.
To investigate these assumptions further, we recommend the use of interactive tools that allow exploring
large sets of points and clusters in a 2-d space (e.g., the Projection Space Explorer
The following plot shows the same procedure of manually removing the modality gap, but here we used CLIP embeddings instead. This shows again, that manually removing the gap by moving CLIP embeddings on the same plane destroys the trained latent space too much to be useful anymore.
The second method of removing the modality gap utilizes a bi-modal model capable of aligning the embedding spaces: CLOOB. In this case, we can directly use the image and text embeddings generated by the model and project it to a 2-d space for visualization. We see a similar result to before: clusters emerge and a lot of inter-cluster connections, but also plenty of intra-cluster connections. There also seems to be a rather large cluster of points with many connections.
In comparison, we also show the results for the same model, but trained with 400M instances of the LAION dataset and a vision transformer architecture used for image embedding instead of a CNN architecture. We can see smaller cluster entities and it seems like connections between clusters are less. To confirm these qualitative findings and be able to compare the methods, we would have to use quantitative measure (e.g., measuring how many intra-cluster connections there are for each method).
In the previous sections, we learned about two ways that can help us close the modality gap:
Let's also recall the reasons for why we would like to close the gap:
From a visualization point of view, we might not care about the other two reasons, as long as we can visually close the modality gap. By visually closing the gap, we do not change the embeddings or the model, but map them into a shared low-dimensional space. This can be accomplished in several ways: (i) out-of-sample projection, (ii) concatenating image-text embeddings and treating them as one combined embedding. The first method results in a low-dimensional datapoint for each image and each text embedding; the second method results in a combined low-dimensional space for the image-text pairs. The method you would want to choose depends on the goal of the visualization.
We can first project embeddings of either images or texts using UMAP. This projection builds a neighborhood graph using in-modal similarities and results in a low-dimensional projection for one modality. We can utilize UMAP's out-of-sample projection to also project the embeddings of the second modality onto the space of the first modality. Since the out-of-sample projection again tries to map each point to the most similar points in the existing low-dimensional space, you can imagine this as using the cross-modal similarities between images and texts. Since this is what CLIP was trained on (i.e., optimizing distances between texts and images), the mapping should visually remove the modality gap.
As an example, we take the 5000 sample MSCOCO validation set and first fit and transform the image embeddings. We then transform the text embeddings with the existing UMAP embedding and show the results in a scatter plot. The overall structure seems to align images and texts; while there are a lot of cross-cluster connections that show that image-text pairs are not always close to each other, most of the connections seem to be within clusters. Cross-cluster connections can be indicators of various things. For example, the pairs may not be deemed similar by CLIP, which results in embeddings that are far away from each other in the high-dimensional latent space, which would be reflected in the low-dimensional projection of the embeddings. Another issue might come from the projection method itself. As mentioned previously, projecting data to a low-dimensional space comes with a loss of information that might introduce artifacts. The fact that we use out-of-sample projection might amplify this effect even further.
Visual analytics tools can be helpful in gaining further insights into the data and why certain pairs seem to be far away from each other. Basic interactions like hover information or selection summaries could already be a good start for further investigation. The rich nature of image and text data also allows for more advanced analytic visualizations; for example, texts can be used to extract labels for clusters that carry rich semantic meaning, or example images could be used to summarize clusters. Visually encoding the high-dimensional similarity of embedding pairs (e.g., as saturation of the lines between pairs) could be a helpful indicator that could show whether point pairs are far apart from each other due to artifacts from the projection or due to CLIP not recognizing them to be similar.
For a different kind of visual representation of image-text embedding spaces, we can simply concatenate the embedding vectors and transform them into a combined low-dimensional space. The low-dimensional space can be visualized in a scatter plot and visual analytics approaches can be used to explore the data. Note that with this approach, we only have one low-dimensional data point per image-text pair.