The use of CLIP embeddings to assess the fidelity of samples produced by text-to-image generative models has been extensively explored in the literature. While the widely adopted CLIPScore, derived from the cosine similarity of text and image embeddings, effectively measures the alignment of a generated image, it does not quantify the diversity of images generated by a text-to-image model.
In this work, we extend the application of CLIP embeddings to quantify and interpret the intrinsic diversity of text-to-image models, which are responsible for generating diverse images from similar text prompts, referred to as prompt-aware diversity. To achieve this, we propose a decomposition of the CLIP-based kernel covariance matrix of image data into text-based and non-text-based components. Using the Schur complement of the joint image-text kernel covariance matrix, we perform this decomposition and define the matrix-based entropy of the decomposed component as the Schur Complement ENtopy DIversity (Scendi) score, as a measure of the prompt-aware diversity for prompt-guided generative models.
Additionally, we discuss the application of the Schur complement-based decomposition to nullify the influence of a given prompt on the CLIP embedding of an image, enabling focus or defocus of the embedded vectors on specific objects. We present several numerical results that apply our proposed Scendi score to evaluate text-to-image and LLM (text-to-text) models. Our numerical results indicate the success of the Scendi score in capturing the intrinsic diversity of prompt-guided generative models.
To cite this work, please use the following BibTeX entries:
@inproceedings{
ospanov2025scendi,
title = {Scendi Score: Prompt-Aware Diversity Evaluation via Schur Complement of CLIP Embeddings},
author = {Azim Ospanov and Mohammad Jalali and Farzan Farnia},
booktitle = {International Conference on Computer Vision},
year = {2025}
}