Scendi Score: Prompt-Aware Diversity Evaluation via Schur Complement of CLIP Embeddings

1The Chinese University of Hong Kong, Department of Computer Science & Engineering
Accepted at ICCV 2025 (Highlight)

Scendi captures the diversity stemming from generative model itself (left) or specified prompts (right)

Abstract

The use of CLIP embeddings to assess the fidelity of samples produced by text-to-image generative models has been extensively explored in the literature. While the widely adopted CLIPScore, derived from the cosine similarity of text and image embeddings, effectively measures the alignment of a generated image, it does not quantify the diversity of images generated by a text-to-image model.

In this work, we extend the application of CLIP embeddings to quantify and interpret the intrinsic diversity of text-to-image models, which are responsible for generating diverse images from similar text prompts, referred to as prompt-aware diversity. To achieve this, we propose a decomposition of the CLIP-based kernel covariance matrix of image data into text-based and non-text-based components. Using the Schur complement of the joint image-text kernel covariance matrix, we perform this decomposition and define the matrix-based entropy of the decomposed component as the Schur Complement ENtopy DIversity (Scendi) score, as a measure of the prompt-aware diversity for prompt-guided generative models.

Additionally, we discuss the application of the Schur complement-based decomposition to nullify the influence of a given prompt on the CLIP embedding of an image, enabling focus or defocus of the embedded vectors on specific objects. We present several numerical results that apply our proposed Scendi score to evaluate text-to-image and LLM (text-to-text) models. Our numerical results indicate the success of the Scendi score in capturing the intrinsic diversity of prompt-guided generative models.

Overview of Scendi: Decomposing CLIP Embeddings

Scendi uses Schur Complement-based decomposition on kernel covariance matrices to remove directions specified in prompts. The example shows correlation of the input image with samples from ImageNet after removing a certain concept, i.e. "a guitar with cabbage" - "guitar" correlates the most with "cabbage" images.

Scendi Score with Fixed Prompts or Generated Image Distributions

Comparing the Scendi Score with Other Metrics for Prompt-Based Diversity Evaluation

Scendi distinguishes between two sources of diversity: the underlying prompts and the generative model itself. When a specific breed is mentioned in the prompts, the score attributes variations in cat breeds to the prompts. However, if the prompts only specify 'cats', then increased diversity in breeds is attributed to the generative model.

Effect of CLIP Direction Removal on K-PCA Clusters

Before decomposition, K-PCA clusters exhibit distinct groupings for animals and fruits. After decomposition, the clusters primarily reflect the remaining direction, for example, after removing the 'fruit' component, the clusters focus on animal species, and vice versa.

Scendi and Typographic Attacks

Scendi-based CLIP decomposition can reduce susceptibility to typographic attacks.

BibTeX

To cite this work, please use the following BibTeX entries:

@inproceedings{
      ospanov2025scendi,
      title = {Scendi Score: Prompt-Aware Diversity Evaluation via Schur Complement of CLIP Embeddings},
      author = {Azim Ospanov and Mohammad Jalali and Farzan Farnia},
      booktitle = {International Conference on Computer Vision},
      year = {2025}
      
}