Scendi Score: Prompt-Aware Diversity Evaluation via Schur Complement of CLIP Embeddings

Abstract

The use of CLIP embeddings to assess the fidelity of samples produced by text-to-image generative models has been extensively explored in the literature. While the widely adopted CLIPScore, derived from the cosine similarity of text and image embeddings, effectively measures the alignment of a generated image, it does not quantify the diversity of images generated by a text-to-image model.

In this work, we extend the application of CLIP embeddings to quantify and interpret the intrinsic diversity of text-to-image models, which are responsible for generating diverse images from similar text prompts, referred to as prompt-aware diversity. To achieve this, we propose a decomposition of the CLIP-based kernel covariance matrix of image data into text-based and non-text-based components. Using the Schur complement of the joint image-text kernel covariance matrix, we perform this decomposition and define the matrix-based entropy of the decomposed component as the Schur Complement ENtopy DIversity (Scendi) score, as a measure of the prompt-aware diversity for prompt-guided generative models.

Additionally, we discuss the application of the Schur complement-based decomposition to nullify the influence of a given prompt on the CLIP embedding of an image, enabling focus or defocus of the embedded vectors on specific objects. We present several numerical results that apply our proposed Scendi score to evaluate text-to-image and LLM (text-to-text) models. Our numerical results indicate the success of the Scendi score in capturing the intrinsic diversity of prompt-guided generative models.

BibTeX

To cite this work, please use the following BibTeX entries:

@inproceedings{ ospanov2025scendi, title = {Scendi Score: Prompt-Aware Diversity Evaluation via Schur Complement of CLIP Embeddings}, author = {Azim Ospanov and Mohammad Jalali and Farzan Farnia}, booktitle = {International Conference on Computer Vision}, year = {2025} }

Scendi Score: Prompt-Aware Diversity Evaluation via Schur Complement of CLIP Embeddings

Scendi captures the diversity stemming from generative model itself (left) or specified prompts (right)

Abstract

Overview of Scendi: Decomposing CLIP Embeddings

Scendi Score with Fixed Prompts or Generated Image Distributions

Comparing the Scendi Score with Other Metrics for Prompt-Based Diversity Evaluation

Effect of CLIP Direction Removal on K-PCA Clusters

Before decomposition, K-PCA clusters exhibit distinct groupings for animals and fruits. After decomposition, the clusters primarily reflect the remaining direction, for example, after removing the 'fruit' component, the clusters focus on animal species, and vice versa.

Scendi and Typographic Attacks

Scendi-based CLIP decomposition can reduce susceptibility to typographic attacks.

BibTeX