ComCa

Compositional Caching for Training-free Open-vocabulary Attribute Detection

arXiv Preprint

Vision-Language Models, Zero-shot, Training-free, Attribute detection

2025

Paper (coming soon)

Project Page (coming soon)

COMCA’s cache construction. Given a list of attributes and objects, we compute their compatibility from a large database $D_r$ and with an LLM. The scores are merged and normalized to obtain the compatibility distribution, from which we sample cache entries and construct the cache. We enrich the latter with soft labels from the VLM-based similarity between cache images and attributes.

ABSTRACT

Attribute detection is crucial for many computer vision tasks, as it enables systems to describe properties such as color, texture, and material. Current approaches often rely on labor-intensive annotation processes which are inherently limited: objects can be described at an arbitrary level of detail (e.g., color vs. color shades), leading to ambiguities when the annotators are not instructed carefully. Furthermore, they operate within a predefined set of attributes, reducing scalability and adaptability to unforeseen downstream applications. We present Compositional Caching (ComCa), a training-free method for open-vocabulary attribute detection that overcomes these constraints. ComCa requires only the list of target attributes and objects as input, using them to populate an auxiliary cache of images by leveraging web-scale databases and Large Language Models to determine attribute-object compatibility. To account for the compositional nature of attributes, cache images receive soft attribute labels. Those are aggregated at inference time based on the similarity between the input and cache images, refining the predictions of underlying Vision-Language Models (VLMs). Importantly, our approach is model-agnostic, compatible with various VLMs. Experiments on public datasets demonstrate that ComCa significantly outperforms zero-shot and cache-based baselines, competing with recent training-based methods, proving that a carefully designed training-free approach can successfully address open-vocabulary attribute detection.

In summary, our contributions are:
• We propose ComCa, an approach specifically designed for training-free open-vocabulary attribute detection. ComCa exploits the compositional nature of objects and attributes to build an auxiliary cache that is used as an anchor to refine the model's predictions.
• We make caching scalable by extracting priors from databases and LLMs on which objects are associated with each attribute, using them to populate the cache. Soft-labeling further accounts for the multiplicity of attributes within an image.
• We empirically show that ComCa outperforms all the cache-based baselines and achieves competitive performance against costly training-based methods, demonstrating the practical feasibility of training-free open vocabulary attribute detection.

We provide additional details and ablations in the supplementary material. You can find them at the end of the arXiv paper.

• Marco Garosi (DISI, University of Trento)
• Alessandro Conti (DISI, University of Trento)
• Gaowen Liu (Cisco Research)
• Elisa Ricci (DISI, University of Trento)
• Massimiliano Mancini (DISI, University of Trento)

You can cite this work as:


@article{garosi2025comca,
    author = {Garosi, Marco and Conti, Alessandro and Liu, Gaowen and Ricci, Elisa and Mancini, Massimiliano},
    title = {Compositional Caching for Training-free Open-Vocabulary Attribute Detection},
    journal = {IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)},
    year= {2025},
}

Key contributions

Supplementary material

Authors

Citation