While Scalable Vector Graphic (SVG) codes appear as either plain text or visually as images, they are structured representations that encode geometric and layout information. However, existing methods typically convert SVGs into raster image, discarding their structural details. Similarly, previous sentence embedding methods generate high-quality text embeddings but do not extend to structured or visual modalities such as SVGs. To address these challenges, we propose the first training-free multimodal embedding method that uses a Multimodal Large Language Model (MLLM) to project text, images, and SVG code into an aligned space. Our method consists of two main components: (1) multimodal Explicit One-word Limitation (mEOL), which produces compact, semantically grounded embeddings across modalities without training; and (2) a semantic SVG module that rewrites SVG code by generating missing or non-descriptive components through visual reasoning. This lets the model embed structural signals overlooked in prior work. Our approach not only introduces the first SVG retrieval setting but also achieves strong empirical performance, surpassing prior methods including training-based models by up to +20.5% Recall@1 on a repurposed VGBench dataset. These results demonstrate that structural cues can significantly enhance semantic alignment in multimodal embeddings, enabling effective retrieval without any fine-tuning.
@inproceedings{kim2026meol,
title={Training-free Multimodal Embedding for Structure-Aware Retrieval of Scalable Vector Graphics and Images},
author={Kyeong Seon Kim and Baek, Seong-Eun and Lee, Jung-Mok and Oh, Tae-Hyun},
booktitle = {Proceedings of the Winter Conference on Applications of Computer Vision (WACV)},
year={2026}
}
}