Researchers at MIT CSAIL found that large language models (LLMs) trained only on text data have an impressive understanding of visual concepts. By prompting LLMs to generate code for rendering images, the researchers collected a dataset of simple digital illustrations. Remarkably, the LLMs could iteratively improve these illustrations when prompted, demonstrating their robust visual knowledge gained from textual descriptions.

Using this LLM-generated dataset for training, the MIT team built a computer vision system that recognizes objects in real photos despite never seeing photo data. Their approach outperformed methods using procedurally generated images. However, the LLMs sometimes failed to identify human recreations of the images they could generate, highlighting inconsistencies in how their visual knowledge is represented.


Understanding the visual knowledge of language models
LLMs trained primarily on text can generate complex visual concepts through code with self-correction. Researchers used these illustrations to train an image-free computer vision system to recognize real photos.
https://news.mit.edu/2024/understanding-visual-knowledge-language-models-0617