LessWrong (30+ Karma)

“Show, not tell: GPT-4o is more opinionated in images than in text” by Daniel Tan, eggsyntax

Apr 2, 2025

21:49

Epistemic status: This should be considered an interim research note. Feedback is appreciated.

Introduction

We increasingly expect language models to be ‘omni-modal’, i.e. capable of flexibly switching between images, text, and other modalities in their inputs and outputs. In order to get a holistic picture of LLM behaviour, black-box LLM psychology should take into account these other modalities as well.

In this project, we do some initial exploration of image generation as a modality for frontier model evaluations, using GPT-4o's image generation API. GPT-4o is one of the first LLMs to produce images natively rather than creating a text prompt which is sent to a separate image model, outputting images and autoregressive token sequences (ie in the same way as text).

We find that GPT-4o tends to respond in a consistent manner to similar prompts. We also find that it tends to more readily express emotions [...]

---

Outline:

(00:53) Introduction

(02:19) What we did

(03:47) Overview of results

(03:54) Models more readily express emotions / preferences in images than in text

(05:38) Quantitative results

(06:25) What might be going on here?

(08:01) Conclusions

(09:04) Acknowledgements

(09:16) Appendix