September 24, 2025
| by Katia SavchukWhat color is the number three? What color is anger? Or the made-up word “gricker”? In a recent experiment, researchers posed questions like these to both people and a large language model (LLM) that had never “seen” anything but massive quantities of text.
Their goal was to shed light on how humans and artificial intelligence interpret color: How much is based on language and how much relies on sensory experience? In short, do you need to visually perceive color to understand it, or is it enough to have only read about it? Answering that question — often framed as a debate between the “embodied view” and the “statistical view” — offers clues about how closely generative artificial intelligence could come to replicating the way humans think.
The findings, published in Cognitive Science, weren’t black and white. “Our results point to the need to combine these different perspectives,” says Douglas Guilbeault, an assistant professor of organizational behavior at Stanford Graduate School of Business. “Statistical inferences about color can get you surprisingly far, but it’s clear that embodied experience is also a critical part of human cognition.”
Guilbeault and his collaborators conducted a series of experiments looking at color metaphors — the associations and meanings attached to particular hues. They recruited more than 500 men with normal vision and over 150 with colorblindness. (They focused on men, who have higher rates of colorblindness, to avoid potential differences in color perception between genders.) From a list of a dozen colors, participants were asked to select ones that most and least closely matched various terms. The researchers also prompted GPT-3.5, a popular large language model developed by OpenAI, to respond to each of the same questions more than 100 times.
A handful of words, such as “grass” and “blood,” had common associations, yet many had no standard color connotation. These included emotions, such as shame and desire; academic disciplines, such as math and sociology; the numbers one through five (both spelled out and as digits); and made-up words, such as “ambous” and “smeex.”
Within all three groups, there was a striking amount of agreement on which colors are linked to abstract concepts. Around 20% to 40% of participants with normal vision and colorblindness chose the same color to represent any given term. (There was an 8% chance of this occurring randomly.) For example, the color most commonly associated with “math” was blue, and people most often chose gray for “gricker.”
“I was surprised by such strong signs of synesthesia in regular people,” says Guilbeault, referring to the tendency to automatically connect sensory inputs like music or words to a color. He says that this finding lends support to the embodied view. “The fact that there’s this robust pattern in reasoning and cognition still needs to be explained — it’s mostly unexplored terrain.”
Color by Numbers
Even more “mind-blowing,” Guilbeault says, was that the LLM also revealed a consistent pattern of associations between words and colors, even when these weren’t evident in the data it had been trained on. The LLM associated “math” with blue more strongly than humans, but overwhelmingly linked “gricker” to green.
“It’s not obvious why GPT-3.5 — a purely statistical engine — would end up having a kind of synesthesia,” he says. The fact that it does, he notes, suggests that an embodied understanding of color is a fundamental part of cognition that’s hidden in the way humans speak and write: “It must be in the language at a really fundamental level.”
However, the LLM could only get so far in approximating human reasoning about color. Colorblind people and those with normal vision shared more associations with each other than with GPT-3.5, suggesting that even a limited sensory experience of color is key to discerning its meaning.
“LLMs should nail color metaphors if the answer is purely statistical,” says co-author Ethan Nadler, a professor of astronomy at the University of California, San Diego, who earned a doctorate in physics at Stanford. “In reality, they fall short of humans because human reasoning is grounded in perception of the world. Our results point to the limits of computational models of the mind.”
When asked for colors that were the opposite of an abstract term, the LLM’s responses often diverged wildly from those of human participants. For example, pink was the color people most frequently chose as the opposite of “math,” whereas the LLM most often selected purple. “Its answers are often nonsensical,” Guilbeault says. “LLMs don’t seem to have a clear idea of what it would mean to be the opposite color because there’s no past data that would tell it, whereas humans ace this because we actually understand what that means.”
In a second part of the experiment, the researchers asked both people and the LLM to explain the reasoning behind their color associations. This time, the human cohort included professional painters. “Humans were more likely to use embodied strategies, and that was strongest among painters,” Guilbeault says, lending further support to the importance of directly experiencing color. “If embodiment helps you understand color, it should show up more in painters, who interact with color more frequently and think a lot more about its meaning.”
Whether LLMs can get significantly better at reasoning about color and other concepts with a sensory component remains an open question. Guilbeault thinks it’s unlikely that more textual training alone would do the trick, noting that GPT-3.5 was already trained on virtually the entire internet.
It’s possible that adding images, videos, or data from sensors could improve LLMs’ performance, he says. Still, his team’s research provides some evidence that LLMs may never fully be able to approximate how humans think. “It may be that there’s only so far that you can get to reproducing the embodied aspects of human cognition from language alone,” Guilbeault says.
For media inquiries, visit the Newsroom.