What are LLMs Bad At? And Why?

Large Language Models (LLMs) are being talked to death right now, but it's for a good reason. They seem to capture some kind of conversational experience a lot of people find useful to have at their fingertips. Yet, we've all had the experience of trying to remember a song and thinking, "Oh I bet ChatGPT would be great for this!", before finding out that it tells you largely incorrect, made-up lyrics.

You can tell it was trained on the lyrics in some fashion, and a lot of them are right! So why didn't it know them better? And why didn't it just admit that it didn't know the answer?

In this article, we'll look at the tasks LLMs like those driving ChatGPT are bad at, as well as attempt to understand how LLMs work to find out why that's the case.

What are LLMs good at doing?

LLMs have many potential pitfalls, many of which are hard to grapple with before having experience in them. In contrast, there are a handful of essential things that LLMs are very good at doing. In looking at some of these, we can hopefully understand what separates the two:

  • Context recall
  • Word definitions
  • Explaining well-established concepts

Let's look at these more individually.

Context recall - ChatGPT is almost infinitely capable of perfectly recalling the exact text of its growing context in a conversation, up to the point where the conversation gets too big. Context management is its own potential blog post, but suffice it to say that there's (effectively) a fixed amount of raw, untrained text that the model can work against at a time. You and the model both contribute to this context with your inputs and ChatGPT's responses. As you send messages, it may push context out of 'working memory' and into the realm of being forgotten, just as may happen when the LLM generates the next words in its response.

Word definitions - LLMs are generally good at recalling the "spirit" of their training data, such as being trained on many definitions of a word, as well as that word in context, as well as being trained on many definitions, in general, to know how it 'feels' to define a word with text. All of these factors combine to cover each other's faults. The LLM won't remember its training data verbatim, but it knows the shape of defining a word. So given that it has enough semantic information about the word in the sort of use or context you want it defined, it can give you a tailor-made definition that fits your situation near-perfectly, while being an overall good definition for that word generally.

Explaining well-established concepts - ChatGPT in particular has been trained for pedagogical purposes - that is, to teach and explain to the widest possible audience. The model tries to give each response in a vacuum, given that it doesn't know who you are or what you might already understand about the topic. If you were to ask the model to "explain quantum physics" then it might try and give you a high-level overview of how quantum physics relates to the average person's understanding of it. If you ask it more and more particular questions, then as long as the topic is well covered in the training data - requiring it to be generally well understood by society at large - then the model will be able to use its more general "explaining" ability along with its trained understanding of the topic and its semantics to continuously meet you where you are in learning.

Okay but why are they good at those things?

The easiest part to explain is the context. No matter how the model manages what's in the context, you can think of it quite simply as the text of your conversation that the model has direct access to. Contrast this with the more ephemeral nature of the source text the model is trained on. The context's source, your conversation's direct text, is kept so long as your conversation is short enough, or about simple and directly relatable subjects without many unique words.

A good LLM platform like ChatGPT does a lot with context management, but it's almost like a layer outside of the actual language model. When you send a new message, some amount of your previous context is shoved around (and parts are likely shoved out) to make room for your prompt's text. The same is true when you get a response from the LLM - as it generates text, it may shove older or more tangentially related text out of the context.

We can consider this to be the only "source text" that the model has direct access to at any given point. So tasks that the model tends to be good at are, at least partially, ones that don't rely on needing perfect recall of text that isn't already in the context of the conversation. This means asking for it to recite Hamlet from "memory", or our song lyrics from earlier, are some of the tasks that it's going to be worst at. Despite being trained on these pieces - likely in many different forms and contexts for something as ubiquitous as Hamlet - the model's training process doesn't keep the definite and true source material.

So what does it do?

Training a model to recognize the "shape" and "spirit" of text

When we talk about training a large language model, those with varying levels of familiarity with machine learning and language semantics may have some intuition for what we mean by shape and spirit in this context.

If you're intimidated by thinking you'll have to learn how the model achieves this, rest assured - it's not that important to know. The ideas behind how machine learning is able to generally recognize the "shape" of any given pattern - language use, handwriting recognition, chess board analysis - it's all whatever brand of pattern recognition we don't care that they used. When we say the "shape of a definition" as we did above, you and I sort of intuitively know what I mean.

💡
Shape of a definition - The fuzzy pattern evoked when the general sensation of imagining a typical definition is approached.

It kind of doesn't even matter if that's a good definition (which it isn't), because it itself clearly embodies what we all agree is the "shape" of a definition. Our inability to define that is tightly linked to the reasons that language works the way it does at all. We may not always be able to define something, but if we know someone else is already familiar with what we mean, then as long as we can successfully evoke that understanding with our words, those words did their job.

💡
Spirit of the topic - The abstract 'feeling' of what it means for a concept to be that concept, in the context of its use, as it embodies the typical understanding of that concept.

So, that's a pretty opaque way to phrase it, but you hopefully get the gist.

And, honestly, that's precisely what we mean by the "spirit" of a topic. The "gist". You could probably explain it in your own terms to demonstrate understanding, and the terms you use could potentially highlight a misunderstanding back to someone that gave you the original definition. This quasi-form of active listening highlights the sense that our understanding of a definition is tied to our existing understanding of the term(s) in context.

We're able to make definitions that match our own awareness of our understanding of a word, but we have a very hard time delivering a definition or explanation of concepts or sensations that aren't already well captured by our own awareness of the "terms" for that concept. The je ne sais quoi of explaining something you understand.

Why do I bring any of this up? Because with enough exposure to various perspectives and ways of discussing various interconnected terms and topics, you can begin to see the terms and vocabulary being used in terms of each other. This is how Large Language Models achieve their capabilities, by analysing the implied semantic relationships between the various vocabulary words that humans throw around in their typical language use - and then mimicking using those words in those ways.

How does only capturing the "spirit" and "shape" of language use limit LLMs?

The answer to this is, by and large, the answer to the original question of the article: What are LLMs bad at? And why?

We can see that if you train a large language model on song lyrics, even many times, even with nearly perfect analytical interpretations of those lyrics provided as well, it's simply too much text to 'remember' in that way given just the shape and spirit of the text. LLMs effectively 'memorize' text in a vacuum. Any sense of perfect recall of training data is really just a highlighting of particularly close-knit use of that language in that way - those vocabulary words in that exact sequence, evoked from the context of your prompt as desired.

Consider a famous quote you've encountered multiple times. ChatGPT doesn't 'remember' this quote in the traditional sense. Instead, it's the quote's short, crisp nature and frequent appearances throughout the training data that increases the likelihood of the model generating a similar phrase. This is because the model's understanding of language, derived from diverse and extensive training data, tends to converge on a reliable, yet flexible, representation of common, concise expressions.

Now, think of a different situation where the quote is lengthy or less common in the training data. Here, the model is likely to generate a paraphrased version, rather than an exact match. This is because these rarer or more complex expressions are often represented in a more diffused manner across the model's language understanding, which leads to a less precise recall.

A grey area scenario could be a line from an obscure book or a less-known song lyric - phrases that are present in the training data but not as common. The model might generate a near-perfect version, but miss or misinterpret a word or two.

The crucial takeaway here is that ChatGPT doesn't have different 'modes' of remembering. It doesn't 'store' quotes or sentences as is. Instead, its capacity to produce similar phrases or quotes is largely a function of the patterns it discerns from its training data, which depends on how often and how concisely these phrases or quotes were represented.

In conclusion...

If you've ever found yourself asking why ChatGPT doesn't seem to remember that quote from a movie you watched, or why it can't provide the exact line from a book you once read, now you know. It's not that ChatGPT is forgetful - it simply doesn't have the ability to recall. It's not holding onto that original data in a way we would consider memory.

Instead, it's taking the context you've given it and using its training to generate the closest possible match. It's akin to a skilled improviser, not a trained parrot. It's creating something new each time based on patterns it's seen before. And though it won't provide the perfect recollection you might want, it's always ready to offer a fresh take, a new perspective, in response to your input. But even the best improviser would need to have at least seen the scene from your favorite movie you want them to recite.

This won't answer all of your questions, or maybe any of them. But I believe it engages enough of our preconceptions about AI and Language Models to help foster a growing intuition for why certain prompts don't work as well as you'd hoped.

In the words of Shakespeare - "Help, I'm a monkey chained to a typewriter, please come save us!"