Innovation Spotlights

Large Language Models Have Trouble Counting

Spotlight #1

Large Language Models famously have trouble accurately counting words.

Consider the following prompt:

How many words does the following sentence have (output the number and nothing else, don’t count the period):
“The cat jumped high.”

For gpt-4o-mini, this usually results in an answer of 5, which is clearly incorrect as the sentence only contains 4 words.

Note that since the OpenAI API is not fully deterministic, you might get a slightly different result.

Of course, one example is not statistically significant, so let’s take a text file containing 100 sentences of different lengths and ask an LLM to count the words in these sentences.



We will use gpt-4o-mini for experiments (simply because it’s cheap but still very strong). Additionally, we set temperature to 0, top_p to 0 and use the seed parameter. This won’t make the experiment completely deterministic, but that’s as close as we can get with the OpenAI API.

We will use the following prompt:

How many words does the following sentence have (output the number and nothing else, don’t count the period):
{sentence}

If we fail to parse the resulting number, we will simply return -1.

If we run this on a file containing randomly generated sentences, we get the following.

Frequency of Differences in Word Counts

There are a few important things to note:

First of all, even an LLM as strong as gpt-4o-mini is really bad at counting words. More than half of word counts are incorrect.

 

Second of all, sometimes LLMs are really bad at counting.
In a few cases, gpt-4o-mini overcounted the amount of words by 5!

One example would be this sentence:

“They spent a magical evening at the outdoor theater, watching a Shakespeare play under the stars.”

Here, gpt-4o-mini returned a count of 21 instead of the correct count of 16.

 

Third, at least gpt-4o-mini seems to practically always overcount and only rarely to undercount. The explanation probably has something to do with the fact that LLMs think in terms of tokens and not words, and the amount of tokens is basically always larger than the amount of words.



In fact, if we plot the difference between the token count and the number of counted words, the picture gets slightly better.

Frequency of Differences in Token Counts

This should be taken that LLMs literally count tokens – the count differences are still quite large, but these two graphs certainly seem to point in the direction that LLMs „count“ in terms of tokens more than they count in terms of words.

Counting Accuracy Decreases with Sentence Length

The next natural question that comes to mind is: Does counting gets worse for longer sentences?

Average Count Difference by Actual Word Count

This chart, to put it mildly, is quite weird. Of course, some of it are probably statistical artifacts (100 sentences is not that high), but there does seem to be an exactly linear relationship between count difference and word count.

However, once the actual word count becomes larger than 10, the average count difference rises very dramatically and then stays quite bad. We can probably conclude that counting sentences that have less or equal to 10 words works okayish (and if there are errors, they are usually off-by-one errors). But counting sentences that have more than 10 words works really poorly.

Basically, once you have more 10 words, trying to naively count them with an LLM is hopeless. You would need to resort to various trickery, such as asking the LLM to count the words one-by-one (which works pretty well).
Of course, this would waste a lot of output tokens.

Generating the Sentences with Certain Word Counts

So LLMs are not good at naively counting words. Why is this important? No one will ever attempt to count words using LLMs anyway (hopefully).

This fact is interesting, because for many applications, you want to restrict word counts of generated texts. For example, in educational applications, you often need to generate sentences of predefined lengths for different language levels.

How well does this work?

Frequency of Differences in Word Counts (Generation), T=0

This is really counterintuitive. Usually, generation is harder than classification. Especially in the case of counting words, generation should be much harder than classification.

Instead, the amount of sentences with incorrect word counts seems to hover between 10 % and 20 %. To be sure, this is not great, but it means that we can actually quite reliably generate sentences with correct word counts by simply generating a number of sentence candidates and then throwing away those candidates that have the incorrect word count.

After all, you can’t just generate a bunch of words and call it a day – the generated words need to form a coherent sentence. This means that you have to look ahead. For example, when you are generating the penultimate token, you need to make sure that you won’t „block yourself“ from generating a valid sentence with the last token. It is also generally weird that LLMs can suddenly count at the generation stage. The answer has probably to do with the fact that for this second task, the LLM is given „room to reason“.

Instead of just demanding an LLM to spit out a count, here we ask it to generate a sentence with the correct count. You can almost imagine the LLM counting the remaining number of words in its „head“.

Interestingly enough, this doesn’t deteriorate much even if we set the temperature to a higher value (e.g. 0.4).

To explore this, I generated sentences with specific word counts to see how well the model adheres to these limits. Plotting the results, we get the chart below.

Frequency of Differences in Word Counts (Generation), T=0.4

<span data-metadata=""><span data-buffer="">Broader Implications

There are three broader implications we can probably take away here.

First, this experiment underscores that even advanced LLMs may struggle with tasks that seem straightforward to humans. This serves as a reminder of the models‘ limitations and the need for caution when applying them to tasks requiring precise outputs.

Second, LLMs are weird. I don’t think that many people would have predicted that generating sentences with correct word counts works fine, but counting words directly does not. We are often tempted to reason about LLMs in an anthropocentric way (understandable, given how good they seem to be at reasoning tasks nowadays).

Third, generation-validation pipelines tend to be more reliable than just generation pipelines. For example, if asked to generate sentences of predefined lengths using LLMs, we could try to improve the prompt until we turn blue (generation-only approach). Or we could just generate more sentences than necessary and then filter out those that don’t match the given criteria (generation-validation approach)

<span data-metadata=""><span data-buffer="">Further Investigation

There are a couple of interesting research directions that might be taken from here:

First, this experiment should be repeated with much larger datasets that will allow for more specific and statistically significant conclusions.

Second, it would very likely be interesting to look at the log probs for the word counts. It would be especially interesting to do this for the generation task.

Mikhail Berkov

CTO von Titanom Technologies

Mikhail Berkov

CTO von Titanom Technologies

Titanom Technologies GmbH

Gabriele-Münter-Straße 3

82110 Germering

www.titanom.com

info@titanom.com


Telefon: 089 36037010

© 2024 Titanom Technologies GmbH

Edit Template