The raspberry test is a crap AI dunk.
You have probably seen it: "If AI is so clever, why can't it count the r's in raspberry?" Hur hur hur.
Fair question if you are testing a human child. Dumb kid.
But we ain’t. If you are testing a machine that does not see words the way you do.
This is today’s AI 101 lesson: Tokens and Tokenisation. Context windows. The hidden budget behind every AI chat.
Yes, it sounds boring. Unfortunately, it explains a lot. So…it’s fun? Right?
Why models make weird mistakes. Why a huge context window does not magically make results better. Why your API bill can get stupid. Why long chats start to drift. Why dumping every document you own into one conversation can make the answer worse, not better.
It’s all tokens, baby.
Ok so what is a token?
A token is the small “chunk” a large language model reads and writes.

They are roughly equivalent to individual words. Annoyingly so at times.
Sometimes a full word. Sometimes part of a word. Sometimes punctuation. Sometimes a space stuck to the next word because apparently computers enjoy making life mildly irritating.
OpenAI's own rough guide is that one token is about four characters, or about three quarters of an English word. Good enough for normal humans. Not exact enough if you are building something where cost matters.
Here’s a chunk of text split up into it’s constituent tokens using the GPT-4 tokeniser:

You can play with this yourself by the way:
Type a sentence in.
Watch it break apart (“chunking”)
You will immediately see why "tokens are words" is not quite right.
The model is not passing your beautiful sentence around as a sentence. It is chunking it into tokens, mapping those tokens to IDs, processing those IDs, then producing new tokens back out.

So what about raspberry? And how many Rs does it really have?
The raspberry example helps, just not as a gotcha.
The lazy take is: "AI can't count letters, so AI is rubbish. Lul”
Yawn. It’s an easy cope from people who aren’t willing to even try using AI to improve their life. I can’t help them,
A better read: the model is usually not seeing letters in the way you are. It sees token chunks. Let’s chunk it using the GPT5 tokeniser and see:

Huh.
Raspberry is made of three tokens inside GPT5.
The model is usually not looking at the word the way you are.
You see: r a s p b e r r y
The model “sees”: ras / p / berry
Suddenly the problem is a bit less mysterious.
You are asking a letter-level question to a system that is not naturally built around a letter-level view of words.
That does not mean the system is useless. It just means you need to understand the interface and what LLMs are and are not good at.
This is the same reason AI can be incredible at summarising a 40-page report, spotting patterns across messy notes, writing code, explaining concepts, and helping you think — while still occasionally fumbling something that looks obvious to a human child.
Because “easy for humans” and “easy for machines” are not the same category.
And if you genuinely want a machine that is good at counting we have them: let me present to you…computers. Shit, that’s the EASY part. We’ve had counting machines for ages!
So when someone snarkily asks: “How many Rs are in raspberry?” they aren’t exposing the death of artificial intelligence. They are exposing a mismatch between the question and the way the system processes text. Much less exciting.
Now we have a concept of what a token is and how we create them using tokenisation let’s look at this practical usage of this knowledge: context windows.

The context window is the model's working memory for the current task.
Your prompt. The previous chat. Uploaded documents. Images. Tool calls. Search results. System instructions. The answer it is writing back to you.
All of that lives in the context window.
When the context window fills up, the app has a few options. Older chatbots used to just kick you into a new thread. Lovely. Very sophisticated! Now many systems compact the context instead. They take the previous mess, summarise it, and carry the summary forward into a cleaner window.
Better than being booted out.
Bu still lossy. Context compaction is a photocopy of a photocopy. It can preserve the main shape and quietly lose the detail that actually mattered. Some constraint. A preference. A weird edge case. The reason you told the agent not to do something. Those little (but important!) nuances may get lost in the compaction.

That matters in normal chats sure. It’s annoying. It matters even more with agents! If the boundary gets compacted badly, the agent can drift. Because the instruction that kept it on track got blurred. That’s why over time some agentic processes actually get worse.
So: easy fix right? Bigger context window? EASY.
Yes. Maybe
Here are the various context window sizes right now (Jun 2026):

These WILL change.
Right now 1M is the top end. And 1 millions is a LOT.
That is ~1575 A4 pages at size 12 Arial font. That’s an lot of info to be able to throw at a model.
That lets you work with longer transcripts, bigger reports, more files, larger codebases, ongoing projects, and fewer annoying resets.
But "bigger" and "better" are not the same word.
The bigger the window, the more tempted you are to throw everything into it.
Whole database? In it goes.
Whole codebase? Why not.
Twenty entire textbook PDFs because one of them probably matters? Sure, sod it.
Then you ask a very specific question and the model has to search through a skip full of context you dumped in because the box was large enough.
You just made a bigger mess. And now the model has to go sorting through it all to find what matters.
More context helps when it adds signal. It hurts when it adds noise. OR: A larger desk is useful if it holds the right papers. Empty every drawer onto it and suddenly you cannot find the passport.
To make it worse the models actual use tokens when they are digging around in the mess you’ve made. All that additional work comes in the form of reasoning tokens.
For the most part when you are using the chat app version of an AI this doesn’t really matter. Whether it’s ChatGPT, Claude, Gemini or something else you don’t get much visibility into your token usage and importantly you aren’t charged per token.
In chat apps, you mostly feel this as quality, limits, speed, or a thread getting a bit weird because it’s overloaded.
In the API, you feel it as money.
Tokens are the billing unit. Input tokens. Output tokens. Cached tokens sometimes. System prompts. User prompts. Tool results. The model's answer. Everything counts.
Here are the (high!) costs of GPT5.5 right now:

That’s $5.00 for every 1M input token and $30.00 for every 1M output tokens. That’s second mortgage time right there.
This is why when you BUILD with AI you need to be on top of this sort of thing. Otherwise costs can spiral.
So, how do you know the costs?
Easiest is to check the current docs:
Or if you want a quick summary of all the models from all providers use Artificial Analysis
Ultimately you need to test in your actual product and choose the best model for the best price you can. This is where "sod it, just use the best model with the biggest context" becomes expensive advice.
If a cheaper model with a smaller useful context does the job, use it.
This is also where the Chinese models are chomping at the heels of the US models. Here is DeepSeek V4 Pro pricing for instance:

Compared to $5/$30 of GPT5.5 you can see why this is a problem for OpenAI!
Is DeepSeek V4 Pro as good as GPT5.5? No. But its 35x cheaper for output tokens. And good enough for a LOT of work. Your customer does not care how many tokens at what expense you heroically burned in the background. They care whether the thing works.
This brings us to the trend of companies telling people to "tokenmaxx".
There has been a trend over the last few months to push engineers to use AI more internally. Even setting up leader boards to see how many tokens engineers can use.
Spend more tokens. Push the tools harder. Generate more. Do more AI activity.
The logic here is that the more AI is used the more productivity. Yeah…dumb.
But activity is not output.
More lines of code is a terrible metric of productivity. A 90-page AI-generated report nobody reads is not productivity. A huge code diff nobody can maintain is not progress. A chatbot conversation that burns a million tokens to produce a waffly answer is not impressive.
Spending tokens is the wrong target. Getting the job done is the target. That’s been forgotten by simply tokenmaxxing.
Instead are the boring rules I would actually use:

Do these and you’ll do just fine.
To the Task,
Kyle
Why it matters: : What are Tokens?
Get the full daily digest in your inbox. Browse recent issues →
: Don't Automate (Yet) Co-founder of OpenAI shared this prompt Kyle Balmer May 27, 2026 https://youtu.be/fq4P6oZJOrY Greg…