Grade-level alignment in LLM-generated K-12 instruction
I asked gpt-4.1 to explain a 4th-grade math standard to a 4th grader. Standard 4.OA.A.1: interpreting multiplication as a comparison. Here’s what came back across three prompt designs.
Let’s talk about what it means when we see a multiplication equation like 35 = 5 × 7. This equation tells us that 35 is the same as 5 groups of 7. But it also means something else: 35 is 5 times as many as 7. Imagine you have 7 apples. If you had 5 times as many apples, you would have 5 × 7 = 35 apples. It also works the other way: 35 is 7 times as many as 5. If you have 5 apples, and you get 7 times as many, you have 7 × 5 = 35 apples. So, when you see a multiplication equation like a = b × c, you can say: “a is b times as many as c,” or “a is c times as many as b.”
This standard is about understanding what multiplication really means when we compare numbers. When you see a multiplication equation like 35 = 5 × 7, it is not just about finding the answer. It is also about understanding the relationship between the numbers. In this example, 35 = 5 × 7 tells us that 35 is 5 times as many as 7, and also that 35 is 7 times as many as 5. You are expected to learn how to read and write multiplication equations as comparisons. This means you should be able to look at a multiplication equation and explain what it means in words.
We are learning to understand what multiplication really means when we compare numbers. When we see a multiplication equation, like 35 = 5 × 7, it tells us more than just the answer. It shows us how one number can be many times as big as another number. For example, if we say “35 is 5 times as many as 7,” we mean that if you have 7 things, and you have 5 groups of them, you will have 35 things in total. This standard is about reading and writing multiplication equations as comparisons — taking a sentence like “35 is 5 times as many as 7” and connecting it to the equation 35 = 5 × 7.
Zero-shot lands at 7th-grade level, three grades above the target. Adding a role and format instruction pushes it to nearly 10th grade. Adding an exemplar does not pull it back. More prompt, more drift. All three outputs are aimed at a 9-year-old. None of them reads like something a 9-year-old would encounter in class.
A larger sample: 1,080 generations
Sixty randomly drawn Common Core standards, 30 ELA and 30 Math across K-12. Three frontier OpenAI models (gpt-4.1, gpt-5.4, gpt-5.5), three prompt sizes, two wording conditions. 1,080 generations total. About five dollars and ninety minutes of compute. Every output scored by code, not by another model. Six classical readability formulas, ensemble median, no LLM in the loop.
Every output is scored by code, not by another model. No LLM judge, no prompt sensitivity, no second source of variance. The six formulas below are the same ones a curriculum reviewer would apply by hand, but run programmatically so the measurement is reproducible and free of model bias.
Six readability formulas
Δ = ensemble median grade minus target grade. Positive means the output is harder than the standard’s audience.
Consistent drift across three frontier models
Mean Δ = +3.29 grade levels. 92.2% of explanations land above their target grade. Cohen’s d = 1.49, a very large effect. And then there’s this: gpt-5.5 = +3.28, gpt-5.4 = +3.30, gpt-4.1 = +3.28. Three different model generations, 0.02 grade levels apart. They are not different. This is not one model’s quirk. It’s consistent behavior across the frontier.
All three models produce the same rightward shift. The dashed line at Δ = 0 is “exactly on target.” Barely any bars touch it.
Drift by grade band
Look at where the outputs land, not just how far off they are. K-2 lands at grade 5.6. 3-5 lands at 7.5. 6-8 at 9.2. 9-12 at 8.9. Across a full K-12 range of targets, outputs cluster between grade 5 and grade 9. The model has a default adult-leaning register and pulls every generation toward it: K-2 gets dragged up several grades, high school gets dragged down. The “best calibrated” band is just the one whose target happens to fall inside the model’s default range.
”Demonstrate command of the conventions of standard English grammar and usage when writing or speaking.”
When we talk or write, we use words in a special way so everyone can understand us. This is called using good grammar. Good grammar means saying words in the right order and using the right words. For example, we say, “I am happy,” not “Happy I am.” We also use the right words for things, like “he” for a boy and “she” for a girl. When we write, we use big letters at the start of a sentence and a period at the end. For example: “The cat is black.” The first letter is big, and there is a dot at the end. Using good grammar helps our friends and teachers know what we mean. It’s like following the rules of a game so everyone can play together and have fun. So, when you talk or write, try to use your words in the right way. That’s what it means to use good grammar!
”Ask and answer questions about information from a speaker, offering appropriate elaboration and detail.”
This standard means that when someone is talking to you (like your teacher, a classmate, or a guest) you should listen carefully. After they finish speaking, you can ask questions about what they said if you want to know more or if something wasn’t clear. For example, if your teacher tells a story about animals, you might ask, “Why did the dog run away?” or “What happened next?” It also means you should be able to answer questions when someone asks you about what you heard. When you answer, try to give enough information so the other person understands. Don’t just say “yes” or “no.” Instead, add details. So, this standard is about being a good listener and a good speaker. You show you understand by asking good questions and giving clear, detailed answers. This helps everyone learn more and have better conversations!
Make sense of problems and persevere in solving them means that when you see a math problem, you should first try to really understand what it’s asking. Don’t just start doing calculations right away. Take a moment to read the problem carefully, figure out what information you have, and what you need to find out. Try to picture the problem in your mind, draw a diagram, or write down what you know. Once you understand the problem, start working on it. If you get stuck, don’t give up! Try a different strategy, like breaking the problem into smaller parts, looking for patterns, or checking if you’ve seen a similar problem before. Persevering means you keep working, even if it’s hard or takes a while. Good problem solvers don’t quit when things get tough. They keep thinking, trying, and learning from their mistakes. This skill is important not just in math, but in life!
Every one of the ten standards with the highest drift in this study is K-2 or grade 3. Every one of the ten with the lowest drift is grade 8 or above. The inverse relationship is consistent and follows directly from how the standards themselves are written.
Each dot is one generated explanation. The dotted diagonal is perfect calibration (output = target). K-2 points sit entirely above it. Red dots appear only in the 9-12 columns.
Prompt reading level matches output reading level
The prompts themselves (the standard’s own wording plus surrounding instructions) were already +3.19 grade levels above their stated target. The model’s output landed at +3.29. 97% of the drift was already present in the prompt before the model generated a single word.
97% of the total drift (+3.19 of +3.29) was already in the prompt. The model’s contribution is the small dark sliver on the right.
The per-cell correlation between prompt reading level and output reading level is Pearson r = 0.30, weak, because all prompts in the main study were already at adult register with very little variance. When the follow-up rewrote prompts across the full K-12 range, r rose to 0.55. The model mirrors what it reads. Give it adult prose, it outputs adult prose.
More elaborate prompts increase drift
Prompt size. Three sizes: short zero-shot (S), medium with role and format instructions (M), long with role, format, and a one-shot exemplar (L).
I thought a more sophisticated prompt would fix it. It made things worse. The role instruction, “you are a teacher writing a student-facing explanation,” is adult prose. The model reads it and continues in adult prose. Adding a one-shot exemplar pulls the number partway back because the exemplar demonstrates simpler register, but it doesn’t overcome the surrounding instructions. Bare zero-shot beats both.
Simplified wording. Every standard ran twice: once with the original CCSS text, once with the standard text rewritten into plain language. That knocked 0.5 grade levels off drift (paired t-tests, p ≤ 7×10⁻⁴ for every model). Statistically solid. Practically, a 0.5-grade fix on a 3-grade problem, because it only touched the standard text, not the role instructions or surrounding context.
Rewriting the full prompt at target grade
Simplifying just the standard text moved the number by 0.5 grade levels. A 0.5-grade fix on a 3-grade problem. I rewrote the entire prompt at the target grade. Role instruction, format instruction, standard text, everything. Then regenerate.
180 cells, gpt-4.1, zero-shot.
Red curve is the baseline (+3.29 mean). Indigo is the full-prompt rewrite (+1.25 mean). The brackets show each distribution’s distance from Δ=0, the target.
62% of the gap closed. Reduction = 1.87 grade levels, t = 9.13, p ≈ 10⁻¹⁶. The prompt register is the primary driver of output register. Write the prompt at the grade, get output near the grade.
But the fix has a ceiling.
Limits of the intervention
The intervention has a range. For grades 3-8 it works cleanly. The chart below shows most of the gap closing. The extremes tell you where the ceiling is.
Red circles are baseline mean Δ per band. Indigo diamonds are after the full-prompt rewrite.
K-2 still lands at +2.94 because gpt-4.1 can’t write prompts below roughly grade 4, so the generator sees grade-4 input for a kindergarten target and produces grade-4 output. 9-12 crosses below zero because the rewriter undershoots adult register and the generator follows. Both are bottlenecks in the prompt writer, not the generator.
Practical recommendations
These models are next-token predictors. That’s not a simplification. It’s the mechanism. The model scans the sequence it has received and predicts the most likely next token. If the prompt arrives in 10th-grade vocabulary and syntax, the next predicted token is a 10th-grade token. The output isn’t influenced by the prompt register. It’s a continuation of it.
- Write the prompt AT the target grade, not just LABEL the target grade. “Write this for a 3rd grader” is not a register constraint. A prompt written at 3rd-grade reading level is.
- Gate every output with a deterministic readability check before delivery. The scoring stack runs in milliseconds and catches the worst cells before a teacher or student sees them.
- Don’t add scaffolding for its own sake. Role and format instructions are themselves adult prose. Adult-flavored instructions push toward adult output.
A few things this study isn’t. Reading level is not pedagogical quality. A grade-7 explanation of a kindergarten standard might still be accurate and useful. This is only OpenAI; Claude and Gemini might do something different. And it’s sixty standards, not the whole CCSS. Take it as the start of the question, not the end.
This post summarizes the key findings. The underlying dataset, seven interactive charts, per-standard breakdowns, and the full statistical report are in the GitHub repo. If you want to dig into the raw numbers or reproduce the pipeline, that is the place to start.
Reproducibility
60 standards drawn from the CCSS Multi-State frameworks (30 ELA, 30 Math), sampled at random with seed 20260504 from a parent pool of 200. Three OpenAI models (gpt-4.1, gpt-5.4, gpt-5.5), three prompt sizes (short zero-shot, medium with role, long with one-shot exemplar), two wording conditions (original and simplified), 1,080 total cells. Scoring: ensemble median of six readability formulas via textstat, syntactic features via spaCy, no LLM-as-judge. Multiple-comparison correction: Holm-Bonferroni. Total generation cost: approximately $5.
Full data and reproduction pipeline at github.com/smnji/grade-level-drift:
git clone https://github.com/smnji/grade-level-drift
cd grade-level-drift && cp .env.example .env # add OPENAI_API_KEY
pip install -r requirements.txt
python -m src.generate --run-id main # ~$5, ~90 min
python -m src.score --run-id main
python -m src.report --run-id main
open reports/main_report.html
Data: CCSS via Learning Commons (CC BY 4.0). Readability stack: textstat (MIT), spaCy (MIT). Word lists: Coxhead AWL, Browne et al. NGSL.
Reach out on LinkedIn, subscribe for new posts, or drop a question below.
Take it further
Subscribe
New case studies and engineering notes, delivered when they're published. No spam.
Discuss
Questions or counter-examples welcome. Comments are powered by GitHub Discussions.