Sam Naji · May 5, 2026

Grade-level alignment in LLM-generated K-12 instruction

From

llm-evaluation edtech readability k-12 alignment methodology

I asked gpt-4.1 to explain a 4th-grade math standard to a 4th grader. Standard 4.OA.A.1: interpreting multiplication as a comparison. Here’s what came back across three prompt designs.

Let’s talk about what it means when we see a multiplication equation like 35 = 5 × 7. This equation tells us that 35 is the same as 5 groups of 7. But it also means something else: 35 is 5 times as many as 7. Imagine you have 7 apples. If you had 5 times as many apples, you would have 5 × 7 = 35 apples. It also works the other way: 35 is 7 times as many as 5. If you have 5 apples, and you get 7 times as many, you have 7 × 5 = 35 apples. So, when you see a multiplication equation like a = b × c, you can say: “a is b times as many as c,” or “a is c times as many as b.”

This standard is about understanding what multiplication really means when we compare numbers. When you see a multiplication equation like 35 = 5 × 7, it is not just about finding the answer. It is also about understanding the relationship between the numbers. In this example, 35 = 5 × 7 tells us that 35 is 5 times as many as 7, and also that 35 is 7 times as many as 5. You are expected to learn how to read and write multiplication equations as comparisons. This means you should be able to look at a multiplication equation and explain what it means in words.

We are learning to understand what multiplication really means when we compare numbers. When we see a multiplication equation, like 35 = 5 × 7, it tells us more than just the answer. It shows us how one number can be many times as big as another number. For example, if we say “35 is 5 times as many as 7,” we mean that if you have 7 things, and you have 5 groups of them, you will have 35 things in total. This standard is about reading and writing multiplication equations as comparisons — taking a sentence like “35 is 5 times as many as 7” and connecting it to the equation 35 = 5 × 7.

Zero-shot lands at 7th-grade level, three grades above the target. Adding a role and format instruction pushes it to nearly 10th grade. Adding an exemplar does not pull it back. More prompt, more drift. All three outputs are aimed at a 9-year-old. None of them reads like something a 9-year-old would encounter in class.

A larger sample: 1,080 generations

Sixty randomly drawn Common Core standards, 30 ELA and 30 Math across K-12. Three frontier OpenAI models (gpt-4.1, gpt-5.4, gpt-5.5), three prompt sizes, two wording conditions. 1,080 generations total. About five dollars and ninety minutes of compute. Every output scored by code, not by another model. Six classical readability formulas, ensemble median, no LLM in the loop.

Design choice: deterministic scoring

Every output is scored by code, not by another model. No LLM judge, no prompt sensitivity, no second source of variance. The six formulas below are the same ones a curriculum reviewer would apply by hand, but run programmatically so the measurement is reproducible and free of model bias.

Six readability formulas

Flesch-Kincaidavg syllables per word + sentence length

SMOGpolysyllable count in a 30-sentence sample

Coleman-Liaucharacter count instead of syllables

ARIcharacters-per-word + words-per-sentence

Gunning Fogcomplex word ratio + sentence length

Dale-Challfraction of words outside 3,000 familiar words

Δ = ensemble median grade minus target grade. Positive means the output is harder than the standard’s audience.

Consistent drift across three frontier models

Mean Δ = +3.29 grade levels. 92.2% of explanations land above their target grade. Cohen’s d = 1.49, a very large effect. And then there’s this: gpt-5.5 = +3.28, gpt-5.4 = +3.30, gpt-4.1 = +3.28. Three different model generations, 0.02 grade levels apart. They are not different. This is not one model’s quirk. It’s consistent behavior across the frontier.

+3.3mean grade drift

92%above their target

1.49Cohen’s d (very large)

0.02spread across 3 models

All three models produce the same rightward shift. The dashed line at Δ = 0 is “exactly on target.” Barely any bars touch it.

Drift by grade band

Look at where the outputs land, not just how far off they are. K-2 lands at grade 5.6. 3-5 lands at 7.5. 6-8 at 9.2. 9-12 at 8.9. Across a full K-12 range of targets, outputs cluster between grade 5 and grade 9. The model has a default adult-leaning register and pulls every generation toward it: K-2 gets dragged up several grades, high school gets dragged down. The “best calibrated” band is just the one whose target happens to fall inside the model’s default range.

+5.60 grade levels above targetoutput: ~grade 5.6 · target: grade K

K36912

L.K.1 · Kindergarten ELA

”Demonstrate command of the conventions of standard English grammar and usage when writing or speaking.”

gpt-4.1 output (zero-shot prompt)

When we talk or write, we use words in a special way so everyone can understand us. This is called using good grammar. Good grammar means saying words in the right order and using the right words. For example, we say, “I am happy,” not “Happy I am.” We also use the right words for things, like “he” for a boy and “she” for a girl. When we write, we use big letters at the start of a sentence and a period at the end. For example: “The cat is black.” The first letter is big, and there is a dot at the end. Using good grammar helps our friends and teachers know what we mean. It’s like following the rules of a game so everyone can play together and have fun. So, when you talk or write, try to use your words in the right way. That’s what it means to use good grammar!

+4.54 grade levels above targetoutput: ~grade 7.5 · target: grade 3

K36912

SL.3.3 · Grade 3 ELA

”Ask and answer questions about information from a speaker, offering appropriate elaboration and detail.”

gpt-4.1 output (zero-shot prompt)

This standard means that when someone is talking to you (like your teacher, a classmate, or a guest) you should listen carefully. After they finish speaking, you can ask questions about what they said if you want to know more or if something wasn’t clear. For example, if your teacher tells a story about animals, you might ask, “Why did the dog run away?” or “What happened next?” It also means you should be able to answer questions when someone asks you about what you heard. When you answer, try to give enough information so the other person understands. Don’t just say “yes” or “no.” Instead, add details. So, this standard is about being a good listener and a good speaker. You show you understand by asking good questions and giving clear, detailed answers. This helps everyone learn more and have better conversations!

+1.24 grade levels above targetoutput: ~grade 9.2 · target: grade 8

K36912

8.MP1 · Grade 8 Math

”Make sense of problems and persevere in solving them.”

gpt-4.1 output (zero-shot prompt)

Make sense of problems and persevere in solving them means that when you see a math problem, you should first try to really understand what it’s asking. Don’t just start doing calculations right away. Take a moment to read the problem carefully, figure out what information you have, and what you need to find out. Try to picture the problem in your mind, draw a diagram, or write down what you know. Once you understand the problem, start working on it. If you get stuck, don’t give up! Try a different strategy, like breaking the problem into smaller parts, looking for patterns, or checking if you’ve seen a similar problem before. Persevering means you keep working, even if it’s hard or takes a while. Good problem solvers don’t quit when things get tough. They keep thinking, trying, and learning from their mistakes. This skill is important not just in math, but in life!

−1.60 grade levels, below targetoutput: ~grade 8.9 · target: grade 10.5

K36912

HSF-IF.B.5 · Grade 10-11 Math

”Relate the domain of a function to its graph and, where applicable, to the quantitative relationship it describes.”

gpt-4.1 output (zero-shot prompt, lands below target)

The domain of a function is the set of all possible input values (usually called x-values) that you can use in the function. When you look at the graph of a function, the domain tells you which x-values are shown on the graph. For example, imagine a function h(n) that gives the number of person-hours needed to assemble n engines in a factory. In real life, you can’t assemble a negative number of engines, and you can’t assemble a fraction of an engine. So, the domain of h(n) is the set of positive integers (1, 2, 3, …), because those are the only values that make sense for n in this situation. When you see a graph of a function, the domain is shown by the part of the x-axis that the graph covers. If the graph only exists for x-values greater than or equal to 0, then the domain is x ≥ 0. If the graph is a line that goes on forever in both directions, the domain is all real numbers.

Every one of the ten standards with the highest drift in this study is K-2 or grade 3. Every one of the ten with the lowest drift is grade 8 or above. The inverse relationship is consistent and follows directly from how the standards themselves are written.

Each dot is one generated explanation. The dotted diagonal is perfect calibration (output = target). K-2 points sit entirely above it. Red dots appear only in the 9-12 columns.

Prompt reading level matches output reading level

The prompts themselves (the standard’s own wording plus surrounding instructions) were already +3.19 grade levels above their stated target. The model’s output landed at +3.29. 97% of the drift was already present in the prompt before the model generated a single word.

97% of the total drift (+3.19 of +3.29) was already in the prompt. The model’s contribution is the small dark sliver on the right.

The per-cell correlation between prompt reading level and output reading level is Pearson r = 0.30, weak, because all prompts in the main study were already at adult register with very little variance. When the follow-up rewrote prompts across the full K-12 range, r rose to 0.55. The model mirrors what it reads. Give it adult prose, it outputs adult prose.

More elaborate prompts increase drift

Prompt size. Three sizes: short zero-shot (S), medium with role and format instructions (M), long with role, format, and a one-shot exemplar (L).

SZero-shot+2.72mean grade drift

LWith exemplar+3.23mean grade drift

MRole + format+3.91mean grade drift

I thought a more sophisticated prompt would fix it. It made things worse. The role instruction, “you are a teacher writing a student-facing explanation,” is adult prose. The model reads it and continues in adult prose. Adding a one-shot exemplar pulls the number partway back because the exemplar demonstrates simpler register, but it doesn’t overcome the surrounding instructions. Bare zero-shot beats both.

Simplified wording. Every standard ran twice: once with the original CCSS text, once with the standard text rewritten into plain language. That knocked 0.5 grade levels off drift (paired t-tests, p ≤ 7×10⁻⁴ for every model). Statistically solid. Practically, a 0.5-grade fix on a 3-grade problem, because it only touched the standard text, not the role instructions or surrounding context.

Rewriting the full prompt at target grade

Simplifying just the standard text moved the number by 0.5 grade levels. A 0.5-grade fix on a 3-grade problem. I rewrote the entire prompt at the target grade. Role instruction, format instruction, standard text, everything. Then regenerate.

180 cells, gpt-4.1, zero-shot.

Red curve is the baseline (+3.29 mean). Indigo is the full-prompt rewrite (+1.25 mean). The brackets show each distribution’s distance from Δ=0, the target.

62% of the gap closed. Reduction = 1.87 grade levels, t = 9.13, p ≈ 10⁻¹⁶. The prompt register is the primary driver of output register. Write the prompt at the grade, get output near the grade.

But the fix has a ceiling.

Limits of the intervention

The intervention has a range. For grades 3-8 it works cleanly. The chart below shows most of the gap closing. The extremes tell you where the ceiling is.

Red circles are baseline mean Δ per band. Indigo diamonds are after the full-prompt rewrite.

K-2 still lands at +2.94 because gpt-4.1 can’t write prompts below roughly grade 4, so the generator sees grade-4 input for a kindergarten target and produces grade-4 output. 9-12 crosses below zero because the rewriter undershoots adult register and the generator follows. Both are bottlenecks in the prompt writer, not the generator.

Practical recommendations

These models are next-token predictors. That’s not a simplification. It’s the mechanism. The model scans the sequence it has received and predicts the most likely next token. If the prompt arrives in 10th-grade vocabulary and syntax, the next predicted token is a 10th-grade token. The output isn’t influenced by the prompt register. It’s a continuation of it.

Write the prompt AT the target grade, not just LABEL the target grade. “Write this for a 3rd grader” is not a register constraint. A prompt written at 3rd-grade reading level is.
Gate every output with a deterministic readability check before delivery. The scoring stack runs in milliseconds and catches the worst cells before a teacher or student sees them.
Don’t add scaffolding for its own sake. Role and format instructions are themselves adult prose. Adult-flavored instructions push toward adult output.

A few things this study isn’t. Reading level is not pedagogical quality. A grade-7 explanation of a kindergarten standard might still be accurate and useful. This is only OpenAI; Claude and Gemini might do something different. And it’s sixty standards, not the whole CCSS. Take it as the start of the question, not the end.

Full data and report

This post summarizes the key findings. The underlying dataset, seven interactive charts, per-standard breakdowns, and the full statistical report are in the GitHub repo. If you want to dig into the raw numbers or reproduce the pipeline, that is the place to start.

github.com/smnji/grade-level-drift →

Reproducibility

60 standards drawn from the CCSS Multi-State frameworks (30 ELA, 30 Math), sampled at random with seed 20260504 from a parent pool of 200. Three OpenAI models (gpt-4.1, gpt-5.4, gpt-5.5), three prompt sizes (short zero-shot, medium with role, long with one-shot exemplar), two wording conditions (original and simplified), 1,080 total cells. Scoring: ensemble median of six readability formulas via textstat, syntactic features via spaCy, no LLM-as-judge. Multiple-comparison correction: Holm-Bonferroni. Total generation cost: approximately $5.

Full data and reproduction pipeline at github.com/smnji/grade-level-drift:

git clone https://github.com/smnji/grade-level-drift
cd grade-level-drift && cp .env.example .env   # add OPENAI_API_KEY
pip install -r requirements.txt
python -m src.generate --run-id main        # ~$5, ~90 min
python -m src.score    --run-id main
python -m src.report   --run-id main
open reports/main_report.html

Data: CCSS via Learning Commons (CC BY 4.0). Readability stack: textstat (MIT), spaCy (MIT). Word lists: Coxhead AWL, Browne et al. NGSL.

Reach out on LinkedIn, subscribe for new posts, or drop a question below.

Take it further

New case studies and engineering notes, delivered when they're published. No spam.

Discuss

Questions or counter-examples welcome. Comments are powered by GitHub Discussions.

A larger sample: 1,080 generations

Consistent drift across three frontier models

Drift by grade band

Prompt reading level matches output reading level

More elaborate prompts increase drift

Rewriting the full prompt at target grade

Limits of the intervention

Practical recommendations

Reproducibility

Take it further

Subscribe

Discuss