Bigger AI Models Aren't Always Better: A Concrete Comparison
In a previous post, two models answered the same recipe question: one hallucinated confidently, the other knew when to stop. Readers asked: which one should I actually use?
This post takes one prompt, runs it through two models (one small, one large), and shows the differences. Then it provides a framework for choosing the right model size.
The Demo: Same Prompt, Two Models
The prompt: "I'm cooking this for six people on Saturday. One is vegan, one is gluten-free. Adapt the recipe for me, give me a shopping list, and a timeline starting from 4pm."
Small model response:
- Quickly gave a shopping list, timeline, and basic adaptations.
- Nothing fancy, but everything asked for.
- For everyday tasks, this is genuinely all you need.
Large model response:
- Added a "Strategy for the vegan guest" section explaining why to make a parallel pot instead of adapting the main dish.
- Timeline starting from the night before.
- Separated prep into phases.
- Told to keep rice pots separate.
- Scaling math for 4 to 6 servings.
- Gave both oven and stovetop methods.
More thorough, but did you need all that for a Saturday dinner? Maybe not.
Why Models Come in Sizes
A model's "size" is roughly its number of parameters—variables it can hold when making a decision. More parameters = more nuance but slower and costlier.
Training and running a big model costs more per question. Two reasons not to always use the biggest:
- Cost: Building something handling thousands of requests—the difference between small and large is the difference between a reasonable bill and a terrifying one.
- Overthinking: For simple tasks, a big model can give more than asked for and take longer.
Model families like Haiku/Sonnet/Opus or Micro/Lite/Pro are size tiers from the same provider. Same architecture, different capacity.
Tokens and Pricing: How You Actually Pay
Models charge by tokens, not questions. A token is roughly ¾ of a word. For example, "Adapt this recipe for a gluten-free vegan" is 7 words but 9 tokens.
A full page of text ≈ 400 tokens. A million tokens ≈ a 750,000-word book.
Surprising: Different models tokenize the same text differently. The same prompt and recipe: the small model counted 6,548 input tokens; the large counted 16,685.
You get charged twice: input tokens (your question) and output tokens (the answer). Output is always more expensive.
Real numbers (Amazon Bedrock, Claude family, as of May 2025):
| Model | Size | Input per 1M tokens | Output per 1M tokens |
|---|---|---|---|
| Haiku | Small | ~$1 | ~$5 |
| Sonnet | Medium | ~$3 | ~$15 |
| Opus | Large | ~$5 | ~$25 |
That's a 5x difference from small to large. For an app handling 10,000 requests/day, that 5x multiplier turns into real money.
Where Bigger Is Worse
Large model response metrics:
- Small model: 18 seconds, ~1,900 output tokens.
- Large model: 44 seconds, 2,700 output tokens.
- 40% more output, 2.4x slower, ~10x more expensive for that single request.
For a Saturday dinner, this is overkill. If building an app answering recipe questions for thousands of users, you'd pay for all that extra thinking on every request.
How to Actually Choose
First: Cost. That 5x difference per token, plus the large model generating 40% more tokens per response, compounds fast. Cost decides what's even on the table.
Then three questions:
- How complex is the task? Summarizing email? Small. Writing a legal brief? Large. Adapting a recipe? Medium.
- How many times will you run it? One question? Use whatever. App serving thousands? Start small, upgrade only when quality isn't enough.
- What are the stakes? Wrong answer ruins dinner? Low stakes. Wrong financial processing costs millions? High stakes—bigger model plus verification.
Picking a Provider
Pick the one available where you already work. On AWS, you have access to all through Bedrock. The concepts are the same.
Models vs. Products: Claude is a model. Claude inside Kiro (a coding IDE) behaves differently from Claude in the Bedrock Playground or on claude.ai. Same brain, different job description.
Try It Yourself
- Beginners: Models come in sizes. For most everyday tasks, a medium model is the sweet spot. Try a few.
- Builders: Start with the smallest model that gives acceptable quality. Only upgrade when you can point to a specific failure the bigger model fixes. Start small, justify up. Use different models for different parts of the same system.
What's Next
The next post will discuss why models forget what you told them. This is part of the "Learning AI Out Loud" series.



