Bigger AI Models Aren't Always Better: A Concrete Comparison

In a previous post, two models answered the same recipe question: one hallucinated confidently, the other knew when to stop. Readers asked: which one should I actually use?

This post takes one prompt, runs it through two models (one small, one large), and shows the differences. Then it provides a framework for choosing the right model size.

The Demo: Same Prompt, Two Models

The prompt: "I'm cooking this for six people on Saturday. One is vegan, one is gluten-free. Adapt the recipe for me, give me a shopping list, and a timeline starting from 4pm."

Small model response:

Large model response:

More thorough, but did you need all that for a Saturday dinner? Maybe not.

Why Models Come in Sizes

A model's "size" is roughly its number of parameters—variables it can hold when making a decision. More parameters = more nuance but slower and costlier.

Training and running a big model costs more per question. Two reasons not to always use the biggest:

  1. Cost: Building something handling thousands of requests—the difference between small and large is the difference between a reasonable bill and a terrifying one.
  2. Overthinking: For simple tasks, a big model can give more than asked for and take longer.

Model families like Haiku/Sonnet/Opus or Micro/Lite/Pro are size tiers from the same provider. Same architecture, different capacity.

Tokens and Pricing: How You Actually Pay

Models charge by tokens, not questions. A token is roughly ¾ of a word. For example, "Adapt this recipe for a gluten-free vegan" is 7 words but 9 tokens.

A full page of text ≈ 400 tokens. A million tokens ≈ a 750,000-word book.

Surprising: Different models tokenize the same text differently. The same prompt and recipe: the small model counted 6,548 input tokens; the large counted 16,685.

You get charged twice: input tokens (your question) and output tokens (the answer). Output is always more expensive.

Real numbers (Amazon Bedrock, Claude family, as of May 2025):

ModelSizeInput per 1M tokensOutput per 1M tokens
HaikuSmall~$1~$5
SonnetMedium~$3~$15
OpusLarge~$5~$25

That's a 5x difference from small to large. For an app handling 10,000 requests/day, that 5x multiplier turns into real money.

Where Bigger Is Worse

Large model response metrics:

For a Saturday dinner, this is overkill. If building an app answering recipe questions for thousands of users, you'd pay for all that extra thinking on every request.

How to Actually Choose

First: Cost. That 5x difference per token, plus the large model generating 40% more tokens per response, compounds fast. Cost decides what's even on the table.

Then three questions:

  1. How complex is the task? Summarizing email? Small. Writing a legal brief? Large. Adapting a recipe? Medium.
  2. How many times will you run it? One question? Use whatever. App serving thousands? Start small, upgrade only when quality isn't enough.
  3. What are the stakes? Wrong answer ruins dinner? Low stakes. Wrong financial processing costs millions? High stakes—bigger model plus verification.

Picking a Provider

Pick the one available where you already work. On AWS, you have access to all through Bedrock. The concepts are the same.

Models vs. Products: Claude is a model. Claude inside Kiro (a coding IDE) behaves differently from Claude in the Bedrock Playground or on claude.ai. Same brain, different job description.

Try It Yourself

What's Next

The next post will discuss why models forget what you told them. This is part of the "Learning AI Out Loud" series.