Choosing AI Models: Size vs. Cost Trade-offs in Practice

Bigger AI Models Aren't Always Better: A Concrete Comparison

In a previous post, two models answered the same recipe question: one hallucinated confidently, the other knew when to stop. Readers asked: which one should I actually use?

This post takes one prompt, runs it through two models (one small, one large), and shows the differences. Then it provides a framework for choosing the right model size.

The Demo: Same Prompt, Two Models

The prompt: "I'm cooking this for six people on Saturday. One is vegan, one is gluten-free. Adapt the recipe for me, give me a shopping list, and a timeline starting from 4pm."

Small model response:

Quickly gave a shopping list, timeline, and basic adaptations.
Nothing fancy, but everything asked for.
For everyday tasks, this is genuinely all you need.

Large model response:

Added a "Strategy for the vegan guest" section explaining why to make a parallel pot instead of adapting the main dish.
Timeline starting from the night before.
Separated prep into phases.
Told to keep rice pots separate.
Scaling math for 4 to 6 servings.
Gave both oven and stovetop methods.

More thorough, but did you need all that for a Saturday dinner? Maybe not.

Why Models Come in Sizes

A model's "size" is roughly its number of parameters—variables it can hold when making a decision. More parameters = more nuance but slower and costlier.

Training and running a big model costs more per question. Two reasons not to always use the biggest:

Cost: Building something handling thousands of requests—the difference between small and large is the difference between a reasonable bill and a terrifying one.
Overthinking: For simple tasks, a big model can give more than asked for and take longer.

Model families like Haiku/Sonnet/Opus or Micro/Lite/Pro are size tiers from the same provider. Same architecture, different capacity.

Tokens and Pricing: How You Actually Pay

Models charge by tokens, not questions. A token is roughly ¾ of a word. For example, "Adapt this recipe for a gluten-free vegan" is 7 words but 9 tokens.

A full page of text ≈ 400 tokens. A million tokens ≈ a 750,000-word book.

Surprising: Different models tokenize the same text differently. The same prompt and recipe: the small model counted 6,548 input tokens; the large counted 16,685.

You get charged twice: input tokens (your question) and output tokens (the answer). Output is always more expensive.

Real numbers (Amazon Bedrock, Claude family, as of May 2025):

Model	Size	Input per 1M tokens	Output per 1M tokens
Haiku	Small	~$1	~$5
Sonnet	Medium	~$3	~$15
Opus	Large	~$5	~$25

That's a 5x difference from small to large. For an app handling 10,000 requests/day, that 5x multiplier turns into real money.

Where Bigger Is Worse

Large model response metrics:

Small model: 18 seconds, ~1,900 output tokens.
Large model: 44 seconds, 2,700 output tokens.
40% more output, 2.4x slower, ~10x more expensive for that single request.

For a Saturday dinner, this is overkill. If building an app answering recipe questions for thousands of users, you'd pay for all that extra thinking on every request.

How to Actually Choose

First: Cost. That 5x difference per token, plus the large model generating 40% more tokens per response, compounds fast. Cost decides what's even on the table.

Then three questions:

How complex is the task? Summarizing email? Small. Writing a legal brief? Large. Adapting a recipe? Medium.
How many times will you run it? One question? Use whatever. App serving thousands? Start small, upgrade only when quality isn't enough.
What are the stakes? Wrong answer ruins dinner? Low stakes. Wrong financial processing costs millions? High stakes—bigger model plus verification.

Picking a Provider

Pick the one available where you already work. On AWS, you have access to all through Bedrock. The concepts are the same.

Models vs. Products: Claude is a model. Claude inside Kiro (a coding IDE) behaves differently from Claude in the Bedrock Playground or on claude.ai. Same brain, different job description.

Try It Yourself

Beginners: Models come in sizes. For most everyday tasks, a medium model is the sweet spot. Try a few.
Builders: Start with the smallest model that gives acceptable quality. Only upgrade when you can point to a specific failure the bigger model fixes. Start small, justify up. Use different models for different parts of the same system.

What's Next

The next post will discuss why models forget what you told them. This is part of the "Learning AI Out Loud" series.

Choosing AI Models: Size vs. Cost Trade-offs in Practice

Bigger AI Models Aren't Always Better: A Concrete Comparison

The Demo: Same Prompt, Two Models

Why Models Come in Sizes

Tokens and Pricing: How You Actually Pay

Where Bigger Is Worse

How to Actually Choose

Picking a Provider

Try It Yourself

What's Next

Editor's Take

Key Takeaways

Why It Matters

Get the weekly digest

You might also like

Claude on $20 budget fails to earn from Algora bounties — data inside

Frontier AI Access Tightens: Compute, Security, and Gov Control

Huawei's Secret Chip Lab Aired on CCTV Ahead of Trump Visit

DCI: Replacing Vector Retrieval with Grep and Shell Commands