Ollama num_ctx: Why Setting It Higher Than the Model Supports Backfires

๐Ÿ“– 2 minutes read

When running local LLMs with Ollama, you can set num_ctx to control the context window size. But there’s a ceiling you might not expect.

The Gotcha

Every model has an architectural limit baked into its training. Setting num_ctx higher than that limit doesn’t give you more context — it gives you garbage output or silent truncation:

# This model was trained with 8K context
ollama run llama3 --num_ctx 32768
# Result: degraded output beyond 8K, not extended context

The num_ctx parameter allocates memory for the KV cache, but the model’s positional embeddings only know how to handle positions it saw during training.

How to Check the Real Limit

# Check model metadata
ollama show llama3 --modelfile | grep num_ctx

# Or check the model card
ollama show llama3

The model card or GGUF metadata will tell you the trained context length. That’s your actual ceiling.

What About YaRN and RoPE Scaling?

Some models support extended context through YaRN (Yet another RoPE extensioN) or other RoPE scaling methods. These are baked into the model weights during fine-tuning — you can’t just enable them with a flag.

If a model advertises 128K context, it was trained or fine-tuned with RoPE scaling to handle that. If it advertises 8K, setting num_ctx=128000 won’t magically give you 128K.

The Rule

Match num_ctx to what the model actually supports. Going lower saves memory. Going higher wastes memory and produces worse output. Check the model card, not your wishful thinking.

Daryle De Silva

VP of Technology

11+ years building and scaling web applications. Writing about what I learn in the trenches.

Comments

Leave a Reply

Your email address will not be published. Required fields are marked *