From 4K to 1M Tokens: The Technical Journey of Long-Context LLMs

JC
January 30, 2026 at 12:14 AM
5 min read

Imagine the difference between reading a single chapter versus an entire book in one sitting. That’s the leap large language models have made in just a few years — from GPT-3’s 2K tokens to today’s models handling over a million tokens. This represents fundamental breakthroughs in computer science that make processing entire codebases, legal documents, and multi-hour conversations possible.

📚 What You’ll Learn

  1. The Core Problem — Why context length was limited and the quadratic attention bottleneck
  2. Architectural Breakthroughs — Flash Attention, sparse patterns, and alternative architectures
  3. The KV Cache Challenge — The hidden inference bottleneck and solutions like GQA
  4. Quality vs. Quantity — The “lost in the middle” problem and what it means
  5. State of the Art — Comparing today’s leading models
  6. What’s Next — Future directions in long-context AI

1. The Core Problem: Quadratic Attention

The Bottleneck

Standard transformer self-attention computes relationships between every pair of tokens:

Attention(Q, K, V) = softmax(QK^T / √d_k)V

For a sequence of length n:

  • Memory: O(n²) for attention scores
  • Compute: O(n²d) operations

Real-world impact:

  • 4K tokens → ~16 million attention scores
  • 100K tokens → ~10 billion attention scores (625x increase!)
  • 100K context at fp16 → ~20GB just for attention matrices

This made long contexts impractical with naive implementations.

2. Architectural Breakthroughs

Flash Attention: The Memory Revolution

Core insight: Don’t materialize the full attention matrix in GPU memory.

How it works:

  • Tile-based computation using fast GPU SRAM instead of slow HBM
  • Fuses operations to avoid storing intermediate results
  • Recomputes during backward pass

Impact:

  • 2–4x speedup
  • Memory drops from O(n²) to O(n)
  • Enabled 10x longer sequences on same hardware
# Standard: Creates massive [n, n] matrix
attention_scores = Q @ K.T # Doesn't scale!
# Flash Attention: Block-wise, never stores full matrix
for block_q in Q_blocks:
for block_k, block_v in zip(K_blocks, V_blocks):
block_output = compute_attention_tile(block_q, block_k, block_v)
accumulate(block_output) # O(n) memory

Sparse Attention Patterns

Philosophy: Not all tokens need to attend to all others.

Sliding Window Attention (Mistral, Longformer)

  • Each token attends only to nearby tokens (e.g., window of 4096)
  • Complexity: O(n × W) instead of O(n²)
  • Trade-off: Local context preserved, limited long-range dependencies

Other patterns:

  • Strided attention — Attend to every k-th token for global context
  • Random attention (BigBird) — Local + random + global tokens
  • All maintain O(n) or O(n log n) complexity

Positional Encoding: RoPE & Interpolation

Problem: Traditional position embeddings don’t extrapolate beyond training length.

RoPE (Rotary Position Embeddings):

  • Encodes relative positions through rotation matrices
  • Naturally extrapolates to longer sequences
  • Used in Llama, Mistral, most modern LLMs

Position Interpolation:

  • Simple technique to extend context 10–25x
  • Compress position indices to fit within trained range
  • Enabled Llama 2 to go from 4K → 32K+ with minimal fine-tuning

State Space Models: A Different Paradigm

Mamba, RWKV — Radical departure from attention:

h_t = A * h_{t-1} + B * x_t  # Recurrent state update
y_t = C * h_t # Output

Key properties:

  • Linear complexity: O(n) vs O(n²)
  • Constant memory: Fixed-size state regardless of sequence length
  • Parallelizable training: Can be formulated as convolutions

Trade-off: Different inductive biases; still being researched for quality parity with transformers.

Get Jagadeesh’s stories in your inbox

Join Medium for free to get updates from this writer.

Hybrid models (Jamba) combine attention + SSM layers for best of both worlds.

3. The KV Cache Challenge

Training solved, but inference has a different bottleneck: the Key-Value cache.

What is KV Caching?

Transformers cache previous key/value computations to avoid recomputation:

# With KV cache - only compute new token
K_new = compute_keys(new_token_only)
K_cache = concatenate(K_cache, K_new) # Just append
attention = softmax(Q @ K_cache.T) @ V_cache

The Memory Problem

KV cache = 2 × layers × kv_heads × head_dim × seq_length × precision
Example (typical 7B model, 100K tokens):
= 2 × 32 × 32 × 128 × 100,000 × 2 bytes
≈ 52 GB just for cache!

At 100K+ tokens, KV cache can exceed the model weights in memory consumption.

Solutions

Grouped Query Attention (GQA) — Llama 2/3, Mistral

  • Share KV heads across multiple query heads
  • 4–8x KV cache reduction with minimal quality loss
# Standard: 32 query heads, 32 KV heads
# GQA: 32 query heads, 8 KV heads → 4x smaller cache

Multi-Query Attention (MQA) — Even more aggressive

  • Single KV head shared across all queries
  • Maximum savings, potential quality trade-off

PagedAttention (vLLM)

  • Treat KV cache like OS virtual memory
  • Break into fixed-size pages, can be non-contiguous
  • Share pages between requests with common prefixes
  • 2–4x better throughput in production

Cache Compression

  • StreamingLLM: Keep first/recent tokens, evict middle
  • H2O: Track and keep “important” KV pairs based on attention patterns

4. The “Lost in the Middle” Problem

Critical finding: Longer context ≠ uniformly better performance.

Needle-in-Haystack Benchmark

Insert a fact at different positions in long context, measure retrieval accuracy:

Position 0-10%:    90%+ accuracy  ✅
Position 20-80%: 50-70% accuracy ⚠️
Position 90-100%: 85%+ accuracy ✅

U-shaped curve — models struggle with information in the middle.

Why This Happens

  • Training data bias (important info often at start/end)
  • Attention dilution across very long contexts
  • Position embedding artifacts

What Works

Models improving:

  • Claude 2.1+ — Near-perfect retrieval across all positions
  • GPT-4 Turbo — Strong improvements
  • Gemini 1.5 — Good but still shows some degradation

Practical strategies:

  • Place critical information at beginning/end when possible
  • Use explicit instructions: “Pay attention to details throughout”
  • Consider hybrid retrieval + LLM approaches for critical tasks
  • Test thoroughly across different positions

5. Current State of the Art (2025)

Leading Models

Press enter or click to view image in full size

When to Use Long Context vs. RAG

Use long context when:

  • Analyzing entire documents requiring full context
  • Multi-turn conversations needing complete history
  • Cross-document reasoning
  • Codebase understanding

Use RAG when:

  • Cost-sensitive applications
  • Need explicit citations/sources
  • Well-structured knowledge bases
  • Specific information retrieval

Cost reality: 100K context ≈ 10–20x more expensive than 4K due to compute and KV cache.

6. What’s Next

Emerging Directions

Infinite context approaches:

  • Learned compression of old context
  • Hierarchical memory (recent/important/archived tiers)
  • External memory integration with on-demand retrieval

Quality improvements:

  • Uniform attention across all positions
  • Faster inference at long context
  • Better reasoning over long dependencies

Architectural innovation:

  • Hybrid transformers + SSMs
  • Adaptive context compression
  • Dynamic attention budget allocation

Multimodal long-context:

  • Hours of video + transcripts + documents
  • Multiple meeting recordings with context
  • Code + documentation + issue history

Hardware Co-Design

Next frontier: Specialized chips optimized for long-context operations, better quantization techniques, and memory hierarchies designed for massive KV caches.

Conclusion

The journey from 4K to 1M+ tokens involved breakthroughs across multiple dimensions:

Solved problems:

  • ✅ Flash Attention conquered the memory wall
  • ✅ Sparse patterns made computation tractable
  • ✅ GQA/MQA addressed KV cache bottleneck
  • ✅ RoPE enabled length extrapolation
  • ✅ SSMs offered alternative O(n) architectures

Remaining challenges:

  • ⚠️ Cost at scale
  • ⚠️ Uniform quality across entire context
  • ⚠️ Finding optimal context length for each task
  • ⚠️ Making it practical for production systems

For practitioners: The tools exist. Experiment, but choose wisely — longer context is powerful but not always optimal. Consider your use case, costs, and quality requirements.

For researchers: Enormous opportunities remain in attention mechanisms, training techniques, inference optimization, and entirely new paradigms.

The 1M token barrier is broken. The next frontier is making it practical, cost-effective, and reliably high-quality.

Key References

  • Dao et al. (2022) — FlashAttention: Fast and Memory-Efficient Exact Attention
  • Liu et al. (2023) — Lost in the Middle: How Language Models Use Long Contexts
  • Gu & Dao (2023) — Mamba: Linear-Time Sequence Modeling with Selective State Spaces
  • Ainslie et al. (2023) — GQA: Training Generalized Multi-Query Transformer Models
  • Kwon et al. (2023) — Efficient Memory Management for LLM Serving with PagedAttention