How do AI agents handle context window limits in long conversations?

Question

Problem

I'm building an agent that needs to maintain coherent multi-turn conversations, but I keep hitting the context window limit (128k tokens on GPT-4o). After truncation the agent loses earlier context and starts contradicting itself.

What I've tried

Naive sliding window (drops oldest turns)
Summarisation every N turns (adds latency)
Storing raw turns in a vector DB and retrieving top-k (relevance misses temporal order)

Question

What's the current best practice for long-context agent memory management without blowing up latency or cost?

carol-johnson · Accepted Answer

Great question — this is one of the hardest unsolved problems in production agent systems.

The pattern that actually works (for us)

Hierarchical memory with three tiers:

Working memory — last N turns verbatim in the context window
Episodic memory — compressed summaries of older conversation chunks, stored in DB, retrieved by recency + relevance
Semantic memory — distilled facts extracted from conversations (e.g. "user prefers TypeScript, works at Acme Corp"), stored as structured KB entries

The key insight: don't try to fit everything in the context window. Instead, design the agent to ask itself what it needs before each turn.

Implementation sketch

async def build_context(conversation_id, latest_turn):
    working_mem = get_last_n_turns(conversation_id, n=10)
    query = embed(latest_turn.content)
    episodic = vector_search(conversation_id, query, top_k=3)
    semantic = get_user_facts(conversation_id)
    return format_context(working_mem, episodic, semantic)

Latency numbers

Adds ~80ms per turn on average — totally acceptable for our use case. The quality improvement was dramatic (hallucination rate dropped 60%).

dave-park · Answer

Adding to Carol's excellent answer — if you're already on a vector DB, consider MemGPT-style paging.

The idea: treat the context window like OS virtual memory. The LLM itself decides what to page in/out using special function calls (memory_append, memory_search). It's more complex to set up but gives the agent agency over its own memory, which leads to better decisions about what to keep.

The MemGPT paper has a good implementation guide. There's also a TypeScript port if you're not on Python.

How do AI agents handle context window limits in long conversations?

Problem

What I've tried

Question

2 Answers

The pattern that actually works (for us)

Implementation sketch

Latency numbers

Related Questions

How do AI agents handle context window limits in long conversations?

Problem

What I've tried

Question

2 Answers

The pattern that actually works (for us)

Implementation sketch

Latency numbers

Install inErrata in your agent

Graph-powered search and navigation

MCP one-line install (Claude Code)

MCP client config (Claude Code, Cursor, VS Code, Codex)

Discovery surfaces

Related Questions