Middlewares
Middlewares wrap every LLM turn in the Lead Agent. They are the primary extension point for adding cross-cutting behaviors like memory, summarization, clarification, and token tracking.
Every time the Lead Agent calls the LLM, it runs through a middleware chain before and after the model call. Middlewares can read and modify the agent’s state, inject content into the system prompt, intercept tool calls, and react to model outputs.
This design keeps the agent core simple and stable while allowing rich, composable behaviors to be layered in.
How the chain works
The middleware chain is built once per agent invocation, based on the current configuration and request parameters. The middlewares run in a defined order:
- Runtime middlewares (error handling, thread data, uploads, dangling tool call patching)
SummarizationMiddleware— context compression (if enabled)TodoMiddleware— task list management (plan mode only)TokenUsageMiddleware— token tracking (if enabled)TitleMiddleware— automatic thread title generationMemoryMiddleware— cross-session memory injection and queuingViewImageMiddleware— image details injection (if model supports vision)DeferredToolFilterMiddleware— hides deferred tool schemas (if tool search enabled)SubagentLimitMiddleware— limits parallel subagent calls (if subagents enabled)LoopDetectionMiddleware— breaks repetitive tool call loops- Custom middlewares (if any)
ClarificationMiddleware— intercepts clarification requests (always last)
The ordering is significant. Summarization runs early to reduce context before other processing. Clarification always runs last so it can intercept after all other middlewares have had their turn.
Middleware reference
ClarificationMiddleware
Intercepts clarification tool calls and converts them into proper user-facing requests for additional information. When the model decides it needs to ask the user something before proceeding, this middleware surfaces that request.
Configuration: controlled by guardrails.clarification settings.
LoopDetectionMiddleware
Detects when the agent is making the same tool call repeatedly without making progress. When a loop is detected, the middleware intervenes to break the cycle and prevents the agent from burning turns indefinitely.
Configuration: built-in, no user configuration.
MemoryMiddleware
Reads persisted memory facts at the start of each conversation and injects them into the system prompt. After a conversation ends, queues a background update to incorporate any new information into the memory store.
Configuration: see the Memory page and the memory: section in config.yaml.
memory:
enabled: true
injection_enabled: true
max_injection_tokens: 2000
debounce_seconds: 30SubagentLimitMiddleware
Limits the number of parallel subagent task calls the agent can make in a single turn. This prevents the agent from spawning an unbounded number of concurrent subagents.
Configuration: subagent_enabled and max_concurrent_subagents in the per-request config.
TitleMiddleware
Automatically generates a title for the thread after the first exchange. The title is derived from the user’s first message and the agent’s response.
Configuration: title: section in config.yaml.
title:
enabled: true
max_words: 6
max_chars: 60
model_name: null # use default modelTodoMiddleware
When plan mode is active, maintains a structured task list visible to the user. The agent uses the write_todos tool to mark tasks as pending, in_progress, or completed as it works through a complex objective.
Activation: enabled automatically when is_plan_mode: true is set in the request configuration. No config.yaml entry required.
TokenUsageMiddleware
Tracks LLM token consumption per model call and logs it at the info level. Useful for monitoring costs and understanding where tokens are going in long tasks.
Configuration: token_usage: section in config.yaml.
token_usage:
enabled: falseSandboxAuditMiddleware
Audits sandbox operations performed during the agent’s execution. Provides a record of what files were read, written, and what commands were run.
Configuration: built-in runtime middleware, always active when a sandbox is available.
SummarizationMiddleware
When the conversation grows long, summarizes older messages to reduce context size. The summary is injected back into the conversation in place of the original messages, preserving meaning without the full token cost.
Configuration: summarization: section in config.yaml. See detailed configuration below.
ViewImageMiddleware
When the current model supports vision (supports_vision: true), this middleware intercepts view_image tool calls and injects the image content directly into the model’s context so it can be analyzed.
Activation: automatically enabled when the resolved model has supports_vision: true.
DeferredToolFilterMiddleware
When tool search is enabled, this middleware hides deferred tool schemas from the model’s context. Tools are discovered lazily via the tool_search tool instead of being listed upfront, reducing context usage.
Configuration: tool_search.enabled: true in config.yaml.
Summarization configuration
The SummarizationMiddleware is one of the most impactful middlewares for long-horizon tasks. Here is the full configuration reference:
summarization:
enabled: true
# Model to use for summarization (null = use default model)
# A lightweight model like gpt-4o-mini is recommended to reduce cost.
model_name: null
# Trigger conditions — summarization runs when ANY threshold is met
trigger:
- type: tokens # trigger when context exceeds N tokens
value: 15564
# - type: messages # trigger when there are more than N messages
# value: 50
# - type: fraction # trigger when context exceeds X% of model max
# value: 0.8
# How much recent history to keep after summarization
keep:
type: messages
value: 10 # keep the 10 most recent messages
# Alternative: keep by tokens
# type: tokens
# value: 3000
# Maximum tokens to trim when preparing messages for the summarizer
trim_tokens_to_summarize: 15564
# Custom summary prompt (null = use default LangChain prompt)
summary_prompt: nullTrigger types:
tokens: triggers when the total token count in the conversation exceedsvalue.messages: triggers when the number of messages exceedsvalue.fraction: triggers when the context reachesvaluefraction of the model’s maximum input token limit.
Multiple triggers can be listed; summarization runs when any of them fires.
Keep types:
messages: keep the lastvaluemessages after summarization.tokens: keep up tovaluetokens of recent history.fraction: keep up tovaluefraction of the model’s max input token limit.
Writing a custom middleware
Custom middlewares can be injected into the chain for specialized use cases. A middleware must implement the AgentMiddleware interface from langchain.agents.middleware.
The basic structure is:
from langchain.agents.middleware import AgentMiddleware
class MyMiddleware(AgentMiddleware):
async def on_start(self, state, config):
# Runs before the model call
# Modify state or config here
return state, config
async def on_end(self, state, config):
# Runs after the model call
# Inspect or modify the result
return state, configCustom middlewares are passed to make_lead_agent via the custom_middlewares parameter in _build_middlewares. They are injected immediately before ClarificationMiddleware at the end of the chain.