Provider-agnostic middleware
The following middleware work with any LLM provider:| Middleware | Description |
|---|---|
| Summarization | Automatically summarize conversation history when approaching token limits. |
| Human-in-the-loop | Pause execution for human approval of tool calls. |
| Model call limit | Limit the number of model calls to prevent excessive costs. |
| Tool call limit | Control tool execution by limiting call counts. |
| Model fallback | Automatically fallback to alternative models when primary fails. |
| PII detection | Detect and handle Personally Identifiable Information (PII). |
| To-do list | Equip agents with task planning and tracking capabilities. |
| LLM tool selector | Use an LLM to select relevant tools before calling main model. |
| Tool retry | Automatically retry failed tool calls with exponential backoff. |
| Model retry | Automatically retry failed model calls with exponential backoff. |
| LLM tool emulator | Emulate tool execution using an LLM for testing purposes. |
| Context editing | Manage conversation context by trimming or clearing tool uses. |
Summarization
Automatically summarize conversation history when approaching token limits, preserving recent messages while compressing older context. Summarization is useful for the following:- Long-running conversations that exceed context windows.
- Multi-turn dialogues with extensive history.
- Applications where preserving full conversation context matters.
Configuration options
Configuration options
Model for generating summaries. Can be a model identifier string (e.g.,
'openai:gpt-4o-mini') or a BaseChatModel instance.Conditions for triggering summarization. Can be:
- A single condition object (all properties must be met - AND logic)
- An array of condition objects (any condition must be met - OR logic)
fraction(number): Fraction of model’s context size (0-1)tokens(number): Absolute token countmessages(number): Message count
How much context to preserve after summarization. Specify exactly one of:
fraction(number): Fraction of model’s context size to keep (0-1)tokens(number): Absolute token count to keepmessages(number): Number of recent messages to keep
Custom token counting function. Defaults to character-based counting.
Custom prompt template for summarization. Uses built-in template if not specified. The template should include
{messages} placeholder where conversation history will be inserted.Maximum number of tokens to include when generating the summary. Messages will be trimmed to fit this limit before summarization.
Prefix to add to the summary message. If not provided, a default prefix is used.
Deprecated: Use
trigger: { tokens: value } instead. Token threshold for triggering summarization.Deprecated: Use
keep: { messages: value } instead. Recent messages to preserve.Full example
Full example
The summarization middleware monitors message token counts and automatically summarizes older messages when thresholds are reached.Trigger conditions control when summarization runs:
- Single condition object (all properties must be met - AND logic)
- Array of conditions (any condition must be met - OR logic)
- Each condition can use
fraction(of model’s context size),tokens(absolute count), ormessages(message count)
fraction- Fraction of model’s context size to keeptokens- Absolute token count to keepmessages- Number of recent messages to keep
Human-in-the-loop
Pause agent execution for human approval, editing, or rejection of tool calls before they execute. Human-in-the-loop is useful for the following:- High-stakes operations requiring human approval (e.g. database writes, financial transactions).
- Compliance workflows where human oversight is mandatory.
- Long-running conversations where human feedback guides the agent.
Model call limit
Limit the number of model calls to prevent infinite loops or excessive costs. Model call limit is useful for the following:- Preventing runaway agents from making too many API calls.
- Enforcing cost controls on production deployments.
- Testing agent behavior within specific call budgets.
Configuration options
Configuration options
Tool call limit
Control agent execution by limiting the number of tool calls, either globally across all tools or for specific tools. Tool call limits are useful for the following:- Preventing excessive calls to expensive external APIs.
- Limiting web searches or database queries.
- Enforcing rate limits on specific tool usage.
- Protecting against runaway agent loops.
Configuration options
Configuration options
Name of specific tool to limit. If not provided, limits apply to all tools globally.
Maximum tool calls across all runs in a thread (conversation). Persists across multiple invocations with the same thread ID. Requires a checkpointer to maintain state.
undefined means no thread limit.Maximum tool calls per single invocation (one user message → response cycle). Resets with each new user message.
undefined means no run limit.Note: At least one of threadLimit or runLimit must be specified.Behavior when limit is reached:
'continue'(default) - Block exceeded tool calls with error messages, let other tools and the model continue. The model decides when to end based on the error messages.'error'- Throw aToolCallLimitExceededErrorexception, stopping execution immediately'end'- Stop execution immediately with a ToolMessage and AI message for the exceeded tool call. Only works when limiting a single tool; throws error if other tools have pending calls.
Full example
Full example
Specify limits with:
- Thread limit - Max calls across all runs in a conversation (requires checkpointer)
- Run limit - Max calls per single invocation (resets each turn)
'continue'(default) - Block exceeded calls with error messages, agent continues'error'- Raise exception immediately'end'- Stop with ToolMessage + AI message (single-tool scenarios only)
Model fallback
Automatically fallback to alternative models when the primary model fails. Model fallback is useful for the following:- Building resilient agents that handle model outages.
- Cost optimization by falling back to cheaper models.
- Provider redundancy across OpenAI, Anthropic, etc.
Configuration options
Configuration options
The middleware accepts a variable number of string arguments representing fallback models in order:
One or more fallback model strings to try in order when the primary model fails
PII detection
Detect and handle Personally Identifiable Information (PII) in conversations using configurable strategies. PII detection is useful for the following:- Healthcare and financial applications with compliance requirements.
- Customer service agents that need to sanitize logs.
- Any application handling sensitive user data.
Custom PII types
You can create custom PII types by providing adetector parameter. This allows you to detect patterns specific to your use case beyond the built-in types.
Three ways to create custom detectors:
- Regex pattern string - Simple pattern matching
- RegExp object - More control over regex flags
- Custom function - Complex detection logic with validation
PIIMatch objects:
Configuration options
Configuration options
Type of PII to detect. Can be a built-in type (
email, credit_card, ip, mac_address, url) or a custom type name.How to handle detected PII. Options:
'block'- Throw error when detected'redact'- Replace with[REDACTED_TYPE]'mask'- Partially mask (e.g.,****-****-****-1234)'hash'- Replace with deterministic hash (e.g.,<email_hash:a1b2c3d4>)
Custom detector. Can be:
RegExp- Regex pattern for matchingstring- Regex pattern string (e.g.,"sk-[a-zA-Z0-9]{32}")function- Custom detector function(content: string) => PIIMatch[]
Check user messages before model call
Check AI messages after model call
Check tool result messages after execution
To-do list
Equip agents with task planning and tracking capabilities for complex multi-step tasks. To-do lists are useful for the following:- Complex multi-step tasks requiring coordination across multiple tools.
- Long-running operations where progress visibility is important.
This middleware automatically provides agents with a
write_todos tool and system prompts to guide effective task planning.Configuration options
Configuration options
No configuration options available (uses defaults).
LLM tool selector
Use an LLM to intelligently select relevant tools before calling the main model. LLM tool selectors are useful for the following:- Agents with many tools (10+) where most aren’t relevant per query.
- Reducing token usage by filtering irrelevant tools.
- Improving model focus and accuracy.
Configuration options
Configuration options
Model for tool selection. Can be a model identifier string (e.g.,
'openai:gpt-4o-mini') or a BaseChatModel instance. Defaults to the agent’s main model.Instructions for the selection model. Uses built-in prompt if not specified.
Maximum number of tools to select. If the model selects more, only the first maxTools will be used. No limit if not specified.
Tool names to always include regardless of selection. These do not count against the maxTools limit.
Tool retry
Automatically retry failed tool calls with configurable exponential backoff. Tool retry is useful for the following:- Handling transient failures in external API calls.
- Improving reliability of network-dependent tools.
- Building resilient agents that gracefully handle temporary errors.
toolRetryMiddleware
Configuration options
Configuration options
Maximum number of retry attempts after the initial call (3 total attempts with default). Must be >= 0.
Optional array of tools or tool names to apply retry logic to. Can be a list of
BaseTool instances or tool name strings. If undefined, applies to all tools.Either an array of error constructors to retry on, or a function that takes an error and returns
true if it should be retried. Default is to retry on all errors.Behavior when all retries are exhausted. Options:
'continue'(default) - Return aToolMessagewith error details, allowing the LLM to handle the failure and potentially recover'error'- Re-raise the exception, stopping agent execution- Custom function - Function that takes the exception and returns a string for the
ToolMessagecontent, allowing custom error formatting
'raise' (use 'error' instead) and 'return_message' (use 'continue' instead). These deprecated values still work but will show a warning.Multiplier for exponential backoff. Each retry waits
initialDelayMs * (backoffFactor ** retryNumber) milliseconds. Set to 0.0 for constant delay. Must be >= 0.Initial delay in milliseconds before first retry. Must be >= 0.
Maximum delay in milliseconds between retries (caps exponential backoff growth). Must be >= 0.
Whether to add random jitter (
±25%) to delay to avoid thundering herdFull example
Full example
The middleware automatically retries failed tool calls with exponential backoff.Key configuration:
maxRetries- Number of retry attempts (default: 2)backoffFactor- Multiplier for exponential backoff (default: 2.0)initialDelayMs- Starting delay in milliseconds (default: 1000ms)maxDelayMs- Cap on delay growth (default: 60000ms)jitter- Add random variation (default: true)
onFailure: "continue"(default) - Return error messageonFailure: "error"- Re-raise exception- Custom function - Function returning error message
Model retry
Automatically retry failed model calls with configurable exponential backoff. Model retry is useful for the following:- Handling transient failures in model API calls.
- Improving reliability of network-dependent model requests.
- Building resilient agents that gracefully handle temporary model errors.
modelRetryMiddleware
Configuration options
Configuration options
Maximum number of retry attempts after the initial call (3 total attempts with default). Must be >= 0.
Either an array of error constructors to retry on, or a function that takes an error and returns
true if it should be retried. Default is to retry on all errors.Behavior when all retries are exhausted. Options:
'continue'(default) - Return anAIMessagewith error details, allowing the agent to potentially handle the failure gracefully'error'- Re-raise the exception, stopping agent execution- Custom function - Function that takes the exception and returns a string for the
AIMessagecontent, allowing custom error formatting
Multiplier for exponential backoff. Each retry waits
initialDelayMs * (backoffFactor ** retryNumber) milliseconds. Set to 0.0 for constant delay. Must be >= 0.Initial delay in milliseconds before first retry. Must be >= 0.
Maximum delay in milliseconds between retries (caps exponential backoff growth). Must be >= 0.
Whether to add random jitter (
±25%) to delay to avoid thundering herdFull example
Full example
The middleware automatically retries failed model calls with exponential backoff.
LLM tool emulator
Emulate tool execution using an LLM for testing purposes, replacing actual tool calls with AI-generated responses. LLM tool emulators are useful for the following:- Testing agent behavior without executing real tools.
- Developing agents when external tools are unavailable or expensive.
- Prototyping agent workflows before implementing actual tools.
Configuration options
Configuration options
List of tool names (string) or tool instances to emulate. If
undefined (default), ALL tools will be emulated. If empty array [], no tools will be emulated. If array with tool names/instances, only those tools will be emulated.Model to use for generating emulated tool responses. Can be a model identifier string (e.g.,
'anthropic:claude-sonnet-4-5-20250929') or a BaseChatModel instance. Defaults to the agent’s model if not specified.Full example
Full example
The middleware uses an LLM to generate plausible responses for tool calls instead of executing the actual tools.
Context editing
Manage conversation context by clearing older tool call outputs when token limits are reached, while preserving recent results. This helps keep context windows manageable in long conversations with many tool calls. Context editing is useful for the following:- Long conversations with many tool calls that exceed token limits
- Reducing token costs by removing older tool outputs that are no longer relevant
- Maintaining only the most recent N tool results in context
Configuration options
Configuration options
Array of
ContextEdit strategies to applyClearToolUsesEdit options:Token count that triggers the edit. When the conversation exceeds this token count, older tool outputs will be cleared.
Minimum number of tokens to reclaim when the edit runs. If set to 0, clears as much as needed.
Number of most recent tool results that must be preserved. These will never be cleared.
Whether to clear the originating tool call parameters on the AI message. When
true, tool call arguments are replaced with empty objects.List of tool names to exclude from clearing. These tools will never have their outputs cleared.
Placeholder text inserted for cleared tool outputs. This replaces the original tool message content.
Full example
Full example
The middleware applies context editing strategies when token limits are reached. The most common strategy is
ClearToolUsesEdit, which clears older tool results while preserving recent ones.How it works:- Monitor token count in conversation
- When threshold is reached, clear older tool outputs
- Keep most recent N tool results
- Optionally preserve tool call arguments for context
Provider-specific middleware
These middleware are optimized for specific LLM providers.Anthropic
Middleware specifically designed for Anthropic’s Claude models.| Middleware | Description |
|---|---|
| Prompt caching | Reduce costs by caching repetitive prompt prefixes |
Prompt caching
Reduce costs and latency by caching static or repetitive prompt content (like system prompts, tool definitions, and conversation history) on Anthropic’s servers. This middleware implements a conversational caching strategy that places cache breakpoints after the most recent message, allowing the entire conversation history (including the latest user message) to be cached and reused in subsequent API calls. Prompt caching is useful for the following:- Applications with long, static system prompts that don’t change between requests
- Agents with many tool definitions that remain constant across invocations
- Conversations where early message history is reused across multiple turns
- High-volume deployments where reducing API costs and latency is critical
Learn more about Anthropic prompt caching strategies and limitations.
Configuration options
Configuration options
Time to live for cached content. Valid values:
'5m' or '1h'Full example
Full example
The middleware caches content up to and including the latest message in each request. On subsequent requests within the TTL window (5 minutes or 1 hour), previously seen content is retrieved from cache rather than reprocessed, significantly reducing costs and latency.How it works:
- First request: System prompt, tools, and the user message “Hi, my name is Bob” are sent to the API and cached
- Second request: The cached content (system prompt, tools, and first message) is retrieved from cache. Only the new message “What’s my name?” needs to be processed, plus the model’s response from the first request
- This pattern continues for each turn, with each request reusing the cached conversation history
OpenAI
Middleware specifically designed for OpenAI models.| Middleware | Description |
|---|---|
| Content moderation | Moderate agent traffic using OpenAI’s moderation endpoint |
Content moderation
Moderate agent traffic (user input, model output, and tool results) using OpenAI’s moderation endpoint to detect and handle unsafe content. Content moderation is useful for the following:- Applications requiring content safety and compliance
- Filtering harmful, hateful, or inappropriate content
- Customer-facing agents that need safety guardrails
- Meeting platform moderation requirements
Learn more about OpenAI’s moderation models and categories.
Configuration options
Configuration options
Full example
Full example
The middleware integrates OpenAI’s moderation endpoint to check content at different stages:Moderation stages:
check_input- User messages before model callcheck_output- AI messages after model callcheck_tool_results- Tool outputs before model call
'end'(default) - Stop execution with violation message'error'- Raise exception for application handling'replace'- Replace flagged content and continue