ChatNVIDIA features and configurations head to the API reference.
Overview
Thelangchain-nvidia-ai-endpoints package contains LangChain integrations for chat models and embeddings powered by NVIDIA AI Foundation Models, and hosted on the NVIDIA API Catalog.
NVIDIA AI Foundation models are community- and NVIDIA-built models that are optimized to deliver the best performance on NVIDIA-accelerated infrastructure. You can use the API to query live endpoints that are available on the NVIDIA API Catalog to get quick results from a DGX-hosted cloud compute environment, or you can download models from NVIDIA’s API catalog with NVIDIA NIM, which is included with the NVIDIA AI Enterprise license. The ability to run models on-premises gives your enterprise ownership of your customizations and full control of your IP and AI application.
NIM microservices are packaged as container images on a per model/model family basis and are distributed as NGC container images through the NVIDIA NGC Catalog. At their core, NIM microservices are containers that provide interactive APIs for running inference on an AI Model.
This example goes over how to use LangChain to interact with NVIDIA models via the ChatNVIDIA class.
For more information on accessing embedding models through this API, refer to the NVIDIAEmbeddings documentation.
Integration details
| Class | Package | Serializable | JS support | Downloads | Version |
|---|---|---|---|---|---|
| ChatNVIDIA | langchain-nvidia-ai-endpoints | beta | ❌ |
Model features
| Tool calling | Structured output | Image input | Audio input | Video input | Token-level streaming | Native async | Token usage | Logprobs |
|---|---|---|---|---|---|---|---|---|
| ✅ | ✅ | ✅ | ❌ | ❌ | ✅ | ✅ | ✅ | ❌ |
Install the package
Access the NVIDIA API Catalog
To get access to the NVIDIA API Catalog, do the following:- Create a free account on the NVIDIA API Catalog and log in.
- Click your profile icon, and then click API Keys. The API Keys page appears.
- Click Generate API Key. The Generate API Key window appears.
- Click Generate Key. You should see API Key Granted, and your key appears.
- Copy and save the key as
NVIDIA_API_KEY. - To verify your key, use the following code.
Instantiation
Now we can access models in the NVIDIA API Catalog:Invocation
Self-host with NVIDIA NIM Microservices
When you are ready to deploy your AI application, you can self-host models with NVIDIA NIM. For more information, refer to NVIDIA NIM Microservices. The following code connects to locally hosted NIM Microservices.Stream, batch, and async
These models natively support streaming, and as is the case with all LangChain LLMs they expose a batch method to handle concurrent requests, as well as async methods for invoke, stream, and batch. Below are a few examples.Supported models
Queryingavailable_models will still give you all of the other models offered by your API credentials.
The playground_ prefix is optional.
Model types
All of these models above are supported and can be accessed viaChatNVIDIA.
Some model types support unique prompting techniques and chat messages. We will review a few important ones below.
To find out more about a specific model, please navigate to the API section of an AI Foundation model as linked here.
General chat
Models such asmeta/llama3-8b-instruct and mistralai/mixtral-8x22b-instruct-v0.1 are good all-around models that you can use for with any LangChain chat messages. Example below.
Code generation
These models accept the same arguments and input structure as regular chat models, but they tend to perform better on code-generation and structured code tasks. An example of this ismeta/codellama-70b.
Multimodal
NVIDIA also supports multimodal inputs, meaning you can provide both images and text for the model to reason over. An example model supporting multimodal inputs isnvidia/neva-22b.
Below is an example use:
Passing an image as a URL
Passing an image as a base64 encoded string
At the moment, some extra processing happens client-side to support larger images like the one above. But for smaller images (and to better illustrate the process going on under the hood), we can directly pass in the image as shown below:Directly within the string
The NVIDIA API uniquely accepts images as base64 images inlined within<img/> HTML tags. While this isn’t interoperable with other LLMs, you can directly prompt the model accordingly.
Example usage within a RunnableWithMessageHistory
Like any other integration, ChatNVIDIA is fine to support chat utilities like RunnableWithMessageHistory which is analogous to usingConversationChain. Below, we show the LangChain RunnableWithMessageHistory example applied to the mistralai/mixtral-8x22b-instruct-v0.1 model.
Tool calling
Starting in v0.2,ChatNVIDIA supports bind_tools.
ChatNVIDIA provides integration with the variety of models on build.nvidia.com as well as local NIMs. Not all these models are trained for tool calling. Be sure to select a model that does have tool calling for your experimention and applications.
You can get a list of models that are known to support tool calling with,
Use with NVIDIA Dynamo
NVIDIA Dynamo is a distributed inference-serving framework built to deploy models in multi-node environments at data center scale. It simplifies and automates the complexities of distributed serving by disaggregating the various phases of inference across different GPUs, intelligently routing requests to the appropriate GPU to avoid redundant computation, and extending GPU memory through data caching to cost-effective storage tiers.ChatNVIDIADynamo is a drop-in replacement for ChatNVIDIA that automatically injects nvext.agent_hints into every request. These hints tell the Dynamo deployment:
osl(output sequence length) — how many tokens to expect, so the scheduler can plan memory allocationiat(inter-arrival time) — how quickly requests arrive, so the router can anticipate loadlatency_sensitivity— how latency-critical a request is, so interactive calls get priority routingpriority— request priority, so background work can yield to critical-path requests
prefix_id is auto-generated for every request, enabling the router to track KV cache affinity.
This section assumes you have a running NVIDIA Dynamo deployment.
Basic usage
SwapChatNVIDIA for ChatNVIDIADynamo and every request automatically includes routing hints. All standard ChatNVIDIA parameters are supported.
ChatNVIDIADynamo accepts four additional parameters beyond those supported by ChatNVIDIA:
| Parameter | Type | Default | Description |
|---|---|---|---|
osl | int | 512 | Expected output sequence length (tokens) |
iat | int | 250 | Expected inter-arrival time (ms) |
latency_sensitivity | float | 1.0 | Higher latency sensitivities get priority routing |
priority | int | 1 | Lower priority settings receive more scheduling priority |
Set defaults at construction time
Configure Dynamo hints when creating the model instance. This is useful when a model instance always serves a particular role, such as a high-priority interactive assistant versus a low-priority background summarizer.Override per invocation
Dynamo parameters can also be overridden on each call. This is useful when the same model instance handles requests with varying characteristics.Stream with Dynamo hints
Dynamo hints are included in the initial streaming request. Dynamo uses them to select the optimal worker before tokens start flowing.Inspect the payload
For debugging, inspect the exact payload thatChatNVIDIADynamo sends to the NIM endpoint using the internal _get_payload method.
nvext.agent_hints section:
API reference
For detailed documentation of allChatNVIDIA features and configurations head to the API reference: python.langchain.com/api_reference/nvidia_ai_endpoints/chat_models/langchain_nvidia_ai_endpoints.chat_models.ChatNVIDIA.html
Related topics
langchain-nvidia-ai-endpointspackageREADME- Overview of NVIDIA NIM for Large Language Models (LLMs)
- Overview of NeMo Retriever Embedding NIM
- Overview of NeMo Retriever Reranking NIM
NVIDIAEmbeddingsModel for RAG Workflows- NVIDIA Provider Page
- NVIDIA Dynamo — open-source inference framework
- Dynamo Quickstart Guide — get a local deployment running
- KV Cache-Aware Routing — how the Smart Router works
Connect these docs to Claude, VSCode, and more via MCP for real-time answers.