Skip to main content
Apify Dataset is a scalable append-only storage with sequential access built for storing structured web scraping results, such as a list of products or Google SERPs, and then export them to various formats like JSON, CSV, or Excel. Datasets are mainly used to save results of Apify Actors—serverless cloud programs for various web scraping, crawling, and data extraction use cases.
This notebook shows how to load Apify datasets to LangChain.

Integration details

ClassPackageSerializableJS supportVersion
ApifyDatasetLoaderlangchain-apifyPyPI - Version

Loader features

SourceDocument Lazy LoadingNative Async Support
Apify Dataset

Prerequisites

You need to have an existing dataset on the Apify platform. This example shows how to load a dataset produced by the Website Content Crawler.
pip install -qU langchain langchain-apify langchain-openai
First, import ApifyDatasetLoader into your source code:
from langchain_apify import ApifyDatasetLoader
from langchain_core.documents import Document
Find your Apify API token and OpenAI API key and initialize these into environment variable:
import os

os.environ["APIFY_TOKEN"] = "your-apify-token"
os.environ["OPENAI_API_KEY"] = "your-openai-api-key"

Pricing

Apify Actors can be priced in different ways, depending on the Actor you run. Many Actors support Pay-Per-Event (PPE) pricing, where you pay for explicit events defined by the Actor author (for example, per dataset item). This can be a good fit for agent workloads where you want clear, per-operation costs.

Map dataset items to documents

Next, define a function that maps Apify dataset record fields to LangChain Document format. For example, if your dataset items are structured like this:
{
    "url": "https://apify.com",
    "text": "Apify is the best web scraping and automation platform."
}
The mapping function in the code below will convert them to LangChain Document format, so that you can use them further with any LLM model (e.g. for question answering).
loader = ApifyDatasetLoader(
    dataset_id="your-dataset-id",
    dataset_mapping_function=lambda dataset_item: Document(
        page_content=dataset_item["text"], metadata={"source": dataset_item["url"]}
    ),
)
data = loader.load()

An example with question answering

In this example, we use data from a dataset to answer a question.
from langchain.indexes import VectorstoreIndexCreator
from langchain_apify import ApifyWrapper
from langchain_core.documents import Document
from langchain_core.vectorstores import InMemoryVectorStore
from langchain_openai import ChatOpenAI
from langchain_openai.embeddings import OpenAIEmbeddings
loader = ApifyDatasetLoader(
    dataset_id="your-dataset-id",
    dataset_mapping_function=lambda item: Document(
        page_content=item["text"] or "", metadata={"source": item["url"]}
    ),
)
index = VectorstoreIndexCreator(
    vectorstore_cls=InMemoryVectorStore, embedding=OpenAIEmbeddings()
).from_loaders([loader])
llm = ChatOpenAI(model="gpt-5-mini")
query = "What is Apify?"
result = index.query_with_sources(query, llm=llm)
print(result["answer"])
print(result["sources"])
 Apify is a platform for developing, running, and sharing serverless cloud programs. It enables users to create web scraping and automation tools and publish them on the Apify platform.

https://docs.apify.com/platform/actors, https://docs.apify.com/platform/actors/running/actors-in-store, https://docs.apify.com/platform/security, https://docs.apify.com/platform/actors/examples

Using the Apify MCP server

Unsure which Actor to use or what parameters it requires? The Apify MCP (Model Context Protocol) server can help you discover available Actors, explore their input schemas, and understand parameter requirements. When connecting to the Apify MCP server over HTTP, include your Apify token in the request headers:
Authorization: Bearer <APIFY_TOKEN>
For more information, see the LangChain MCP documentation.
Connect these docs to Claude, VSCode, and more via MCP for real-time answers.