Apify dataset integration

This guide shows how to use Apify with LangChain to load documents from an Apify Dataset.

Overview

Apify is a cloud platform for web scraping and data extraction, which provides an ecosystem of more than 10,000 ready-made apps called Actors for various web scraping, crawling, and data extraction use cases. This guide shows how to load documents from an Apify Dataset — a scalable append-only storage built for storing structured web scraping results, such as a list of products or Google SERPs, and then export them to various formats like JSON, CSV, or Excel. Datasets are typically used to save results of different Actors. For example:

Website Content Crawler Actor deeply crawls websites such as documentation, knowledge bases, help centers, or blogs, and stores the text content of webpages into a dataset
RAG Web Browser Actor queries Google Search, scrapes the top N pages from the results, and returns the cleaned content in Markdown format for further processing by a large language model

Integration details

Class	Package	Local	Serializable	PY support
ApifyDatasetLoader	@langchain/community	❌	❌	✅

Loader features

Source	Document Lazy Loading	Native Async Support
Apify Dataset	❌	❌

Setup

Credentials

You’ll need to sign up for an Apify account and retrieve your Apify API token. Set it as an environment variable:

process.env.APIFY_TOKEN = "your-apify-token"

Installation

You’ll first need to install the official Apify client and LangChain packages:

npm

npm install apify-client @langchain/community @langchain/core @langchain/openai hnswlib-node

See this section for general instructions on installing LangChain packages.

Pricing

Many Actors support Pay-Per-Event (PPE) pricing, where you pay for explicit events defined by the Actor author (for example, per dataset item). This can be a good fit for agent workloads where you want clear, per-operation costs. Apify also offers pay-per-use pricing with a free tier available. Pricing varies by Actor – some Actors are free (you only pay for platform usage), while others charge per result or event. See Apify pricing for details.

Usage

From a new dataset (crawl a website and store the data in an Apify dataset)

If you don’t already have an existing dataset on the Apify platform, you’ll need to initialize the document loader by calling an Actor and waiting for the results. In the example below, we use the Website Content Crawler Actor to crawl LangChain documentation, store the results in Apify Dataset, and then load the dataset using the ApifyDatasetLoader. For this demonstration, we’ll use a fast Cheerio crawler type and limit the number of crawled pages to 10. Note: Running the Website Content Crawler may take some time, depending on the size of the website. For large sites, it can take several hours or even days! Here’s an example:

import { ApifyDatasetLoader } from "@langchain/community/document_loaders/web/apify_dataset";
import { HNSWLib } from "@langchain/community/vectorstores/hnswlib";
import { OpenAIEmbeddings, ChatOpenAI } from "@langchain/openai";
import { Document } from "@langchain/core/documents";
import { ChatPromptTemplate } from "@langchain/core/prompts";
import { createStuffDocumentsChain } from "@langchain/classic/chains/combine_documents";
import { createRetrievalChain } from "@langchain/classic/chains/retrieval";

const APIFY_TOKEN = "YOUR-APIFY-TOKEN"; // or set as process.env.APIFY_TOKEN
const OPENAI_API_KEY = "YOUR-OPENAI-API-KEY"; // or set as process.env.OPENAI_API_KEY

/*
 * datasetMappingFunction is a function that maps your Apify dataset format to LangChain documents.
 * In the below example, the Apify dataset format looks like this:
 * {
 *   "url": "https://apify.com",
 *   "text": "Apify is the best web scraping and automation platform."
 * }
 */
const loader = await ApifyDatasetLoader.fromActorCall(
  "apify/website-content-crawler",
  {
    maxCrawlPages: 10,
    crawlerType: "cheerio",
    startUrls: [{ url: "https://js.langchain.com/docs/" }],
  },
  {
    datasetMappingFunction: (item) =>
      new Document({
        pageContent: (item.text || "") as string,
        metadata: { source: item.url },
      }),
    clientOptions: {
      token: APIFY_TOKEN,
    },
  }
);

const docs = await loader.load();

const vectorStore = await HNSWLib.fromDocuments(
  docs,
  new OpenAIEmbeddings({ apiKey: OPENAI_API_KEY })
);

const model = new ChatOpenAI({
  model: "gpt-5-mini",
  temperature: 0,
  apiKey: OPENAI_API_KEY,
});

const questionAnsweringPrompt = ChatPromptTemplate.fromMessages([
  [
    "system",
    "Answer the user's questions based on the below context:\n\n{context}",
  ],
  ["human", "{input}"],
]);

const combineDocsChain = await createStuffDocumentsChain({
  llm: model,
  prompt: questionAnsweringPrompt,
});

const chain = await createRetrievalChain({
  retriever: vectorStore.asRetriever(),
  combineDocsChain,
});

const res = await chain.invoke({ input: "What is LangChain?" });

console.log(res.answer);
console.log(res.context.map((doc) => doc.metadata.source));

/*
  LangChain is a framework for developing applications powered by language models.
  [
    'https://js.langchain.com/docs/',
    'https://js.langchain.com/docs/modules/chains/',
    'https://js.langchain.com/docs/modules/chains/llmchain/',
    'https://js.langchain.com/docs/category/functions-4'
  ]
*/

When to use Apify

Apify is ideal when you need:

Access to thousands of pre-built Actors for various platforms (social media, e-commerce, search engines, etc.)
Custom web scraping and automation workflows beyond simple search
Flexible Actor ecosystem – run any Actor from the Apify Store

From an existing dataset

If you’ve already run an Actor and have an existing dataset on the Apify platform, you can initialize the document loader directly using the constructor

import { ApifyDatasetLoader } from "@langchain/community/document_loaders/web/apify_dataset";
import { HNSWLib } from "@langchain/community/vectorstores/hnswlib";
import { OpenAIEmbeddings, ChatOpenAI } from "@langchain/openai";
import { Document } from "@langchain/core/documents";
import { ChatPromptTemplate } from "@langchain/core/prompts";
import { createRetrievalChain } from "@langchain/classic/chains/retrieval";
import { createStuffDocumentsChain } from "@langchain/classic/chains/combine_documents";

const APIFY_TOKEN = "YOUR-APIFY-TOKEN"; // or set as process.env.APIFY_TOKEN
const OPENAI_API_KEY = "YOUR-OPENAI-API-KEY"; // or set as process.env.OPENAI_API_KEY

/*
 * datasetMappingFunction is a function that maps your Apify dataset format to LangChain documents.
 * In the below example, the Apify dataset format looks like this:
 * {
 *   "url": "https://apify.com",
 *   "text": "Apify is the best web scraping and automation platform."
 * }
 */
const loader = new ApifyDatasetLoader("your-dataset-id", {
  datasetMappingFunction: (item) =>
    new Document({
      pageContent: (item.text || "") as string,
      metadata: { source: item.url },
    }),
  clientOptions: {
    token: APIFY_TOKEN,
  },
});

const docs = await loader.load();

const vectorStore = await HNSWLib.fromDocuments(
  docs,
  new OpenAIEmbeddings({ apiKey: OPENAI_API_KEY })
);

const model = new ChatOpenAI({
  model: "gpt-5-mini",
  temperature: 0,
  apiKey: OPENAI_API_KEY,
});

const questionAnsweringPrompt = ChatPromptTemplate.fromMessages([
  [
    "system",
    "Answer the user's questions based on the below context:\n\n{context}",
  ],
  ["human", "{input}"],
]);

const combineDocsChain = await createStuffDocumentsChain({
  llm: model,
  prompt: questionAnsweringPrompt,
});

const chain = await createRetrievalChain({
  retriever: vectorStore.asRetriever(),
  combineDocsChain,
});

const res = await chain.invoke({ input: "What is LangChain?" });

console.log(res.answer);
console.log(res.context.map((doc) => doc.metadata.source));

/*
  LangChain is a framework for developing applications powered by language models.
  [
    'https://js.langchain.com/docs/',
    'https://js.langchain.com/docs/modules/chains/',
    'https://js.langchain.com/docs/modules/chains/llmchain/',
    'https://js.langchain.com/docs/category/functions-4'
  ]
*/

Additional Actor examples

The Apify Store contains thousands of pre-built Actors. Here are examples of other popular Actors you can use with the document loader:

Instagram Scraper

import { ApifyDatasetLoader } from "@langchain/community/document_loaders/web/apify_dataset";
import { Document } from "@langchain/core/documents";

const docs = await ApifyDatasetLoader.fromActorCall(
  "apify/instagram-scraper",
  {
    directUrls: ["https://www.instagram.com/p/ABC123/"],
    resultsType: "posts",
    resultsLimit: 10,
  },
  {
    datasetMappingFunction: (item) =>
      new Document({
        pageContent: item.caption || "",
        metadata: {
          source: item.url,
          likesCount: item.likesCount,
          commentsCount: item.commentsCount,
        },
      }),
    clientOptions: { token: process.env.APIFY_TOKEN },
  }
);

Google Search Results Scraper

import { ApifyDatasetLoader } from "@langchain/community/document_loaders/web/apify_dataset";
import { Document } from "@langchain/core/documents";

const searchDocs = await ApifyDatasetLoader.fromActorCall(
  "apify/google-search-scraper",
  {
    queries: "langchain javascript tutorial",
    maxPagesPerQuery: 1,
    countryCode: "us",
    languageCode: "en",
  },
  {
    datasetMappingFunction: (item) => {
      const organicResults = Array.isArray(item.organicResults)
        ? item.organicResults
        : [];

      const pageContent = organicResults
        .map(
          (r) =>
            `${r.position}. ${r.title}\n${r.url}\n${r.description ?? ""}`.trim()
        )
        .join("\n\n");

      return new Document({
        pageContent,
        metadata: {
          source: item.searchQuery?.url ?? item.url,
          query: item.searchQuery?.term,
          page: item.searchQuery?.page,
        },
      });
    },
    clientOptions: { token: process.env.APIFY_TOKEN },
  }
);

Browse the Apify Store to discover more Actors for your use case.

Using Apify MCP server

Unsure which Actor to use or what parameters it requires? The Apify MCP (Model Context Protocol) server can help you discover available Actors, explore their input schemas, and understand parameter requirements. When connecting to the Apify MCP server over HTTP, include your Apify token in the request headers:

Authorization: Bearer <APIFY_TOKEN>

For more information, see the LangChain MCP documentation and Apify MCP server.

Edit this page on GitHub or file an issue.

Connect these docs to Claude, VSCode, and more via MCP for real-time answers.

Popular Providers

General integrations

RAG integrations

Overview

Integration details

Loader features

Setup

Credentials

Installation

Pricing

Usage

From a new dataset (crawl a website and store the data in an Apify dataset)

When to use Apify

From an existing dataset

Additional Actor examples

Instagram Scraper

Google Search Results Scraper

Using Apify MCP server

Popular Providers

General integrations

RAG integrations

​Overview

​Integration details

​Loader features

​Setup

​Credentials

​Installation

​Pricing

​Usage

​From a new dataset (crawl a website and store the data in an Apify dataset)

​When to use Apify

​From an existing dataset

​Additional Actor examples

​Instagram Scraper

​Google Search Results Scraper

​Using Apify MCP server

Overview

Integration details

Loader features

Setup

Credentials

Installation

Pricing

Usage

From a new dataset (crawl a website and store the data in an Apify dataset)

When to use Apify

From an existing dataset

Additional Actor examples

Instagram Scraper

Google Search Results Scraper

Using Apify MCP server