Apify Dataset is a scalable append-only storage with sequential access built for storing structured web scraping results, such as a list of products or Google SERPs, and then export them to various formats like JSON, CSV, or Excel. Datasets are mainly used to save results of Apify Actors—serverless cloud programs for various web scraping, crawling, and data extraction use cases.This notebook shows how to load Apify datasets to LangChain.
Integration details
| Class | Package | Serializable | JS support | Version |
|---|---|---|---|---|
| ApifyDatasetLoader | langchain-apify | ❌ | ✅ |
Loader features
| Source | Document Lazy Loading | Native Async Support |
|---|---|---|
| Apify Dataset | ❌ | ❌ |
Prerequisites
You need to have an existing dataset on the Apify platform. This example shows how to load a dataset produced by the Website Content Crawler.ApifyDatasetLoader into your source code:
Pricing
Apify Actors can be priced in different ways, depending on the Actor you run. Many Actors support Pay-Per-Event (PPE) pricing, where you pay for explicit events defined by the Actor author (for example, per dataset item). This can be a good fit for agent workloads where you want clear, per-operation costs.Map dataset items to documents
Next, define a function that maps Apify dataset record fields to LangChainDocument format.
For example, if your dataset items are structured like this:
Document format, so that you can use them further with any LLM model (e.g. for question answering).