Skip to content
Open
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
76 changes: 76 additions & 0 deletions tutorials/data/common-crawl/README.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,76 @@
# Common Crawl Data Preprocessing Tutorial W/ NeMo Curator

This guide explains how to prepare the Common Crawl dataset for language model pretraining.

**Common Crawl** is a non-profit organization that maintains an open repository of web crawl data. The dataset consists of petabytes of raw web page data, metadata extracts, and text extracts collected monthly since 2008. Each crawl contains billions of web pages from across the internet, making it one of the largest openly available datasets for training language models.

Dataset source: [Common Crawl](https://commoncrawl.org/)

## Dataset overview

The Common Crawl dataset is organized by **snapshots**, with each snapshot representing a complete web crawl:

- **Main crawls**: General purpose web crawl covering diverse websites across the internet. Released approximately monthly (format: `CC-MAIN-YYYY-WW`, e.g., `CC-MAIN-2025-30`)
- **News crawls**: Focused crawl of news websites and articles, useful for domain-specific training on journalism and current events. Released monthly (format: `YYYY-MM`, e.g., `2025-08`)

Each `CC-MAIN` snapshot contains multiple WARC (Web ARChive) files, with each file approximately **~1 GB compressed**, alongside a `warc.paths.gz` file that lists all WARC files in that snapshot. A typical main crawl snapshot includes around **~80,000 WARC files**, totaling approximately ~70-100 TB compressed. Each WARC file contains raw HTTP response data (HTML pages, headers, metadata) that requires extraction and filtering for language model training.

## Requirements

Install NeMo Curator directly from the GitHub repository with the CPU text processing extension:

```bash
pip install "nemo-curator[text_cpu] @ git+https://github.com/NVIDIA-NeMo/Curator.git@main"
```

For more information about NeMo Curator, visit the [official repository](https://github.com/NVIDIA-NeMo/Curator).
For detailed installation instructions, see the [installation guide](https://docs.nvidia.com/nemo/curator/latest/admin/installation.html).

## Usage

To **download, uncompress, filter, and tokenize** CommonCrawl data, simply run the `prepare_commoncrawl.py` script.
All stages of the pipeline are handled within a single script thanks to **NeMo Curator**, which provides modular processing stages for downloading, cleaning, and tokenizing large-scale text datasets.

You will need to specify a few configuration options when running the script:

* **`--output_dir`**
Path where processed data will be stored.

* Raw files will be placed in: `output_dir/raw`
* Tokenized files will be placed in: `output_dir/tokens`

* **`--tokenizer-model`**
The tokenizer model identifier from the Hugging Face Hub used to tokenize the dataset.

* **`--append-eod`**
When set, appends an End-of-Document (EOD) token to every sample.

* **`--filter-data`**
Enables NeMo Curator’s filtering stages to remove short documents, documents with excessive URLs, and documents with highly repetitive n-grams.

* **`--use-aws-to-download`**
Enables downloading CommonCrawl snapshots from S3.

* **`--start-snapshot`** and **`--end-snapshot`**
Specify the range of snapshots to download.

* **`--url-limit`**
Maximum number of files to download. Each file is approximately **1 GB**.

#### Example

```bash
python3 prepare_commoncrawl.py \
--output_dir $DATASETS_PATH/CommonCrawl \
--tokenizer-model nvidia/NVIDIA-Nemotron-Nano-12B-v2 \
--append-eod \
--filter-data \
--url-limit 5
```

---

When the script completes, it will automatically generate a **`dataset-prefixes.txt`** file in the output directory.
This file contains the dataset file prefixes required by **Megatron-LM** and **Megatron-Bridge** via the `--data-args-path` configuration.

For more details about the new `MegatronTokenizerWriter` stage, refer to the ["megatron-tokenizer" tutorial in NeMo Curator](https://github.com/NVIDIA-NeMo/Curator/tree/main/tutorials/text/megatron-tokenizer).
129 changes: 129 additions & 0 deletions tutorials/data/common-crawl/prepare_commoncrawl.py
Original file line number Diff line number Diff line change
@@ -0,0 +1,129 @@
# Copyright (c) 2025, NVIDIA CORPORATION. All rights reserved.
#
# Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License.
# You may obtain a copy of the License at
#
# http://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.

import argparse
import os
import time
from pathlib import Path

from nemo_curator.core.client import RayClient
from nemo_curator.pipeline import Pipeline
from nemo_curator.stages.text.download import CommonCrawlDownloadExtractStage
from nemo_curator.stages.text.filters import RepeatingTopNGramsFilter, UrlsFilter, WordCountFilter
from nemo_curator.stages.text.io.writer import MegatronTokenizerWriter
from nemo_curator.stages.text.modules import ScoreFilter


def main(args: argparse.Namespace) -> None: # noqa: D103
# Initialize and start the Ray client
ray_client = RayClient()
ray_client.start()

raw_dir = os.path.join(args.output_dir, "raw")
tokens_dir = os.path.join(args.output_dir, "tokens")

os.makedirs(raw_dir, exist_ok=True)
os.makedirs(tokens_dir, exist_ok=True)

print("Filtering and tokenizing Common Crawl data for Megatron-LM training")
print(f" The raw dataset will be written to '{raw_dir}'")
print(f" The filtered and tokenized dataset will be written to '{tokens_dir}'")

# Create a pipeline with the stages
pipeline = Pipeline(
name="commoncrawl-filter-and-tokenize",
description="Filter and tokenize Common Crawl data for Megatron-LM training.",
)
# Download and extract the Common Crawl data
pipeline.add_stage(
CommonCrawlDownloadExtractStage(
start_snapshot=args.start_snapshot,
end_snapshot=args.end_snapshot,
download_dir=raw_dir,
crawl_type="main",
use_aws_to_download=args.use_aws_to_download,
verbose=True,
url_limit=args.url_limit,
)
)

# Filter the data
if args.filter_data:
## Filter short documents
pipeline.add_stage(ScoreFilter(filter_obj=WordCountFilter(min_words=50)))
## Filter documents with too many URLs
pipeline.add_stage(ScoreFilter(filter_obj=UrlsFilter(max_url_to_text_ratio=0.1)))
## Filter documents with too many repeating n-grams
pipeline.add_stage(ScoreFilter(filter_obj=RepeatingTopNGramsFilter()))

# Tokenize the data
pipeline.add_stage(
MegatronTokenizerWriter(
path=tokens_dir,
model_identifier=args.tokenizer_model,
append_eod=args.append_eod,
)
)

print("Starting the filtering and tokenization pipeline")
start_time = time.time()
# Run the pipeline
results = pipeline.run()
end_time = time.time()
execution_time = end_time - start_time
# Count the total number of records
print(f"\n\nFiltering and tokenization pipeline finished (took {execution_time} seconds)")
print(f"The results were written to '{[result.data for result in results]}'")

# Stop the Ray client
ray_client.stop()

# Create --data-args-path file
data_args_path = os.path.join(args.output_dir, "dataset-prefixes.txt")
file_prefixes = [str(file)[:-4] for file in Path(tokens_dir).glob("**/*.bin")]
with open(data_args_path, "w") as f:
for file_prefix in file_prefixes:
f.write(file_prefix + "\n")

print(f"The --data-args-path file was written to '{data_args_path}'")


if __name__ == "__main__":
parser = argparse.ArgumentParser()
group = parser.add_argument_group(title="input data")
group.add_argument("--output_dir", type=str, required=True, help="Path to output directory")
group.add_argument(
"--tokenizer-model", type=str, required=True, help="Hugging Face model identifier for the tokenizer"
)
group.add_argument("--append-eod", action="store_true", help="Append an <eod> token to the end of each sample.")
group.add_argument(
"--filter-data",
action="store_true",
help="Filter short documents, documents with too many URLs, and documents with too many repeating n-grams using NeMo Curator's filters",
)
group.add_argument(
"--use-aws-to-download",
action="store_true",
help="Use the s5cmd command to download from the Common Crawl's S3 bucket",
)
group.add_argument("--start-snapshot", type=str, default="2025-30", help="Start snapshot to download")
group.add_argument("--end-snapshot", type=str, default="2025-30", help="End snapshot to download")
group.add_argument(
"--url-limit",
type=int,
default=2,
help="Limit the number of URLs/WARC files to download. Each WARC file is ~1GB",
)
args = parser.parse_args()
main(args)