Nemo Curator — GPU-accelerated data curation for LLM training
Nemo Curator
Section titled “Nemo Curator”GPU-accelerated data curation for LLM training. Supports text/image/video/audio. Features fuzzy deduplication (16× faster), quality filtering (30+ heuristics), semantic deduplication, PII redaction, NSFW detection. Scales across GPUs with RAPIDS. Use for preparing high-quality training datasets, cleaning web data, or deduplicating large corpora.
Skill metadata
Section titled “Skill metadata”| Source | Optional — install with hermes skills install official/mlops/nemo-curator |
| Path | optional-skills/mlops/nemo-curator |
| Version | 1.0.0 |
| Author | Orchestra Research |
| License | MIT |
| Dependencies | nemo-curator, cudf, dask, rapids |
| Tags | Data Processing, NeMo Curator, Data Curation, GPU Acceleration, Deduplication, Quality Filtering, NVIDIA, RAPIDS, PII Redaction, Multimodal, LLM Training Data |
Reference: full SKILL.md
Section titled “Reference: full SKILL.md”The following is the complete skill definition that Hermes loads when this skill is triggered. This is what the agent sees as instructions when the skill is active.
NeMo Curator - GPU-Accelerated Data Curation
Section titled “NeMo Curator - GPU-Accelerated Data Curation”NVIDIA’s toolkit for preparing high-quality training data for LLMs.
When to use NeMo Curator
Section titled “When to use NeMo Curator”Use NeMo Curator when:
- Preparing LLM training data from web scrapes (Common Crawl)
- Need fast deduplication (16× faster than CPU)
- Curating multi-modal datasets (text, images, video, audio)
- Filtering low-quality or toxic content
- Scaling data processing across GPU cluster
Performance:
- 16× faster fuzzy deduplication (8TB RedPajama v2)
- 40% lower TCO vs CPU alternatives
- Near-linear scaling across GPU nodes
Use alternatives instead:
- datatrove: CPU-based, open-source data processing
- dolma: Allen AI’s data toolkit
- Ray Data: General ML data processing (no curation focus)
Quick start
Section titled “Quick start”Installation
Section titled “Installation”# Text curation (CUDA 12)uv pip install "nemo-curator[text_cuda12]"
# All modalitiesuv pip install "nemo-curator[all_cuda12]"
# CPU-only (slower)uv pip install "nemo-curator[cpu]"Basic text curation pipeline
Section titled “Basic text curation pipeline”from nemo_curator import ScoreFilter, Modifyfrom nemo_curator.datasets import DocumentDatasetimport pandas as pd
# Load datadf = pd.DataFrame({"text": ["Good document", "Bad doc", "Excellent text"]})dataset = DocumentDataset(df)
# Quality filteringdef quality_score(doc): return len(doc["text"].split()) > 5 # Filter short docs
filtered = ScoreFilter(quality_score)(dataset)
# Deduplicationfrom nemo_curator.modules import ExactDuplicatesdeduped = ExactDuplicates()(filtered)
# Savededuped.to_parquet("curated_data/")Data curation pipeline
Section titled “Data curation pipeline”Stage 1: Quality filtering
Section titled “Stage 1: Quality filtering”from nemo_curator.filters import ( WordCountFilter, RepeatedLinesFilter, UrlRatioFilter, NonAlphaNumericFilter)
# Apply 30+ heuristic filtersfrom nemo_curator import ScoreFilter
# Word count filterdataset = dataset.filter(WordCountFilter(min_words=50, max_words=100000))
# Remove repetitive contentdataset = dataset.filter(RepeatedLinesFilter(max_repeated_line_fraction=0.3))
# URL ratio filterdataset = dataset.filter(UrlRatioFilter(max_url_ratio=0.2))Stage 2: Deduplication
Section titled “Stage 2: Deduplication”Exact deduplication:
from nemo_curator.modules import ExactDuplicates
# Remove exact duplicatesdeduped = ExactDuplicates(id_field="id", text_field="text")(dataset)Fuzzy deduplication (16× faster on GPU):
from nemo_curator.modules import FuzzyDuplicates
# MinHash + LSH deduplicationfuzzy_dedup = FuzzyDuplicates( id_field="id", text_field="text", num_hashes=260, # MinHash parameters num_buckets=20, hash_method="md5")
deduped = fuzzy_dedup(dataset)Semantic deduplication:
from nemo_curator.modules import SemanticDuplicates
# Embedding-based deduplicationsemantic_dedup = SemanticDuplicates( id_field="id", text_field="text", embedding_model="sentence-transformers/all-MiniLM-L6-v2", threshold=0.8 # Cosine similarity threshold)
deduped = semantic_dedup(dataset)Stage 3: PII redaction
Section titled “Stage 3: PII redaction”from nemo_curator.modules import Modifyfrom nemo_curator.modifiers import PIIRedactor
# Redact personally identifiable informationpii_redactor = PIIRedactor( supported_entities=["EMAIL_ADDRESS", "PHONE_NUMBER", "PERSON", "LOCATION"], anonymize_action="replace" # or "redact")
redacted = Modify(pii_redactor)(dataset)Stage 4: Classifier filtering
Section titled “Stage 4: Classifier filtering”from nemo_curator.classifiers import QualityClassifier
# Quality classificationquality_clf = QualityClassifier( model_path="nvidia/quality-classifier-deberta", batch_size=256, device="cuda")
# Filter low-quality documentshigh_quality = dataset.filter(lambda doc: quality_clf(doc["text"]) > 0.5)GPU acceleration
Section titled “GPU acceleration”GPU vs CPU performance
Section titled “GPU vs CPU performance”| Operation | CPU (16 cores) | GPU (A100) | Speedup |
|---|---|---|---|
| Fuzzy dedup (8TB) | 120 hours | 7.5 hours | 16× |
| Exact dedup (1TB) | 8 hours | 0.5 hours | 16× |
| Quality filtering | 2 hours | 0.2 hours | 10× |
Multi-GPU scaling
Section titled “Multi-GPU scaling”from nemo_curator import get_clientimport dask_cuda
# Initialize GPU clusterclient = get_client(cluster_type="gpu", n_workers=8)
# Process with 8 GPUsdeduped = FuzzyDuplicates(...)(dataset)Multi-modal curation
Section titled “Multi-modal curation”Image curation
Section titled “Image curation”from nemo_curator.image import ( AestheticFilter, NSFWFilter, CLIPEmbedder)
# Aesthetic scoringaesthetic_filter = AestheticFilter(threshold=5.0)filtered_images = aesthetic_filter(image_dataset)
# NSFW detectionnsfw_filter = NSFWFilter(threshold=0.9)safe_images = nsfw_filter(filtered_images)
# Generate CLIP embeddingsclip_embedder = CLIPEmbedder(model="openai/clip-vit-base-patch32")image_embeddings = clip_embedder(safe_images)Video curation
Section titled “Video curation”from nemo_curator.video import ( SceneDetector, ClipExtractor, InternVideo2Embedder)
# Detect scenesscene_detector = SceneDetector(threshold=27.0)scenes = scene_detector(video_dataset)
# Extract clipsclip_extractor = ClipExtractor(min_duration=2.0, max_duration=10.0)clips = clip_extractor(scenes)
# Generate embeddingsvideo_embedder = InternVideo2Embedder()video_embeddings = video_embedder(clips)Audio curation
Section titled “Audio curation”from nemo_curator.audio import ( ASRInference, WERFilter, DurationFilter)
# ASR transcriptionasr = ASRInference(model="nvidia/stt_en_fastconformer_hybrid_large_pc")transcribed = asr(audio_dataset)
# Filter by WER (word error rate)wer_filter = WERFilter(max_wer=0.3)high_quality_audio = wer_filter(transcribed)
# Duration filteringduration_filter = DurationFilter(min_duration=1.0, max_duration=30.0)filtered_audio = duration_filter(high_quality_audio)Common patterns
Section titled “Common patterns”Web scrape curation (Common Crawl)
Section titled “Web scrape curation (Common Crawl)”from nemo_curator import ScoreFilter, Modifyfrom nemo_curator.filters import *from nemo_curator.modules import *from nemo_curator.datasets import DocumentDataset
# Load Common Crawl datadataset = DocumentDataset.read_parquet("common_crawl/*.parquet")
# Pipelinepipeline = [ # 1. Quality filtering WordCountFilter(min_words=100, max_words=50000), RepeatedLinesFilter(max_repeated_line_fraction=0.2), SymbolToWordRatioFilter(max_symbol_to_word_ratio=0.3), UrlRatioFilter(max_url_ratio=0.3),
# 2. Language filtering LanguageIdentificationFilter(target_languages=["en"]),
# 3. Deduplication ExactDuplicates(id_field="id", text_field="text"), FuzzyDuplicates(id_field="id", text_field="text", num_hashes=260),
# 4. PII redaction PIIRedactor(),
# 5. NSFW filtering NSFWClassifier(threshold=0.8)]
# Executefor stage in pipeline: dataset = stage(dataset)
# Savedataset.to_parquet("curated_common_crawl/")Distributed processing
Section titled “Distributed processing”from nemo_curator import get_clientfrom dask_cuda import LocalCUDACluster
# Multi-GPU clustercluster = LocalCUDACluster(n_workers=8)client = get_client(cluster=cluster)
# Process large datasetdataset = DocumentDataset.read_parquet("s3://large_dataset/*.parquet")deduped = FuzzyDuplicates(...)(dataset)
# Cleanupclient.close()cluster.close()Performance benchmarks
Section titled “Performance benchmarks”Fuzzy deduplication (8TB RedPajama v2)
Section titled “Fuzzy deduplication (8TB RedPajama v2)”- CPU (256 cores): 120 hours
- GPU (8× A100): 7.5 hours
- Speedup: 16×
Exact deduplication (1TB)
Section titled “Exact deduplication (1TB)”- CPU (64 cores): 8 hours
- GPU (4× A100): 0.5 hours
- Speedup: 16×
Quality filtering (100GB)
Section titled “Quality filtering (100GB)”- CPU (32 cores): 2 hours
- GPU (2× A100): 0.2 hours
- Speedup: 10×
Cost comparison
Section titled “Cost comparison”CPU-based curation (AWS c5.18xlarge × 10):
- Cost: $3.60/hour × 10 = $36/hour
- Time for 8TB: 120 hours
- Total: $4,320
GPU-based curation (AWS p4d.24xlarge × 2):
- Cost: $32.77/hour × 2 = $65.54/hour
- Time for 8TB: 7.5 hours
- Total: $491.55
Savings: 89% reduction ($3,828 saved)
Supported data formats
Section titled “Supported data formats”- Input: Parquet, JSONL, CSV
- Output: Parquet (recommended), JSONL
- WebDataset: TAR archives for multi-modal
Use cases
Section titled “Use cases”Production deployments:
- NVIDIA used NeMo Curator to prepare Nemotron-4 training data
- Open-source datasets curated: RedPajama v2, The Pile
References
Section titled “References”- Filtering Guide - 30+ quality filters, heuristics
- Deduplication Guide - Exact, fuzzy, semantic methods
Resources
Section titled “Resources”- GitHub: https://github.com/NVIDIA/NeMo-Curator ⭐ 500+
- Docs: https://docs.nvidia.com/nemo-framework/user-guide/latest/datacuration/
- Version: 0.4.0+
- License: Apache 2.0