Huggingface Tokenizers — Fast tokenizers optimized for research and production
Huggingface Tokenizers
Section titled “Huggingface Tokenizers”Fast tokenizers optimized for research and production. Rust-based implementation tokenizes 1GB in <20 seconds. Supports BPE, WordPiece, and Unigram algorithms. Train custom vocabularies, track alignments, handle padding/truncation. Integrates seamlessly with transformers. Use when you need high-performance tokenization or custom tokenizer training.
Skill metadata
Section titled “Skill metadata”| Source | Optional — install with hermes skills install official/mlops/huggingface-tokenizers |
| Path | optional-skills/mlops/huggingface-tokenizers |
| Version | 1.0.0 |
| Author | Orchestra Research |
| License | MIT |
| Dependencies | tokenizers, transformers, datasets |
| Tags | Tokenization, HuggingFace, BPE, WordPiece, Unigram, Fast Tokenization, Rust, Custom Tokenizer, Alignment Tracking, Production |
Reference: full SKILL.md
Section titled “Reference: full SKILL.md”The following is the complete skill definition that Hermes loads when this skill is triggered. This is what the agent sees as instructions when the skill is active.
HuggingFace Tokenizers - Fast Tokenization for NLP
Section titled “HuggingFace Tokenizers - Fast Tokenization for NLP”Fast, production-ready tokenizers with Rust performance and Python ease-of-use.
When to use HuggingFace Tokenizers
Section titled “When to use HuggingFace Tokenizers”Use HuggingFace Tokenizers when:
- Need extremely fast tokenization (<20s per GB of text)
- Training custom tokenizers from scratch
- Want alignment tracking (token → original text position)
- Building production NLP pipelines
- Need to tokenize large corpora efficiently
Performance:
- Speed: <20 seconds to tokenize 1GB on CPU
- Implementation: Rust core with Python/Node.js bindings
- Efficiency: 10-100× faster than pure Python implementations
Use alternatives instead:
- SentencePiece: Language-independent, used by T5/ALBERT
- tiktoken: OpenAI’s BPE tokenizer for GPT models
- transformers AutoTokenizer: Loading pretrained only (uses this library internally)
Quick start
Section titled “Quick start”Installation
Section titled “Installation”# Install tokenizerspip install tokenizers
# With transformers integrationpip install tokenizers transformersLoad pretrained tokenizer
Section titled “Load pretrained tokenizer”from tokenizers import Tokenizer
# Load from HuggingFace Hubtokenizer = Tokenizer.from_pretrained("bert-base-uncased")
# Encode textoutput = tokenizer.encode("Hello, how are you?")print(output.tokens) # ['hello', ',', 'how', 'are', 'you', '?']print(output.ids) # [7592, 1010, 2129, 2024, 2017, 1029]
# Decode backtext = tokenizer.decode(output.ids)print(text) # "hello, how are you?"Train custom BPE tokenizer
Section titled “Train custom BPE tokenizer”from tokenizers import Tokenizerfrom tokenizers.models import BPEfrom tokenizers.trainers import BpeTrainerfrom tokenizers.pre_tokenizers import Whitespace
# Initialize tokenizer with BPE modeltokenizer = Tokenizer(BPE(unk_token="[UNK]"))tokenizer.pre_tokenizer = Whitespace()
# Configure trainertrainer = BpeTrainer( vocab_size=30000, special_tokens=["[UNK]", "[CLS]", "[SEP]", "[PAD]", "[MASK]"], min_frequency=2)
# Train on filesfiles = ["train.txt", "validation.txt"]tokenizer.train(files, trainer)
# Savetokenizer.save("my-tokenizer.json")Training time: ~1-2 minutes for 100MB corpus, ~10-20 minutes for 1GB
Batch encoding with padding
Section titled “Batch encoding with padding”# Enable paddingtokenizer.enable_padding(pad_id=3, pad_token="[PAD]")
# Encode batchtexts = ["Hello world", "This is a longer sentence"]encodings = tokenizer.encode_batch(texts)
for encoding in encodings: print(encoding.ids)# [101, 7592, 2088, 102, 3, 3, 3]# [101, 2023, 2003, 1037, 2936, 6251, 102]Tokenization algorithms
Section titled “Tokenization algorithms”BPE (Byte-Pair Encoding)
Section titled “BPE (Byte-Pair Encoding)”How it works:
- Start with character-level vocabulary
- Find most frequent character pair
- Merge into new token, add to vocabulary
- Repeat until vocabulary size reached
Used by: GPT-2, GPT-3, RoBERTa, BART, DeBERTa
from tokenizers import Tokenizerfrom tokenizers.models import BPEfrom tokenizers.trainers import BpeTrainerfrom tokenizers.pre_tokenizers import ByteLevel
tokenizer = Tokenizer(BPE(unk_token="<|endoftext|>"))tokenizer.pre_tokenizer = ByteLevel()
trainer = BpeTrainer( vocab_size=50257, special_tokens=["<|endoftext|>"], min_frequency=2)
tokenizer.train(files=["data.txt"], trainer=trainer)Advantages:
- Handles OOV words well (breaks into subwords)
- Flexible vocabulary size
- Good for morphologically rich languages
Trade-offs:
- Tokenization depends on merge order
- May split common words unexpectedly
WordPiece
Section titled “WordPiece”How it works:
- Start with character vocabulary
- Score merge pairs:
frequency(pair) / (frequency(first) × frequency(second)) - Merge highest scoring pair
- Repeat until vocabulary size reached
Used by: BERT, DistilBERT, MobileBERT
from tokenizers import Tokenizerfrom tokenizers.models import WordPiecefrom tokenizers.trainers import WordPieceTrainerfrom tokenizers.pre_tokenizers import Whitespacefrom tokenizers.normalizers import BertNormalizer
tokenizer = Tokenizer(WordPiece(unk_token="[UNK]"))tokenizer.normalizer = BertNormalizer(lowercase=True)tokenizer.pre_tokenizer = Whitespace()
trainer = WordPieceTrainer( vocab_size=30522, special_tokens=["[UNK]", "[CLS]", "[SEP]", "[PAD]", "[MASK]"], continuing_subword_prefix="##")
tokenizer.train(files=["corpus.txt"], trainer=trainer)Advantages:
- Prioritizes meaningful merges (high score = semantically related)
- Used successfully in BERT (state-of-the-art results)
Trade-offs:
- Unknown words become
[UNK]if no subword match - Saves vocabulary, not merge rules (larger files)
Unigram
Section titled “Unigram”How it works:
- Start with large vocabulary (all substrings)
- Compute loss for corpus with current vocabulary
- Remove tokens with minimal impact on loss
- Repeat until vocabulary size reached
Used by: ALBERT, T5, mBART, XLNet (via SentencePiece)
from tokenizers import Tokenizerfrom tokenizers.models import Unigramfrom tokenizers.trainers import UnigramTrainer
tokenizer = Tokenizer(Unigram())
trainer = UnigramTrainer( vocab_size=8000, special_tokens=["<unk>", "<s>", "</s>"], unk_token="<unk>")
tokenizer.train(files=["data.txt"], trainer=trainer)Advantages:
- Probabilistic (finds most likely tokenization)
- Works well for languages without word boundaries
- Handles diverse linguistic contexts
Trade-offs:
- Computationally expensive to train
- More hyperparameters to tune
Tokenization pipeline
Section titled “Tokenization pipeline”Complete pipeline: Normalization → Pre-tokenization → Model → Post-processing
Normalization
Section titled “Normalization”Clean and standardize text:
from tokenizers.normalizers import NFD, StripAccents, Lowercase, Sequence
tokenizer.normalizer = Sequence([ NFD(), # Unicode normalization (decompose) Lowercase(), # Convert to lowercase StripAccents() # Remove accents])
# Input: "Héllo WORLD"# After normalization: "hello world"Common normalizers:
NFD,NFC,NFKD,NFKC- Unicode normalization formsLowercase()- Convert to lowercaseStripAccents()- Remove accents (é → e)Strip()- Remove whitespaceReplace(pattern, content)- Regex replacement
Pre-tokenization
Section titled “Pre-tokenization”Split text into word-like units:
from tokenizers.pre_tokenizers import Whitespace, Punctuation, Sequence, ByteLevel
# Split on whitespace and punctuationtokenizer.pre_tokenizer = Sequence([ Whitespace(), Punctuation()])
# Input: "Hello, world!"# After pre-tokenization: ["Hello", ",", "world", "!"]Common pre-tokenizers:
Whitespace()- Split on spaces, tabs, newlinesByteLevel()- GPT-2 style byte-level splittingPunctuation()- Isolate punctuationDigits(individual_digits=True)- Split digits individuallyMetaspace()- Replace spaces with ▁ (SentencePiece style)
Post-processing
Section titled “Post-processing”Add special tokens for model input:
from tokenizers.processors import TemplateProcessing
# BERT-style: [CLS] sentence [SEP]tokenizer.post_processor = TemplateProcessing( single="[CLS] $A [SEP]", pair="[CLS] $A [SEP] $B [SEP]", special_tokens=[ ("[CLS]", 1), ("[SEP]", 2), ],)Common patterns:
# GPT-2: sentence <|endoftext|>TemplateProcessing( single="$A <|endoftext|>", special_tokens=[("<|endoftext|>", 50256)])
# RoBERTa: <s> sentence </s>TemplateProcessing( single="<s> $A </s>", pair="<s> $A </s> </s> $B </s>", special_tokens=[("<s>", 0), ("</s>", 2)])Alignment tracking
Section titled “Alignment tracking”Track token positions in original text:
output = tokenizer.encode("Hello, world!")
# Get token offsetsfor token, offset in zip(output.tokens, output.offsets): start, end = offset print(f"{token:10} → [{start:2}, {end:2}): {text[start:end]!r}")
# Output:# hello → [ 0, 5): 'Hello'# , → [ 5, 6): ','# world → [ 7, 12): 'world'# ! → [12, 13): '!'Use cases:
- Named entity recognition (map predictions back to text)
- Question answering (extract answer spans)
- Token classification (align labels to original positions)
Integration with transformers
Section titled “Integration with transformers”Load with AutoTokenizer
Section titled “Load with AutoTokenizer”from transformers import AutoTokenizer
# AutoTokenizer automatically uses fast tokenizerstokenizer = AutoTokenizer.from_pretrained("bert-base-uncased")
# Check if using fast tokenizerprint(tokenizer.is_fast) # True
# Access underlying tokenizers.Tokenizerfast_tokenizer = tokenizer.backend_tokenizerprint(type(fast_tokenizer)) # <class 'tokenizers.Tokenizer'>Convert custom tokenizer to transformers
Section titled “Convert custom tokenizer to transformers”from tokenizers import Tokenizerfrom transformers import PreTrainedTokenizerFast
# Train custom tokenizertokenizer = Tokenizer(BPE())# ... train tokenizer ...tokenizer.save("my-tokenizer.json")
# Wrap for transformerstransformers_tokenizer = PreTrainedTokenizerFast( tokenizer_file="my-tokenizer.json", unk_token="[UNK]", pad_token="[PAD]", cls_token="[CLS]", sep_token="[SEP]", mask_token="[MASK]")
# Use like any transformers tokenizeroutputs = transformers_tokenizer( "Hello world", padding=True, truncation=True, max_length=512, return_tensors="pt")Common patterns
Section titled “Common patterns”Train from iterator (large datasets)
Section titled “Train from iterator (large datasets)”from datasets import load_dataset
# Load datasetdataset = load_dataset("wikitext", "wikitext-103-raw-v1", split="train")
# Create batch iteratordef batch_iterator(batch_size=1000): for i in range(0, len(dataset), batch_size): yield dataset[i:i + batch_size]["text"]
# Train tokenizertokenizer.train_from_iterator( batch_iterator(), trainer=trainer, length=len(dataset) # For progress bar)Performance: Processes 1GB in ~10-20 minutes
Enable truncation and padding
Section titled “Enable truncation and padding”# Enable truncationtokenizer.enable_truncation(max_length=512)
# Enable paddingtokenizer.enable_padding( pad_id=tokenizer.token_to_id("[PAD]"), pad_token="[PAD]", length=512 # Fixed length, or None for batch max)
# Encode with bothoutput = tokenizer.encode("This is a long sentence that will be truncated...")print(len(output.ids)) # 512Multi-processing
Section titled “Multi-processing”from tokenizers import Tokenizerfrom multiprocessing import Pool
# Load tokenizertokenizer = Tokenizer.from_file("tokenizer.json")
def encode_batch(texts): return tokenizer.encode_batch(texts)
# Process large corpus in parallelwith Pool(8) as pool: # Split corpus into chunks chunk_size = 1000 chunks = [corpus[i:i+chunk_size] for i in range(0, len(corpus), chunk_size)]
# Encode in parallel results = pool.map(encode_batch, chunks)Speedup: 5-8× with 8 cores
Performance benchmarks
Section titled “Performance benchmarks”Training speed
Section titled “Training speed”| Corpus Size | BPE (30k vocab) | WordPiece (30k) | Unigram (8k) |
|---|---|---|---|
| 10 MB | 15 sec | 18 sec | 25 sec |
| 100 MB | 1.5 min | 2 min | 4 min |
| 1 GB | 15 min | 20 min | 40 min |
Hardware: 16-core CPU, tested on English Wikipedia
Tokenization speed
Section titled “Tokenization speed”| Implementation | 1 GB corpus | Throughput |
|---|---|---|
| Pure Python | ~20 minutes | ~50 MB/min |
| HF Tokenizers | ~15 seconds | ~4 GB/min |
| Speedup | 80× | 80× |
Test: English text, average sentence length 20 words
Memory usage
Section titled “Memory usage”| Task | Memory |
|---|---|
| Load tokenizer | ~10 MB |
| Train BPE (30k vocab) | ~200 MB |
| Encode 1M sentences | ~500 MB |
Supported models
Section titled “Supported models”Pre-trained tokenizers available via from_pretrained():
BERT family:
bert-base-uncased,bert-large-caseddistilbert-base-uncasedroberta-base,roberta-large
GPT family:
gpt2,gpt2-medium,gpt2-largedistilgpt2
T5 family:
t5-small,t5-base,t5-largegoogle/flan-t5-xxl
Other:
facebook/bart-base,facebook/mbart-large-cc25albert-base-v2,albert-xlarge-v2xlm-roberta-base,xlm-roberta-large
Browse all: https://huggingface.co/models?library=tokenizers
References
Section titled “References”- Training Guide - Train custom tokenizers, configure trainers, handle large datasets
- Algorithms Deep Dive - BPE, WordPiece, Unigram explained in detail
- Pipeline Components - Normalizers, pre-tokenizers, post-processors, decoders
- Transformers Integration - AutoTokenizer, PreTrainedTokenizerFast, special tokens
Resources
Section titled “Resources”- Docs: https://huggingface.co/docs/tokenizers
- GitHub: https://github.com/huggingface/tokenizers ⭐ 9,000+
- Version: 0.20.0+
- Course: https://huggingface.co/learn/nlp-course/chapter6/1
- Paper: BPE (Sennrich et al., 2016), WordPiece (Schuster & Nakajima, 2012)