GPT-X2-TOK-32k

A 32,768-vocabulary BPE tokenizer trained on 50B of FineWeb-Edu using the HuggingFace tokenizers library.

Key Details

Property	Value
Vocab Size	32,768
Type	Byte-level BPE
Training Data	50GB FineWeb-Edu (sample-100BT)

Why 32K?

~9% better compression on FineWeb-Edu compared to Llama's generic 32K tokenizer (810 vs 887 tokens on representative samples)
Smaller embedding table than 50K (GPT-2) — saves ~10M parameters that can be reinvested into transformer layers
Domain-optimized — trained specifically on educational web text, capturing common patterns in the training distribution

Usage

from transformers import AutoTokenizertokenizer = AutoTokenizer.from_pretrained("YOUR_USERNAME/gpt-tok-32k-fineweb")text = "The quick brown fox jumps over the lazy dog."tokens = tokenizer.encode(text)print(f"Tokens: {len(tokens)}")print(tokenizer.decode(tokens))

Training

Trained using the HuggingFace tokenizers library with:

Byte-level BPE
No normalization
ByteLevel pre-tokenizer with regex splitting
50GB of streaming FineWeb-Edu text

Used By

GPT-X2-125M-50BT — 125M parameter Llama-style language model trained on 50B tokens

Downloads last month: -; Downloads are not tracked for this model. How to track

Inference Providers NEW

This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

AxiomicLabs
/

GPT-X2-TOK-32K

GPT-X2-TOK-32k

Key Details

Why 32K?

Usage

Training

Used By

Dataset used to train AxiomicLabs/GPT-X2-TOK-32K

Collection including AxiomicLabs/GPT-X2-TOK-32K

GPT-X2