GPT-X2
Collection
Collection of all things GPT-X2 • 3 items • Updated • 1
A 32,768-vocabulary BPE tokenizer trained on 50B of FineWeb-Edu using the HuggingFace tokenizers library.
| Property | Value |
|---|---|
| Vocab Size | 32,768 |
| Type | Byte-level BPE |
| Training Data | 50GB FineWeb-Edu (sample-100BT) |
from transformers import AutoTokenizertokenizer = AutoTokenizer.from_pretrained("YOUR_USERNAME/gpt-tok-32k-fineweb")text = "The quick brown fox jumps over the lazy dog."tokens = tokenizer.encode(text)print(f"Tokens: {len(tokens)}")print(tokenizer.decode(tokens))
Trained using the HuggingFace tokenizers library with: