AxiomicLabs/GPT-X2-125M
AxiomicLabs • code
GPT-X2-125M is the second generation GPT-X model. 125M parameters, 75B tokens, custom 32K tokenizer, 30 layers. Trained from scratch using a 4-source progressive curriculum with AST-normalized code, achieving near state-of-the-art performance on both natural language and structured reasoning benchmarks at the 125M scale on a fraction of the training data.
Results

GPT-X2-125M achieves competitive performance with leading models despite using significantly less data. Most notably, it matches SmolLM2-135M within ~1 point on aggregate while using ~27x fewer tokens and having ~8% fewer parameters.
Evaluated with an internal harness modeled on EleutherAI/lm-eval-harness; all benchmarks are zero-shot.
| Company | Model | HellaSwag | ARC (Average) | PIQA | LogicMark | Winogrande | ArithMark | Average | Training tokens |
|---|---|---|---|---|---|---|---|---|---|
| HuggingFace | SmolLM2-135M | 43.22% | 44.62% | 67.52% | 48.78% | 48.46% | 33.26% | 47.64% | 2T |
| Axiomic Labs | GPT-X2-125M | 40.55% | 39.90% | 66.97% | 49.12% | 49.01% | 34.78% | 46.72% | 75B |
| HuggingFace | SmolLM-135M | 42.70% | 43.17% | 67.19% | 43.89% | 50.43% | 32.34% | 46.62% | 600B |
| MobileLLM-R1-140M-base | 33.91% | 37.47% | 62.79% | 45.04% | 50.28% | 46.94% | 46.07% | 4.2T | |
| Axiomic Labs | GPT-X-125M | 36.57% | 38.84% | 65.72% | 43.83% | 50.83% | 30.52% | 44.39% | 15B |
| MobileLLM-125M | 38.90% | 35.50% | 65.30% | 42.04% | 53.10% | 31.16% | 44.33% | 1T | |
| OpenAI | GPT-3 (125M) | 33.70% | 35.10% | 64.60% | NA | 52.00% | NA | NA | 300B |
| OpenAI | GPT-2 Medium (355M) | 39.40% | 34.80% | 66.30% | 44.90% | 50.40% | 34.80% | 43.94% | ~10B |
| OpenAI | GPT-2 (124M) | 31.49% | 31.40% | 63.28% | 44.52% | 48.54% | 32.80% | 42.01% | ~10B |
| EleutherAI | Pythia-160M | 30.46% | 29.95% | 57.94% | 40.87% | 49.41% | 28.06% | 39.45% | ~225B |
| OPT-125M | 31.39% | 31.53% | 62.02% | 43.81% | 49.96% | 27.48% | 41.03% | 180B | |
| EleutherAI | GPT-Neo-125M | 30.55% | 31.43% | 61.75% | 45.40% | 49.09% | 29.98% | 41.37% | 300B |
LogicMark and ArithMark are procedural benchmarks designed to evaluate structured reasoning and arithmetic generalization across increasing difficulty levels.
Usage
from transformers import AutoModelForCausalLM, AutoTokenizer
import torch
model_name = "AxiomicLabs/GPT-X2-125M"
tokenizer = AutoTokenizer.from_pretrained(model_name, trust_remote_code=True)
model = AutoModelForCausalLM.from_pretrained(
model_name,
trust_remote_code=True,
dtype=torch.bfloat16,
device_map="auto",
)
prompt = "The future of AI is"
inputs = tokenizer(prompt, return_tensors="pt").to(model.device)
with torch.inference_mode():
output = model.generate(
**inputs,
max_new_tokens=120,
do_sample=True,
temperature=0.8,
top_p=0.95,
repetition_penalty=1.1,
no_repeat_ngram_size=4,
)
text = tokenizer.decode(output[0], skip_special_tokens=True)
print(text)
What's New in v2
| Change | GPT-X (v1) | GPT-X2 (v2) | Why |
|---|---|---|---|
| Tokenizer | GPT-2 BPE (50K) | Custom 32K trained on FineWeb-Edu | ~9% better compression, frees params for layers |
| Depth | 27 layers | 30 layers | Saved embedding params reinvested into 3 extra layers |
| rope_theta | 10,000 | 100,000 | Better long-range attention (SmolLM2-proven) |
| Learning rate | 6e-4 | 1.5e-3 | SmolLM used 3e-3 at this scale; v1 was too conservative |
| LR decay | Cosine to 1/10th peak | WSD decay to 0 | SmolLM-style; tighter convergence during cooldown |
| Warmup | 1,000 steps | 2,000 steps | Higher LR needs longer warmup for stability |
| Data | 15B tokens, FineWeb-Edu only | 75B tokens, 4-source curriculum | 5x more data with progressive multi-source mixing |
| Training | Messy resume | Clean from-scratch | Controlled curriculum with planned distribution shifts |
Architecture
| Component | Details |
|---|---|
| Position encoding | RoPE (theta=100,000) |
| Normalization | RMSNorm (float32 upcast) |
| Feed-forward | SwiGLU (3-matrix gated MLP) |
| Attention | Grouped Query Attention -- 9Q / 3KV (3:1) |
| QK stability | QK-Norm (RMSNorm per head, before RoPE) |
| Bias | None (all layers bias-free) |
| Embedding | sqrt(d_model) scaling + weight tying |
| Auxiliary loss | z-loss (1e-4 on logit magnitudes < 31Bt then 0) |
| Depth | 30 layers x 576 hidden |
Config
vocab_size = 32,768 (custom BPE trained on FineWeb-Edu)
n_layer = 30
n_head = 9 (query heads)
n_kv_heads = 3 (key-value heads, 3:1 GQA)
n_embd = 576
head_dim = 64
intermediate = 1,536 (SwiGLU, 2.67x ratio)
block_size = 1,024
rope_theta = 100,000
total params = 125,081,664
Parameter Breakdown
| Component | Params |
|---|---|
| Token embeddings (32768 x 576) | 18,874,368 |
| Per block (x30): attention + SwiGLU + norms | 3,540,224 |
| 30 transformer blocks | 106,206,720 |
| Final RMSNorm | 576 |
| LM head (tied with embeddings) | 0 |
| Total | 125,081,664 |
Training
Data
The curriculum gradually introduces specialized data (math/code) after the model learns core language, improving reasoning without harming fluency. Four data sources mixed via a progressive curriculum:
| Source | Dataset | Purpose |
|---|---|---|
| FineWeb-Edu | HuggingFaceFW/fineweb-edu (sample-100BT) | Primary educational web text |
| DCLM | mlfoundations/dclm-baseline-1.0 | High-quality diverse web text |
| FineMath | HuggingFaceTB/finemath (finemath-4plus) | Mathematical reasoning (score >= 4) |
| NPset-Python | AxiomicLabs/NPset-python | AST-normalized Python code |
- Tokens: 75B (143,051 steps x 524,288 tokens/step)
- Tokenizer: Custom 32K BPE trained on 50GB of FineWeb-Edu
- Final val loss: 2.7525 (FineWeb-Edu held-out)
Progressive Data Curriculum
Rather than a fixed mixture, data sources are introduced progressively to let the model build foundational language capabilities before adding specialized data:
| Phase | Token Range | FineWeb-Edu | DCLM | FineMath | Code |
|---|---|---|---|---|---|
| Early | 0 -- 18B | 58% | 40% | 1% | 1% |
| Ramp | 18B -- 20B | 58% -> 54% | 40% -> 36% | 1% -> 6% | 1% -> 4% |
| Hold | 20B -- 45B | 54% | 36% | 6% | 4% |
| Taper | 45B -- 58B | 54% -> 55% | 36% -> 38% | 6% -> 4.5% | 4% -> 2.5% |
| Hold | 58B -- 75B | 55% | 38% | 4.5% | 2.5% |
FineMath and code are drip-fed at 1% from the start so the model sees early examples, then ramped to peak during the stable LR phase for maximum learning, and gradually tapered back toward primary sources during the LR decay phase.
AST-Based Code Normalization
Raw Python code is converted to a compact pseudocode representation (TinyDSL) before tokenization using AST parsing. This achieves ~1.25x token compression compared to raw code, letting the model learn programming reasoning without the need for code-specific tokens and reducing understanding overhead.
The normalizer:
- Strips comments, docstrings, whitespace, and syntactic noise
- Replaces Python builtins with full English words (
len->length,str->string, etc.) - Uses natural language keywords with spaces (
end function,for else,list comprehension, etc.) - Expands comprehensions, decorators, exception handling, and all Python AST node types
Example:
# Raw Python
def fibonacci(n):
if n <= 1:
return n
return fibonacci(n - 1) + fibonacci(n - 2)
# Normalized TinyDSL
# function fibonacci n
# begin
# if n <= 1
# begin
# return n
# end
# return call fibonacci n - 1 + call fibonacci n - 2
# end function
Optimization
- Optimizer: AdamW (betas=0.9/0.95, weight_decay=0.1 < 31Bt then 0.01)
- Learning rate: 1.5e-3 max, decays to 0
- Schedule: WSD -- 2,000 step warmup, stable phase (80%), linear decay to 0 over final 20%
- Batch size: 524,288 tokens (micro_batch=8, seq_len=1024, 64 grad accumulation)
- Precision: bfloat16 mixed precision
- Gradient clipping: 1.0
Hardware
- 1x RTX 3080 Ti
- Training time: ~500 hours
Design Decisions
- 30 layers x 576 hidden -- 3 more layers than v1, made possible by smaller 32K vocab. Depth is the primary driver of quality at 125M scale (SmolLM, MobileLLM).
- Custom 32K tokenizer -- Trained on FineWeb-Edu for ~9% better compression than GPT-2 BPE. Fewer vocab entries = smaller embedding table = more params for transformer layers.
- rope_theta=100K -- Matches SmolLM2-135M. Better extrapolation and long-range dependency modeling.
- 1.5e-3 learning rate -- GPT-X v1 used 6e-4 which was too conservative. SmolLM proved 3e-3 @ 1M batch works at 135M scale; thus used 1.5e-3 at 524288 batch.
- GQA 3:1 -- 9 query heads, 3 KV heads. Saves attention parameters reinvested into SwiGLU capacity.
- QK-Norm -- Critical for 30-layer stability. RMSNorm on Q and K before RoPE.
- z-loss -- Prevents logit magnitude drift (PaLM, T5) used early.
- Progressive curriculum -- FineMath and NPset-Python introduced at 1% early, ramped to peak during stable LR, tapered during decay. Lets the model build language foundations first, then learn specialized reasoning and logic patterns.
- AST code normalization (NPset-Python) -- 1.25x token compression via Python AST to TinyDSL conversion. Strips syntactic noise and normalizes identifiers so the model learns programming structure rather than memorizing variable names.
Limitations
- Hardware: Total training took roughly 500 hours for 75B tokens; as a result, ablations were not feasible, and VRAM capped the context window at 1024
- Small model: 125M parameters limits reasoning and factual recall
- Educational data only: Primarily trained on educational datasets; not representative of general web text
- Not instruction-tuned: Base model only, not aligned for chat
- English only
- 1024 context window
Citation
@misc{gptx2_2025,
title={GPT-X2: Data-Efficient Language Modeling at 125M Scale},
author={Axiomic Labs},
year={2025},
howpublished={\url{https://huggingface.co/AxiomicLabs/GPT-X2-125M}},
note={Trained on 75B tokens with a progressive curriculum and custom tokenizer}
}