X2 Banner

GPT-X2-125M is the second generation GPT-X model. 125M parameters, 75B tokens, custom 32K tokenizer, 30 layers. Trained from scratch using a 4-source progressive curriculum with AST-normalized code, achieving near state-of-the-art performance on both natural language and structured reasoning benchmarks at the 125M scale on a fraction of the training data.

Results

Average Score vs Training Compute

GPT-X2-125M achieves competitive performance with leading models despite using significantly less data. Most notably, it matches SmolLM2-135M within ~1 point on aggregate while using ~27x fewer tokens and having ~8% fewer parameters.

Evaluated with an internal harness modeled on EleutherAI/lm-eval-harness; all benchmarks are zero-shot.

Company	Model	HellaSwag	ARC (Average)	PIQA	LogicMark	Winogrande	ArithMark	Average	Training tokens
HuggingFace	SmolLM2-135M	43.22%	44.62%	67.52%	48.78%	48.46%	33.26%	47.64%	2T
Axiomic Labs	GPT-X2-125M	40.55%	39.90%	66.97%	49.12%	49.01%	34.78%	46.72%	75B
HuggingFace	SmolLM-135M	42.70%	43.17%	67.19%	43.89%	50.43%	32.34%	46.62%	600B
Facebook	MobileLLM-R1-140M-base	33.91%	37.47%	62.79%	45.04%	50.28%	46.94%	46.07%	4.2T
Axiomic Labs	GPT-X-125M	36.57%	38.84%	65.72%	43.83%	50.83%	30.52%	44.39%	15B
Facebook	MobileLLM-125M	38.90%	35.50%	65.30%	42.04%	53.10%	31.16%	44.33%	1T
OpenAI	GPT-3 (125M)	33.70%	35.10%	64.60%	NA	52.00%	NA	NA	300B
OpenAI	GPT-2 Medium (355M)	39.40%	34.80%	66.30%	44.90%	50.40%	34.80%	43.94%	~10B
OpenAI	GPT-2 (124M)	31.49%	31.40%	63.28%	44.52%	48.54%	32.80%	42.01%	~10B
EleutherAI	Pythia-160M	30.46%	29.95%	57.94%	40.87%	49.41%	28.06%	39.45%	~225B
Facebook	OPT-125M	31.39%	31.53%	62.02%	43.81%	49.96%	27.48%	41.03%	180B
EleutherAI	GPT-Neo-125M	30.55%	31.43%	61.75%	45.40%	49.09%	29.98%	41.37%	300B

LogicMark and ArithMark are procedural benchmarks designed to evaluate structured reasoning and arithmetic generalization across increasing difficulty levels.

Usage

from transformers import AutoModelForCausalLM, AutoTokenizer
import torch

model_name = "AxiomicLabs/GPT-X2-125M"

tokenizer = AutoTokenizer.from_pretrained(model_name, trust_remote_code=True)
model = AutoModelForCausalLM.from_pretrained(
    model_name,
    trust_remote_code=True,
    dtype=torch.bfloat16,
    device_map="auto",
)

prompt = "The future of AI is"
inputs = tokenizer(prompt, return_tensors="pt").to(model.device)

with torch.inference_mode():
    output = model.generate(
        **inputs,
        max_new_tokens=120,
        do_sample=True,
        temperature=0.8,
        top_p=0.95,
        repetition_penalty=1.1,
        no_repeat_ngram_size=4,
    )

text = tokenizer.decode(output[0], skip_special_tokens=True)
print(text)

What's New in v2

Change	GPT-X (v1)	GPT-X2 (v2)	Why
Tokenizer	GPT-2 BPE (50K)	Custom 32K trained on FineWeb-Edu	~9% better compression, frees params for layers
Depth	27 layers	30 layers	Saved embedding params reinvested into 3 extra layers
rope_theta	10,000	100,000	Better long-range attention (SmolLM2-proven)
Learning rate	6e-4	1.5e-3	SmolLM used 3e-3 at this scale; v1 was too conservative
LR decay	Cosine to 1/10th peak	WSD decay to 0	SmolLM-style; tighter convergence during cooldown
Warmup	1,000 steps	2,000 steps	Higher LR needs longer warmup for stability
Data	15B tokens, FineWeb-Edu only	75B tokens, 4-source curriculum	5x more data with progressive multi-source mixing
Training	Messy resume	Clean from-scratch	Controlled curriculum with planned distribution shifts

Architecture

Component	Details
Position encoding	RoPE (theta=100,000)
Normalization	RMSNorm (float32 upcast)
Feed-forward	SwiGLU (3-matrix gated MLP)
Attention	Grouped Query Attention -- 9Q / 3KV (3:1)
QK stability	QK-Norm (RMSNorm per head, before RoPE)
Bias	None (all layers bias-free)
Embedding	sqrt(d_model) scaling + weight tying
Auxiliary loss	z-loss (1e-4 on logit magnitudes < 31Bt then 0)
Depth	30 layers x 576 hidden

Config

vocab_size     = 32,768    (custom BPE trained on FineWeb-Edu)
n_layer        = 30
n_head         = 9         (query heads)
n_kv_heads     = 3         (key-value heads, 3:1 GQA)
n_embd         = 576
head_dim       = 64
intermediate   = 1,536     (SwiGLU, 2.67x ratio)
block_size     = 1,024
rope_theta     = 100,000
total params   = 125,081,664

Parameter Breakdown

Component	Params
Token embeddings (32768 x 576)	18,874,368
Per block (x30): attention + SwiGLU + norms	3,540,224
30 transformer blocks	106,206,720
Final RMSNorm	576
LM head (tied with embeddings)	0
Total	125,081,664

Training

Data

The curriculum gradually introduces specialized data (math/code) after the model learns core language, improving reasoning without harming fluency. Four data sources mixed via a progressive curriculum:

Source	Dataset	Purpose
FineWeb-Edu	HuggingFaceFW/fineweb-edu (`sample-100BT`)	Primary educational web text
DCLM	mlfoundations/dclm-baseline-1.0	High-quality diverse web text
FineMath	HuggingFaceTB/finemath (`finemath-4plus`)	Mathematical reasoning (score >= 4)
NPset-Python	AxiomicLabs/NPset-python	AST-normalized Python code

Tokens: 75B (143,051 steps x 524,288 tokens/step)
Tokenizer: Custom 32K BPE trained on 50GB of FineWeb-Edu
Final val loss: 2.7525 (FineWeb-Edu held-out)

Progressive Data Curriculum

Rather than a fixed mixture, data sources are introduced progressively to let the model build foundational language capabilities before adding specialized data:

Phase	Token Range	FineWeb-Edu	DCLM	FineMath	Code
Early	0 -- 18B	58%	40%	1%	1%
Ramp	18B -- 20B	58% -> 54%	40% -> 36%	1% -> 6%	1% -> 4%
Hold	20B -- 45B	54%	36%	6%	4%
Taper	45B -- 58B	54% -> 55%	36% -> 38%	6% -> 4.5%	4% -> 2.5%
Hold	58B -- 75B	55%	38%	4.5%	2.5%

FineMath and code are drip-fed at 1% from the start so the model sees early examples, then ramped to peak during the stable LR phase for maximum learning, and gradually tapered back toward primary sources during the LR decay phase.

AST-Based Code Normalization

Raw Python code is converted to a compact pseudocode representation (TinyDSL) before tokenization using AST parsing. This achieves ~1.25x token compression compared to raw code, letting the model learn programming reasoning without the need for code-specific tokens and reducing understanding overhead.

The normalizer:

Strips comments, docstrings, whitespace, and syntactic noise
Replaces Python builtins with full English words (len -> length, str -> string, etc.)
Uses natural language keywords with spaces (end function, for else, list comprehension, etc.)
Expands comprehensions, decorators, exception handling, and all Python AST node types

Example:

# Raw Python
def fibonacci(n):
    if n <= 1:
        return n
    return fibonacci(n - 1) + fibonacci(n - 2)

# Normalized TinyDSL
# function fibonacci n 
# begin
# if n <= 1
# begin
# return n
# end
# return call fibonacci n - 1 + call fibonacci n - 2 
# end function

Optimization

Optimizer: AdamW (betas=0.9/0.95, weight_decay=0.1 < 31Bt then 0.01)
Learning rate: 1.5e-3 max, decays to 0
Schedule: WSD -- 2,000 step warmup, stable phase (80%), linear decay to 0 over final 20%
Batch size: 524,288 tokens (micro_batch=8, seq_len=1024, 64 grad accumulation)
Precision: bfloat16 mixed precision
Gradient clipping: 1.0

Hardware

1x RTX 3080 Ti
Training time: ~500 hours

Design Decisions

30 layers x 576 hidden -- 3 more layers than v1, made possible by smaller 32K vocab. Depth is the primary driver of quality at 125M scale (SmolLM, MobileLLM).
Custom 32K tokenizer -- Trained on FineWeb-Edu for ~9% better compression than GPT-2 BPE. Fewer vocab entries = smaller embedding table = more params for transformer layers.
rope_theta=100K -- Matches SmolLM2-135M. Better extrapolation and long-range dependency modeling.
1.5e-3 learning rate -- GPT-X v1 used 6e-4 which was too conservative. SmolLM proved 3e-3 @ 1M batch works at 135M scale; thus used 1.5e-3 at 524288 batch.
GQA 3:1 -- 9 query heads, 3 KV heads. Saves attention parameters reinvested into SwiGLU capacity.
QK-Norm -- Critical for 30-layer stability. RMSNorm on Q and K before RoPE.
z-loss -- Prevents logit magnitude drift (PaLM, T5) used early.
Progressive curriculum -- FineMath and NPset-Python introduced at 1% early, ramped to peak during stable LR, tapered during decay. Lets the model build language foundations first, then learn specialized reasoning and logic patterns.
AST code normalization (NPset-Python) -- 1.25x token compression via Python AST to TinyDSL conversion. Strips syntactic noise and normalizes identifiers so the model learns programming structure rather than memorizing variable names.

Limitations

Hardware: Total training took roughly 500 hours for 75B tokens; as a result, ablations were not feasible, and VRAM capped the context window at 1024
Small model: 125M parameters limits reasoning and factual recall
Educational data only: Primarily trained on educational datasets; not representative of general web text
Not instruction-tuned: Base model only, not aligned for chat
English only
1024 context window

Citation

@misc{gptx2_2025,
  title={GPT-X2: Data-Efficient Language Modeling at 125M Scale},
  author={Axiomic Labs},
  year={2025},
  howpublished={\url{https://huggingface.co/AxiomicLabs/GPT-X2-125M}},
  note={Trained on 75B tokens with a progressive curriculum and custom tokenizer}
}

AxiomicLabs/GPT-X2-125M