Back to Models
AX

AxiomicLabs/GPT-X2-125M

AxiomicLabscode

X2 Banner

GPT-X2-125M is the second generation GPT-X model. 125M parameters, 75B tokens, custom 32K tokenizer, 30 layers. Trained from scratch using a 4-source progressive curriculum with AST-normalized code, achieving near state-of-the-art performance on both natural language and structured reasoning benchmarks at the 125M scale on a fraction of the training data.

Results

Average Score vs Training Compute

GPT-X2-125M achieves competitive performance with leading models despite using significantly less data. Most notably, it matches SmolLM2-135M within ~1 point on aggregate while using ~27x fewer tokens and having ~8% fewer parameters.

Evaluated with an internal harness modeled on EleutherAI/lm-eval-harness; all benchmarks are zero-shot.

CompanyModelHellaSwagARC (Average)PIQALogicMarkWinograndeArithMarkAverageTraining tokens
HuggingFaceSmolLM2-135M43.22%44.62%67.52%48.78%48.46%33.26%47.64%2T
Axiomic LabsGPT-X2-125M40.55%39.90%66.97%49.12%49.01%34.78%46.72%75B
HuggingFaceSmolLM-135M42.70%43.17%67.19%43.89%50.43%32.34%46.62%600B
FacebookMobileLLM-R1-140M-base33.91%37.47%62.79%45.04%50.28%46.94%46.07%4.2T
Axiomic LabsGPT-X-125M36.57%38.84%65.72%43.83%50.83%30.52%44.39%15B
FacebookMobileLLM-125M38.90%35.50%65.30%42.04%53.10%31.16%44.33%1T
OpenAIGPT-3 (125M)33.70%35.10%64.60%NA52.00%NANA300B
OpenAIGPT-2 Medium (355M)39.40%34.80%66.30%44.90%50.40%34.80%43.94%~10B
OpenAIGPT-2 (124M)31.49%31.40%63.28%44.52%48.54%32.80%42.01%~10B
EleutherAIPythia-160M30.46%29.95%57.94%40.87%49.41%28.06%39.45%~225B
FacebookOPT-125M31.39%31.53%62.02%43.81%49.96%27.48%41.03%180B
EleutherAIGPT-Neo-125M30.55%31.43%61.75%45.40%49.09%29.98%41.37%300B

LogicMark and ArithMark are procedural benchmarks designed to evaluate structured reasoning and arithmetic generalization across increasing difficulty levels.


Usage

from transformers import AutoModelForCausalLM, AutoTokenizer
import torch

model_name = "AxiomicLabs/GPT-X2-125M"

tokenizer = AutoTokenizer.from_pretrained(model_name, trust_remote_code=True)
model = AutoModelForCausalLM.from_pretrained(
    model_name,
    trust_remote_code=True,
    dtype=torch.bfloat16,
    device_map="auto",
)

prompt = "The future of AI is"
inputs = tokenizer(prompt, return_tensors="pt").to(model.device)

with torch.inference_mode():
    output = model.generate(
        **inputs,
        max_new_tokens=120,
        do_sample=True,
        temperature=0.8,
        top_p=0.95,
        repetition_penalty=1.1,
        no_repeat_ngram_size=4,
    )

text = tokenizer.decode(output[0], skip_special_tokens=True)
print(text)

What's New in v2

ChangeGPT-X (v1)GPT-X2 (v2)Why
TokenizerGPT-2 BPE (50K)Custom 32K trained on FineWeb-Edu~9% better compression, frees params for layers
Depth27 layers30 layersSaved embedding params reinvested into 3 extra layers
rope_theta10,000100,000Better long-range attention (SmolLM2-proven)
Learning rate6e-41.5e-3SmolLM used 3e-3 at this scale; v1 was too conservative
LR decayCosine to 1/10th peakWSD decay to 0SmolLM-style; tighter convergence during cooldown
Warmup1,000 steps2,000 stepsHigher LR needs longer warmup for stability
Data15B tokens, FineWeb-Edu only75B tokens, 4-source curriculum5x more data with progressive multi-source mixing
TrainingMessy resumeClean from-scratchControlled curriculum with planned distribution shifts

Architecture

ComponentDetails
Position encodingRoPE (theta=100,000)
NormalizationRMSNorm (float32 upcast)
Feed-forwardSwiGLU (3-matrix gated MLP)
AttentionGrouped Query Attention -- 9Q / 3KV (3:1)
QK stabilityQK-Norm (RMSNorm per head, before RoPE)
BiasNone (all layers bias-free)
Embeddingsqrt(d_model) scaling + weight tying
Auxiliary lossz-loss (1e-4 on logit magnitudes < 31Bt then 0)
Depth30 layers x 576 hidden

Config

vocab_size     = 32,768    (custom BPE trained on FineWeb-Edu)
n_layer        = 30
n_head         = 9         (query heads)
n_kv_heads     = 3         (key-value heads, 3:1 GQA)
n_embd         = 576
head_dim       = 64
intermediate   = 1,536     (SwiGLU, 2.67x ratio)
block_size     = 1,024
rope_theta     = 100,000
total params   = 125,081,664

Parameter Breakdown

ComponentParams
Token embeddings (32768 x 576)18,874,368
Per block (x30): attention + SwiGLU + norms3,540,224
30 transformer blocks106,206,720
Final RMSNorm576
LM head (tied with embeddings)0
Total125,081,664

Training

Data

The curriculum gradually introduces specialized data (math/code) after the model learns core language, improving reasoning without harming fluency. Four data sources mixed via a progressive curriculum:

SourceDatasetPurpose
FineWeb-EduHuggingFaceFW/fineweb-edu (sample-100BT)Primary educational web text
DCLMmlfoundations/dclm-baseline-1.0High-quality diverse web text
FineMathHuggingFaceTB/finemath (finemath-4plus)Mathematical reasoning (score >= 4)
NPset-PythonAxiomicLabs/NPset-pythonAST-normalized Python code
  • Tokens: 75B (143,051 steps x 524,288 tokens/step)
  • Tokenizer: Custom 32K BPE trained on 50GB of FineWeb-Edu
  • Final val loss: 2.7525 (FineWeb-Edu held-out)

Progressive Data Curriculum

Rather than a fixed mixture, data sources are introduced progressively to let the model build foundational language capabilities before adding specialized data:

PhaseToken RangeFineWeb-EduDCLMFineMathCode
Early0 -- 18B58%40%1%1%
Ramp18B -- 20B58% -> 54%40% -> 36%1% -> 6%1% -> 4%
Hold20B -- 45B54%36%6%4%
Taper45B -- 58B54% -> 55%36% -> 38%6% -> 4.5%4% -> 2.5%
Hold58B -- 75B55%38%4.5%2.5%

FineMath and code are drip-fed at 1% from the start so the model sees early examples, then ramped to peak during the stable LR phase for maximum learning, and gradually tapered back toward primary sources during the LR decay phase.

AST-Based Code Normalization

Raw Python code is converted to a compact pseudocode representation (TinyDSL) before tokenization using AST parsing. This achieves ~1.25x token compression compared to raw code, letting the model learn programming reasoning without the need for code-specific tokens and reducing understanding overhead.

The normalizer:

  • Strips comments, docstrings, whitespace, and syntactic noise
  • Replaces Python builtins with full English words (len -> length, str -> string, etc.)
  • Uses natural language keywords with spaces (end function, for else, list comprehension, etc.)
  • Expands comprehensions, decorators, exception handling, and all Python AST node types

Example:

# Raw Python
def fibonacci(n):
    if n <= 1:
        return n
    return fibonacci(n - 1) + fibonacci(n - 2)

# Normalized TinyDSL
# function fibonacci n 
# begin
# if n <= 1
# begin
# return n
# end
# return call fibonacci n - 1 + call fibonacci n - 2 
# end function

Optimization

  • Optimizer: AdamW (betas=0.9/0.95, weight_decay=0.1 < 31Bt then 0.01)
  • Learning rate: 1.5e-3 max, decays to 0
  • Schedule: WSD -- 2,000 step warmup, stable phase (80%), linear decay to 0 over final 20%
  • Batch size: 524,288 tokens (micro_batch=8, seq_len=1024, 64 grad accumulation)
  • Precision: bfloat16 mixed precision
  • Gradient clipping: 1.0

Hardware

  • 1x RTX 3080 Ti
  • Training time: ~500 hours

Design Decisions

  • 30 layers x 576 hidden -- 3 more layers than v1, made possible by smaller 32K vocab. Depth is the primary driver of quality at 125M scale (SmolLM, MobileLLM).
  • Custom 32K tokenizer -- Trained on FineWeb-Edu for ~9% better compression than GPT-2 BPE. Fewer vocab entries = smaller embedding table = more params for transformer layers.
  • rope_theta=100K -- Matches SmolLM2-135M. Better extrapolation and long-range dependency modeling.
  • 1.5e-3 learning rate -- GPT-X v1 used 6e-4 which was too conservative. SmolLM proved 3e-3 @ 1M batch works at 135M scale; thus used 1.5e-3 at 524288 batch.
  • GQA 3:1 -- 9 query heads, 3 KV heads. Saves attention parameters reinvested into SwiGLU capacity.
  • QK-Norm -- Critical for 30-layer stability. RMSNorm on Q and K before RoPE.
  • z-loss -- Prevents logit magnitude drift (PaLM, T5) used early.
  • Progressive curriculum -- FineMath and NPset-Python introduced at 1% early, ramped to peak during stable LR, tapered during decay. Lets the model build language foundations first, then learn specialized reasoning and logic patterns.
  • AST code normalization (NPset-Python) -- 1.25x token compression via Python AST to TinyDSL conversion. Strips syntactic noise and normalizes identifiers so the model learns programming structure rather than memorizing variable names.

Limitations

  • Hardware: Total training took roughly 500 hours for 75B tokens; as a result, ablations were not feasible, and VRAM capped the context window at 1024
  • Small model: 125M parameters limits reasoning and factual recall
  • Educational data only: Primarily trained on educational datasets; not representative of general web text
  • Not instruction-tuned: Base model only, not aligned for chat
  • English only
  • 1024 context window

Citation

@misc{gptx2_2025,
  title={GPT-X2: Data-Efficient Language Modeling at 125M Scale},
  author={Axiomic Labs},
  year={2025},
  howpublished={\url{https://huggingface.co/AxiomicLabs/GPT-X2-125M}},
  note={Trained on 75B tokens with a progressive curriculum and custom tokenizer}
}
Visit Website

0 reviews

5
0
4
0
3
0
2
0
1
0
Likes3
Downloads
📝

No reviews yet

Be the first to review AxiomicLabs/GPT-X2-125M!

Model Info

ProviderAxiomicLabs
Categorycode
Reviews0
Avg. Rating / 5.0

Community

Likes3
Downloads

Rating Guidelines

★★★★★Exceptional
★★★★Great
★★★Good
★★Fair
Poor