talkie-1930-13b-base — GPTQ int4

A 4-bit GPTQ quantization of talkie-lm/talkie-1930-13b-base — the 13B "vintage" language model trained on 260B tokens of pre-1931 English-language text, by Alec Radford, Nick Levine, and David Duvenaud.

This quantization shrinks the model from 24.7 GB (bf16) to ~7.4 GB (int4), allowing it to fit comfortably on a single 16 GB consumer GPU (RTX 4080/5080-class) with room for activations and KV cache.

Use it

from gptqmodel import GPTQModel
import talkie_hf.talkie_qmodel  # registers TalkieQModel in GPTQModel's MODEL_MAP

model = GPTQModel.load("dtestnyrr/talkie-1930-13b-base-gptq-int4", trust_remote_code=True)
ids = model.tokenizer("If scientists discover life on other planets,", return_tensors="pt").input_ids.cuda()
out = model.generate(input_ids=ids, max_new_tokens=200, do_sample=True, temperature=0.7)
print(model.tokenizer.decode(out[0], skip_special_tokens=True))

You need:

gptqmodel >= 6.0.3
transformers >= 5.4
torch >= 2.8 with CUDA 12.x (Blackwell/sm_120 supported via Triton kernels)

Quantization details

Parameter	Value
Method	GPTQ
Bits	4
Group size	128
Activation order	False
Symmetric	True
Effective bits per weight	~4.29 BPW
Calibration corpus	256 × 2048-token windows from 8 pre-1931 Project Gutenberg classics
Calibration tokens	~524,288
Quantization framework	GPTQModel v6.0.3

The calibration corpus was deliberately curated from public-domain works published before 1931 (Tolstoy, Austen, Dickens, Doyle, Twain, Melville, Wilde, Shelley) to match the temporal distribution of the talkie pretraining data. Modern calibration text would have introduced systematic distributional bias against the model's training distribution.

Hardware

VRAM: ~7.6 GB during inference (peak, with short context). Fits a 16 GB GPU comfortably.
CPU fallback: Supported via GPTQModel's CPU kernel, but very slow.
Tested on: NVIDIA RTX 5080 (Blackwell, sm_120) under WSL2 / Ubuntu 24.04 / CUDA 12.8 / PyTorch 2.11.

Inference throughput on RTX 5080: ~10–14 tokens/sec greedy decode (no KV cache yet — see "limitations" below).

How this was made

The talkie reference repo ships the model as a single 53 GB final.ckpt pickle of a custom TalkieModel (decoder-only GPT with RoPE θ=10⁶, SwiGLU MLP, RMSNorm, QK-norm, learnable per-head and per-residual gain modules, and an embedding-skip connection at every layer). It is not a HuggingFace transformers model.

To make the GPTQ tooling work, I ported the architecture into a HuggingFace PreTrainedModel (TalkieForCausalLM), converted the weights into sharded safetensors (bf16, 24.7 GB), and built an HF tokenizer that matches the original tiktoken BPE bit-perfectly. Bit-identity tests confirmed:

Every weight tensor in the converted safetensors is bit-identical to the original .ckpt after fp32→bf16 cast (443/443 tensors).
The HF wrapper produces identical logits to the reference talkie code on random weights (max abs diff = 0).
The HF tokenizer encodes test strings to exactly the same token IDs as tiktoken.

GPTQ then ran via GPTQModel's auto-detect (it discovered the 7-Linears-per-layer structure correctly, 40 layers).

The conversion + quantization scripts are at: [link to your GitHub repo or gist]

Architecture (carried over from talkie)

13.28B parameters, 40 layers, hidden=5120, heads=40, head_dim=128, intermediate=13696, vocab=65536
RoPE base θ = 1,000,000 (notably larger than typical Llama's 10⁴–10⁵)
F.rms_norm everywhere (no learnable RMSNorm scale)
QK-norm: RMSNorm on Q and K post-RoPE
Per-head learnable gain on Q before SDPA (HeadGain)
Per-residual learnable gain on attention/MLP output (ActGain)
Embedding-skip connection at every layer (post-norm input embedding added back via ActGain(0.0) initial scale)
Scalar WeightGain on the lm_head matrix
Tokenizer: tiktoken BPE, 65,535 merges + 1 special token (<|endoftext|>)

Limitations

No KV cache in the current wrapper. Each generation step recomputes attention over the full prefix — fine for short prompts, slow for long ones. Adding KV cache support is a future TODO.
Pre-1931 worldview. The base model is trained only on pre-1931 English text and has no knowledge of post-1930 events, science, or culture. It does not know about WWII, computers, transistors, antibiotics beyond early-stage research, etc.
Quantization quality loss: GPTQ at 4-bit / group=128 typically incurs a 1–3% perplexity penalty vs bf16. I did not run a quantitative perplexity benchmark for this release.

License & attribution

Released under the Apache 2.0 license, matching the upstream model.

Original model credit:

Authors: Alec Radford, Nick Levine, David Duvenaud
Project page: https://talkie-lm.com/
Reference code: https://github.com/talkie-lm/talkie

If you use this quantized model in your work, please cite the original talkie release.

dtestnyrr/talkie-1930-13b-base-gptq-int4

talkie-1930-13b-base — GPTQ int4

Use it

Quantization details

Hardware

How this was made

Architecture (carried over from talkie)

Limitations

License & attribution

No reviews yet

Model Info

Community

Rating Guidelines