Back to Models
DT

dtestnyrr/talkie-1930-13b-base-gptq-int4

dtestnyrrcode

talkie-1930-13b-base — GPTQ int4

A 4-bit GPTQ quantization of talkie-lm/talkie-1930-13b-base — the 13B "vintage" language model trained on 260B tokens of pre-1931 English-language text, by Alec Radford, Nick Levine, and David Duvenaud.

This quantization shrinks the model from 24.7 GB (bf16) to ~7.4 GB (int4), allowing it to fit comfortably on a single 16 GB consumer GPU (RTX 4080/5080-class) with room for activations and KV cache.

Use it

from gptqmodel import GPTQModel
import talkie_hf.talkie_qmodel  # registers TalkieQModel in GPTQModel's MODEL_MAP

model = GPTQModel.load("dtestnyrr/talkie-1930-13b-base-gptq-int4", trust_remote_code=True)
ids = model.tokenizer("If scientists discover life on other planets,", return_tensors="pt").input_ids.cuda()
out = model.generate(input_ids=ids, max_new_tokens=200, do_sample=True, temperature=0.7)
print(model.tokenizer.decode(out[0], skip_special_tokens=True))

You need:

  • gptqmodel >= 6.0.3
  • transformers >= 5.4
  • torch >= 2.8 with CUDA 12.x (Blackwell/sm_120 supported via Triton kernels)

Quantization details

ParameterValue
MethodGPTQ
Bits4
Group size128
Activation orderFalse
SymmetricTrue
Effective bits per weight~4.29 BPW
Calibration corpus256 × 2048-token windows from 8 pre-1931 Project Gutenberg classics
Calibration tokens~524,288
Quantization frameworkGPTQModel v6.0.3

The calibration corpus was deliberately curated from public-domain works published before 1931 (Tolstoy, Austen, Dickens, Doyle, Twain, Melville, Wilde, Shelley) to match the temporal distribution of the talkie pretraining data. Modern calibration text would have introduced systematic distributional bias against the model's training distribution.

Hardware

  • VRAM: ~7.6 GB during inference (peak, with short context). Fits a 16 GB GPU comfortably.
  • CPU fallback: Supported via GPTQModel's CPU kernel, but very slow.
  • Tested on: NVIDIA RTX 5080 (Blackwell, sm_120) under WSL2 / Ubuntu 24.04 / CUDA 12.8 / PyTorch 2.11.

Inference throughput on RTX 5080: ~10–14 tokens/sec greedy decode (no KV cache yet — see "limitations" below).

How this was made

The talkie reference repo ships the model as a single 53 GB final.ckpt pickle of a custom TalkieModel (decoder-only GPT with RoPE θ=10⁶, SwiGLU MLP, RMSNorm, QK-norm, learnable per-head and per-residual gain modules, and an embedding-skip connection at every layer). It is not a HuggingFace transformers model.

To make the GPTQ tooling work, I ported the architecture into a HuggingFace PreTrainedModel (TalkieForCausalLM), converted the weights into sharded safetensors (bf16, 24.7 GB), and built an HF tokenizer that matches the original tiktoken BPE bit-perfectly. Bit-identity tests confirmed:

  • Every weight tensor in the converted safetensors is bit-identical to the original .ckpt after fp32→bf16 cast (443/443 tensors).
  • The HF wrapper produces identical logits to the reference talkie code on random weights (max abs diff = 0).
  • The HF tokenizer encodes test strings to exactly the same token IDs as tiktoken.

GPTQ then ran via GPTQModel's auto-detect (it discovered the 7-Linears-per-layer structure correctly, 40 layers).

The conversion + quantization scripts are at: [link to your GitHub repo or gist]

Architecture (carried over from talkie)

  • 13.28B parameters, 40 layers, hidden=5120, heads=40, head_dim=128, intermediate=13696, vocab=65536
  • RoPE base θ = 1,000,000 (notably larger than typical Llama's 10⁴–10⁵)
  • F.rms_norm everywhere (no learnable RMSNorm scale)
  • QK-norm: RMSNorm on Q and K post-RoPE
  • Per-head learnable gain on Q before SDPA (HeadGain)
  • Per-residual learnable gain on attention/MLP output (ActGain)
  • Embedding-skip connection at every layer (post-norm input embedding added back via ActGain(0.0) initial scale)
  • Scalar WeightGain on the lm_head matrix
  • Tokenizer: tiktoken BPE, 65,535 merges + 1 special token (<|endoftext|>)

Limitations

  • No KV cache in the current wrapper. Each generation step recomputes attention over the full prefix — fine for short prompts, slow for long ones. Adding KV cache support is a future TODO.
  • Pre-1931 worldview. The base model is trained only on pre-1931 English text and has no knowledge of post-1930 events, science, or culture. It does not know about WWII, computers, transistors, antibiotics beyond early-stage research, etc.
  • Quantization quality loss: GPTQ at 4-bit / group=128 typically incurs a 1–3% perplexity penalty vs bf16. I did not run a quantitative perplexity benchmark for this release.

License & attribution

Released under the Apache 2.0 license, matching the upstream model.

Original model credit:

If you use this quantized model in your work, please cite the original talkie release.

Visit Website

0 reviews

5
0
4
0
3
0
2
0
1
0
Likes8
Downloads
📝

No reviews yet

Be the first to review dtestnyrr/talkie-1930-13b-base-gptq-int4!

Model Info

Providerdtestnyrr
Categorycode
Reviews0
Avg. Rating / 5.0

Community

Likes8
Downloads

Rating Guidelines

★★★★★Exceptional
★★★★Great
★★★Good
★★Fair
Poor