AuriAetherwiing/G4-26B-A4B-Musica-v1
AuriAetherwiing • imageGemma-4-26B-A4B Musica v1
RP/storygen/writing/conversational tune of Gemma-4-26B-A4B-it, third model in Musica series. Bit of a wild card, I liked the prose and the creativity in scenarios more than 31B version's, but this model is also somewhat less stable and not as smart, to be honest. It's still quite decent though, imo.
Both reasoning and non-reasoning models work, though reasoning seems... quite yappy by default, prefilling Okay, let's see after <|channel>thought does make it bit more concise usually.
Instruction following seems bit inconsistent, sometimes it follows everything perfectly, sometimes it just goes against some constraints in reasoning, seems like it's a bit of a MoE chaos there. Though generally it stick decently well to system prompt. Refusals still do not exist. Swipe diversity is quite good.
This training run was sponsored by ArliAI
Training notes
Surprisingly, much less of a pain than 31B, which is wild given that it is a MoE. Used the same Axolotl commit, with grouped_mm MoE kernel. Scattermoe doesn't seem to be implemented on that commit yet, but don't think it matters too much. Graphs were very, very similar to 31B, except loss landed a bit higher - think its just a result of sparsity. Honestly, it feels like Google just overfitted those models on Gemini logits, lol. It also trained very fast compared to 31B, despite still using only SDPA.
r64a64 LoRA, 1e-5, 1 epoch, constant w/ warmup. 9 hours on 2xRTX Pro 6000 Blackwell.
allura-forge/musica-sft-v1-gemma4-pretok - pretokenized dataset.
CometML Project - training graphs and stats.
AuriAetherwiing/G4-26B-A4B-Musica-v1-lora - LoRA adapter.
Recommended Samplers
-
Temperature: 1
-
Min-P: 0.02
-
NSigma: 2
Don't use repetition penalties of any kind, they harm more than they do good.
Axolotl config
See Axolotl config
# =============================================================================
# BASE MODEL
# =============================================================================
base_model: /home/arli/models/gemma-4-26B-A4B-it
# =============================================================================
# PLUGINS & KERNEL OPTIMIZATIONS
# =============================================================================
plugins:
- axolotl.integrations.liger.LigerPlugin # not sure if it works with Gemma 4 but it doesn't crash at least
- axolotl.integrations.cut_cross_entropy.CutCrossEntropyPlugin # must have! KV cache is too expensive otherwise
- axolotl.integrations.kernels.KernelsPlugin # required for scattermoe and batched_mm for efficient MoE training
cut_cross_entropy: true
liger_rope: true
liger_rms_norm: true
liger_layer_norm: true
liger_glu_activation: true
liger_rms_norm_gated: true
use_kernels: true
use_scattermoe: true
experts_implementation: grouped_mm
# =============================================================================
# QUANTIZATION
# =============================================================================
load_in_8bit: false
load_in_4bit: false
# =============================================================================
# DATASET
# =============================================================================
shuffle_merged_datasets: true
datasets:
- path: allura-forge/musica-sft-v1-gemma4-pretok # finally, pretokenized datasets
ds_type: parquet
type:
dataset_prepared_path: ./last_run_prepared
val_set_size: 0
# =============================================================================
# OUTPUT & ADAPTER
# =============================================================================
output_dir: ./outputs/v1
adapter: lora
save_safetensors: true
# =============================================================================
# SEQUENCE & SAMPLE PACKING
# =============================================================================
sequence_len: 8192 # ideally 16384 but Gemma 4 31B has too expensive KV cache
sample_packing: true # DOES in fact work with SDPA
pad_to_sequence_len: false
# =============================================================================
# LORA
# =============================================================================
lora_r: 64
lora_alpha: 64
lora_dropout: 0.0
lora_target_modules: 'model.language_model.layers.[\d]+.(_checkpoint_wrapped_module.)?(mlp|self_attn).(up|down|gate|q|k|v|o)_proj'
lora_target_parameters:
- experts.gate_up_proj
- experts.down_proj
lora_mlp_kernel: false
lora_qkv_kernel: false
lora_o_kernel: false
# =============================================================================
# TRAINING HYPERPARAMETERS
# =============================================================================
gradient_accumulation_steps: 4
micro_batch_size: 2
num_epochs: 1
optimizer: adamw_torch_fused
lr_scheduler: constant_with_warmup
learning_rate: 1e-5
warmup_ratio: 0.05
max_grad_norm: 0.5
weight_decay: 0.05
# =============================================================================
# PRECISION
# =============================================================================
bf16: auto
# =============================================================================
# ATTENTION
# =============================================================================
sdp_attention: true
#flash_attention: true # Doesn't work on Gemma 4 currently
#flex_attention: true # up to 40% less memory use with compile, but slower than SDPA
#torch_compile: true # speed up, but unreliable and breaks often
#gemma4_hybrid_attn_impl: true
# =============================================================================
# LOGGING & MONITORING
# =============================================================================
use_comet: true # install comet-ml with pip and do comet login before starting
comet_project_name: musica-26b-a4b
logging_steps: 1
# =============================================================================
# CHECKPOINTING & SAVING
# =============================================================================
auto_resume_from_checkpoints: false
evals_per_epoch: 0
saves_per_epoch: 4
save_total_limit: 4
gradient_checkpointing: false
gradient_checkpointing_kwargs:
use_reentrant: false
# =============================================================================
# FSDP
# =============================================================================
fsdp_config:
fsdp_version: 2
offload_params: false
cpu_ram_efficient_loading: false
auto_wrap_policy: TRANSFORMER_BASED_WRAP
transformer_layer_cls_to_wrap: Gemma4TextDecoderLayer
state_dict_type: FULL_STATE_DICT
sharding_strategy: FULL_SHARD
reshard_after_forward: true
activation_checkpointing: true