RT-DETRv4-X Manga109-s

漫画ページから コマ枠 (frame) / 人物 (body) / 台詞 (text) の 3 クラスを検出する RT-DETRv4 (X size) モデル。Manga109-s を見開きから単ページに分割した 15,264 枚で 30 epoch ファインチューンしたものです。

日本語

概要

漫画ページから 3 クラス (コマ枠・人物・台詞) を検出する RT-DETRv4 (X size) モデル:

0: body — 人物 (キャラクター)
1: text — 台詞 / 吹き出し領域
2: frame — コマ枠

Manga109-s を見開きから単ページ分割した 15,264 枚で 30 epoch ファインチューン、DINOv2 ViT-B/14 特徴蒸留付き。ComfyUI ワークフロー、自動化されたコマ単位処理パイプライン、漫画ドメインの研究を想定しています。


ベースアーキテクチャ	RT-DETRv4 X-size (HGNetv2-B5 backbone + DFINETransformer decoder)
蒸留教師モデル	DINOv2 ViT-B/14 (Apache 2.0)
学習データ	Manga109-s — 商用利用許諾済 87 タイトル
入力解像度	1280 × 1280
クラス数	3 (body / text / frame)

検出例

bbox の色分け: 黄緑 = frame (コマ) / 青 = body (人物) / 赤 = text (セリフ)

ペン入れ済み漫画ページの検出例

完成原稿 (ペン入れ済み) の例。コマ・人物・台詞の 3 クラスすべて高精度で取れています。

手書きネームの検出例

ラフな手書きネーム (下描き / ストーリーボード) の例。学習データは完成原稿だけですが、線画の途中段階でもコマ枠・人物・台詞をある程度検出できます。

精度

Manga109-s validation split (1,212 ページ, 23,619 bbox) で評価:

クラス	mAP	AP50	AP75	AR100
body (人物)	76.2%	96.2%	85.8%	84.6%
text (台詞)	77.0%	96.9%	84.7%	82.5%
frame (コマ枠)	96.4%	98.6%	97.9%	98.4%
平均	83.2%	97.2%	89.5%	88.5%

検出ヒット率を表す AP50 が 3 クラス全て 95% 以上で、実用上ほぼ取りこぼしなし。body / text の伸びしろは IoU 厳格化 (AP75) であり、検出漏れではなく bbox 枠の精度向上が今後の改善ポイント。

ファイル

ファイル	説明
`model.onnx`	ONNX (opset 17, 静的入力 `1×3×1280×1280`)

推論

ONNX グラフの入出力:

入力: images (float32, NCHW, [0, 1] に正規化), orig_target_sizes (int64, [N, 2] = [width, height])
出力: labels (int, [N, 300]), boxes (float32, [N, 300, 4], 元画像座標の xyxy), scores (float32, [N, 300])

onnxruntime での最低限のサンプル:

import numpy as np
import onnxruntime as ort
from PIL import Image, ImageDraw

CLASS_NAMES = {0: "body", 1: "text", 2: "frame"}
INPUT_SIZE = 1280
CONF_THRESHOLD = 0.5

session = ort.InferenceSession(
    "model.onnx",
    providers=["CUDAExecutionProvider", "CPUExecutionProvider"],
)

image = Image.open("page.jpg").convert("RGB")
W, H = image.size

# 前処理: 1280x1280 にリサイズ → CHW float32 [0, 1]
resized = image.resize((INPUT_SIZE, INPUT_SIZE), Image.BILINEAR)
arr = np.asarray(resized, dtype=np.float32) / 255.0
arr = arr.transpose(2, 0, 1)[None]  # 1x3x1280x1280
orig_size = np.array([[W, H]], dtype=np.int64)

labels, boxes, scores = session.run(
    None, {"images": arr, "orig_target_sizes": orig_size}
)
labels, boxes, scores = labels[0], boxes[0], scores[0]

# 信頼度閾値で絞る (boxes は既に元画像の座標で出てくる)
keep = scores >= CONF_THRESHOLD
print(f"Detected {int(keep.sum())} objects")
for cid, (x1, y1, x2, y2), s in zip(labels[keep], boxes[keep], scores[keep]):
    print(f"  {CLASS_NAMES[int(cid)]:5s}  conf={s:.3f}  "
          f"bbox=({x1:.0f},{y1:.0f},{x2:.0f},{y2:.0f})")

# 可視化
draw = ImageDraw.Draw(image)
colors = {0: "blue", 1: "red", 2: "yellow"}
for cid, (x1, y1, x2, y2) in zip(labels[keep], boxes[keep]):
    draw.rectangle([x1, y1, x2, y2], outline=colors[int(cid)], width=3)
image.save("output.png")

CPU で動かす場合は providers=["CPUExecutionProvider"] のみで OK。GPU 用には onnxruntime-gpu をインストール。

学習設定


エポック	30 (flat 15 + cosine 11 + no_aug 4)
バッチサイズ	16 (single GPU)
オプティマイザ	AdamW — `lr=2.5e-4`, backbone `lr=2.5e-6`, `weight_decay=1.25e-4`
拡張	Mosaic / RandomPhotometricDistort / RandomZoomOut / RandomIoUCrop, Mixup (epoch 2–15)
蒸留	DINOv2 ViT-B/14 特徴蒸留, `loss_distill` weight 20 (adaptive)
train / val	train: 15,264 枚 (単ページ 14,798 + 見開き保持 466) / val: 1,212 枚 (単ページ 1,116 + 見開き保持 96)、bbox は 290,200 / 23,619
クラス別 bbox 数 (train)	body 109,480 / text 105,139 / frame 75,581
クラス別 bbox 数 (val)	body 8,645 / text 8,433 / frame 6,541

Manga109-s の見開きページは原則センター線で単ページに分割して学習。ただし bbox がセンター線をまたぐページ (見開きを横断するコマやキャラクターを含むページ) は分割せず見開きのまま保持してアノテーションを残す mixed-mode。これは推論時の典型的なシナリオ (1 度に 1 ページ) と入出力を揃えつつ、学習中に「センター線をまたぐ正解 bbox」を欠落させないため。

ライセンス・帰属表示

モデル本体: Apache License 2.0

本モデルは Manga109-s を学習データとして使用しています。Manga109-s の規約に従って以下を明示します:

データセット本体は同梱しません。Manga109-s の取得は公式サイトからの正規入手に従ってください。
このモデルを使って Manga109-s 収録漫画画像の 複製・改変を商材化することは規約により禁止されています。
Manga109-s に基づく結果を公表する際は下記 2 論文の引用が必要です。

引用 (BibTeX)

@article{multimedia_aizawa_2020,
    author={Kiyoharu Aizawa and Azuma Fujimoto and Atsushi Otsubo and Toru Ogawa and Yusuke Matsui and Koki Tsubota and Hikaru Ikuta},
    title={Building a Manga Dataset ``Manga109'' with Annotations for Multimedia Applications},
    journal={IEEE MultiMedia},
    volume={27},
    number={2},
    pages={8--18},
    doi={10.1109/mmul.2020.2987895},
    year={2020}
}

@article{mtap_matsui_2017,
    author={Yusuke Matsui and Kota Ito and Yuji Aramaki and Azuma Fujimoto and Toru Ogawa and Toshihiko Yamasaki and Kiyoharu Aizawa},
    title={Sketch-based Manga Retrieval using Manga109 Dataset},
    journal={Multimedia Tools and Applications},
    volume={76},
    number={20},
    pages={21811--21838},
    doi={10.1007/s11042-016-4020-z},
    year={2017}
}

謝辞

RT-DETRv4 / D-FINE (Apache 2.0) — 本モデルのベースアーキテクチャと学習コード
DINOv2 (Apache 2.0) — 蒸留教師モデル
HGNetv2 (Apache 2.0) — backbone

English

Overview

RT-DETRv4 (X-size) finetuned on Manga109-s for 3-class object detection on Japanese manga pages:

0: body — characters / human figures
1: text — dialogue balloons & text regions
2: frame — panel borders

Trained on 15,264 single-page images (split from Manga109-s spreads) for 30 epochs with DINOv2 ViT-B/14 feature distillation. Designed for ComfyUI workflows, automated panel processing pipelines, and manga-domain research.


Base architecture	RT-DETRv4 X-size (HGNetv2-B5 backbone + DFINETransformer decoder)
Distillation teacher	DINOv2 ViT-B/14 (Apache 2.0)
Training data	Manga109-s — 87 commercially-licensed titles
Input resolution	1280 × 1280
Number of classes	3 (body / text / frame)

Examples

bbox color coding: yellow-green = frame (panel) / blue = body (character) / red = text (dialogue)

Detection on inked manga page

A finished, inked manga page. All three classes (panel / character / dialogue) are picked up with high precision.

Detection on rough hand-drawn name

A rough hand-drawn "name" (storyboard / pre-inking sketch). Although the training data only contains finished manga, the model still recognises panels, characters and dialogue regions reasonably well at the rough-draft stage.

Performance

Evaluated on the Manga109-s validation split (1,212 pages, 23,619 boxes).

Class	mAP	AP50	AP75	AR100
body	76.2%	96.2%	85.8%	84.6%
text	77.0%	96.9%	84.7%	82.5%
frame	96.4%	98.6%	97.9%	98.4%
average	83.2%	97.2%	89.5%	88.5%

AP50 ≥ 95% across all three classes — virtually no missed detections in practical use. The remaining headroom in body / text is primarily IoU-strictness (AP75) rather than recall.

Files

File	Description
`model.onnx`	ONNX, opset 17, static `1×3×1280×1280` input

Inference

The ONNX graph exposes:

inputs: images (float32, NCHW, normalised to [0, 1]), orig_target_sizes (int64, [N, 2] = [width, height])
outputs: labels (int, [N, 300]), boxes (float32, [N, 300, 4], xyxy in original image coordinates), scores (float32, [N, 300])

Minimum working example with onnxruntime:

import numpy as np
import onnxruntime as ort
from PIL import Image, ImageDraw

CLASS_NAMES = {0: "body", 1: "text", 2: "frame"}
INPUT_SIZE = 1280
CONF_THRESHOLD = 0.5

session = ort.InferenceSession(
    "model.onnx",
    providers=["CUDAExecutionProvider", "CPUExecutionProvider"],
)

image = Image.open("page.jpg").convert("RGB")
W, H = image.size

# Preprocess: resize to 1280x1280, CHW float32 in [0, 1]
resized = image.resize((INPUT_SIZE, INPUT_SIZE), Image.BILINEAR)
arr = np.asarray(resized, dtype=np.float32) / 255.0
arr = arr.transpose(2, 0, 1)[None]  # 1x3x1280x1280
orig_size = np.array([[W, H]], dtype=np.int64)

labels, boxes, scores = session.run(
    None, {"images": arr, "orig_target_sizes": orig_size}
)
labels, boxes, scores = labels[0], boxes[0], scores[0]

# Filter by confidence (boxes are already in original image coordinates)
keep = scores >= CONF_THRESHOLD
print(f"Detected {int(keep.sum())} objects")
for cid, (x1, y1, x2, y2), s in zip(labels[keep], boxes[keep], scores[keep]):
    print(f"  {CLASS_NAMES[int(cid)]:5s}  conf={s:.3f}  "
          f"bbox=({x1:.0f},{y1:.0f},{x2:.0f},{y2:.0f})")

# Visualise
draw = ImageDraw.Draw(image)
colors = {0: "blue", 1: "red", 2: "yellow"}
for cid, (x1, y1, x2, y2) in zip(labels[keep], boxes[keep]):
    draw.rectangle([x1, y1, x2, y2], outline=colors[int(cid)], width=3)
image.save("output.png")

For CPU-only inference, use providers=["CPUExecutionProvider"]. For GPU, install onnxruntime-gpu.

Training


Epochs	30 (flat 15 + cosine 11 + no-aug 4)
Batch size	16 (single GPU)
Optimiser	AdamW — `lr=2.5e-4`, backbone `lr=2.5e-6`, `weight_decay=1.25e-4`
Augmentation	Mosaic / RandomPhotometricDistort / RandomZoomOut / RandomIoUCrop, Mixup (epoch 2–15)
Distillation	DINOv2 ViT-B/14 feature distillation, `loss_distill` weight 20 (adaptive)
Train / val split	train: 15,264 images (14,798 single pages + 466 retained spreads) / val: 1,212 images (1,116 single pages + 96 retained spreads); 290,200 / 23,619 bboxes
Per-class bbox count (train)	body 109,480 / text 105,139 / frame 75,581
Per-class bbox count (val)	body 8,645 / text 8,433 / frame 6,541

Manga109-s spreads were split at the centerline into single pages prior to training, except when an annotated bbox crossed the centerline — in that case the page was retained as a spread (mixed mode). This keeps the inference contract single-page-friendly while preserving cross-spread groundtruth boxes (e.g. panels or characters that span both pages) instead of clipping them away.

License & Attribution

Model: Apache License 2.0.

This model was trained on Manga109-s, whose terms of use require the following acknowledgements:

The dataset itself is not bundled with this release. Obtain Manga109-s through the official channel.
Using this model to commercially redistribute or sell reproductions / derivatives of Manga109-s manga images is prohibited by the dataset terms.
The two papers below must be cited when reporting results that depend on Manga109-s.

Citation

@article{multimedia_aizawa_2020,
    author={Kiyoharu Aizawa and Azuma Fujimoto and Atsushi Otsubo and Toru Ogawa and Yusuke Matsui and Koki Tsubota and Hikaru Ikuta},
    title={Building a Manga Dataset ``Manga109'' with Annotations for Multimedia Applications},
    journal={IEEE MultiMedia},
    volume={27},
    number={2},
    pages={8--18},
    doi={10.1109/mmul.2020.2987895},
    year={2020}
}

@article{mtap_matsui_2017,
    author={Yusuke Matsui and Kota Ito and Yuji Aramaki and Azuma Fujimoto and Toru Ogawa and Toshihiko Yamasaki and Kiyoharu Aizawa},
    title={Sketch-based Manga Retrieval using Manga109 Dataset},
    journal={Multimedia Tools and Applications},
    volume={76},
    number={20},
    pages={21811--21838},
    doi={10.1007/s11042-016-4020-z},
    year={2017}
}

Acknowledgements

RT-DETRv4 / D-FINE (Apache 2.0) — base architecture and training code
DINOv2 (Apache 2.0) — distillation teacher
HGNetv2 (Apache 2.0) — backbone

tori29umai/rtdetrv4-x-manga109s