tori29umai/rtdetrv4-x-manga109s
tori29umai • generalRT-DETRv4-X Manga109-s
漫画ページから コマ枠 (frame) / 人物 (body) / 台詞 (text) の 3 クラスを検出する RT-DETRv4 (X size) モデル。Manga109-s を見開きから単ページに分割した 15,264 枚で 30 epoch ファインチューンしたものです。
日本語
概要
漫画ページから 3 クラス (コマ枠・人物・台詞) を検出する RT-DETRv4 (X size) モデル:
0: body— 人物 (キャラクター)1: text— 台詞 / 吹き出し領域2: frame— コマ枠
Manga109-s を見開きから単ページ分割した 15,264 枚で 30 epoch ファインチューン、DINOv2 ViT-B/14 特徴蒸留付き。ComfyUI ワークフロー、自動化されたコマ単位処理パイプライン、漫画ドメインの研究を想定しています。
| ベースアーキテクチャ | RT-DETRv4 X-size (HGNetv2-B5 backbone + DFINETransformer decoder) |
| 蒸留教師モデル | DINOv2 ViT-B/14 (Apache 2.0) |
| 学習データ | Manga109-s — 商用利用許諾済 87 タイトル |
| 入力解像度 | 1280 × 1280 |
| クラス数 | 3 (body / text / frame) |
検出例
bbox の色分け: 黄緑 = frame (コマ) / 青 = body (人物) / 赤 = text (セリフ)

完成原稿 (ペン入れ済み) の例。コマ・人物・台詞の 3 クラスすべて高精度で取れています。

ラフな手書きネーム (下描き / ストーリーボード) の例。学習データは完成原稿だけですが、線画の途中段階でもコマ枠・人物・台詞をある程度検出できます。
精度
Manga109-s validation split (1,212 ページ, 23,619 bbox) で評価:
| クラス | mAP | AP50 | AP75 | AR100 |
|---|---|---|---|---|
| body (人物) | 76.2% | 96.2% | 85.8% | 84.6% |
| text (台詞) | 77.0% | 96.9% | 84.7% | 82.5% |
| frame (コマ枠) | 96.4% | 98.6% | 97.9% | 98.4% |
| 平均 | 83.2% | 97.2% | 89.5% | 88.5% |
検出ヒット率を表す AP50 が 3 クラス全て 95% 以上で、実用上ほぼ取りこぼしなし。body / text の伸びしろは IoU 厳格化 (AP75) であり、検出漏れではなく bbox 枠の精度向上が今後の改善ポイント。
ファイル
| ファイル | 説明 |
|---|---|
model.onnx | ONNX (opset 17, 静的入力 1×3×1280×1280) |
推論
ONNX グラフの入出力:
- 入力:
images(float32, NCHW, [0, 1] に正規化),orig_target_sizes(int64,[N, 2]=[width, height]) - 出力:
labels(int,[N, 300]),boxes(float32,[N, 300, 4], 元画像座標の xyxy),scores(float32,[N, 300])
onnxruntime での最低限のサンプル:
import numpy as np
import onnxruntime as ort
from PIL import Image, ImageDraw
CLASS_NAMES = {0: "body", 1: "text", 2: "frame"}
INPUT_SIZE = 1280
CONF_THRESHOLD = 0.5
session = ort.InferenceSession(
"model.onnx",
providers=["CUDAExecutionProvider", "CPUExecutionProvider"],
)
image = Image.open("page.jpg").convert("RGB")
W, H = image.size
# 前処理: 1280x1280 にリサイズ → CHW float32 [0, 1]
resized = image.resize((INPUT_SIZE, INPUT_SIZE), Image.BILINEAR)
arr = np.asarray(resized, dtype=np.float32) / 255.0
arr = arr.transpose(2, 0, 1)[None] # 1x3x1280x1280
orig_size = np.array([[W, H]], dtype=np.int64)
labels, boxes, scores = session.run(
None, {"images": arr, "orig_target_sizes": orig_size}
)
labels, boxes, scores = labels[0], boxes[0], scores[0]
# 信頼度閾値で絞る (boxes は既に元画像の座標で出てくる)
keep = scores >= CONF_THRESHOLD
print(f"Detected {int(keep.sum())} objects")
for cid, (x1, y1, x2, y2), s in zip(labels[keep], boxes[keep], scores[keep]):
print(f" {CLASS_NAMES[int(cid)]:5s} conf={s:.3f} "
f"bbox=({x1:.0f},{y1:.0f},{x2:.0f},{y2:.0f})")
# 可視化
draw = ImageDraw.Draw(image)
colors = {0: "blue", 1: "red", 2: "yellow"}
for cid, (x1, y1, x2, y2) in zip(labels[keep], boxes[keep]):
draw.rectangle([x1, y1, x2, y2], outline=colors[int(cid)], width=3)
image.save("output.png")
CPU で動かす場合は providers=["CPUExecutionProvider"] のみで OK。GPU 用には onnxruntime-gpu をインストール。
学習設定
| エポック | 30 (flat 15 + cosine 11 + no_aug 4) |
| バッチサイズ | 16 (single GPU) |
| オプティマイザ | AdamW — lr=2.5e-4, backbone lr=2.5e-6, weight_decay=1.25e-4 |
| 拡張 | Mosaic / RandomPhotometricDistort / RandomZoomOut / RandomIoUCrop, Mixup (epoch 2–15) |
| 蒸留 | DINOv2 ViT-B/14 特徴蒸留, loss_distill weight 20 (adaptive) |
| train / val | train: 15,264 枚 (単ページ 14,798 + 見開き保持 466) / val: 1,212 枚 (単ページ 1,116 + 見開き保持 96)、bbox は 290,200 / 23,619 |
| クラス別 bbox 数 (train) | body 109,480 / text 105,139 / frame 75,581 |
| クラス別 bbox 数 (val) | body 8,645 / text 8,433 / frame 6,541 |
Manga109-s の見開きページは原則センター線で単ページに分割して学習。ただし bbox がセンター線をまたぐページ (見開きを横断するコマやキャラクターを含むページ) は分割せず見開きのまま保持してアノテーションを残す mixed-mode。これは推論時の典型的なシナリオ (1 度に 1 ページ) と入出力を揃えつつ、学習中に「センター線をまたぐ正解 bbox」を欠落させないため。
ライセンス・帰属表示
モデル本体: Apache License 2.0
本モデルは Manga109-s を学習データとして使用しています。Manga109-s の規約に従って以下を明示します:
- データセット本体は同梱しません。Manga109-s の取得は公式サイトからの正規入手に従ってください。
- このモデルを使って Manga109-s 収録漫画画像の 複製・改変を商材化することは規約により禁止されています。
- Manga109-s に基づく結果を公表する際は下記 2 論文の引用が必要です。
引用 (BibTeX)
@article{multimedia_aizawa_2020,
author={Kiyoharu Aizawa and Azuma Fujimoto and Atsushi Otsubo and Toru Ogawa and Yusuke Matsui and Koki Tsubota and Hikaru Ikuta},
title={Building a Manga Dataset ``Manga109'' with Annotations for Multimedia Applications},
journal={IEEE MultiMedia},
volume={27},
number={2},
pages={8--18},
doi={10.1109/mmul.2020.2987895},
year={2020}
}
@article{mtap_matsui_2017,
author={Yusuke Matsui and Kota Ito and Yuji Aramaki and Azuma Fujimoto and Toru Ogawa and Toshihiko Yamasaki and Kiyoharu Aizawa},
title={Sketch-based Manga Retrieval using Manga109 Dataset},
journal={Multimedia Tools and Applications},
volume={76},
number={20},
pages={21811--21838},
doi={10.1007/s11042-016-4020-z},
year={2017}
}
謝辞
- RT-DETRv4 / D-FINE (Apache 2.0) — 本モデルのベースアーキテクチャと学習コード
- DINOv2 (Apache 2.0) — 蒸留教師モデル
- HGNetv2 (Apache 2.0) — backbone
English
Overview
RT-DETRv4 (X-size) finetuned on Manga109-s for 3-class object detection on Japanese manga pages:
0: body— characters / human figures1: text— dialogue balloons & text regions2: frame— panel borders
Trained on 15,264 single-page images (split from Manga109-s spreads) for 30 epochs with DINOv2 ViT-B/14 feature distillation. Designed for ComfyUI workflows, automated panel processing pipelines, and manga-domain research.
| Base architecture | RT-DETRv4 X-size (HGNetv2-B5 backbone + DFINETransformer decoder) |
| Distillation teacher | DINOv2 ViT-B/14 (Apache 2.0) |
| Training data | Manga109-s — 87 commercially-licensed titles |
| Input resolution | 1280 × 1280 |
| Number of classes | 3 (body / text / frame) |
Examples
bbox color coding: yellow-green = frame (panel) / blue = body (character) / red = text (dialogue)

A finished, inked manga page. All three classes (panel / character / dialogue) are picked up with high precision.

A rough hand-drawn "name" (storyboard / pre-inking sketch). Although the training data only contains finished manga, the model still recognises panels, characters and dialogue regions reasonably well at the rough-draft stage.
Performance
Evaluated on the Manga109-s validation split (1,212 pages, 23,619 boxes).
| Class | mAP | AP50 | AP75 | AR100 |
|---|---|---|---|---|
| body | 76.2% | 96.2% | 85.8% | 84.6% |
| text | 77.0% | 96.9% | 84.7% | 82.5% |
| frame | 96.4% | 98.6% | 97.9% | 98.4% |
| average | 83.2% | 97.2% | 89.5% | 88.5% |
AP50 ≥ 95% across all three classes — virtually no missed detections in practical use. The remaining headroom in body / text is primarily IoU-strictness (AP75) rather than recall.
Files
| File | Description |
|---|---|
model.onnx | ONNX, opset 17, static 1×3×1280×1280 input |
Inference
The ONNX graph exposes:
- inputs:
images(float32, NCHW, normalised to [0, 1]),orig_target_sizes(int64,[N, 2]=[width, height]) - outputs:
labels(int,[N, 300]),boxes(float32,[N, 300, 4],xyxyin original image coordinates),scores(float32,[N, 300])
Minimum working example with onnxruntime:
import numpy as np
import onnxruntime as ort
from PIL import Image, ImageDraw
CLASS_NAMES = {0: "body", 1: "text", 2: "frame"}
INPUT_SIZE = 1280
CONF_THRESHOLD = 0.5
session = ort.InferenceSession(
"model.onnx",
providers=["CUDAExecutionProvider", "CPUExecutionProvider"],
)
image = Image.open("page.jpg").convert("RGB")
W, H = image.size
# Preprocess: resize to 1280x1280, CHW float32 in [0, 1]
resized = image.resize((INPUT_SIZE, INPUT_SIZE), Image.BILINEAR)
arr = np.asarray(resized, dtype=np.float32) / 255.0
arr = arr.transpose(2, 0, 1)[None] # 1x3x1280x1280
orig_size = np.array([[W, H]], dtype=np.int64)
labels, boxes, scores = session.run(
None, {"images": arr, "orig_target_sizes": orig_size}
)
labels, boxes, scores = labels[0], boxes[0], scores[0]
# Filter by confidence (boxes are already in original image coordinates)
keep = scores >= CONF_THRESHOLD
print(f"Detected {int(keep.sum())} objects")
for cid, (x1, y1, x2, y2), s in zip(labels[keep], boxes[keep], scores[keep]):
print(f" {CLASS_NAMES[int(cid)]:5s} conf={s:.3f} "
f"bbox=({x1:.0f},{y1:.0f},{x2:.0f},{y2:.0f})")
# Visualise
draw = ImageDraw.Draw(image)
colors = {0: "blue", 1: "red", 2: "yellow"}
for cid, (x1, y1, x2, y2) in zip(labels[keep], boxes[keep]):
draw.rectangle([x1, y1, x2, y2], outline=colors[int(cid)], width=3)
image.save("output.png")
For CPU-only inference, use providers=["CPUExecutionProvider"]. For GPU, install onnxruntime-gpu.
Training
| Epochs | 30 (flat 15 + cosine 11 + no-aug 4) |
| Batch size | 16 (single GPU) |
| Optimiser | AdamW — lr=2.5e-4, backbone lr=2.5e-6, weight_decay=1.25e-4 |
| Augmentation | Mosaic / RandomPhotometricDistort / RandomZoomOut / RandomIoUCrop, Mixup (epoch 2–15) |
| Distillation | DINOv2 ViT-B/14 feature distillation, loss_distill weight 20 (adaptive) |
| Train / val split | train: 15,264 images (14,798 single pages + 466 retained spreads) / val: 1,212 images (1,116 single pages + 96 retained spreads); 290,200 / 23,619 bboxes |
| Per-class bbox count (train) | body 109,480 / text 105,139 / frame 75,581 |
| Per-class bbox count (val) | body 8,645 / text 8,433 / frame 6,541 |
Manga109-s spreads were split at the centerline into single pages prior to training, except when an annotated bbox crossed the centerline — in that case the page was retained as a spread (mixed mode). This keeps the inference contract single-page-friendly while preserving cross-spread groundtruth boxes (e.g. panels or characters that span both pages) instead of clipping them away.
License & Attribution
Model: Apache License 2.0.
This model was trained on Manga109-s, whose terms of use require the following acknowledgements:
- The dataset itself is not bundled with this release. Obtain Manga109-s through the official channel.
- Using this model to commercially redistribute or sell reproductions / derivatives of Manga109-s manga images is prohibited by the dataset terms.
- The two papers below must be cited when reporting results that depend on Manga109-s.
Citation
@article{multimedia_aizawa_2020,
author={Kiyoharu Aizawa and Azuma Fujimoto and Atsushi Otsubo and Toru Ogawa and Yusuke Matsui and Koki Tsubota and Hikaru Ikuta},
title={Building a Manga Dataset ``Manga109'' with Annotations for Multimedia Applications},
journal={IEEE MultiMedia},
volume={27},
number={2},
pages={8--18},
doi={10.1109/mmul.2020.2987895},
year={2020}
}
@article{mtap_matsui_2017,
author={Yusuke Matsui and Kota Ito and Yuji Aramaki and Azuma Fujimoto and Toru Ogawa and Toshihiko Yamasaki and Kiyoharu Aizawa},
title={Sketch-based Manga Retrieval using Manga109 Dataset},
journal={Multimedia Tools and Applications},
volume={76},
number={20},
pages={21811--21838},
doi={10.1007/s11042-016-4020-z},
year={2017}
}
Acknowledgements
- RT-DETRv4 / D-FINE (Apache 2.0) — base architecture and training code
- DINOv2 (Apache 2.0) — distillation teacher
- HGNetv2 (Apache 2.0) — backbone