Back to Models
TO

tori29umai/rtdetrv4-x-manga109s

tori29umaigeneral

RT-DETRv4-X Manga109-s

日本語 | English

漫画ページから コマ枠 (frame) / 人物 (body) / 台詞 (text) の 3 クラスを検出する RT-DETRv4 (X size) モデル。Manga109-s を見開きから単ページに分割した 15,264 枚で 30 epoch ファインチューンしたものです。


日本語

概要

漫画ページから 3 クラス (コマ枠・人物・台詞) を検出する RT-DETRv4 (X size) モデル:

  • 0: body — 人物 (キャラクター)
  • 1: text — 台詞 / 吹き出し領域
  • 2: frame — コマ枠

Manga109-s を見開きから単ページ分割した 15,264 枚で 30 epoch ファインチューン、DINOv2 ViT-B/14 特徴蒸留付き。ComfyUI ワークフロー、自動化されたコマ単位処理パイプライン、漫画ドメインの研究を想定しています。

ベースアーキテクチャRT-DETRv4 X-size (HGNetv2-B5 backbone + DFINETransformer decoder)
蒸留教師モデルDINOv2 ViT-B/14 (Apache 2.0)
学習データManga109-s — 商用利用許諾済 87 タイトル
入力解像度1280 × 1280
クラス数3 (body / text / frame)

検出例

bbox の色分け: 黄緑 = frame (コマ) / 青 = body (人物) / 赤 = text (セリフ)

ペン入れ済み漫画ページの検出例

完成原稿 (ペン入れ済み) の例。コマ・人物・台詞の 3 クラスすべて高精度で取れています。

手書きネームの検出例

ラフな手書きネーム (下描き / ストーリーボード) の例。学習データは完成原稿だけですが、線画の途中段階でもコマ枠・人物・台詞をある程度検出できます。

精度

Manga109-s validation split (1,212 ページ, 23,619 bbox) で評価:

クラスmAPAP50AP75AR100
body (人物)76.2%96.2%85.8%84.6%
text (台詞)77.0%96.9%84.7%82.5%
frame (コマ枠)96.4%98.6%97.9%98.4%
平均83.2%97.2%89.5%88.5%

検出ヒット率を表す AP50 が 3 クラス全て 95% 以上で、実用上ほぼ取りこぼしなし。body / text の伸びしろは IoU 厳格化 (AP75) であり、検出漏れではなく bbox 枠の精度向上が今後の改善ポイント。

ファイル

ファイル説明
model.onnxONNX (opset 17, 静的入力 1×3×1280×1280)

推論

ONNX グラフの入出力:

  • 入力: images (float32, NCHW, [0, 1] に正規化), orig_target_sizes (int64, [N, 2] = [width, height])
  • 出力: labels (int, [N, 300]), boxes (float32, [N, 300, 4], 元画像座標の xyxy), scores (float32, [N, 300])

onnxruntime での最低限のサンプル:

import numpy as np
import onnxruntime as ort
from PIL import Image, ImageDraw

CLASS_NAMES = {0: "body", 1: "text", 2: "frame"}
INPUT_SIZE = 1280
CONF_THRESHOLD = 0.5

session = ort.InferenceSession(
    "model.onnx",
    providers=["CUDAExecutionProvider", "CPUExecutionProvider"],
)

image = Image.open("page.jpg").convert("RGB")
W, H = image.size

# 前処理: 1280x1280 にリサイズ → CHW float32 [0, 1]
resized = image.resize((INPUT_SIZE, INPUT_SIZE), Image.BILINEAR)
arr = np.asarray(resized, dtype=np.float32) / 255.0
arr = arr.transpose(2, 0, 1)[None]  # 1x3x1280x1280
orig_size = np.array([[W, H]], dtype=np.int64)

labels, boxes, scores = session.run(
    None, {"images": arr, "orig_target_sizes": orig_size}
)
labels, boxes, scores = labels[0], boxes[0], scores[0]

# 信頼度閾値で絞る (boxes は既に元画像の座標で出てくる)
keep = scores >= CONF_THRESHOLD
print(f"Detected {int(keep.sum())} objects")
for cid, (x1, y1, x2, y2), s in zip(labels[keep], boxes[keep], scores[keep]):
    print(f"  {CLASS_NAMES[int(cid)]:5s}  conf={s:.3f}  "
          f"bbox=({x1:.0f},{y1:.0f},{x2:.0f},{y2:.0f})")

# 可視化
draw = ImageDraw.Draw(image)
colors = {0: "blue", 1: "red", 2: "yellow"}
for cid, (x1, y1, x2, y2) in zip(labels[keep], boxes[keep]):
    draw.rectangle([x1, y1, x2, y2], outline=colors[int(cid)], width=3)
image.save("output.png")

CPU で動かす場合は providers=["CPUExecutionProvider"] のみで OK。GPU 用には onnxruntime-gpu をインストール。

学習設定

エポック30 (flat 15 + cosine 11 + no_aug 4)
バッチサイズ16 (single GPU)
オプティマイザAdamW — lr=2.5e-4, backbone lr=2.5e-6, weight_decay=1.25e-4
拡張Mosaic / RandomPhotometricDistort / RandomZoomOut / RandomIoUCrop, Mixup (epoch 2–15)
蒸留DINOv2 ViT-B/14 特徴蒸留, loss_distill weight 20 (adaptive)
train / valtrain: 15,264 枚 (単ページ 14,798 + 見開き保持 466) / val: 1,212 枚 (単ページ 1,116 + 見開き保持 96)、bbox は 290,200 / 23,619
クラス別 bbox 数 (train)body 109,480 / text 105,139 / frame 75,581
クラス別 bbox 数 (val)body 8,645 / text 8,433 / frame 6,541

Manga109-s の見開きページは原則センター線で単ページに分割して学習。ただし bbox がセンター線をまたぐページ (見開きを横断するコマやキャラクターを含むページ) は分割せず見開きのまま保持してアノテーションを残す mixed-mode。これは推論時の典型的なシナリオ (1 度に 1 ページ) と入出力を揃えつつ、学習中に「センター線をまたぐ正解 bbox」を欠落させないため。

ライセンス・帰属表示

モデル本体: Apache License 2.0

本モデルは Manga109-s を学習データとして使用しています。Manga109-s の規約に従って以下を明示します:

  • データセット本体は同梱しません。Manga109-s の取得は公式サイトからの正規入手に従ってください。
  • このモデルを使って Manga109-s 収録漫画画像の 複製・改変を商材化することは規約により禁止されています。
  • Manga109-s に基づく結果を公表する際は下記 2 論文の引用が必要です。

引用 (BibTeX)

@article{multimedia_aizawa_2020,
    author={Kiyoharu Aizawa and Azuma Fujimoto and Atsushi Otsubo and Toru Ogawa and Yusuke Matsui and Koki Tsubota and Hikaru Ikuta},
    title={Building a Manga Dataset ``Manga109'' with Annotations for Multimedia Applications},
    journal={IEEE MultiMedia},
    volume={27},
    number={2},
    pages={8--18},
    doi={10.1109/mmul.2020.2987895},
    year={2020}
}

@article{mtap_matsui_2017,
    author={Yusuke Matsui and Kota Ito and Yuji Aramaki and Azuma Fujimoto and Toru Ogawa and Toshihiko Yamasaki and Kiyoharu Aizawa},
    title={Sketch-based Manga Retrieval using Manga109 Dataset},
    journal={Multimedia Tools and Applications},
    volume={76},
    number={20},
    pages={21811--21838},
    doi={10.1007/s11042-016-4020-z},
    year={2017}
}

謝辞

  • RT-DETRv4 / D-FINE (Apache 2.0) — 本モデルのベースアーキテクチャと学習コード
  • DINOv2 (Apache 2.0) — 蒸留教師モデル
  • HGNetv2 (Apache 2.0) — backbone

English

Overview

RT-DETRv4 (X-size) finetuned on Manga109-s for 3-class object detection on Japanese manga pages:

  • 0: body — characters / human figures
  • 1: text — dialogue balloons & text regions
  • 2: frame — panel borders

Trained on 15,264 single-page images (split from Manga109-s spreads) for 30 epochs with DINOv2 ViT-B/14 feature distillation. Designed for ComfyUI workflows, automated panel processing pipelines, and manga-domain research.

Base architectureRT-DETRv4 X-size (HGNetv2-B5 backbone + DFINETransformer decoder)
Distillation teacherDINOv2 ViT-B/14 (Apache 2.0)
Training dataManga109-s — 87 commercially-licensed titles
Input resolution1280 × 1280
Number of classes3 (body / text / frame)

Examples

bbox color coding: yellow-green = frame (panel) / blue = body (character) / red = text (dialogue)

Detection on inked manga page

A finished, inked manga page. All three classes (panel / character / dialogue) are picked up with high precision.

Detection on rough hand-drawn name

A rough hand-drawn "name" (storyboard / pre-inking sketch). Although the training data only contains finished manga, the model still recognises panels, characters and dialogue regions reasonably well at the rough-draft stage.

Performance

Evaluated on the Manga109-s validation split (1,212 pages, 23,619 boxes).

ClassmAPAP50AP75AR100
body76.2%96.2%85.8%84.6%
text77.0%96.9%84.7%82.5%
frame96.4%98.6%97.9%98.4%
average83.2%97.2%89.5%88.5%

AP50 ≥ 95% across all three classes — virtually no missed detections in practical use. The remaining headroom in body / text is primarily IoU-strictness (AP75) rather than recall.

Files

FileDescription
model.onnxONNX, opset 17, static 1×3×1280×1280 input

Inference

The ONNX graph exposes:

  • inputs: images (float32, NCHW, normalised to [0, 1]), orig_target_sizes (int64, [N, 2] = [width, height])
  • outputs: labels (int, [N, 300]), boxes (float32, [N, 300, 4], xyxy in original image coordinates), scores (float32, [N, 300])

Minimum working example with onnxruntime:

import numpy as np
import onnxruntime as ort
from PIL import Image, ImageDraw

CLASS_NAMES = {0: "body", 1: "text", 2: "frame"}
INPUT_SIZE = 1280
CONF_THRESHOLD = 0.5

session = ort.InferenceSession(
    "model.onnx",
    providers=["CUDAExecutionProvider", "CPUExecutionProvider"],
)

image = Image.open("page.jpg").convert("RGB")
W, H = image.size

# Preprocess: resize to 1280x1280, CHW float32 in [0, 1]
resized = image.resize((INPUT_SIZE, INPUT_SIZE), Image.BILINEAR)
arr = np.asarray(resized, dtype=np.float32) / 255.0
arr = arr.transpose(2, 0, 1)[None]  # 1x3x1280x1280
orig_size = np.array([[W, H]], dtype=np.int64)

labels, boxes, scores = session.run(
    None, {"images": arr, "orig_target_sizes": orig_size}
)
labels, boxes, scores = labels[0], boxes[0], scores[0]

# Filter by confidence (boxes are already in original image coordinates)
keep = scores >= CONF_THRESHOLD
print(f"Detected {int(keep.sum())} objects")
for cid, (x1, y1, x2, y2), s in zip(labels[keep], boxes[keep], scores[keep]):
    print(f"  {CLASS_NAMES[int(cid)]:5s}  conf={s:.3f}  "
          f"bbox=({x1:.0f},{y1:.0f},{x2:.0f},{y2:.0f})")

# Visualise
draw = ImageDraw.Draw(image)
colors = {0: "blue", 1: "red", 2: "yellow"}
for cid, (x1, y1, x2, y2) in zip(labels[keep], boxes[keep]):
    draw.rectangle([x1, y1, x2, y2], outline=colors[int(cid)], width=3)
image.save("output.png")

For CPU-only inference, use providers=["CPUExecutionProvider"]. For GPU, install onnxruntime-gpu.

Training

Epochs30 (flat 15 + cosine 11 + no-aug 4)
Batch size16 (single GPU)
OptimiserAdamW — lr=2.5e-4, backbone lr=2.5e-6, weight_decay=1.25e-4
AugmentationMosaic / RandomPhotometricDistort / RandomZoomOut / RandomIoUCrop, Mixup (epoch 2–15)
DistillationDINOv2 ViT-B/14 feature distillation, loss_distill weight 20 (adaptive)
Train / val splittrain: 15,264 images (14,798 single pages + 466 retained spreads) / val: 1,212 images (1,116 single pages + 96 retained spreads); 290,200 / 23,619 bboxes
Per-class bbox count (train)body 109,480 / text 105,139 / frame 75,581
Per-class bbox count (val)body 8,645 / text 8,433 / frame 6,541

Manga109-s spreads were split at the centerline into single pages prior to training, except when an annotated bbox crossed the centerline — in that case the page was retained as a spread (mixed mode). This keeps the inference contract single-page-friendly while preserving cross-spread groundtruth boxes (e.g. panels or characters that span both pages) instead of clipping them away.

License & Attribution

Model: Apache License 2.0.

This model was trained on Manga109-s, whose terms of use require the following acknowledgements:

  • The dataset itself is not bundled with this release. Obtain Manga109-s through the official channel.
  • Using this model to commercially redistribute or sell reproductions / derivatives of Manga109-s manga images is prohibited by the dataset terms.
  • The two papers below must be cited when reporting results that depend on Manga109-s.

Citation

@article{multimedia_aizawa_2020,
    author={Kiyoharu Aizawa and Azuma Fujimoto and Atsushi Otsubo and Toru Ogawa and Yusuke Matsui and Koki Tsubota and Hikaru Ikuta},
    title={Building a Manga Dataset ``Manga109'' with Annotations for Multimedia Applications},
    journal={IEEE MultiMedia},
    volume={27},
    number={2},
    pages={8--18},
    doi={10.1109/mmul.2020.2987895},
    year={2020}
}

@article{mtap_matsui_2017,
    author={Yusuke Matsui and Kota Ito and Yuji Aramaki and Azuma Fujimoto and Toru Ogawa and Toshihiko Yamasaki and Kiyoharu Aizawa},
    title={Sketch-based Manga Retrieval using Manga109 Dataset},
    journal={Multimedia Tools and Applications},
    volume={76},
    number={20},
    pages={21811--21838},
    doi={10.1007/s11042-016-4020-z},
    year={2017}
}

Acknowledgements

Visit Website

0 reviews

5
0
4
0
3
0
2
0
1
0
Likes5
Downloads
📝

No reviews yet

Be the first to review tori29umai/rtdetrv4-x-manga109s!

Model Info

Providertori29umai
Categorygeneral
Reviews0
Avg. Rating / 5.0

Community

Likes5
Downloads

Rating Guidelines

★★★★★Exceptional
★★★★Great
★★★Good
★★Fair
Poor