Real-Time On-Device Diffusion: Practical Acceleration via Fused Low-Bit Kernels

Mar 21, 2026·

Xia Ruize

· 1 min read

PDF

Overview image generated from the compiled PDF.

Abstract

This paper turns low-bit diffusion acceleration into a real systems result. Building on MoDiff, it introduces a practical kernel implementation and a cache-update fusion strategy that reduces memory traffic during inference, achieving measured runtime gains on hardware rather than only operation-count estimates.

Type

Preprint

Publication

Compiled Research Manuscript

Authors: Xia Ruize, Weizhi Gao, Jiapeng Hu, Xiaorui Liu
Compiled from: draft/Real_Time_On_Device_Diffusion__Practical_Acceleration_via_Fused_Low_Bit_Kernels/main.tex

Abstract

Diffusion models have demonstrated remarkable effectiveness as generative models, but their high computational cost, caused by the iterative denoising process and computationally heavy backbone networks, remains a major barrier to on-device and real-time deployment. Modulated Diffusion (MoDiff) provides a post-training quantization framework that supports highly efficient low-bit activation quantization. However, prior work evaluates the acceleration benefits of MoDiff only through operation-count analysis, without validating them on real hardware. In this paper, we present a kernel implementation of MoDiff for practical hardware acceleration. In particular, we propose a cache-update fusion paradigm that eliminates additional I/O operations during inference. Experiments and ablation studies demonstrate that our implementation achieves up to a 1.8× runtime speedup over FP32 models and up to a 42.2% reduction in memory I/O usage compared to FP32.

Read the paper

Last updated on Mar 21, 2026