Text2Sign: A Single-GPU Diffusion Baseline for Text-to-Sign Language Video Generation
Overview image generated from the compiled PDF.Authors: Ruize Xia
Published in: IEEE Access
DOI: 10.1109/ACCESS.2026.3686260
Source: ORCID record 0009-0000-0501-0943
Abstract
Sign language is a primary communication channel for millions of Deaf and hard-of-hearing people, yet generating signer video directly from text remains difficult because video diffusion models are expensive to train and evaluate. This article presents Text2Sign, a single-GPU diffusion baseline for text-to-sign language video generation. The model combines a frozen vision-language text encoder with a three-dimensional encoder-decoder backbone and factorized spatial and temporal attention, reducing the cost of full spatio-temporal attention while preserving motion coherence. The results indicate that pretrained text conditioning improves generalization under limited data, although the system remains limited to low-resolution short clips and does not yet include expert linguistic evaluation. The contribution should therefore be read as an efficiency-oriented research baseline rather than a complete sign-language production system.