Text2Sign: A Single-GPU Diffusion Baseline for Text-to-Sign Language Video Generation

Jan 1, 2026·
Ruize Xia
Ruize Xia
· 1 min read
Overview image generated from the compiled PDF.
Abstract
This IEEE Access article presents Text2Sign, a single-GPU diffusion baseline for text-to-sign language video generation. The system combines a frozen vision-language text encoder with a 3D encoder-decoder backbone and factorized spatial-temporal attention, balancing generation quality with realistic compute limits.
Type
Publication
IEEE Access

Authors: Ruize Xia
Published in: IEEE Access
DOI: 10.1109/ACCESS.2026.3686260
Source: ORCID record 0009-0000-0501-0943

Abstract

Sign language is a primary communication channel for millions of Deaf and hard-of-hearing people, yet generating signer video directly from text remains difficult because video diffusion models are expensive to train and evaluate. This article presents Text2Sign, a single-GPU diffusion baseline for text-to-sign language video generation. The model combines a frozen vision-language text encoder with a three-dimensional encoder-decoder backbone and factorized spatial and temporal attention, reducing the cost of full spatio-temporal attention while preserving motion coherence. The results indicate that pretrained text conditioning improves generalization under limited data, although the system remains limited to low-resolution short clips and does not yet include expert linguistic evaluation. The contribution should therefore be read as an efficiency-oriented research baseline rather than a complete sign-language production system.

Read the paper