Attention Heatmap Drift in a Contrastively Pretrained Vision–Language Model: A Controlled Matched-Learning-Rate Comparison of Full Fine-Tuning and Low-Rank Adaptation
Overview image generated from the compiled PDF.Authors: Ruize Xia
DOI: 10.20944/preprints202604.0317.v1
Source: ORCID record 0009-0000-0501-0943
Abstract
Fine-tuning contrastive vision-language models such as CLIP is an essential step for downstream task specialization, yet the associated changes in internal attention geometry remain insufficiently characterized. This preprint investigates attention heatmap drift in a contrastively pretrained vision-language model through a controlled matched-learning-rate comparison of full fine-tuning and low-rank adaptation. The study uses a multi-faceted metric suite, including CLS-to-patch attention entropy, effective receptive field, concentration, head diversity, attention rollout, representational analysis, and subset-sensitivity tests. The results provide a structural lens for understanding transformer adaptation and suggest that preserving attention structure is an important mechanism for maintaining foundation-model generalization.