Attention Heatmap Drift in a Contrastively Pretrained Vision–Language Model: A Controlled Matched-Learning-Rate Comparison of Full Fine-Tuning and Low-Rank Adaptation

Apr 6, 2026·

Ruize Xia

· 1 min read

DOI PDF DOI

Overview image generated from the compiled PDF.

Abstract

This preprint analyzes attention heatmap drift in a contrastively pretrained vision-language model under matched-learning-rate comparisons of full fine-tuning and low-rank adaptation. It studies CLIP ViT-B/32 attention behavior on downstream visual classification tasks using metrics such as attention entropy, effective receptive field, concentration, head diversity, rollout, and representational similarity.

Type

Preprint

Publication

Preprints

Authors: Ruize Xia
DOI: 10.20944/preprints202604.0317.v1
Source: ORCID record 0009-0000-0501-0943

Abstract

Fine-tuning contrastive vision-language models such as CLIP is an essential step for downstream task specialization, yet the associated changes in internal attention geometry remain insufficiently characterized. This preprint investigates attention heatmap drift in a contrastively pretrained vision-language model through a controlled matched-learning-rate comparison of full fine-tuning and low-rank adaptation. The study uses a multi-faceted metric suite, including CLS-to-patch attention entropy, effective receptive field, concentration, head diversity, attention rollout, representational analysis, and subset-sensitivity tests. The results provide a structural lens for understanding transformer adaptation and suggest that preserving attention structure is an important mechanism for maintaining foundation-model generalization.

Read the paper

Last updated on Apr 6, 2026