Universitat Politècnica de Catalunya. Universitat Rovira i Virgili
Universitat Rovira i Virgili
Universitat de Barcelona
Moreno Ribas, Antonio
David Sánchez Ruenes, Josep Domingo I Ferrer
2026-01-27
Large language models fine-tuned on domain-specific data are vulnerable to membership inference attacks, which can reveal whether particular examples were used in training. While prior work has established that fine-tuned models exhibit higher vulnerability than pre-trained models, this research has focused almost exclusively on endpoint comparisons-evaluating vulnerability after fine-tuning is complete without examining how it develops during training. This thesis investigates the progressive emergence of membership inference vulnerability across training epochs and its relationship with overfitting. We evaluate five membership inference attacks across five fine-tuning methods (full fine-tuning, LoRA, BitFit, adapter tuning, and prefix tuning), three model scales (1B, 6.9B, and 12B parameters), and five training epochs, yielding 375 attack evaluations. To ensure methodological rigor, we employ bag-of-words validation to verify that evaluation datasets are free from distribution artifacts that have confounded prior benchmarks. The central finding is a strong correlation between the training-validation loss gap-a standard measure of overfitting-and attack effectiveness across all experimental conditions. Pearson correlations range from 0.838 to 0.996 across attack methods, with all correlations statistically significant (p < 0.001). This relationship holds consistently across fine-tuning methods and model scales, suggesting that membership inference attacks primarily succeed when models are overfitted rather than exploiting fundamental architectural vulnerabilities. Reference-based attacks, which compare the fine-tuned model's behavior against the original base model, show amplified sensitivity compared to attacks that examine only the fine-tuned model, achieving high effectiveness at lower overfitting levels. These findings suggest that standard generalization practices may reduce membership inference vulnerability alongside their benefits for model quality. The loss gap, already monitored by practitioners for model selection, could serve as a practical privacy risk indicator during fine-tuning without requiring attack implementation. The core contributions of this thesis have been accepted for publication at RECSI 2026 (XVIII Reunión Española sobre Criptología y Seguridad de la Información).
Master thesis
English
Àrees temàtiques de la UPC::Informàtica::Intel·ligència artificial::Aprenentatge automàtic; Àrees temàtiques de la UPC::Informàtica::Seguretat informàtica; Machine learning; Computer security; Atacs d'inferència de pertinença; Models de llenguatge grans; Ajust fi; Privacitat; Ajust fi eficient en paràmetres; Sobreajust; Membership inference attacks; Large language models; Fine-tuning; Privacy; Parameter-efficient fine-tuning; Overfitting; Aprenentatge automàtic; Seguretat informàtica
Universitat Politècnica de Catalunya
Open Access
Treballs acadèmics [82686]