Vision-Language Multimodal Fusion in Dermatological Disease Classification

La Quatra, M.; Cilia, N. D.; Conti, V.; Sorce, S.; Garraffa, G.; Salerno, V. M.

doi:10.1007/978-3-031-88217-3_15

Accurate diagnosis of dermatological diseases traditionally depends on expert analysis of visual data, which can be both resource-intensive and difficult to scale. In response to these challenges, this study explores the use of pre-trained Vision-Language Models (VLMs) to automate the annotation of dermatological images. These models generate detailed text descriptions that mimic expert dermatological assessments, providing an additional modality for disease classification. While visual data remains the primary factor for high accuracy, the inclusion of VLM-generated text annotations leads to a measurable improvement in classification performance. We experimented with several late-fusion strategies to combine visual and textual data, demonstrating that this multimodal approach enhances the overall effectiveness of disease classification systems. The results indicate that although VLM-generated descriptions are secondary to visual information, they add valuable context that improves diagnostic accuracy when combined with image-based models. This approach contributes to making automated disease classification more effective and accessible. We release the automatically generated annotations to support further research in this domain.

Vision-Language Multimodal Fusion in Dermatological Disease Classification

La Quatra M.;Cilia N. D.;Conti V.;Sorce S.;Garraffa G.;Salerno V. M.

2025-01-01

Abstract

Accurate diagnosis of dermatological diseases traditionally depends on expert analysis of visual data, which can be both resource-intensive and difficult to scale. In response to these challenges, this study explores the use of pre-trained Vision-Language Models (VLMs) to automate the annotation of dermatological images. These models generate detailed text descriptions that mimic expert dermatological assessments, providing an additional modality for disease classification. While visual data remains the primary factor for high accuracy, the inclusion of VLM-generated text annotations leads to a measurable improvement in classification performance. We experimented with several late-fusion strategies to combine visual and textual data, demonstrating that this multimodal approach enhances the overall effectiveness of disease classification systems. The results indicate that although VLM-generated descriptions are secondary to visual information, they add valuable context that improves diagnostic accuracy when combined with image-based models. This approach contributes to making automated disease classification more effective and accessible. We release the automatically generated annotations to support further research in this domain.

Scheda breve

Scheda completa

Scheda completa (DC)

	Anno
	
				2025
			
	Codice ISBN
	
				9783031882166
9783031882173
			
	Appare nelle tipologie:
	
				4.1 Contributo in Atti di convegno

File in questo prodotto:

Non ci sono file associati a questo prodotto.

I documenti in IRIS sono protetti da copyright e tutti i diritti sono riservati, salvo diversa indicazione.

Utilizza questo identificativo per citare o creare un link a questo documento: https://hdl.handle.net/11387/194673

Citazioni

ND

1

ND

Vision-Language Multimodal Fusion in Dermatological Disease Classification

La Quatra M.;Cilia N. D.;Conti V.;Sorce S.;Garraffa G.;Salerno V. M.

2025-01-01

Abstract

Scheda breve Scheda completa Scheda completa (DC)

Informazioni

Citazioni

social impact

Conferma cancellazione

Scheda breve

Scheda completa

Scheda completa (DC)