Accurate diagnosis of dermatological diseases traditionally depends on expert analysis of visual data, which can be both resource-intensive and difficult to scale. In response to these challenges, this study explores the use of pre-trained Vision-Language Models (VLMs) to automate the annotation of dermatological images. These models generate detailed text descriptions that mimic expert dermatological assessments, providing an additional modality for disease classification. While visual data remains the primary factor for high accuracy, the inclusion of VLM-generated text annotations leads to a measurable improvement in classification performance. We experimented with several late-fusion strategies to combine visual and textual data, demonstrating that this multimodal approach enhances the overall effectiveness of disease classification systems. The results indicate that although VLM-generated descriptions are secondary to visual information, they add valuable context that improves diagnostic accuracy when combined with image-based models. This approach contributes to making automated disease classification more effective and accessible. We release the automatically generated annotations to support further research in this domain.

Vision-Language Multimodal Fusion in Dermatological Disease Classification

La Quatra M.
;
Cilia N. D.;Conti V.;Sorce S.;Garraffa G.;Salerno V. M.
2025-01-01

Abstract

Accurate diagnosis of dermatological diseases traditionally depends on expert analysis of visual data, which can be both resource-intensive and difficult to scale. In response to these challenges, this study explores the use of pre-trained Vision-Language Models (VLMs) to automate the annotation of dermatological images. These models generate detailed text descriptions that mimic expert dermatological assessments, providing an additional modality for disease classification. While visual data remains the primary factor for high accuracy, the inclusion of VLM-generated text annotations leads to a measurable improvement in classification performance. We experimented with several late-fusion strategies to combine visual and textual data, demonstrating that this multimodal approach enhances the overall effectiveness of disease classification systems. The results indicate that although VLM-generated descriptions are secondary to visual information, they add valuable context that improves diagnostic accuracy when combined with image-based models. This approach contributes to making automated disease classification more effective and accessible. We release the automatically generated annotations to support further research in this domain.
2025
9783031882166
9783031882173
File in questo prodotto:
Non ci sono file associati a questo prodotto.

I documenti in IRIS sono protetti da copyright e tutti i diritti sono riservati, salvo diversa indicazione.

Utilizza questo identificativo per citare o creare un link a questo documento: https://hdl.handle.net/11387/194673
Citazioni
  • ???jsp.display-item.citation.pmc??? ND
  • Scopus ND
  • ???jsp.display-item.citation.isi??? ND
social impact