Accurate diagnosis of dermatological diseases traditionally depends on expert analysis of visual data, which can be both resource-intensive and difficult to scale. In response to these challenges, this study explores the use of pre-trained Vision-Language Models (VLMs) to automate the annotation of dermatological images. These models generate detailed text descriptions that mimic expert dermatological assessments, providing an additional modality for disease classification. While visual data remains the primary factor for high accuracy, the inclusion of VLM-generated text annotations leads to a measurable improvement in classification performance. We experimented with several late-fusion strategies to combine visual and textual data, demonstrating that this multimodal approach enhances the overall effectiveness of disease classification systems. The results indicate that although VLM-generated descriptions are secondary to visual information, they add valuable context that improves diagnostic accuracy when combined with image-based models. This approach contributes to making automated disease classification more effective and accessible. We release the automatically generated annotations to support further research in this domain.
Vision-Language Multimodal Fusion in Dermatological Disease Classification
La Quatra M.
;Cilia N. D.;Conti V.;Sorce S.;Garraffa G.;Salerno V. M.
2025-01-01
Abstract
Accurate diagnosis of dermatological diseases traditionally depends on expert analysis of visual data, which can be both resource-intensive and difficult to scale. In response to these challenges, this study explores the use of pre-trained Vision-Language Models (VLMs) to automate the annotation of dermatological images. These models generate detailed text descriptions that mimic expert dermatological assessments, providing an additional modality for disease classification. While visual data remains the primary factor for high accuracy, the inclusion of VLM-generated text annotations leads to a measurable improvement in classification performance. We experimented with several late-fusion strategies to combine visual and textual data, demonstrating that this multimodal approach enhances the overall effectiveness of disease classification systems. The results indicate that although VLM-generated descriptions are secondary to visual information, they add valuable context that improves diagnostic accuracy when combined with image-based models. This approach contributes to making automated disease classification more effective and accessible. We release the automatically generated annotations to support further research in this domain.I documenti in IRIS sono protetti da copyright e tutti i diritti sono riservati, salvo diversa indicazione.