Text-guided diverse-expression diffusion model for molecule generation

IRIS

The task of molecule generation guided by specific text descriptions has been proposed to generate molecules that match given text inputs. Mainstream methods typically use simplified molecular input line entry system (SMILES) to represent molecules and rely on diffusion models or autoregressive structures for modeling. However, the one-to-many mapping diversity when using SMILES to represent molecules causes existing methods to require complex model architectures and larger training datasets to improve performance, which affects the efficiency of model training and generation. In this paper, we propose a text-guided diverse-expression diffusion (TGDD) model for molecule generation. TGDD combines both SMILES and self-referencing embedded strings (SELFIES) into a novel diverse-expression molecular representation, enabling precise molecule mapping based on natural language. By leveraging this diverse-expression representation, TGDD simplifies the segmented diffusion generation process, achieving faster training and reduced memory consumption, while also exhibiting stronger alignment with natural language. TGDD outperforms both TGM-LDM and the autoregressive model MolT5-Base on most evaluation metrics.

Text-guided diverse-expression diffusion model for molecule generation

Weng 翁, Wenchao 文超;Jiang 蒋, Hanyu 涵羽;Kong 孔, Xiangjie 祥杰;Pau, Giovanni

2025-01-01

Abstract

The task of molecule generation guided by specific text descriptions has been proposed to generate molecules that match given text inputs. Mainstream methods typically use simplified molecular input line entry system (SMILES) to represent molecules and rely on diffusion models or autoregressive structures for modeling. However, the one-to-many mapping diversity when using SMILES to represent molecules causes existing methods to require complex model architectures and larger training datasets to improve performance, which affects the efficiency of model training and generation. In this paper, we propose a text-guided diverse-expression diffusion (TGDD) model for molecule generation. TGDD combines both SMILES and self-referencing embedded strings (SELFIES) into a novel diverse-expression molecular representation, enabling precise molecule mapping based on natural language. By leveraging this diverse-expression representation, TGDD simplifies the segmented diffusion generation process, achieving faster training and reduced memory consumption, while also exhibiting stronger alignment with natural language. TGDD outperforms both TGM-LDM and the autoregressive model MolT5-Base on most evaluation metrics.

Scheda breve

Scheda completa

Scheda completa (DC)

Anno

2025

Appare nelle tipologie:

1.1 Articolo in rivista

File in questo prodotto:

Non ci sono file associati a questo prodotto.

I documenti in IRIS sono protetti da copyright e tutti i diritti sono riservati, salvo diversa indicazione.

Utilizza questo identificativo per citare o creare un link a questo documento: https://hdl.handle.net/11387/193153

Citazioni

ND

0

0

social impact