The task of molecule generation guided by specific text descriptions has been proposed to generate molecules that match given text inputs. Mainstream methods typically use simplified molecular input line entry system (SMILES) to represent molecules and rely on diffusion models or autoregressive structures for modeling. However, the one-to-many mapping diversity when using SMILES to represent molecules causes existing methods to require complex model architectures and larger training datasets to improve performance, which affects the efficiency of model training and generation. In this paper, we propose a text-guided diverse-expression diffusion (TGDD) model for molecule generation. TGDD combines both SMILES and self-referencing embedded strings (SELFIES) into a novel diverse-expression molecular representation, enabling precise molecule mapping based on natural language. By leveraging this diverse-expression representation, TGDD simplifies the segmented diffusion generation process, achieving faster training and reduced memory consumption, while also exhibiting stronger alignment with natural language. TGDD outperforms both TGM-LDM and the autoregressive model MolT5-Base on most evaluation metrics.
Text-guided diverse-expression diffusion model for molecule generation
Pau, Giovanni
2025-01-01
Abstract
The task of molecule generation guided by specific text descriptions has been proposed to generate molecules that match given text inputs. Mainstream methods typically use simplified molecular input line entry system (SMILES) to represent molecules and rely on diffusion models or autoregressive structures for modeling. However, the one-to-many mapping diversity when using SMILES to represent molecules causes existing methods to require complex model architectures and larger training datasets to improve performance, which affects the efficiency of model training and generation. In this paper, we propose a text-guided diverse-expression diffusion (TGDD) model for molecule generation. TGDD combines both SMILES and self-referencing embedded strings (SELFIES) into a novel diverse-expression molecular representation, enabling precise molecule mapping based on natural language. By leveraging this diverse-expression representation, TGDD simplifies the segmented diffusion generation process, achieving faster training and reduced memory consumption, while also exhibiting stronger alignment with natural language. TGDD outperforms both TGM-LDM and the autoregressive model MolT5-Base on most evaluation metrics.I documenti in IRIS sono protetti da copyright e tutti i diritti sono riservati, salvo diversa indicazione.