Unraveling learning characteristics of transformer models for molecular design

Published in Patterns, 2025

In drug design, transformer networks adopted from natural language processing are applied in a variety of ways. We have used sequence-based generative compound design as a model system to explore the learning characteristics of transformers and determine if these models learned information relevant for protein-ligand interactions. The analysis reveals that sequence-based predictions of active compounds using transformer models required a proportion of at least ∼60% of the original test sequences. Moreover, predictions depended on sequence and compound similarity of training and test data and on compound memorization effects. The predictions were purely statistically driven by associating sequence patterns with molecular structures, thus rationalizing their strict dependence on detectable similarities. Moreover, the transformer models did not learn target sequence information relevant for ligand binding. While the results do not call sequence-based compound design approaches generally into question, they caution against over-interpretation of transformer models used for such applications.