Presentation

· Contributors · Organizations · Search Program · My Schedule · Happening Now · Maps

This content is available for: Workshop Reg Pass. Upgrade Registration

Scalable Lead Prediction with Transformers Using HPC Resources

SessionNinth Computational Approaches for Cancer Workshop (CAFCW23)

DescriptionA promising direction in cancer drug discovery is high-throughput screening of extensive compound datasets to identify advantageous properties, including their ability to interact with relevant biomolecules such as proteins. However, traditional structural approaches for assessing binding affinity, such as free energy methods or molecular docking, pose significant computational bottlenecks when dealing with such vast datasets. To address this, we have developed a docking surrogate called the SMILES transformer (ST), which learns molecular features from the SMILES representation of compounds and approximates their binding affinity. SMILES data is first tokenized using a well-established SMILES-pair tokenizer and fed into a BERT-like Transformer model to generate vector embeddings for each molecule, effectively capturing the essential information. These extracted embeddings are then fed into a regression model to predict the binding affinity. Leveraging the high-performance computing resources at Argonne National Lab, we devised a workflow to scale model training and inference across multiple supercomputing nodes. To evaluate the performance and accuracy of our workflow, we conducted experiments using molecular docking binding affinity data on multiple receptors, comparing ST with another state-of-the-art docking surrogate. Impressively, both surrogates yielded comparable val-r2 measurements of between 70 and 90%, affirming the capability of ST to learn molecular features directly from language-based data. Furthermore, one significant advantage of the ST approach is its notably faster tokenization preprocessing compared to the alternative method, which requires generating molecular descriptors using Mordred. Our workflow facilitated screening of ~ 3 billion compounds on 48 nodes of the Polaris supercomputer in approximately an hour. In summary, our approach presents an efficient means to screen extensive compound databases for potential molecular properties that could serve as lead compounds targeting cancer. Looking ahead, an important future direction for our workflow involves integrating de-novo drug design, enabling us to scale our efforts to explore the limits of synthesizable compounds within chemical space.

Author/Presenters

Archit Vasan

Argonne National Laboratory