# Model Overview ## Description: DiffDock is a generative diffusion model for drug discovery in molecular blind docking. DiffDock consists of two models: the Score and Confidence models. The Score model generates a series of potential poses for protein-ligand binding by running a reverse diffusion process. DiffDock does not require any information about a binding pocket. During its diffusion process, the molecule's position relative to the protein, its orientation, and the torsion angles are allowed to change. Running the learned reverse diffusion process transforms a distribution of noisy prior molecule poses to the one learned by the model. As a result, it outputs many sampled poses and ranks them via its confidence model. Leveraging the same neural-network architecture designed in the original DiffDock by MIT, the model v2.0 is trained by NVIDIA using PLINDER, a state-of-art dataset of well curated and labeled protein-ligand complexes, which therefore, delivers a much higher accuracy for molecular docking tasks. This model is ready for commercial and non-commercial use.
## License/Terms of Use: This model is released under the [MIT License](https://github.com/gcorso/DiffDock/blob/main/LICENSE). ## References: ``` @article {Durairaj2024.07.17.603955, author = {Durairaj, Janani and Adeshina, Yusuf and Cao, Zhonglin and Zhang, Xuejin and Oleinikovas, Vladas and Duignan, Thomas and McClure, Zachary and Robin, Xavier and Studer, Gabriel and Kovtun, Daniel and Rossi, Emanuele and Zhou, Guoqing and Veccham, Srimukh and Isert, Clemens and Peng, Yuxing and Sundareson, Prabindh and Akdel, Mehmet and Corso, Gabriele and St{\"a}rk, Hannes and Tauriello, Gerardo and Carpenter, Zachary and Bronstein, Michael and Kucukbenli, Emine and Schwede, Torsten and Naef, Luca}, title = {PLINDER: The protein-ligand interactions dataset and evaluation resource}, elocation-id = {2024.07.17.603955}, year = {2024}, doi = {10.1101/2024.07.17.603955}, publisher = {Cold Spring Harbor Laboratory}, abstract = {Protein-ligand interactions (PLI) are foundational to small molecule drug design. With computational methods striving towards experimental accuracy, there is a critical demand for a well-curated and diverse PLI dataset. Existing datasets are often limited in size and diversity, and commonly used evaluation sets suffer from training information leakage, hindering the realistic assessment of method generalization capabilities. To address these shortcomings, we present PLIN-DER, the largest and most annotated dataset to date, comprising 449,383 PLI systems, each with over 500 annotations, similarity metrics at protein, pocket, interaction and ligand levels, and paired unbound (apo) and predicted structures. We propose an approach to generate training and evaluation splits that minimizes task-specific leakage and maximizes test set quality, and compare the resulting performance of DiffDock when retrained with different kinds of splits.Competing Interest StatementThe authors have declared no competing interest.}, URL = {https://www.biorxiv.org/content/early/2024/07/19/2024.07.17.603955.1}, eprint = {https://www.biorxiv.org/content/early/2024/07/19/2024.07.17.603955.1.full.pdf}, journal = {bioRxiv} } ``` ``` @article{corso2023diffdock, title={DiffDock: Diffusion Steps, Twists, and Turns for Molecular Docking}, author = {Corso, Gabriele and Stärk, Hannes and Jing, Bowen and Barzilay, Regina and Jaakkola, Tommi}, journal={International Conference on Learning Representations (ICLR)}, year={2023} } ``` ## Model Architecture: **Architecture Type:** Score-Based Diffusion Model (SBDM)
**Network Architecture:** Graph Convolution Neural Network
The Score model is a 3-dimensional equivariant graph neural network that has three layers: embedding, interaction layer with 6 graph convolution layers, and output layer. In total, the Score model has 20M parameters. ## Input: **Input Type(s):** Text (Ligand, Protein), Number (Poses to Generate, Batch Size, Diffusion Steps, Diffusion Time Divisions) Binary (No Final Step Noise, Save Diffusion Trajectory, and Skip Gen Conformer)
**Input Format(s):** Text: String (SMILES, Structural Data Files (SDF) or Tripos molecule structure (Mol2) for Ligand), String (Protein Data Bank (PDB)), Number: Integer; Binary: Boolean
**Input Parameters:** 1D
**Other Properties Related to Input:** No max sequence
## Output: **Output Type(s):** Text (Ligand Molecule 3D Positions, 3D), Text (Ligand Molecule 3D Positions, 3D), Number (List of Confidence Scores, 1D)
**Output Format:** Text: Structural Data Files (SDF), Text: Protein Data Bank (PDB), Number: Array of Floating Point 32
**Output Parameters:** docked_ligand, visualizations_files, pose_confidence
## Software Integration: **Runtime Engine(s):** * PyTorch
**Supported Hardware Microarchitecture Compatibility:**
* NVIDIA Ampere
* NVIDIA Ada Lovelace
* NVIDIA Hopper
* NVIDIA Grace Hopper
**[Preferred/Supported] Operating System(s):**
* Linux
## Model Version(s): DiffDock v2.1
# Training & Evaluation Dataset: ## Training: **Link:** [PLINDER](https://plinder-org.github.io/plinder/index.html)

**Data Collection Method by dataset:**
* Human
**Labeling Method by dataset:**
* Hybrid: Human & Automated
**Properties (Quantity, Dataset Descriptions, Sensor(s)):** 450,000 protein-ligand complexes automatically curated using the PDB database. See (https://www.biorxiv.org/content/10.1101/2024.07.17.603955v3).
## Evaluation: **Link:** [PoseBusters benchmark (PDB) set](https://zenodo.org/records/8278563)

**Data Collection Method by dataset:**
* Human
**Labeling Method by dataset:**
* Hybrid: Human & Automated
**Properties (Quantity, Dataset Descriptions, Sensor(s)):** 428 protein-ligand complexes manually curated using the PDB database. See https://arxiv.org/abs/2308.05777v1.
## Inference: **Engine:** PyTorch
**Test Hardware:**
* A100, A6000, H100, L40, L40S
# Ethical Considerations: NVIDIA believes Trustworthy AI is a shared responsibility and we have established policies and practices to enable development for a wide array of AI applications. When downloaded or used in accordance with our terms of service, developers should work with their internal model team to ensure this model meets requirements for the relevant industry and use case and addresses unforeseen product misuse. Users are responsible for ensuring the physical properties of model-generated molecular docking poses are appropriately evaluated and comply with applicable safety regulations and ethical standards. **For more detailed information on ethical considerations for this model**, please see the Model Card++ Explainability, Bias, Safety & Security, and Privacy Subcards. Please report security vulnerabilities or NVIDIA AI Concerns [here](https://www.nvidia.com/en-us/support/submit-security-vulnerability/).