ipd/proteinmpnn

RUN ANYWHERE

ProteinMPNN is a deep learning model for predicting amino acid sequences for protein backbones.

Model Overview

Description:

ProteinMPNN (Protein Message Passing Neural Network) is a deep learning-based graph neural network designed to predict amino acid sequences for given protein backbones. This network leverages evolutionary, functional, and structural information to generate sequences that are likely to fold into the desired 3D structures.

This model is available for commercial use.

Third-Party Community Consideration

This model is not owned or developed by NVIDIA. This model has been developed and built to a third-party’s requirements for this application and use case.

License/Terms of Use:

This model is released under the MIT License.

You are responsible for ensuring that your use of NVIDIA AI Foundation Models complies with all applicable laws.

References:

@article{dauparas2022robust, title={Robust deep learning--based protein sequence design using ProteinMPNN}, author={Dauparas, Justas and Anishchenko, Ivan and Bennett, Nathaniel and Bai, Hua and Ragotte, Robert J and Milles, Lukas F and Wicky, Basile IM and Courbet, Alexis and de Haas, Rob J and Bethel, Neville and others}, journal={Science}, volume={378}, number={6615}, pages={49--56}, year={2022}, publisher={American Association for the Advancement of Science} }

Model Architecture:

Architecture Type: Protein Amino Acid Sequence Prediction
Network Architecture: ProteinMPNN

Input:

Input Type(s): Protein in Protein Data Bank (PDB) format
Input Format(s): String
Input Parameters: One-Dimensional (1D)

Output:

Output Type(s): Amino Acid Sequence
Output Format: Multi-FASTA (text file)
Output Parameters: 1D

Software Integration:

Runtime Engine(s):

  • Triton

Supported Hardware Microarchitecture Compatibility:

  • NVIDIA Ampere
  • NVIDIA Lovelace
  • NVIDIA Turing

[Preferred/Supported] Operating System(s):

  • Linux

Model Version(s):

ProteinMPNN 1.0.0

Training & Evaluation:

Training Dataset:

Link: The Protein Data Bank

** Data Collection Method by dataset

  • Hybrid: Automatic, Human

For PDB dataset, scientists worldwide submit structural data determined by X-ray crystallography or cryo-electron microscopy (cryo-EM). This includes atomic coordinates, experimental data, and metadata about the biological macromolecules.

** Labeling Method by dataset

  • Hybrid: Automatic, Human

For PDB dataset, expert biocurators review the submitted data to ensure accuracy and completeness. This involves checking the plausibility of the data and annotating it with relevant biological and chemical information. CATH 4.1 dataset is derived from the PDB dataset. The CATH (Class, Architecture, Topology, Homologous superfamily) database hierarchically classifies protein domain structures that are obtained from protein structures deposited in the PDB. The data in CATH are specifically sourced from PDB files and include structures determined at a resolution of 4 angstrom or better. The classification process involves both manual and automated methods to ensure accurate domain identification and classification. For ProteinMPNN, the data underwent quality filtering to ensure high accuracy, this involved removing structures with low resolution and potential errors.

Properties (Quantity, Dataset Descriptions, Sensor(s)): Model was trained by Institute for Protein Design. The dataset for training consisted of 23,358 sequences. Dataset: CATH 4.2, PDB. Sensors: X-ray crystallography, cryoEM.

Evaluation Dataset:

** Data Collection Method by dataset

  • Automatic: random splits from PDB dataset.

** Labeling Method by dataset

  • Automatic: random splits from PDB dataset.

The training, validation, and test splits were derived from protein assemblies in the PDB, which includes structures determined by X-ray crystallography or cryo-electron microscopy (cryoEM). The dataset was divided into random splits with 23,358 sequences for training, 1,464 for validation, and 1,529 for testing.

Inference:

Engine: Triton
Test Hardware:

  • A100
  • L40
  • H100

Ethical Considerations:

NVIDIA believes Trustworthy AI is a shared responsibility and we have established policies and practices to enable development for a wide array of AI applications. When downloaded or used in accordance with our terms of service, developers should work with their supporting model team to ensure this model meets requirements for the relevant industry and use case and addresses unforeseen product misuse. Please report security vulnerabilities or NVIDIA AI Concerns here.