ipd/proteinmpnn
ProteinMPNN is a deep learning model for predicting amino acid sequences for protein backbones.
Model Overview
Description:
ProteinMPNN (Protein Message Passing Neural Network) is a deep learning-based
graph neural network designed to predict amino acid sequences for given protein
backbones. This network leverages evolutionary, functional, and structural
information to generate sequences that are likely to fold into the desired 3D
structures.
This model is available for commercial use.
Third-Party Community Consideration
This model is not owned or developed by NVIDIA. This model has been developed
and built to a third-party’s requirements for this application and use case.
License/Terms of Use:
This model is released under the MIT License.
You are responsible for ensuring that your use of NVIDIA AI Foundation Models complies with all applicable laws.
References:
@article{dauparas2022robust, title={Robust deep learning--based protein sequence design using ProteinMPNN}, author={Dauparas, Justas and Anishchenko, Ivan and Bennett, Nathaniel and Bai, Hua and Ragotte, Robert J and Milles, Lukas F and Wicky, Basile IM and Courbet, Alexis and de Haas, Rob J and Bethel, Neville and others}, journal={Science}, volume={378}, number={6615}, pages={49--56}, year={2022}, publisher={American Association for the Advancement of Science} }
Model Architecture:
Architecture Type: Protein Amino Acid Sequence Prediction
Network Architecture: ProteinMPNN
Input:
Input Type(s): Protein in Protein Data Bank (PDB) format
Input Format(s): String
Input Parameters: One-Dimensional (1D)
Output:
Output Type(s): Amino Acid Sequence
Output Format: Multi-FASTA (text file)
Output Parameters: 1D
Software Integration:
Runtime Engine(s):
- Triton
Supported Hardware Microarchitecture Compatibility:
- NVIDIA Ampere
- NVIDIA Lovelace
- NVIDIA Turing
[Preferred/Supported] Operating System(s):
- Linux
Model Version(s):
ProteinMPNN 1.0.0
Training & Evaluation:
Training Dataset:
Link:
The Protein Data Bank
** Data Collection Method by dataset
- Hybrid: Automatic, Human
For PDB dataset, scientists worldwide submit structural data
determined by X-ray crystallography or cryo-electron microscopy (cryo-EM).
This includes atomic coordinates, experimental data, and metadata about the
biological macromolecules.
** Labeling Method by dataset
- Hybrid: Automatic, Human
For PDB dataset, expert biocurators review the submitted data to
ensure accuracy and completeness. This involves checking the plausibility of
the data and annotating it with relevant biological and chemical information.
CATH 4.1 dataset is derived from the PDB dataset. The CATH
(Class, Architecture, Topology, Homologous superfamily) database
hierarchically classifies protein domain structures that are obtained from
protein structures deposited in the PDB. The data in CATH are specifically
sourced from PDB files and include structures determined at a resolution of 4
angstrom or better. The classification process involves both manual and
automated methods to ensure accurate domain identification and classification.
For ProteinMPNN, the data underwent quality filtering to ensure high accuracy,
this involved removing structures with low resolution and potential errors.
Properties (Quantity, Dataset Descriptions, Sensor(s)): Model was trained by Institute for Protein Design. The dataset for training consisted of 23,358 sequences. Dataset: CATH 4.2, PDB. Sensors: X-ray crystallography, cryoEM.
Evaluation Dataset:
** Data Collection Method by dataset
- Automatic: random splits from PDB dataset.
** Labeling Method by dataset
- Automatic: random splits from PDB dataset.
The training, validation, and test splits were derived from protein assemblies
in the PDB, which includes structures determined by X-ray
crystallography or cryo-electron microscopy (cryoEM). The dataset was divided
into random splits with 23,358 sequences for training, 1,464 for validation, and
1,529 for testing.
Inference:
Engine: Triton
Test Hardware:
- A100
- L40
- H100
Ethical Considerations:
NVIDIA believes Trustworthy AI is a shared responsibility and we have established policies and practices to enable development for a wide array of AI applications. When downloaded or used in accordance with our terms of service, developers should work with their supporting model team to ensure this model meets requirements for the relevant industry and use case and addresses unforeseen product misuse. Please report security vulnerabilities or NVIDIA AI Concerns here.