meta/esm2-650m

PREVIEW

Generates embeddings of proteins from their amino acid sequences.

Model Overview

Description:

ESM2 is a state of the art language model for protein sequences. It outputs a numerical representation of input protein sequences that is suitable for downstream tasks. In particular, its output is used for protein folding in the ESMFold model. ESM2 comes in various sizes: 650M parameters, 3B parameters, and 15B parameters. ESM2 was developed and trained by META. The larger models tend to result in more accurate results on downstream tasks, however they also have a longer runtime.

License

MIT License

Copyright (c) Meta Platforms, Inc. and affiliates.

Permission is hereby granted, free of charge, to any person obtaining a copy of this software and associated documentation files (the "Software"), to deal in the Software without restriction, including without limitation the rights to use, copy, modify, merge, publish, distribute, sublicense, and/or sell copies of the Software, and to permit persons to whom the Software is furnished to do so, subject to the following conditions:

The above copyright notice and this permission notice shall be included in all copies or substantial portions of the Software.

THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY, FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE SOFTWARE.

This model is available for commercial use.

Third-Party Community Consideration

This model is not owned or developed by NVIDIA. This model has been developed and built to a third-party’s requirements for this application and use case; see link to Non-NVIDIA Model Card.

References:

@ARTICLE{rives2021esm, title = "Biological structure and function emerge from scaling unsupervised learning to 250 million protein sequences", author = "Rives, Alexander and Meier, Joshua and Sercu, Tom and Goyal, Siddharth and Lin, Zeming and Liu, Jason and Guo, Demi and Ott, Myle and Zitnick, C Lawrence and Ma, Jerry and Fergus, Rob", journal = "Proc. Natl. Acad. Sci. U. S. A.", volume = 118, number = 15, month = apr, year = 2021, keywords = "deep learning; generative biology; protein language model; representation learning; synthetic biology", language = "en", doi = {10.1073/pnas.2016239118} } For the self-attention contact prediction: @article{rao2020transformer, author = {Rao, Roshan M and Meier, Joshua and Sercu, Tom and Ovchinnikov, Sergey and Rives, Alexander}, title= {Transformer protein language models are unsupervised structure learners}, year= {2020}, doi= {10.1101/2020.12.15.422761}, url= {https://www.biorxiv.org/content/10.1101/2020.12.15.422761v1}, journal= {bioRxiv} }

Model Architecture:

Architecture Type: BERT with rotational embeddings
Network Architecture: ESM2-650m

Input:

Input Type(s): Protein Sequence
Input Format(s): String
Input Parameters: 1D
Other Properties Related to Input: Protein Sequence matching the regular expression ^[ARNDCQEGHILKMFPSTWYVXBOU]*$ upto 1024 characters

Output:

Output Type(s): Embeddings
Output Format: Float 16 Array
Output Parameters: 1D
Other Properties Related to Output: NA

Software Integration:

Runtime Engine(s):

  • [Not Applicable (N/A)- Name Platform If Multiple]

Supported Hardware Microarchitecture Compatibility:

  • NVIDIA Ampere
  • L40

[Preferred/Supported] Operating System(s):

  • Linux

Model Version(s):

ESM2

Training & Evaluation:

Training Dataset:

Link: UniRef50
** Data Collection Method by dataset

  • Not Applicable

** Labeling Method by dataset

  • Not Applicable

Properties (Quantity, Dataset Descriptions, Sensor(s)): UniRef50, September 2021 version, is used for the training of ESM models. The training dataset was partitioned by randomly selecting 0.5% (≈ 250,000) sequences to form the validation set. The training set has sequences removed via the procedure described

Evaluation Dataset:

UniRef50
** Data Collection Method by dataset

  • Not Applicable

** Labeling Method by dataset

  • Not Applicable

Properties (Quantity, Dataset Descriptions, Sensor(s)): UniRef50, September 2021 version, is used for the training of ESM models. The training dataset was partitioned by randomly selecting 0.5% (≈ 250,000) sequences to form the validation set. The training set has sequences removed via the procedure described

Inference:

Engine: Triton
Test Hardware:

  • [Other (Not Listed)]

Ethical Considerations:

NVIDIA believes Trustworthy AI is a shared responsibility and we have established policies and practices to enable development for a wide array of AI applications. When downloaded or used in accordance with our terms of service, developers should work with their supporting model team to ensure this model meets requirements for the relevant industry and use case and addresses unforeseen product misuse. For more detailed information on ethical considerations for this model, please see the Model Card++ Explainability, Bias, Safety & Security, and Privacy Subcards. Please report security vulnerabilities or NVIDIA AI Concerns here.

**If anything is meant for internal-purposes only (including this statement and pre-filled content recommendations, please alert Trustworthy AI Product Manager or designee before publishing)