Generates embeddings of proteins from their amino acid sequences.
ESM2 is a state of the art language model for protein sequences. It outputs a numerical representation of input protein sequences that is suitable for downstream tasks. In particular, its output is used for protein folding in the ESMFold model. ESM2 comes in various sizes: 650M parameters, 3B parameters, and 15B parameters. ESM2 was developed and trained by META. The larger models tend to result in more accurate results on downstream tasks, however they also have a longer runtime.
MIT License
Copyright (c) Meta Platforms, Inc. and affiliates.
Permission is hereby granted, free of charge, to any person obtaining a copy of this software and associated documentation files (the "Software"), to deal in the Software without restriction, including without limitation the rights to use, copy, modify, merge, publish, distribute, sublicense, and/or sell copies of the Software, and to permit persons to whom the Software is furnished to do so, subject to the following conditions:
The above copyright notice and this permission notice shall be included in all copies or substantial portions of the Software.
THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY, FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE SOFTWARE.
This model is available for commercial use.
This model is not owned or developed by NVIDIA. This model has been developed and built to a third-party’s requirements for this application and use case; see link to Non-NVIDIA Model Card.
@ARTICLE{rives2021esm, title = "Biological structure and function emerge from scaling unsupervised learning to 250 million protein sequences", author = "Rives, Alexander and Meier, Joshua and Sercu, Tom and Goyal, Siddharth and Lin, Zeming and Liu, Jason and Guo, Demi and Ott, Myle and Zitnick, C Lawrence and Ma, Jerry and Fergus, Rob", journal = "Proc. Natl. Acad. Sci. U. S. A.", volume = 118, number = 15, month = apr, year = 2021, keywords = "deep learning; generative biology; protein language model; representation learning; synthetic biology", language = "en", doi = {10.1073/pnas.2016239118} } For the self-attention contact prediction: @article{rao2020transformer, author = {Rao, Roshan M and Meier, Joshua and Sercu, Tom and Ovchinnikov, Sergey and Rives, Alexander}, title= {Transformer protein language models are unsupervised structure learners}, year= {2020}, doi= {10.1101/2020.12.15.422761}, url= {https://www.biorxiv.org/content/10.1101/2020.12.15.422761v1}, journal= {bioRxiv} }
Architecture Type: BERT with rotational embeddings
Network Architecture: ESM2-650m
Input Type(s): Protein Sequence
Input Format(s): String
Input Parameters: 1D
Other Properties Related to Input: Protein Sequence matching the regular expression ^[ARNDCQEGHILKMFPSTWYVXBOU]*$
upto 1024 characters
Output Type(s): Embeddings
Output Format: Float 16 Array
Output Parameters: 1D
Other Properties Related to Output: NA
Runtime Engine(s):
Supported Hardware Microarchitecture Compatibility:
[Preferred/Supported] Operating System(s):
ESM2
Link:
UniRef50
** Data Collection Method by dataset
** Labeling Method by dataset
Properties (Quantity, Dataset Descriptions, Sensor(s)): UniRef50, September 2021 version, is used for the training of ESM models. The training dataset was partitioned by randomly selecting 0.5% (≈ 250,000) sequences to form the validation set. The training set has sequences removed via the procedure described
UniRef50
** Data Collection Method by dataset
** Labeling Method by dataset
Properties (Quantity, Dataset Descriptions, Sensor(s)): UniRef50, September 2021 version, is used for the training of ESM models. The training dataset was partitioned by randomly selecting 0.5% (≈ 250,000) sequences to form the validation set. The training set has sequences removed via the procedure described
Engine: Triton
Test Hardware:
NVIDIA believes Trustworthy AI is a shared responsibility and we have established policies and practices to enable development for a wide array of AI applications. When downloaded or used in accordance with our terms of service, developers should work with their supporting model team to ensure this model meets requirements for the relevant industry and use case and addresses unforeseen product misuse. For more detailed information on ethical considerations for this model, please see the Model Card++ Explainability, Bias, Safety & Security, and Privacy Subcards. Please report security vulnerabilities or NVIDIA AI Concerns here.
**If anything is meant for internal-purposes only (including this statement and pre-filled content recommendations, please alert Trustworthy AI Product Manager or designee before publishing)