Evo 2 is a biological foundation model that is able to integrate information over long genomic sequences while retaining sensitivity to single-nucleotide change. At 40 billion parameters, the model understands the genetic code for all domains of life and is the largest AI model for biology to date. Evo 2 was trained on a dataset of nearly 9 trillion nucleotides.
Here, we show the predicted structure of the protein coded for in the Evo2-generated DNA sequence. Prodigal is used to predict the coding region, and ESMFold is used to predict the structure of the protein.
This model is ready for commercial use.
This model is not owned or developed by NVIDIA. This model has been developed and built to a third-party's requirements for this application and use case.
GOVERNING TERMS: This trial service is governed by the NVIDIA API Trial Terms of Service. The NIM container is governed by the NVIDIA Software License Agreement and Product-Specific Terms for AI Products. Use of this model is governed by the NVIDIA Open Model License Agreement. ADDITIONAL INFORMATION: Apache 2.0 License.
You are responsible for ensuring that your use of NVIDIA AI Foundation Models complies with all applicable laws.
Global
Evo is able to perform zero-shot function prediction for genes. Evo also can perform multi-element generation tasks, such as generating synthetic CRISPR-Cas molecular complexes. Evo 2 can also predict gene essentiality at nucleotide resolution and can generate coding-rich sequences up to at least 1M kb in length. Advances in multi-modal and multi-scale learning with Evo provide a promising path toward improving our understanding and control of biology across multiple levels of complexity.
2/19/2025
The user supplies an uppercase DNA sequence which may optionally be prepended with a specifically formatted taxonomy prompt (also uppercase) for an organism. The format of this prompt is |D__[DOMAIN];P__[PHYLUM];C__[CLASS];O__[ORDER];F__[FAMILY];G__[GENUS];S__[SPECIES]|"
. For example, the token for Escherichia coli would be |D__BACTERIA;P__PSEUDOMONADOTA;C__GAMMAPROTEOBACTERIA;O__ENTEROBACTERALES;F__ENTEROBACTERIACEAE;G__ESCHERICHIA;S__ESCHERICHIA|
. You can include a taxonomy token anywhere in your input DNA for your target organism if you want to bias the model's generated result in that direction.
evo2-40b:1.0.0
Multiple datasets were used for training, testing and evaluation (see details below).
Link: https://www.science.org/doi/10.1126/science.ado9336
Data Collection Method: Automatic/Sensors/Human
Labeling Method by dataset: N/A (no labels)
The previously published OpenGenome dataset was used in its entirety as part of the training data for this study. This included representative prokaryotic genomes available through GTDB release v214.1, and curated phage and plasmid sequences retrieved through IMG/VR and IMG/PR.
Link: https://academic.oup.com/nar/advance-article/doi/10.1093/nar/gkab776/6370255
Data Collection Method: Automatic/Sensors/Human
Labeling Method by dataset: Automatic
New prokaryotic reference genomes made available through the GTDB release 220.0 update were added to the training data for this study. New genomes were identified by selecting all species' reference genomes that had no previously published (release 214.1) genomes within their species cluster, resulting in 28,174 additional prokaryotic genomes.
Link: https://genomebiology.biomedcentral.com/articles/10.1186/s13059-016-0997-x
Data Collection Method: Automatic/Sensors/Human
Labeling Method by dataset: Automatic
All available eukaryotic reference genomes were downloaded from NCBI on 05/32/2024, excluding atypical genomes, metagenome-assembled genomes, and genomes from large multi-isolate projects. This resulted in 16,704 genomes including an estimated ~10.7 trillion nucleotides. Only contigs that were annotated as 'Primary Assembly', 'non-nuclear', or 'aGasCar1.hap1' (an aberrant annotation that applied only to GCA_027917425.1) were retained. Mash sketch was run on each individual genome with the flag "-s 10000" and the mash distance was calculated between all genomes as an estimate for their pairwise 1-ANI (average nucleotide identity). All genomes with a mash distance < 0.01 were joined with edges in a graph, and clusters were identified by finding connected components. One representative genome per cluster was chosen, prioritizing genomes with a higher assembly level and genomes with longer total sequence length. This clustering resulted in 15,148 candidate genomes. Genomes were further filtered by removing ambiguous nucleotides at the termini of each contig, by removing regions annotated as "centromere" in an available GFF file, and by removing contigs that were less than 10 kb in total length. Finally, contigs that were composed of more than 5% ambiguous nucleotides were removed. This final filtered set included 15,032 genomes and 6.98 trillion nucleotides.
Link: https://www.nature.com/articles/s41586-024-07552-4
Data Collection Method: Automatic/Sensors/Human
Labeling Method by dataset: Automatic
A previously described metagenomics dataset was further curated as part of the training data. This included 41,253 metagenomes and metagenome-assembled genomes from NCBI, JGI IMG, MGnify, MG-RAST, Tara Oceans samples, and Youngblut et al. animal gut metagenomes. All contigs were split at consecutive stretches of ambiguous nucleotides of length 5 bp or longer, the split contigs were filtered by a minimum sequence length of 1 kb, and only contigs with at least one open reading frame as predicted by prodigal were kept. Contig-encoded proteins were previously clustered at 90% identity using MMseqs. To further remove redundant sequences, contigs were sorted by descending length, and each contig was only retained if at least 90% of its respective protein clusters were not already in the sequence collection (determined using a bloom filter).
Link: https://www.ncbi.nlm.nih.gov/datasets/organelle
Data Collection Method: Automatic/Sensors/Human
Labeling Method by dataset: Automatic
Eukaryotic organelle genomes: 33,457 organelle genomes were identified and downloaded using the "NCBI Organelle" web resource. Ambiguous nucleotides at the terminal ends of the organelle genome sequences were removed. Sequences that had over 25 ambiguous nucleotides were removed. This resulted in 32,241 organelle genomes that were used for training, including 1,613 mitochondria, 12,856 chloroplasts, 1,751 plastids, 18 apicoplasts, 1 cyanelle, and 1 kinetoplast.
NVIDIA believes Trustworthy AI is a shared responsibility and we have established policies and practices to enable development for a wide array of AI applications. When downloaded or used in accordance with our terms of service, developers should work with their internal model team to ensure this model meets requirements for the relevant industry and use case and addresses unforeseen product misuse.
Please report security vulnerabilities or NVIDIA AI Concerns here.
This workflow shows how generative AI can generate DNA sequences that can be translated into proteins for bioengineering.