NVIDIA
Explore Models Blueprints GPUs
Terms of Use

|

Privacy Policy

|

Manage My Privacy

|

Contact

Copyright © 2025 NVIDIA Corporation

adept

fuyu-8b

PREVIEW

Multi-modal model for a wide range of tasks, including image understanding and language generation.

image understandinglanguage generationmultimodalcomputer visioncvimageimage-to-textvideovlm
Get API Key
API Reference
Accelerated by DGX Cloud

Model Overview

Description:

Fuyu-8B is a multi-modal transformer introduced by Adept AI. It can perform a wide range of tasks, including image understanding, text generation, and code generation. Architecturally, Fuyu is a vanilla decoder-only transformer - there is no image encoder. Image patches are instead linearly projected into the first layer of the transformer, bypassing the embedding lookup. The transformer decoder is simply treated like an image transformer (albeit with no pooling and causal attention).

Terms of use

By accessing this model, you are agreeing to the Fuyu-8b terms and conditions of the CC BY-NC license.

Third-Party Community Consideration:

This model is not owned or developed by NVIDIA. This model has been developed and built to a third-party’s requirements for this application and use case; see Fuyu's Hugging Face Model Card.

References(s):

  • Fuyu-8B Blog Post by adept.ai
  • Fuyu-8B Model Card on Hugging Face

Model Architecture:

Architecture Type: Transformer
Network Architecture: Fuyu-8b
Model Version: N/A

Input:

Input Format: Red, Green, Blue (RGB) Image + Text
Input Parameters: None

Output:

Output Format: Text
Output Parameters: None

Software Integration:

Supported Hardware Platform(s): Hopper, Ampere/Turing
Supported Operating System(s): Linux

Inference:

Engine: Triton
Test Hardware: Other