Multi-modal model for a wide range of tasks, including image understanding and language generation.
Fuyu-8B is a multi-modal transformer introduced by Adept AI. It can perform a wide range of tasks, including image understanding, text generation, and code generation. Architecturally, Fuyu is a vanilla decoder-only transformer - there is no image encoder. Image patches are instead linearly projected into the first layer of the transformer, bypassing the embedding lookup. The transformer decoder is simply treated like an image transformer (albeit with no pooling and causal attention).
By accessing this model, you are agreeing to the Fuyu-8b terms and conditions of the CC BY-NC license.
This model is not owned or developed by NVIDIA. This model has been developed and built to a third-party’s requirements for this application and use case; see Fuyu's Hugging Face Model Card.
Architecture Type: Transformer
Network Architecture: Fuyu-8b
Model Version: N/A
Input Format: Red, Green, Blue (RGB) Image + Text
Input Parameters: None
Output Format: Text
Output Parameters: None
Supported Hardware Platform(s): Hopper, Ampere/Turing
Supported Operating System(s): Linux
Engine: Triton
Test Hardware: Other