MiniMaxAI/MiniMax-M3 is MiniMaxAI's 428B A22B Mixture-of-Experts first vision-language model combining long-context reasoning, agentic workflows, and creative capabilities in a single platform. The multimodal MoE model evolved from MiniMax M2 along three axes — width, attention, and visual grounding.
MiniMax-M3 enables advanced use cases such as long-form video understanding, extended coding tasks (8+ hours), and high-quality design workflows.
To set up your environment to fine-tune this model with NeMo AutoModel, follow the installation guide.
Use image/video instruction data that matches the target agent workflow. Good candidates include:
For a full walkthrough of how multimodal datasets are preprocessed and integrated into NeMo AutoModel, including chat-template conversion and collate functions, see the Multi-Modal Dataset Guide.
NeMo AutoModel supports several ways to launch training: the AutoModel CLI with Slurm, interactive sessions, torchrun, and more. For full details on Slurm batch jobs, multi-node configuration, and environment variables, see the Run on a Cluster guide.
Before running, make sure your cluster environment is configured following the Run on a Cluster guide.
export TRANSFORMERS_OFFLINE=1
export HF_HOME=/path/to/hf_cache
export HF_DATASETS_OFFLINE=1
export WANDB_API_KEY=your_wandb_key
srun --output=output.out \
--error=output.err \
--container-image /path/to/automodel26.04.image.sqsh \
--no-container-mount-home bash -c "
CUDA_DEVICE_MAX_CONNECTIONS=1 automodel \
/path/to/minimax_m3_vl.yaml \
--nproc-per-node=8 \
--model.pretrained_model_name_or_path=/path/to/MiniMax-M3 \
--processor.pretrained_model_name_or_path=/path/to/MiniMax-M3"
Full fine-tuning recipe can be found at:
examples/vlm_finetune/minimax_m3 /minimax_m3_vl_sft_ep32pp4.yamlexamples/vlm_finetune/minimax_m3 /minimax_m3_vl_lora_pp4ep8_8node.yamlBefore you start:
HF_HOME points to a shared cache visible from all nodes.HF_DATASETS_OFFLINE=1.wandb section in the recipe to record loss, throughput, and memory curves.The SFT and LoRA fine-tuning loss curves are shown below.
SFT
LoRA