Skip to main content
NVIDIA
Explore
Models
Skills
Blueprints
GPUs
Docs
Terms of Use
Privacy Policy
Your Privacy Choices
Contact

Copyright © 2026 NVIDIA Corporation

5 results for

Filters

  • NVIDIA
    5
  • AI Engineer
    5
  • Developer
    5
  • DevOps Engineer
    3
  • Hpc Developer
    3
  • Ml Engineer
    3
  • AI And Machine Learning
    5
  • Dynamo
    2
  • NeMo Megatron Bridge
    2
  • Megatron Core
    1
  • Validate and use packed sequences and long-context training in Megatron-Bridge, distinguishing offline packed SFT for LLMs from in-batch packing for VLMs, and applying the right CP constraints.
    Skill
    Developer
    209
    11d

    Select, validate, patch, and deploy existing NVIDIA Dynamo Kubernetes recipes. Use for model/backend/GPU/deployment-mode recipe bring-up; use router-starter for router-only mode work and troubleshoot for broken deployments.
    Skill
    Developer
    219
    11d

    Start or patch Dynamo router modes and run router endpoint smoke checks. Use for round-robin, KV-aware, least-loaded, or device-aware routing setup; use recipe-runner for recipe deployment and troubleshoot for failure diagnosis.
    Skill
    Developer
    221
    11d

    How to launch distributed Megatron-LM training jobs on a SLURM cluster. Covers a minimal sbatch skeleton, environment-variable setup for torch.distributed.run, CUDA_DEVICE_MAX_CONNECTIONS rules across hardware and parallelism modes, container conventions,
    Skill
    Developer
    205
    9d
    Items per page
    of 1 pages

    Convert single-node scripts to multi-node Slurm sbatch jobs and debug common multi-node failures. Covers srun-native vs uv run torch.distributed approaches, container setup, NCCL timeouts, OOM sizing for MoE models, and interactive allocation.
    Skill
    Developer
    185
    11d