Model Overview

Description:

Qwen-Image-Edit-NVPCB-OVSL2SL transforms synthetic solder-light printed-circuit-board (PCB) component crops — produced in NVIDIA Omniverse — into the photographic solder-light style captured at NVIDIA PCB inspection stations, so that downstream PCB inspection models trained on real solder-light photographs can be augmented with Omniverse-generated synthetic data. The release is an NVIDIA fine-tuned version of the Qwen-Image-Edit image-to-image diffusion pipeline (diffusion transformer, Qwen2.5-VL text encoder, Qwen-Image VAE, tokenizer, image processor, and scheduler configuration), specialized for the Omniverse → NVPCB solder-light style transfer. Qwen-Image-Edit-NVPCB-OVSL2SL v1.0.0 was developed by NVIDIA as part of the NVPCB inspection-data harmonization pipeline. This model is ready for commercial use.

License/Terms of Use:

Governing Terms: Use of this model is governed by the NVIDIA Open Model Agreement. Additional Information: Apache License, Version 2.0.

Deployment Geography:

Global

Use Case:

NVIDIA engineers and researchers building PCB inspection / automated optical inspection (AOI) systems that need to be augmented with Omniverse-generated synthetic data. The model converts Omniverse-rendered solder-light PCB component crops into the photographic solder-light style produced by NVIDIA's physical inspection stations, closing the sim-to-real style gap so that inspection models trained on real photographs can be evaluated or augmented with synthetic Omniverse data. This model is not intended to be the primary inspection decision-maker; it is a sim-to-real data-translation step. Inspection pass/fail decisions must come from a downstream inspection model with human review.

Release Date:

Github 06/02/2026 via https://github.com/NVIDIA/paidf-augmentation

References(s):

Qwen-Image-Edit base model — https://huggingface.co/Qwen/Qwen-Image-Edit
Qwen2.5-VL text encoder — https://huggingface.co/Qwen/Qwen2.5-VL-7B-Instruct
LoRA: Hu et al., "LoRA: Low-Rank Adaptation of Large Language Models," 2021 — https://arxiv.org/abs/2106.09685
Diffusers library — https://github.com/huggingface/diffusers

Model Architecture:

Architecture Type: Transformer (diffusion transformer with cross-modal conditioning)

Network Architecture: This release is a self-contained HuggingFace diffusers pipeline directory. All of the following components are redistributed as part of the release artifact:

transformer/ (NVIDIA fine-tuned, redistributed): the upstream Qwen-Image-Edit flow-matching image-to-image diffusion transformer, fine-tuned by NVIDIA on the attention and feed-forward projections of QwenImageTransformerBlock (to_q, to_k, to_v, to_out, the cross-attention add_{q,k,v}_proj / to_add_out, and img_mlp / txt_mlp).
text_encoder/ (redistributed unmodified): Qwen2.5-VL; attends over both the input image and the instruction prompt.
vae/ (redistributed unmodified): Qwen-Image VAE.
tokenizer/, processor/, scheduler/, and model_index.json (redistributed unmodified): Qwen-Image-Edit tokenizer, image processor, scheduler configuration, and pipeline entry point.
Fine-tuning methodology: NVIDIA fine-tuned the transformer using LoRA (rank 16, ~1.7 × 10^8 parameters introduced during training); the resulting weights were then merged back into the transformer for release, so the released artifact is a standalone diffusers pipeline that requires no separate adapter file at inference time.
The released pipeline directory can be loaded directly with diffusers.QwenImageEditPipeline.from_pretrained(...).

This model was developed based on Qwen-Image-Edit.

Number of model parameters: Approximately ~2.0 × 10^10 (20B) total parameters in the released checkpoint. Of these, ~1.7 × 10^8 (170M) parameters in the diffusion transformer were updated by NVIDIA fine-tuning; the remaining parameters come from the upstream Qwen-Image-Edit pipeline (transformer + Qwen2.5-VL text encoder + Qwen-Image VAE) and are redistributed unmodified.

Cumulative Compute: ~0.6 GPU-hour total on a single NVIDIA H100 SXM (~0.5 GPU-hour for the 1500-step fine-tuning run + ~5 GPU-minutes for the latent/embedding cache build).

Estimated Energy and Emissions for Model Training: ~0.4 kWh and ~0.16 kgCO2e total. Methodology: GPU energy = 0.6 GPU-hour × 0.7 kW (H100 SXM rated TDP) × 0.6 average utilization (typical for LoRA fine-tuning, which is not consistently GPU-bound) ≈ 0.25 kWh; multiplied by an assumed datacenter PUE of 1.5 to account for cooling and facility overhead ≈ 0.38 kWh; multiplied by 0.4 kgCO2e/kWh (U.S. national-grid average) ≈ 0.16 kgCO2e. Estimates use rated TDP rather than measured wall-power and are therefore conservative upper bounds; actual emissions depend on the specific datacenter's PUE and regional grid carbon intensity at training time.

Input(s):

Input Type(s): Image, Text

Input Format(s):

Image: PNG / JPG, Red, Green, Blue (RGB)
Text: UTF-8 instruction prompt (English)

Input Parameters:

Image: Two-Dimensional (2D)
Text: One-Dimensional (1D)

Other Properties Related to Input: Fine-tuned at target area 262,144 pixels (~512×512); other resolutions are accepted by the underlying diffusers pipeline but NVIDIA fine-tuning was not performed at them, so style fidelity may degrade. Input must be a single Omniverse-rendered PCB component crop on an approximately black background, similar to the synthetic solder-light style the model was fine-tuned on. The accompanying instruction prompt is the fixed English sentence used during training, recovered automatically from the cached training metadata (cache/_meta.pt) at inference time; the trained prompt is:

"Render this PCB component crop in the style of an NVPCB raked-solder-light photograph: dark reddish board with bright orange-red and blue specular highlights on the solder pads, photorealistic textures."

Output(s)

Output Type(s): Image

Output Format(s): PNG; Red, Green, Blue (RGB)

Output Parameters: Two-Dimensional (2D)

Other Properties Related to Output: Output resolution matches the input target area used during caching (~512×512). The output preserves component identity and board layout while transferring the rendering style from Omniverse-synthetic solder-light to NVPCB photographic solder-light.

Our AI models are designed and/or optimized to run on NVIDIA GPU-accelerated systems. By leveraging NVIDIA's hardware (e.g. GPU cores) and software frameworks (e.g., CUDA libraries), the model achieves faster training and inference times compared to CPU-only solutions.

Software Integration:

Runtime Engine(s):

PyTorch (via HuggingFace diffusers QwenImageEditPipeline)

Supported Hardware Microarchitecture Compatibility: NVIDIA Ampere (A100)
NVIDIA Hopper (H100)
NVIDIA Lovelace (RTX 40-series)

Supported Operating System(s):

Linux

The integration of foundation and fine-tuned models into AI systems requires additional testing using use-case-specific data to ensure safe and effective deployment. Following the V-model methodology, iterative testing and validation at both unit and system levels are essential to mitigate risks, meet technical and functional requirements, and ensure compliance with safety and ethical standards before deployment.

Model Version(s):

v1.0.0 — qwen-image-edit-nvpcb-OVSL2SL (NVIDIA fine-tune of the upstream Qwen-Image-Edit pipeline, 1500 training steps). The released artifact is a self-contained HuggingFace diffusers pipeline directory containing transformer/, text_encoder/, vae/, tokenizer/, processor/, scheduler/, and model_index.json.

Training, Testing, and Evaluation Datasets:

Dataset Overview

Total Number of Datasets: 1 (paired Omniverse-synthetic / NVPCB-photographic solder-light component crops; NVIDIA-internal)
Total Size: 228 paired component crops (228 Omniverse-rendered synthetic solder-light inputs + 228 NVPCB photographic solder-light targets), ~512×512 each
Dataset partition: Training ~95%, Validation ~5% (held-out by filename stem; no separate test split — evaluation is qualitative side-by-side plus CLIP-style embedding distance against held-out targets)
Time period for data collection: H1 2026 (January – June 2026)

Training Dataset:

Data Modality:

Image
Text (fixed instruction prompt)

** Image Training Data Size

[Less than a Million Images]

** Text Training Data Size

[Less than a Billion Tokens]

** Data Collection Method by dataset

Hybrid: Synthetic, Human (Omniverse synthetic rendering for the input side + manually-collected photography at NVIDIA inspection stations for the target side)

** Labeling Method by dataset

Not Applicable — paired-image translation is supervised by the target image itself; no per-image label is required beyond the fixed instruction prompt.

Properties: 228 paired component crops with Omniverse-rendered synthetic solder-light as input and NVPCB photographic solder-light as target. Modality: image + a fixed English instruction prompt. Content nature: synthetic renders and photographs of inanimate PCBs (NVIDIA-internal). No personal data, no copyright-protected web content, no machine-generated text/speech. No human subjects are depicted.

Testing Dataset:

Data Collection Method by dataset:

[Not Applicable]

Labeling Method by dataset:

[Not Applicable]

Properties Not Applicable — the model is qualitatively evaluated on the held-out validation split (~5%) via side-by-side HTML reports rather than against a separate test split.

Evaluation Dataset:

** Data Collection Method by dataset

Hybrid: Synthetic, Human (Omniverse synthetic rendering + manually-collected photography)

** Labeling Method by dataset

Not Applicable

** Properties: Held-out validation pairs (~5%) from the same paired set, used to produce qualitative side-by-side comparison HTML reports and to compute CLIP-style embedding distance between generated and ground-truth target images.

Inference:

Acceleration Engine: PyTorch (native, bf16 weights) via diffusers.QwenImageEditPipeline.from_pretrained(...), loaded directly from the released pipeline directory.

Test Hardware:

NVIDIA H100
NVIDIA A100

Ethical Considerations:

NVIDIA believes Trustworthy AI is a shared responsibility and we have established policies and practices to enable development for a wide array of AI applications. Developers should work with their internal model team to ensure this model meets requirements for the relevant industry and use case and addresses unforeseen product misuse.

Please make sure you have proper rights and permissions for all input image and video content; if image or video includes people, personal health information, or intellectual property, the image or video generated will not blur or maintain proportions of image subjects included.

Users are responsible for model inputs and outputs. Users are responsible for ensuring safe integration of this model, including implementing guardrails as well as other safety mechanisms, prior to deployment.

For more detailed information on ethical considerations for this model, please see the Bias, Explainability, Privacy, and Safety & Security tabs of ModelCard++.

Please report model quality, risk, security vulnerabilities or NVIDIA AI Concerns here.

Get Help

Getting started with the NIM

Deploying and integrating the NIM is straightforward thanks to our industry standard APIs. Visit the Visual Generative AI NIM page for release documentation, deployment guides and more.

Enterprise Support

Get access to knowledge base articles and support cases or submit a ticket.

qwen-image-edit-nvpcb-ovsl2sl