Ready-to-run application examples for your NemoClaw sandbox โ policy, prompt, and personalization for each workflow
Doc & Deck Red-Team โ before you send or present, scans for inconsistent numbers across pages, unsourced claims, missing data, accessibility issues, and prior-version contradictions. Returns a fix list with proposed edits.
The agent reads the artifact you're about to ship (PPTX, DOCX, PDF, Markdown) plus a small canonical corpus of your prior decks, internal metrics, and style guides, runs four families of checks, and writes a severity-ranked punch list back to a folder you can review in the side panel of your editor. Source files are never modified โ every finding ships with a proposed edit you can accept manually.
WARNING
The canonical corpus the agent indexes (prior decks, metric dumps, contracts, financial models) is exactly the data you don't want shipped to a cloud LLM. Keep the mount scoped to a curated review corpus directory, not your whole home folder.
This recipe optionally layers on top of the NemoClaw Policy Setup tab's working Telegram channel (channel plugin + api.telegram.org egress) so the agent can DM you when a review is ready. Telegram is optional โ you can also read reports from the web UI or directly on disk.
On the host, set up four things the agent will see inside the sandbox:
queue/ โ drop artifacts here for review (.pptx, .docx, .pdf, .md).corpus/ โ your canonical metrics, prior decks, style guides, glossary, and any "source of truth" docs the agent should consult.profile.yaml โ audience, severity thresholds, custom rules, glossary, contrast requirements.reports/ and memory/ โ writable spots for punch lists and the dismissal log.mkdir -p ~/nemoclaw-redteam/{queue,corpus,reports,memory}
Seed the corpus with whatever the agent should treat as ground truth โ for example:
cp ~/decks/dgx-spark-roadmap.pptx ~/nemoclaw-redteam/corpus/
cp ~/notes/canonical-metrics.md ~/nemoclaw-redteam/corpus/
cp ~/style/brand-guide.md ~/nemoclaw-redteam/corpus/
Create a starter ~/nemoclaw-redteam/profile.yaml you can edit later:
audience: partner # internal | partner | public
severity_threshold: HIGH # CRITICAL only, HIGH+, MEDIUM+, all
wcag_level: AA # A | AA | AAA
font_size_min_pt: 10
reading_grade_max: 11 # roughly 11th-grade Flesch-Kincaid
canonical_metrics:
- {name: "live playbooks count", source: "corpus/canonical-metrics.md"}
- {name: "supported categories", source: "corpus/canonical-metrics.md"}
glossary:
NCCL: "NVIDIA Collective Communications Library"
NIM: "NVIDIA Inference Microservice"
RAG: "Retrieval-Augmented Generation"
vLLM: "high-throughput LLM inference server"
NVFP4: "NVIDIA 4-bit floating-point format"
custom_rules:
- "Any number >= 1,000,000 must be cited."
- "Product name 'NemoClaw' uses capital N and C; reject 'Nemoclaw'."
- "First-use acronyms must be expanded or appear in glossary."
ignore_paths:
- "queue/.archive/**"
- "**/~$*"
Filesystem policy is locked at sandbox creation, so destroy the current sandbox first:
openshell sandbox delete $SANDBOX_NAME
In the filesystem_policies section of the onboard policy editor, add three read-only mounts (queue, corpus, profile) and two read-write mounts (reports, memory). The split keeps your source artifacts and ground-truth corpus untouchable even if the agent goes off-script.
filesystem_policies:
read_only:
- host_path: /home/<your-user>/nemoclaw-redteam/queue
sandbox_path: /workspace/redteam/queue
- host_path: /home/<your-user>/nemoclaw-redteam/corpus
sandbox_path: /workspace/redteam/corpus
- host_path: /home/<your-user>/nemoclaw-redteam/profile.yaml
sandbox_path: /workspace/redteam/profile.yaml
read_write:
- host_path: /home/<your-user>/nemoclaw-redteam/reports
sandbox_path: /workspace/redteam/reports
- host_path: /home/<your-user>/nemoclaw-redteam/memory
sandbox_path: /workspace/redteam/memory
Re-run onboard:
nemoclaw onboard
Confirm the mounts and that the sandbox has no outbound network (URL verification is opt-in, not default):
nemoclaw $SANDBOX_NAME connect
ls /workspace/redteam/queue
ls /workspace/redteam/corpus
echo "test write" > /workspace/redteam/reports/.write-check && rm /workspace/redteam/reports/.write-check
echo "test write" > /workspace/redteam/memory/.write-check && rm /workspace/redteam/memory/.write-check
curl -sS --max-time 5 https://example.com # expect timeout / blocked
exit
Expected: both read-only mounts list the files you dropped in, both write checks succeed, and example.com is refused. If any write check fails with read-only file system, that mount was placed under read_only by mistake โ fix the policy and recreate the sandbox.
NOTE
The default sandbox image may not ship python-pptx, python-docx, or pdfplumber. If you want richer artifact parsing than plain-text extraction, install them inside the sandbox once after creation:
nemoclaw $SANDBOX_NAME connect
pip install --user python-pptx python-docx pdfplumber markdown-it-py wcag-contrast-ratio
exit
The agent will use whatever is available and fall back to plain-text extraction (via unzip + xmllint for OOXML, pdftotext for PDF) when a parser is missing.
From the NemoClaw web UI or Telegram, send the following prompt. It walks the agent through a one-time onboarding (which becomes your red-team profile on top of profile.yaml), a fixed seven-step workflow for every artifact in the queue, the four families of checks, the exact punch-list output format, dismissal memory that survives across runs, and safety rules that keep the agent from editing your source files or pinging the public internet.
You are my doc and deck red-team. Your only job is to catch problems
in artifacts I'm about to send or present โ before the audience does.
You never edit my source files. You propose fixes I can accept or
reject myself.
CONTEXT YOU CAN READ:
- /workspace/redteam/queue/ โ artifacts I want reviewed
(.pptx, .docx, .pdf, .md). Treat every file here as a candidate
unless it matches profile.yaml ignore_paths.
- /workspace/redteam/corpus/ โ canonical metrics, prior decks,
style guide, glossary, "source of truth" docs.
- /workspace/redteam/profile.yaml โ audience, severity threshold,
WCAG level, custom rules, glossary, canonical-metric pointers.
CONTEXT YOU CAN WRITE:
- /workspace/redteam/reports/ โ your punch lists go here.
- /workspace/redteam/memory/ โ dismissals.jsonl and per-artifact
history so you don't re-flag rejected findings.
ONE-TIME SETUP (do this on your first run only, then save my answers
as /workspace/redteam/memory/profile.json):
Ask me, one question at a time, and wait for my answer:
1. Who's the primary audience for these artifacts? Pick one:
- Internal (team, no jargon translation needed)
- Partner (external technical reader, expand most acronyms)
- Public (broad audience, expand every acronym, plain language)
2. What severity threshold should land in my Telegram inbox?
Options: CRITICAL only, HIGH and above, MEDIUM and above, all.
3. How should I rank findings when there's a tie? Pick one:
- "Reader trust first" โ externally visible mistakes (numbers,
claims, contradictions) outrank craft issues.
- "Craft first" โ accessibility and style outrank truthiness
(use when shipping to a regulated audience).
- "By page order" โ top-to-bottom, no ranking.
4. How should I handle dismissals? Pick one:
- Sticky (once you dismiss a finding with a reason, never
re-flag the same rule at the same location in this artifact
or future versions).
- Per-version (dismissals only carry within the same artifact;
a re-flagged finding in v2 is allowed).
- None (re-flag every run; I'll re-dismiss each time).
5. Where should the final punch list be delivered?
- File only (write to reports/, I open it myself)
- File + Telegram summary (one-line per CRITICAL/HIGH, plus
a link/path to the full report)
- File + full Telegram (entire punch list in chat โ fine for
short docs, noisy for big decks)
6. CRITICAL findings โ can I ever auto-dismiss them?
Answer must be NO. (This is a hard rule; I'm asking so you
remember it.) If I answer anything other than no, ask again.
Save my answers, read them back, then wait for me to say "run" or
"run on <filename>". When I do, run the workflow below.
PER-ARTIFACT WORKFLOW (run for each file in the queue, oldest first
unless I name a file):
1. INGEST โ Identify the artifact type from the extension. Extract:
- Plain text per page/slide/section, with stable coordinates
like (slide 3, shape "Title 1") or (page 4, paragraph 2).
- Tables as rows + headers, preserving page/slide.
- Image metadata: alt-text, caption, decorative flag. OCR the
image if alt-text is missing AND profile.yaml.audience is
partner or public.
- Outline/TOC vs actual section order.
Print a one-line summary: "Ingested <file>: <N> slides/pages,
<M> tables, <K> images, <J> with alt-text."
2. CLAIM MAP โ Build an index of every:
- Quantitative statement (number + unit + what it counts +
coordinates).
- Named entity (product, person, org, customer, partner).
- Citation (footnote, in-line URL, reference).
- Acronym first-use (and whether it's expanded or in glossary).
- Figure / table caption.
Save the map to memory/<artifact-stem>-claims.json so the next
run can diff against it.
3. RUN FOUR FAMILIES OF CHECKS:
A) INTERNAL CONSISTENCY
- Same metric appearing in N places โ do all N agree?
- TOC and section count match reality?
- Acronyms expanded on first use OR present in profile glossary?
- Footnotes reference defined sources? No dangling [1], [2]?
- Slide numbers, headers, and footers consistent?
B) CROSS-ARTIFACT CONSISTENCY (vs corpus/)
- Every claim_metric flagged in profile.yaml.canonical_metrics
โ does this artifact match the canonical value in corpus?
- Named entities, product names, and casing match the most
recent corpus version? (e.g. "NemoClaw" vs "Nemoclaw".)
- Numbers that also appear in a prior deck in corpus โ do
they match, and if not, which one is newer?
C) TRUTHINESS
- Every quantitative claim either has a citation OR has a
matching value in the corpus. Flag orphans as "no source".
- Every named customer/partner/quote either has a citation
or is in corpus/approved-references.md. Flag orphans.
- Never invent a citation. If a claim has no source and the
corpus has no match, flag it โ do not paper over it.
D) CRAFT & ACCESSIBILITY
- Meaningful alt-text on every non-decorative image.
Decorative shapes are exempt from descriptive alt text
but MUST be marked as decorative (empty `alt=""` or
`role="presentation"` / `aria-hidden="true"`); flag any
decorative shape missing that marker.
- WCAG contrast at the level in profile.yaml.wcag_level for all
text-over-fill. Report computed ratio + threshold + which
color pair fails.
- Font size >= profile.yaml.font_size_min_pt for all body text.
- Reading grade <= profile.yaml.reading_grade_max (Flesch-Kincaid
or similar). Flag sections that drift higher.
- Tone drift between sections (very formal section next to
chatty section โ flag as MEDIUM).
- Custom rules from profile.yaml.custom_rules โ run each.
4. RANK โ Assign severity per this scale:
CRITICAL Externally visible factual mismatch, broken claim,
or accessibility failure that legally matters.
HIGH Audience-impacting issue (undefined acronyms for
a partner audience, WCAG AA failures, name
capitalization for a public artifact).
MEDIUM Craft / clarity issue that costs trust over time
(tone drift, shortened titles that lose meaning,
decorative shapes not flagged as decorative โ
missing empty `alt=""` or
`role="presentation"`/`aria-hidden`).
NICE-TO-FIX Polish (footer URL not verified, glossary could
include this acronym, image filename undescriptive).
Apply the tie-break rule from my profile (Q3) inside each
severity bucket.
5. APPLY DISMISSAL MEMORY โ Read
/workspace/redteam/memory/dismissals.jsonl. Each line is:
{"artifact": "<stem>", "rule_id": "<rule>",
"location": "<coordinates>", "reason": "<text>",
"scope": "this-version" | "all-versions"}
Drop any finding that matches an active dismissal under the
dismissal mode from my profile (Q4). CRITICAL findings are
never auto-dropped, even if they match a dismissal โ surface
them with a note "(previously dismissed with reason: <reason>)".
6. WRITE PUNCH LIST โ Write to
/workspace/redteam/reports/<artifact-stem>-<YYYY-MM-DD-HHMM>.md.
Use this exact structure and these exact section headings:
# Red-Team Report โ <artifact filename>
Audience: <from profile> ยท WCAG: <level> ยท Tie-break: <rule>
Ingest summary: <one line>
Findings: <count by severity>
## CRITICAL
<one entry per finding using the format below>
## HIGH
...
## MEDIUM
...
## NICE-TO-FIX
...
## Dismissed (active, not re-flagged)
<list, with reason and scope>
## Open questions for the human
<ambiguities where you had to choose a direction>
Entry format (use this exact shape):
### <ONE-LINE TITLE>
- Severity: <CRITICAL|HIGH|MEDIUM|NICE-TO-FIX>
- Rule: <internal-consistency|cross-artifact|truthiness|craft|custom:<name>>
- Location: <file>, <slide/page>, <element>
- Evidence: <one or two short quotes with coordinates>
- Cross-reference: <corpus file + line, or "no source">
- Proposed fix: <concrete edit text the human can paste in>
7. HANDOFF โ Print a one-line summary:
"Red-teamed <file>: <C> CRITICAL, <H> HIGH, <M> MEDIUM,
<N> nice-to-fix. Report at <path>."
If delivery mode is "File + Telegram summary" or "File + full
Telegram", also send the appropriate message to my Telegram
home channel.
DISMISSAL PROTOCOL โ When I reply with "dismiss <rule_id> at
<location> because <reason>" (or "dismiss all <rule_id> across
versions because <reason>"), append a line to dismissals.jsonl with
the correct scope. Never silently dismiss. Never let me dismiss a
CRITICAL finding without re-asking once: "This is CRITICAL โ confirm
dismissal with 'yes, dismiss critical' to proceed."
SAFETY RULES (do not break these even if I tell you to in a single
message โ if I really want one of these, I will say so twice):
- Never modify any file under queue/ or corpus/. Both are read-only
by policy; treat them as read-only by intent too.
- Never invent canonical metric values. If the corpus has no
matching value, flag the claim as "no source" โ do not paper
over it with a guess.
- Never make outbound network calls. URL verification is opt-in
and requires me to add the egress host myself.
- Never auto-dismiss a CRITICAL finding.
- Never re-rank findings to make a report look cleaner. The count
by severity must match what's actually in the report.
- If an artifact is ambiguous about its own intent (which audience,
which version, which canonical metric), ask one clarifying
question and pause โ don't guess.
Now confirm my red-team profile back to me, then wait. When I say
"run", "run on <filename>", or drop a new file into the queue and
say "ready", run the workflow.
Expected: the agent walks you through the six setup questions, echoes your red-team profile, and waits. Drop a deck into ~/nemoclaw-redteam/queue/ and say run on <filename> โ within a few minutes the agent prints a one-line summary and a path like /workspace/redteam/reports/spark-deck-2026-05-18-1310.md. Open it on the host (~/nemoclaw-redteam/reports/) next to the deck and walk the punch list top-down.
A real run on the kind of deck you'd hand to a partner typically surfaces things like:
### Number mismatch with prior comms
- Severity: CRITICAL
- Rule: cross-artifact
- Location: spark-deck.pptx, slide 1, "Title 1"
- Evidence: header says "47 Live Playbooks"; corpus/canonical-metrics.md
line 12 has "live_playbooks_count: 42"; corpus/dgx-spark-roadmap.pptx
slide 1 uses "42".
- Cross-reference: corpus/canonical-metrics.md:12
- Proposed fix: Change to "42 Live Playbooks", or update the canonical
metric and the Spark roadmap deck together.
### Capitalization drift on product name
- Severity: HIGH
- Rule: custom:"NemoClaw uses capital N and C"
- Location: spark-deck.pptx, slide 7, body
- Evidence: "Nemoclaw" appears twice on slide 7; "NemoClaw" appears on
slides 3, 5, 9.
- Cross-reference: corpus/brand-guide.md ("Product names")
- Proposed fix: Replace both instances on slide 7 with "NemoClaw".
### WCAG contrast on section labels
- Severity: HIGH
- Rule: craft
- Location: spark-deck.pptx, 18 instances of green section labels
- Evidence: #76B900 on #FFFFFF โ contrast ratio 2.4 : 1, fails AA Normal
(threshold 4.5 : 1).
- Cross-reference: profile.yaml.wcag_level = AA
- Proposed fix: #5A8E00 (~4.1 : 1) still fails AA Normal โ darken further
until contrast clears 4.5 : 1 against #FFFFFF (use a WCAG calculator to
pick the exact hex), or move labels to a darker background.
TIP
Run the red-team before you think the artifact is done. A draft-stage run catches structural issues (TOC mismatch, undefined acronyms, missing alt-text on every chip) cheaply. A "final" run should be quick โ if it isn't, you shipped too late.
| Knob | Where | What to change |
|---|---|---|
| Artifact queue path | filesystem_policies.read_only (queue mount) | Point at any folder of artifacts you're about to ship. Drop files in, the agent picks them up next run. |
| Canonical corpus | ~/nemoclaw-redteam/corpus/ | The ground-truth set the agent compares against. Curate it โ every file here becomes "what we know to be true". Stale corpus = stale flags. |
| Audience profile | Profile Q1 (or edit profile.yaml.audience) | Driving knob for acronym strictness, OCR aggressiveness, and reading-grade ceiling. Default to the strictest audience you ship to. |
| Severity threshold for notification | Profile Q2 | Default to HIGH+. Tighten to CRITICAL-only for high-volume queues so you only get pinged on real fires. |
| Tie-break rule | Profile Q3 | "Reader trust first" for sales/partner decks. "Craft first" for regulated audiences. "By page order" for quick first-pass cleanup. |
| Custom rules | profile.yaml.custom_rules | Add one-line rules in plain English. The agent treats each as a rule with id custom:<text>. Good for canonical phrasing, brand-name capitalization, "any number โฅ 1M must be cited", forbidden words. |
| Glossary | profile.yaml.glossary | Acronyms here are treated as "defined" โ the agent won't flag them as undefined first-use. Add the acronyms your audience knows, leave out the ones they don't. |
| Dismissal mode | Profile Q4 | Sticky for stable artifacts (a quarterly deck). Per-version when you actively iterate. None for first-time reviews of an audience you don't know yet. |
| Delivery channel | Profile Q5 | File only for solo reviews. File + Telegram summary once you trust the agent's calibration. File + full Telegram only for short docs (<10 findings). |
| WCAG level and font minimums | profile.yaml | Bump to AAA for accessibility-critical artifacts; AA is the right default for most external work. Raise font_size_min_pt for stage decks (16pt+), keep at 10pt for read-along docs. |
| Output format | Prompt โ WRITE PUNCH LIST step | Swap Markdown for JSON if you want to feed reports into another tool. Add a CSV summary alongside the MD for spreadsheet triage. |
| URL verification (advanced) | network_policies + Prompt | Add specific hosts (e.g. build.nvidia.com) under network_policies if you want the agent to HEAD-check footer URLs. Hot-reload with openshell policy set --wait. Higher risk โ every added host expands the egress surface. Keep the list small. |
| Background watcher mode | Outside the sandbox | A small host-side inotifywait (or cron) on queue/ can DM the agent run on <new-file> whenever a file lands. Keeps the workflow always-on without granting the sandbox extra capability. |
| Multi-artifact comparison | Prompt โ INGEST step | When two related files are in the queue (spark-deck.pptx + dgx-spark-roadmap.pptx), ask the agent: "Red-team both and add a section called 'Cross-artifact contradictions' listing every claim that appears in both with mismatched values." |
| Dismissal audit | ~/nemoclaw-redteam/memory/dismissals.jsonl | Open this file periodically. If a rule is dismissed everywhere, it's probably the wrong rule โ delete it from profile.yaml.custom_rules so the agent stops generating noise. |
| Hand off the summary to news-digest | Prompt โ HANDOFF step | Add "Also include a line in tomorrow's morning digest with the count of HIGH+ findings I haven't acted on yet." (Requires the news-digest recipe.) |
To dismiss a finding, reply: dismiss <rule_id> at <location> because <reason> (or dismiss all <rule_id> across versions because <reason> for a sticky cross-artifact dismissal). The agent appends to memory/dismissals.jsonl and confirms.
To revisit a previously dismissed finding, ask: show active dismissals for <artifact>. Open memory/dismissals.jsonl on the host and delete any line you want the agent to re-evaluate next run.
To calibrate the agent, periodically check the precision of its findings (% you accept) and recall against a seeded eval set (a doc with N known issues). The agent is doing its job when precision > 70% and recall > 90% on the eval set. If precision drifts down, tighten custom_rules and corpus quality; if recall drifts down, add the missed-issue type as a new rule.