[{"data":1,"prerenderedAt":-1},["ShallowReactive",2],{"project-83836":3},{"id":4,"name":5,"fullName":6,"owner":7,"repo":5,"description":8,"homepage":9,"htmlUrl":10,"language":11,"languages":10,"totalLinesOfCode":10,"stars":12,"forks":13,"watchers":14,"openIssues":15,"contributorsCount":15,"subscribersCount":15,"size":15,"stars1d":14,"stars7d":16,"stars30d":16,"stars90d":15,"forks30d":15,"starsTrendScore":17,"compositeScore":18,"rankGlobal":10,"rankLanguage":10,"license":19,"archived":20,"fork":20,"defaultBranch":21,"hasWiki":22,"hasPages":22,"topics":23,"createdAt":10,"pushedAt":10,"updatedAt":30,"readmeContent":31,"aiSummary":10,"trendingCount":15,"starSnapshotCount":15,"syncStatus":13,"lastSyncTime":32,"discoverSource":33},83836,"VeloQ","lucifer1004\u002FVeloQ","lucifer1004","Agent-friendly GPU profile-query CLI","https:\u002F\u002Flucifer1004.github.io\u002FVeloQ\u002F",null,"Rust",68,2,1,0,14,9,51.83,"MIT License",false,"main",true,[24,25,26,27,28,29],"cli","cuda","gpu","ncu","nsys","profiling","2026-06-12 04:01:42","\u003Cp align=\"center\">\n  \u003Cimg src=\"docs\u002Fassets\u002Flogo.svg\" alt=\"VeloQ logo\" width=\"104\" height=\"104\" \u002F>\n\u003C\u002Fp>\n\n\u003Ch1 align=\"center\">VeloQ\u003C\u002Fh1>\n\n\u003Cp align=\"center\">\u003Cem>Pure CLI in \u002F JSON contract out; no GUI required.\u003C\u002Fem>\u003C\u002Fp>\n\n**Agent-friendly profile-query CLI family.** JSON by default, with\nCSV\u002Ftable projections where they are useful. One shot per call.\nDesigned so a coding agent (or a shell script) can reason about GPU\nprofiles without ever opening a GUI.\n\nVeloQ covers three profile sources today — **Nsight Systems** (timeline\ntraces), **Nsight Compute** (kernel reports), and experimental\n**PyTorch\u002FKineto** Chrome traces — through a single binary with a shared\nenvelope and a pluggable `ProfileSource` trait. The PyTorch\u002FKineto source\ncovers the Perfetto-style Chrome trace shape used by PyTorch profiler.\n\n## Status\n\n16 NSys verbs (timeline analysis, kernel overlap, NCU handoff, prep\u002Fcache helpers, and schema) +\n11 NCU verbs (`summary`, `launches`, `inspect`, `metrics`, `disasm`,\n`ranges`, `graphs`, `sources`, `source-metrics`, `warp-stalls`,\nplus `schema`) +\n10 experimental PyTorch verbs (`summary`, `search`, `inspect`, `stats`,\n`correlate`, `timeline`, `slices`, `collectives`, `prep`, `schema`) +\nfive root meta verbs (`info`, `sources`, `clean`, `recipes`, `self-update`). JSON output returns the same\nv1 envelope on stdout — every list response uses canonical\n`data.rows[]` with a stable per-row `key`; NSys trace responses also\ncarry top-level `trace_span` for per-second normalization. Errors come\nback through the same envelope shape with a non-zero exit code.\n\n### NSys ingestion\n\nNSys traces are read through `nsys export -t parquetdir`. **Minimum\nrequired nsys version is 2024.6** (the release that introduced the\n`parquetdir` `--type`). All VeloQ-generated products live under one\n`\u003Creport>.veloq\u002F` artifact root; the NSys parquet cache is its\n`parquetdir\u002F` child with ctime invalidation.\n\n## How it compares\n\nFor a GPU-profile question an agent otherwise reaches for one of three\nthings. VeloQ is built to beat each on the axes that matter to an agent —\na **stable typed contract**, **token economy**, and **scriptability** —\nnot raw speed: VeloQ reads `nsys export -t parquetdir` output, it does not\nreplace `nsys`\u002F`ncu`.\n\n|                              | Nsight GUI | Raw `nsys`\u002F`ncu` text in context | Hand-rolled SQLite + jq |             **VeloQ**              |\n| ---------------------------- | :--------: | :------------------------------: | :---------------------: | :--------------------------------: |\n| Scriptable \u002F one-shot        |     ✗      |             ~ ad hoc             |            ✓            |                 ✓                  |\n| Token-efficient for an agent |    n\u002Fa     |        ✗ dumps everything        |            ~            | ✓ shaped rows + truncation signals |\n| Stable typed contract        |     ✗      |           ✗ free text            |    ✗ schema you own     |     ✓ versioned JSON envelope      |\n| Cross-capture diffable       |     ✗      |                ✗                 |            ~            |       ✓ stable per-row `key`       |\n| Zero setup per query         |     ✓      |                ✓                 |            ✗            |                 ✓                  |\n\n**When _not_ to use VeloQ:** interactive, human exploration of a timeline,\nor a one-off eyeball check — the Nsight GUI is genuinely better there.\nVeloQ is for programmatic, repeatable, agent- or script-driven querying.\n\n## Install\n\nFor Linux and macOS, the install script is the shortest path: it installs\nboth the `veloq` binary and the bundled Agent Skills.\n\n```bash\n# Linux x86_64 \u002F aarch64 and macOS x86_64 \u002F arm64\ncurl -fsSL https:\u002F\u002Fraw.githubusercontent.com\u002Flucifer1004\u002Fveloq\u002Fmain\u002Fscripts\u002Finstall.sh | bash\n```\n\nInstalls the `veloq` binary under `~\u002F.local\u002Fbin` and the Agent Skills\nfor profile analysis (`nsys-profile-analysis`, `ncu-profile-analysis`,\n`pytorch-profile-analysis`) under\n`~\u002F.agents\u002Fskills\u002F`. Pass `--no-skills` to install just the\nbinary, or `--no-binary` to refresh the skills when you manage the\nVeloQ CLI separately. The skills are VeloQ-backed: they can be\ninstalled separately, but profile evidence extraction still requires a\n`veloq` binary on `PATH`. `--bin-dir \u003Cpath>` overrides the binary\ninstall location.\n\nFor Windows, use `cargo binstall veloq` below or grab\n`veloq-x86_64-windows.exe` from the\n[Releases](https:\u002F\u002Fgithub.com\u002Flucifer1004\u002Fveloq\u002Freleases) page directly.\n\n### Cargo binstall (binary only)\n\nIf you use [`cargo-binstall`](https:\u002F\u002Fgithub.com\u002Fcargo-bins\u002Fcargo-binstall),\ninstall the prebuilt `veloq` binary from the GitHub release:\n\n```bash\ncargo binstall veloq\n```\n\n`cargo binstall` installs only the executable. To fetch the bundled\nskills from the latest release without replacing the binstall-managed\nVeloQ binary, run:\n\n```bash\nveloq self-update --no-binary\n```\n\nUse `--skills-dir \u003Cpath>` on that second command to install skills under a\nnon-default root such as `.claude\u002Fskills\u002F`.\n\n```bash\nveloq self-update --no-binary --skills-dir .claude\n```\n\n### Codex plugin (alternative)\n\nVeloQ ships Codex plugin metadata under `.codex-plugin\u002F` and a local\nmarketplace under `.agents\u002Fplugins\u002F`. From a VeloQ checkout:\n\n```bash\ncodex plugin marketplace add .\ncodex plugin add veloq@veloq\n```\n\nThe plugin install handles the Agent Skills only. Those skills require\nthe VeloQ CLI for evidence extraction, so install the `veloq` binary\nseparately via `cargo binstall veloq` or `scripts\u002Finstall.sh --no-skills`.\n\nThe repo's canonical Agent Skills source lives under `.agents\u002Fskills\u002F`.\nThe legacy `.claude\u002Fskills` path is kept as a compatibility alias.\n\n### Claude Code plugin (alternative)\n\nVeloQ ships a one-plugin marketplace listing under\n`.claude-plugin\u002F`. Users running Claude Code's plugin manager can:\n\n```text\n\u002Fplugin marketplace add https:\u002F\u002Fgithub.com\u002Flucifer1004\u002Fveloq.git\n\u002Fplugin install veloq@veloq\n```\n\nThis uses the same Agent Skills through the Claude-specific plugin\nmetadata under `.claude-plugin\u002F`.\n\n### Updating\n\n```bash\nveloq self-update                              # binary AND bundled Agent Skills\nveloq self-update --check                      # is a newer release out? (JSON)\nveloq self-update --no-skills                  # binary only\nveloq self-update --no-binary                  # Agent Skills only; keep your binary manager\nveloq self-update --skills-dir .claude          # install skills to .claude\u002Fskills\u002F\n```\n\n`self-update` pulls the latest GitHub release: it replaces the running\nbinary and re-installs the bundled Agent Skills, matching what\n`install.sh` does — so a self-updated binary never leaves stale skills\nbehind. Skills go to `~\u002F.agents\u002Fskills\u002F` by default; `--skills-dir \u003Cpath>`\n(or `VELOQ_SKILLS_DIR`) points them under a different root — a\nproject-local `.agents`, a Claude-specific `.claude`, etc. By convention\nskills live in a `skills\u002F` subdir, so `skills\u002F` is appended automatically\n(pass the root or the full skills dir, either works); the skills\nthemselves are portable `SKILL.md` files. `--check` only reports\n`update_available` without touching anything. All emit the standard\nenvelope on stdout. Re-running `install.sh` (same `--skills-dir`) also\nworks: it refreshes bundled Agent Skills and removes stale files from prior\nAgent Skills installs.\n\nIf the binary was installed with `cargo-binstall` and you want\n`cargo-binstall` to remain the binary manager, use\n`veloq self-update --no-binary` to refresh Agent Skills only, and use\n`cargo binstall` again for binary updates.\n\n## Quick start\n\nThese examples assume `veloq` is on `PATH` (via one of the install\nmethods above). For contributors building from source, see\n[Build from source](#build-from-source) — the binary lands at\n`target\u002Frelease\u002Fveloq`.\n\n```bash\n# ── NSys (timeline) — hoisted to the top level and also available as `veloq nsys ...`\n# Summarize a trace\nveloq summary path\u002Fto\u002Ftrace.nsys-rep\nveloq nsys summary path\u002Fto\u002Ftrace.nsys-rep\n\n# Top kernels by total time\nveloq stats path\u002Fto\u002Ftrace.nsys-rep --limit 10\n# Aggregate attributable kernels by full NVTX hierarchy path\nveloq stats path\u002Fto\u002Ftrace.nsys-rep --type kernel --group-by nvtx-path\n\n# Human-friendly comfy-table view\nveloq stats path\u002Fto\u002Ftrace.nsys-rep --limit 10 --format table\n\n# Find kernels by name. On large traces, --name-regex prunes the scan\n# before name resolution and runs several times faster than the\n# equivalent --name '*...*' glob (identical results).\nveloq search path\u002Fto\u002Ftrace.nsys-rep --type kernel --name-regex 'gemm' --sort duration:desc --limit 10\n\n# Discover canonical workflows (nvtx-breakdown, gpu-idle-audit,\n# memcpy-asymmetry, cold-kernel-hotspot, ...)\nveloq recipes\nveloq recipes nvtx-breakdown\n\n# GPU performance-counter samples (needs --gpu-metrics-devices at capture time)\nveloq metrics path\u002Fto\u002Ftrace.nsys-rep --type gpu --limit 8 --sort=mean:desc\n# Same data as a 50ms time series\nveloq metrics path\u002Fto\u002Ftrace.nsys-rep --type gpu --counter '*Throughput*' --bucket 50ms\n\n# NIC performance-counter samples (needs --nic-metrics=lf or =hf at capture time)\nveloq metrics path\u002Fto\u002Ftrace.nsys-rep --type nic --counter 'IB: Bytes*' --bucket 50ms\n\n# CPU hotspot (needs --sample=process-tree at capture time)\nveloq metrics path\u002Fto\u002Ftrace.nsys-rep --type cpu-sampling --limit 20\n# Per-thread breakdown\nveloq metrics path\u002Fto\u002Ftrace.nsys-rep --type cpu-sampling --group-by tid\n# Drill: full callchain for one sample\nveloq inspect path\u002Fto\u002Ftrace.nsys-rep cpu_sample:1234\n\n# Generate an Nsight Compute rerun command for a selected NSys kernel event\nveloq nsys ncu-command path\u002Fto\u002Ftrace.nsys-rep kernel:1234\nveloq nsys ncu-command path\u002Fto\u002Ftrace.nsys-rep kernel:1234 --print | bash\n\n# ── NCU (kernel reports) — namespaced under `ncu`\n# Slim overview (launch-derived totals + NCU-version session)\nveloq ncu summary path\u002Fto\u002Freport.ncu-rep\nveloq ncu summary --format csv path\u002Fto\u002Freport.ncu-rep\n# List launches; drill in for full per-launch metrics \u002F rules\nveloq ncu launches path\u002Fto\u002Freport.ncu-rep --kernel '*gemm*'\nveloq ncu inspect path\u002Fto\u002Freport.ncu-rep --row-id launch:0\n# Cross-launch metric projection (long form by default; jq-friendly diff shape)\nveloq ncu metrics path\u002Fto\u002Freport.ncu-rep --counter 'sm__*active*'\n# Per-launch SASS \u002F PTX \u002F source-line correlation (cached per cubin)\nveloq ncu disasm path\u002Fto\u002Freport.ncu-rep --row-id launch:0 \\\n  | jq '.data.rows[0] | {function_name, instruction_count: (.instructions|length)}'\n# Per-source-line warp-stall-reason histogram (from timed_warp_samples)\nveloq ncu warp-stalls path\u002Fto\u002Freport.ncu-rep --row-id launch:0\n# Other list verbs\nveloq ncu sources path\u002Fto\u002Freport.ncu-rep\nveloq ncu ranges path\u002Fto\u002Freport.ncu-rep\nveloq ncu schema launches\n\n# ── PyTorch\u002FKineto (Chrome traces) — namespaced under `pytorch`\nveloq pytorch summary path\u002Fto\u002Fworker0.pt.trace.json\nveloq pytorch search path\u002Fto\u002Fworker0.pt.trace.json --type kernel --is-comm\nveloq pytorch correlate path\u002Fto\u002Fworker0.pt.trace.json kernel:91\nveloq pytorch slices path\u002Fto\u002Fworker0.pt.trace.json --aggregate --group-by step\nveloq pytorch stats path\u002Fto\u002Fworker0.pt.trace.json --type comm --group-by comm-kind,rank\nveloq pytorch collectives path\u002Fto\u002Fworker0.pt.trace.json\nveloq pytorch schema search\n\n# ── Meta verbs\nveloq sources\nveloq info path\u002Fto\u002Ffile.ncu-rep\nveloq schema metrics\n```\n\n### Build from source\n\n```bash\n# The repo pins Rust 1.89.0 via rust-toolchain.toml.\ncargo build --release -p veloq\n# Binary lands at target\u002Frelease\u002Fveloq — either invoke it via the\n# full path or run `cp target\u002Frelease\u002Fveloq ~\u002F.local\u002Fbin\u002F` to put\n# it on PATH manually.\n.\u002Ftarget\u002Frelease\u002Fveloq --help\n```\n\n> Heads-up: nsys's GPU\u002FNIC\u002FCPU-sample\u002FSCHED buffers can silently\n> drop data on long captures. Every `metrics` response carries\n> coverage + per-type trust signals at `data.auxiliary.common`;\n> `veloq metrics --type \u003Cgpu|nic|cpu-sampling|cpu-sched> --help`\n> lists them. Read coverage before quoting numbers.\n\nThe first command on a new `.nsys-rep` runs `nsys export -t parquetdir`,\ncaching `\u003Ctrace>.nsys-rep.veloq\u002Fparquetdir\u002F\u003CTABLE>.parquet` for reuse;\npassing that generated `parquetdir\u002F` back resolves to the owning\n`.nsys-rep`, so sidecars stay under one artifact root. `veloq prep\n\u003Ctrace>` exports upfront; `veloq clean \u003Ctrace>` removes the generated\nproducts for one report.\n\n## Response envelope\n\nEvery successful JSON call returns the source-qualified v1 envelope:\n\n```json\n{\n  \"schema\": \"v1\",\n  \"source\": { \"kind\": \"nsys\", \"version\": \"v1\" },\n  \"command\": \"nsys.stats\",\n  \"trace\": { \"kind\": \"nsys\", \"path\": \"trace.nsys-rep\" },\n  \"trace_span\": { \"origin_ns\": 0, \"span_ns\": 12345000000 },\n  \"data\": {\n    \"count\": 50,\n    \"total_matched\": 1234,\n    \"rows\": [{ \"key\": \"kernel|...|dev:0|stream:7\", \"...\": \"...\" }]\n  }\n}\n```\n\n- `schema` — envelope-format version. Bumps on every breaking\n  envelope-shape change.\n- `source.kind` — which profile backend produced the response\n  (`\"nsys\"`, `\"ncu\"`, `\"pytorch\"`, or `\"veloq\"` for meta verbs).\n- `source.version` — per-source wire-format version. Bumps\n  independently from the envelope when the source's payload shapes\n  change. Currently NSys reports `v1` (the NVTX domain dimension on\n  `stats --group-by nvtx-path` rows — domain-qualified key plus resolved\n  `domain_id`\u002F`domain_pid`\u002F`domain_name`) and NCU reports `v1` (the\n  `ncu_report`-native wire — `inspect` carries no section catalog and\n  `summary.auxiliary.session` keeps only the NCU version; each\n  `ncu inspect` metric's\n  `metric_type` \u002F `metric_subtype` \u002F `rollup` is the `ncu_report` enum\n  _name_ such as `\"counter\"` rather than the integer `1`, with the raw\n  integer kept alongside as `*_code`).\n- `command` — qualified as `\u003Csource>.\u003Cverb>` for source verbs\n  (`nsys.stats`, `ncu.summary`), or just `\u003Cverb>` for meta verbs\n  (`info`, `sources`, `clean`).\n- `trace.kind` — mirrors the producing `source.kind` (or the\n  detected source kind for `veloq info`). Omitted entirely for\n  trace-less verbs (`sources`, `schema`, `ncu.schema`).\n- `trace_span` — primary-execution `(origin_ns, span_ns)` window.\n  Agents normalize totals by `span_ns` to get per-second rates\n  without a separate `summary` call. Omitted when the source does not\n  provide a trace-wide window, and on meta verbs that don't read a\n  trace.\n- `data.rows[]` — canonical primary list on every list-shaped verb.\n  Each row carries a `key: string` composed from its identifying\n  axes (e.g. `\"kernel:1234\"`, `\"bucket|0..1000000\"`,\n  `\"slice|step_42|@1234567\"`) so agents can\n  `INDEX(.data.rows; .key)` across two captures and diff by key.\n  Non-primary data lives under `data.auxiliary`.\n\n**Stability.** The JSON envelope and the per-source `version`s are VeloQ's\npublic contract. Additive fields are non-breaking and keep the version; any\nbreaking shape change bumps `schema` (`ENVELOPE_VERSION`) or the affected\n`source.version` and lands a CHANGELOG entry. The crate's `0.x` Cargo\nversion is **independent** of the wire version — pin behavior to the\nenvelope\u002Fsource versions, not the crate version.\n\nErrors share the same shape, with `data` replaced by `error`:\n\n```json\n{\n  \"schema\": \"v1\",\n  \"source\": { \"kind\": \"nsys\", \"version\": \"v1\" },\n  \"command\": \"nsys.stats\",\n  \"trace\": { \"kind\": \"nsys\", \"path\": \"trace.nsys-rep\" },\n  \"error\": {\n    \"message\": \"invalid --from `1s`: must pair with --to\",\n    \"chain\": [\"resolving --from\u002F--to\"]\n  }\n}\n```\n\nCLI-level parse failures (unknown flag, bad subcommand) omit `source`,\n`command`, and `trace`. `--help` \u002F `--version` print clap's native\nusage text unchanged.\n\nException: `veloq nsys ncu-command --print` intentionally writes a\nraw shell script on stdout for piping, and writes failures to stderr\nwithout a JSON envelope.\n\n## Subcommands\n\n### NSys verbs (hoisted to top level, also available under `nsys`)\n\n| Command             | Purpose                                                                                                                                                                                 |\n| ------------------- | --------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- |\n| `summary`           | Overview: version, capabilities, per-table, primary vs full span                                                                                                                        |\n| `stats`             | Aggregation across kernel\u002Fmemcpy\u002Fmemset\u002Fsync\u002Fruntime\u002Fosrt\u002Fgraph\u002Fnvtx by name + composable axes                                                                                          |\n| `search`            | Filter events → list of `row_id`s plus headline columns                                                                                                                                 |\n| `inspect`           | Full per-kind details for one or more `row_id`s                                                                                                                                         |\n| `correlate`         | CPU↔GPU causal chain for a `row_id`                                                                                                                                                     |\n| `ncu-command`       | Generate a native `ncu` rerun command for one selected kernel event                                                                                                                     |\n| `gaps`              | GPU idle bubbles. Default `--scope device` is cross-stream (no phantom gaps from idle peer streams); `--scope stream` for per-stream starvation; `--scope trace` for multi-GPU rig idle |\n| `timeline`          | Time-bucketed GPU activity (busy ns + per-kind breakdown per bucket)                                                                                                                    |\n| `concurrency`       | Kernel\u002Ftransfer overlap: per-device union vs sum busy time, peak concurrency, per-stream (incl. same-stream PDL) + compute\u002Fcopy overlap. Extraction-only (ratios in jq)                 |\n| `graph-replays`     | CUDA Graph replay decomposition: per-replay GPU work keyed by `(device, context, correlationId)`, across both `--cuda-graph-trace=graph` and `=node` captures                           |\n| `slices`            | Per-NVTX-range CPU bounds + attributed GPU work                                                                                                                                         |\n| `hardware`          | CPU \u002F GPU \u002F NIC inventory from the trace's `TARGET_INFO_*` tables                                                                                                                       |\n| `metrics`           | GPU\u002FNIC PM counters, CPU IP samples, or CPU scheduler events — hotspot summary, time series, callchain via `inspect`                                                                    |\n| `prep`              | Build the Parquet + metadata caches eagerly                                                                                                                                             |\n| `correlation-stats` | Build\u002Fload the correlation index and report counts                                                                                                                                      |\n| `schema \u003Ctarget>`   | Strict JSON Schema for one NSys verb's response                                                                                                                                         |\n\nEvery NSys command above can also be invoked as `veloq nsys \u003Ccommand>\n...`; the top-level form is kept as the default-source shorthand.\n\n### NCU verbs (namespaced under `ncu`)\n\nNCU verbs share a `\u003Ctrace>.veloq\u002Fncu-native.json.gz` sidecar built on\nfirst use; subsequent calls deserialise it instead of re-ingesting the\nreport.\n\nAll NCU detail verbs accept `--format json\\|csv\\|table`; tabular\noutput mirrors the JSON `data.rows[]` one row per output line\n(nested objects become dotted-key columns, `BTreeMap` fields like\n`counters` expand to one column per resolved counter name). `ncu\nschema` is JSON-only.\n\n| Command               | Formats            | Purpose                                                                                                                                                                                                                                    |\n| --------------------- | ------------------ | ------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------ |\n| `ncu summary`         | json \u002F csv \u002F table | Slim overview: one launch-derived totals row + degraded session (NCU version only). `--format csv\\|table` renders the totals + session as a `section,key,value` projection.                                                                |\n| `ncu launches`        | json \u002F csv \u002F table | List CUDA kernel launches as headline rows (`launch:\u003Cidx>`); filters: `--kernel '\u003Cglob>'`, `--nvtx-range '\u003Cglob>'`, `--grid WxHxD`, `--block WxHxD`, `--limit`                                                                             |\n| `ncu inspect`         | json \u002F csv \u002F table | Full per-launch payload (full metric list with placement-tagged instances + rules + recovered identity scalars) for one or more `--row-id launch:\u003Cidx>`; tabular emits one wide row per id with NULL columns for `not_found` variants      |\n| `ncu metrics`         | json \u002F csv \u002F table | Cross-launch metric projection. Default long form (one row per `(launch, counter)`); `--per-launch` for wide form (BTreeMap counters expand to one column per name)                                                                        |\n| `ncu disasm`          | json \u002F csv \u002F table | SASS \u002F PTX \u002F source-index correlation for the cubin one launch ran out of (cubin extracted from the report, cached per-cubin under `\u003Creport>.veloq\u002Fdisasm\u002F`); tabular emits one row per SASS instruction with denormalised kernel identity |\n| `ncu source-metrics`  | json \u002F csv \u002F table | Per-source-line \u002F per-SASS \u002F per-file NCU counter attribution. Joins per-PC metric instances with DWARF source-line attribution; `--by line\\|sass\\|file`. See `veloq recipes source-line-hotspots` for the canonical invocation.           |\n| `ncu warp-stalls`     | json \u002F csv \u002F table | Per-source-line warp-stall-reason histogram from `timed_warp_samples` (the raw warp-state stream); `--by line\\|sass\\|reason`, `--file '\u003Cglob>'`. Raw sample counts + `not_issued`; jq for percentages.                                     |\n| `ncu ranges`          | json \u002F csv \u002F table | List range workloads (`--replay-mode range`)                                                                                                                                                                                               |\n| `ncu graphs`          | json \u002F csv \u002F table | List CUDA-graph workloads (`--graph-profiling graph`)                                                                                                                                                                                      |\n| `ncu sources`         | json \u002F csv \u002F table | Per-cubin source metadata (`cuda_sm_name`, `embedded_source_file_count`, `has_disasm`), one row per launch's cubin                                                                                                                         |\n| `ncu schema \u003Ctarget>` | json               | Strict JSON Schema for one NCU response. Targets: `summary \\| launches \\| inspect \\| metrics \\| disasm \\| ranges \\| graphs \\| sources \\| source-metrics \\| warp-stalls`                                                                    |\n\n### PyTorch verbs (namespaced under `pytorch`)\n\nPyTorch is an experimental `source.version = \"v0\"` source for Kineto\nChrome trace files (`.pt.trace.json` \u002F `.pt.trace.json.gz`). Directory\ninputs and cross-rank collective skew are planned, not shipped in v0.\nWhen one trace file contains multiple rank values, rank-scoped list and\naggregate commands require `--rank \u003Cn>` or `--all-ranks`.\nIt uses the same general VeloQ verbs instead\nof adding parallel `steps`, `memory`, or `comm` commands; communication\nquestions use `--type comm`, `--is-comm`, grouping axes, `slices`, and the\nsource-specific `collectives` verb.\n\n| Command                   | Formats            | Purpose                                                                                                       |\n| ------------------------- | ------------------ | ------------------------------------------------------------------------------------------------------------- |\n| `pytorch summary`         | json \u002F csv \u002F table | Trace inventory, capabilities, active devices, rank\u002Fworker inference, versions, capture flags                 |\n| `pytorch search`          | json \u002F csv \u002F table | Typed event refs; filters include `--type`, name glob\u002Fregex, duration, time, rank, device, stream, step       |\n| `pytorch inspect`         | json \u002F csv \u002F table | Raw args, typed args, parent\u002Fchildren, step\u002FPython context, and correlation\u002Fflow links for one or more row ids |\n| `pytorch stats`           | json \u002F csv \u002F table | Duration\u002Fcount aggregation by `name,type,step,rank,device,stream,shape,comm-kind,python-context,python-path`  |\n| `pytorch correlate`       | json \u002F csv \u002F table | CPU op \u002F annotation \u002F runtime \u002F driver \u002F GPU activity causal chain for one or more row ids                    |\n| `pytorch timeline`        | json \u002F csv \u002F table | Time buckets with CPU, GPU, communication, and per-type time                                                  |\n| `pytorch slices`          | json \u002F csv \u002F table | ProfilerStep and user annotation range instances or aggregates                                                |\n| `pytorch collectives`     | json \u002F csv \u002F table | Single-trace communication groups with CPU\u002FNCCL evidence row ids and link\u002Fordinal confidence                  |\n| `pytorch prep`            | json \u002F csv \u002F table | Build or inspect PyTorch sidecars under `\u003Cinput>.veloq\u002Fpytorch\u002F`                                              |\n| `pytorch schema \u003Ctarget>` | json               | Strict JSON Schema for one PyTorch response                                                                   |\n\n### Meta verbs (root, owned by the binary)\n\n| Command          | Purpose                                                                                                                                                                                                                                                                                                                           |\n| ---------------- | --------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- |\n| `info \u003Ctrace>`   | First-touch trace map: source kind, filesystem facts, capability bitmap, plus (on a cached parquetdir) device\u002Fprocess inventory, NVTX domains + top paths, and `applicable_recipes` filtered by trace shape. Sub-100ms on a parquetdir; basics-only on a cold `.nsys-rep` with a `meta.next_steps` hint pointing at `veloq prep`. |\n| `recipes [\u003Cid>]` | List or show registered workflow recipes (run `veloq recipes` for the catalog, `veloq recipes \u003Cid>` for one).                                                                                                                                                                                                                     |\n| `sources`        | Registered sources and their wire-format versions                                                                                                                                                                                                                                                                                 |\n| `clean \u003Ctrace>`  | Remove the `\u003Ctrace>.veloq\u002F` artifact root generated by VeloQ                                                                                                                                                                                                                                                                      |\n| `self-update`    | Update the binary and bundled Agent Skills from the latest GitHub release (`--check` \u002F `--no-skills` \u002F `--no-binary` \u002F `--skills-dir`)                                                                                                                                                                                           |\n\nPer-verb flag detail, response shape, sort keys, and examples live\nin `veloq \u003Cverb> --help` (which is projected from the same\n`JsonSchema` derive as the response, so it can't drift).\n\n## NVTX caveat\n\nNSys's `NVTX_EVENTS` table records CPU-side range timestamps only;\nGPU work is reached by walking `correlationId` from `NVTX → runtime\nAPI → kernel\u002Fmemcpy\u002Fmemset` with `(device, context)` disambiguation\nfrom `TARGET_INFO_CUDA_CONTEXT_INFO`. VeloQ does this walk in SQL for\n`stats --nvtx`\u002F`search --nvtx`\u002F`slices` and in a pre-built index\n(`\u003Ctrace>.veloq\u002Fcorrelation.bin`) for `correlate`.\n\nThe same walk runs in reverse for `inspect` (default-on) and\n`search --with-nvtx` (opt-in batched): given a kernel \u002F memcpy \u002F\nmemset \u002F sync `row_id`, VeloQ surfaces `nvtx_context: { range_id,\nname, depth, iter_index }` for the innermost enclosing NVTX range.\n`iter_index` is the 0-based ordinal among same-`(global_tid,\ndomain_id, name)` repeats — answers \"which step did this kernel\nbelong to\" without a second jq pass.\n\nFor nested NVTX, `veloq stats T --group-by nvtx-path` and\n`veloq slices T --aggregate --group-by path` group by the full\nslash-joined hierarchy path, so repeated leaf\nnames under different parents remain distinct. `inspect T nvtx:N`\nalso includes `path`, `parent_row_id`, and `parent_name` when the\nNVTX tree can be built.\n\n## Inputs\n\n| Source  | Extensions                  | Notes                                                                                                         |\n| ------- | --------------------------- | ------------------------------------------------------------------------------------------------------------- |\n| NSys    | `.nsys-rep`                 | Primary path; exported via `nsys export -t parquetdir` on first use                                           |\n| NSys    | `\u003Cstem>_pqtdir\u002F`            | Pre-exported parquetdir; opened directly                                                                      |\n| NSys    | `\u003Ctrace>.veloq\u002Fparquetdir\u002F` | Generated alias for the owning `.nsys-rep`; not a separate source                                             |\n| NCU     | `.ncu-rep`                  | Nsight Compute kernel report (ingested via NVIDIA's `ncu_report` API at prep time; no vendored proto schemas) |\n| PyTorch | `.pt.trace.json`            | PyTorch\u002FKineto Chrome trace JSON                                                                              |\n| PyTorch | `.pt.trace.json.gz`         | Gzipped PyTorch\u002FKineto Chrome trace JSON                                                                      |\n\n`veloq info \u003Ctrace>` reports which source claims the file based on\nthe same `detect()` heuristic the dispatcher uses, so an agent can\nprobe a path without having to maintain its own extension list.\n\nNCU ingestion runs NVIDIA's `ncu_report` Python API at **prep time only**;\nquery-time is NCU-free and the generated `\u003Creport>.veloq\u002F` sidecar is\nportable across Linux\u002FmacOS\u002FWindows. VeloQ auto-discovers the Nsight\nCompute install (`extras\u002Fpython`, or the macOS app bundle's\n`Contents\u002FMacOS\u002Fpython`). For a non-standard location, set\n`VELOQ_NCU_REPORT_DIR` to the directory containing `ncu_report.py`, and\u002For\n`VELOQ_PYTHON` to the interpreter to run the helper with.\n\n## License\n\nMIT\n","2026-06-11 04:11:36","CREATED_QUERY"]