[{"data":1,"prerenderedAt":-1},["ShallowReactive",2],{"project-2032":3},{"id":4,"name":5,"fullName":6,"owner":7,"repo":5,"description":8,"homepage":9,"htmlUrl":9,"language":10,"languages":9,"totalLinesOfCode":9,"stars":11,"forks":12,"watchers":13,"openIssues":14,"contributorsCount":15,"subscribersCount":15,"size":15,"stars1d":16,"stars7d":17,"stars30d":18,"stars90d":15,"forks30d":15,"starsTrendScore":19,"compositeScore":20,"rankGlobal":9,"rankLanguage":9,"license":21,"archived":22,"fork":22,"defaultBranch":23,"hasWiki":24,"hasPages":22,"topics":25,"createdAt":9,"pushedAt":9,"updatedAt":26,"readmeContent":27,"aiSummary":28,"trendingCount":15,"starSnapshotCount":15,"syncStatus":29,"lastSyncTime":30,"discoverSource":31},2032,"Qwen3.6-27B-AEON-Ultimate-Uncensored-DFlash","AEON-7\u002FQwen3.6-27B-AEON-Ultimate-Uncensored-DFlash","AEON-7","Lossless abliteration of Qwen3.6-27B with NVFP4 hardware quantization for DGX Spark \u002F Blackwell. BF16 (51 GB) + NVFP4 (26 GB) deployment guide, docker-compose, and QuickStart.",null,"Python",274,28,5,6,0,10,17,110,30,82.89,"Apache License 2.0",false,"main",true,[],"2026-06-12 04:00:13","\u003Cdiv align=\"center\">\n\n# Qwen3.6-27B-AEON-Ultimate-Uncensored\n\n### Lossless abliteration · Capability-enhanced · NVFP4 hardware-quantized for Blackwell\n\n[![BF16](https:\u002F\u002Fimg.shields.io\u002Fbadge\u002FHuggingFace-BF16_(51_GB)-yellow?logo=huggingface)](https:\u002F\u002Fhuggingface.co\u002FAEON-7\u002FQwen3.6-27B-AEON-Ultimate-Uncensored-BF16)\n[![NVFP4](https:\u002F\u002Fimg.shields.io\u002Fbadge\u002FHuggingFace-NVFP4_(26_GB)-yellow?logo=huggingface)](https:\u002F\u002Fhuggingface.co\u002FAEON-7\u002FQwen3.6-27B-AEON-Ultimate-Uncensored-NVFP4)\n[![Container](https:\u002F\u002Fimg.shields.io\u002Fbadge\u002Fghcr.io-vllm--aeon--ultimate--dflash-blue?logo=docker)](https:\u002F\u002Fgithub.com\u002FAEON-7\u002FQwen3.6-27B-AEON-Ultimate-Uncensored-DFlash\u002Fpkgs\u002Fcontainer\u002Fvllm-aeon-ultimate-dflash)\n[![DDTree](https:\u002F\u002Fimg.shields.io\u002Fbadge\u002Fghcr.io-vllm--aeon--ultimate--ddtree-purple?logo=docker)](https:\u002F\u002Fgithub.com\u002Fusers\u002FAEON-7\u002Fpackages\u002Fcontainer\u002Fpackage\u002Fvllm-aeon-ultimate-ddtree)\n[![License](https:\u002F\u002Fimg.shields.io\u002Fbadge\u002FLicense-Apache_2.0-green)](LICENSE)\n[![☕ Tips](https:\u002F\u002Fimg.shields.io\u002Fbadge\u002F%E2%98%95_Tips-Support_the_work-ff5e5b?style=flat)](https:\u002F\u002Fgithub.com\u002FAEON-7\u002FAEON-7#-support-the-work)\n\n**Refusals: 0 \u002F 100** &nbsp;·&nbsp; **KL vs base: 0.000492** &nbsp;·&nbsp; **Compression: 49 %** &nbsp;·&nbsp; **Capability: enhanced**\n\n\u003C\u002Fdiv>\n\n---\n\n## TL;DR\n\nA **fully uncensored, capability-enhanced** abliteration of [Qwen\u002FQwen3.6-27B](https:\u002F\u002Fhuggingface.co\u002FQwen\u002FQwen3.6-27B), produced over **72 hours of continuous research** drawing on hundreds of parallel AI research agents, the industry's best published methodologies, custom in-house techniques, and yet-unreleased pre-public branches of next-generation abliteration software.\n\n## Performance — DGX Spark v4 vs Raw Baseline\n\n**This is the headline.** On DGX Spark \u002F GB10, the v4 DFlash container turns the default “it runs, but it feels slow” baseline into a usable long-context local agent model.\n\n| Deployment | Container | DFlash | CUDA graphs | Tool calling | Avg c=1 decode |\n|---|---|---:|---:|---:|---:|\n| 🔴 **Raw baseline** | `vllm\u002Fvllm-openai:nightly` | off | off (`--enforce-eager`) | off | **10.49 tok\u002Fs** |\n| 🟢 **AEON v4 DFlash** | `ghcr.io\u002Faeon-7\u002Fvllm-aeon-ultimate-dflash:qwen36-v4` | **k=15** | **on** | **on** | **37.56 tok\u002Fs** |\n\n**Average single-stream decode improvement: +258%** over the raw stock eager baseline.\n\n### Single-Stream Decode\n\n| Category | 🔴 Raw baseline | 🟢 v4 DFlash | Approx. speed increase | v4 TTFT | v4 TPOT |\n|---|---:|---:|---:|---:|---:|\n| Coding | 10.70 tok\u002Fs | **31.89 tok\u002Fs** | **+198%** | 191 ms | 30.5 ms |\n| Math | 10.01 tok\u002Fs | **37.76 tok\u002Fs** | **+277%** | 225 ms | 25.5 ms |\n| Reasoning | 10.54 tok\u002Fs | **42.41 tok\u002Fs** | **+303%** | 221 ms | 22.6 ms |\n| Prose | 10.59 tok\u002Fs | **31.85 tok\u002Fs** | **+201%** | 212 ms | 30.4 ms |\n| Natural language | 10.56 tok\u002Fs | **31.99 tok\u002Fs** | **+203%** | 183 ms | 30.3 ms |\n| Extraction \u002F JSON | 10.56 tok\u002Fs | **49.48 tok\u002Fs** | **+369%** | 227 ms | 19.2 ms |\n| **Average** | **10.49 tok\u002Fs** | **37.56 tok\u002Fs** | **+258%** | ~210 ms | ~26.4 ms |\n\n### Practical Agent Concurrency\n\nAt c=16, the optimized container keeps active streams much more responsive. Aggregate throughput improves most on structured agent\u002Ftool workloads, and TPOT drops across every category.\n\n| Category | 🔴 Raw c=16 aggregate \u002F TPOT | 🟢 v4 c=16 aggregate \u002F TPOT | Aggregate change |\n|---|---:|---:|---:|\n| Coding | 134.47 tok\u002Fs \u002F 115.1 ms | **144.45 tok\u002Fs \u002F 61.5 ms** | **+7%** |\n| Math | 134.38 tok\u002Fs \u002F 115.1 ms | **193.94 tok\u002Fs \u002F 41.6 ms** | **+44%** |\n| Reasoning | 134.86 tok\u002Fs \u002F 115.4 ms | **187.82 tok\u002Fs \u002F 46.6 ms** | **+39%** |\n| Prose | **135.34 tok\u002Fs** \u002F 115.3 ms | 121.34 tok\u002Fs \u002F **80.6 ms** | -10% aggregate, **30% lower TPOT** |\n| Natural language | 129.82 tok\u002Fs \u002F 117.7 ms | **130.19 tok\u002Fs \u002F 71.2 ms** | ~flat aggregate, **39% lower TPOT** |\n| Extraction \u002F JSON | 133.30 tok\u002Fs \u002F 115.4 ms | **219.11 tok\u002Fs \u002F 43.2 ms** | **+64%** |\n\n### Stress Saturation\n\nc=256 is a saturation test, not the recommended interactive setting. The baseline can report high aggregate throughput by letting every stream crawl. v4 keeps per-active-stream TPOT far lower, but at c=256 requests queue hard and TTFT rises into minutes.\n\n| Category | 🔴 Raw c=256 TPOT | 🟢 v4 c=256 TPOT | v4 c=256 TTFT |\n|---|---:|---:|---:|\n| Coding | 575.5 ms | **70.0 ms** | 149.6 s |\n| Math | 531.9 ms | **42.7 ms** | 103.6 s |\n| Reasoning | 540.7 ms | **49.4 ms** | 109.3 s |\n| Prose | 532.5 ms | **77.1 ms** | 159.8 s |\n| Natural language | 533.4 ms | **72.9 ms** | 160.0 s |\n| Extraction \u002F JSON | 551.9 ms | **43.2 ms** | 90.4 s |\n\n### What v4 Adds\n\n- Latest validated community vLLM nightly: `0.20.2rc1.dev166+gf6490a284`\n- FlashInfer 0.6.11\n- DFlash sliding-window-attention compatibility patch from vLLM PR #40898\n- CUTLASS NVFP4 fast path selected for GB10 \u002F sm_121a\n- DFlash k=15 using `z-lab\u002FQwen3.6-27B-DFlash`\n- Qwen3 reasoning parser and Qwen3-Coder tool-call parser enabled\n- Packaged gateway\u002Fproduction\u002Fbenchmark profiles so users do not have to hand-assemble the full vLLM command\n\n### DDTree v5 Research Track\n\nDDTree is the next obvious performance target, but it must land inside vLLM without losing multimodal, reasoning, tool calling, NVFP4, or the OpenAI-compatible gateway surface. The current research image is published separately from the production DFlash image and is intentionally marked experimental:\n\n```bash\ndocker pull ghcr.io\u002Faeon-7\u002Fvllm-aeon-ultimate-ddtree:qwen36-v5-m53-experimental\n```\n\nRead the full DDTree card and lab chronicle: [`docs\u002Fqwen36-ddtree-card.md`](docs\u002Fqwen36-ddtree-card.md).\n\nCurrent status in one line: **flat DFlash remains the production path; DDTree v5 is a published experimental container for tree-verifier, branch-state, and GDN replay development.** The image preserves the same NVFP4, DFlash, multimodal, reasoning, tool-calling, and OpenAI-compatible vLLM surface, but true non-flat branch commit is still research-only.\n\nThe working implementation plan lives in [`docs\u002Fddtree-vllm-integration-plan.md`](docs\u002Fddtree-vllm-integration-plan.md). M1 scaffolding and the current experimental Docker context live in [`container\u002Fqwen36-v5-ddtree-experimental\u002F`](container\u002Fqwen36-v5-ddtree-experimental\u002F).\n\nThe DDTree card documents:\n\n- the current container tags and digest,\n- what works today,\n- the M1 through M53 trial-and-error path,\n- the current M53 non-flat probe status,\n- known blockers around branch-state GDN replay, fused branch attention, and accepted-branch commit,\n- benchmark context and caveats,\n- where community help is most likely to move the project forward.\n\nRaw benchmark files:\n\n- [`bench\u002Fresults\u002Fqwen36_dirty_baseline_eager_20260510T034652Z.json`](bench\u002Fresults\u002Fqwen36_dirty_baseline_eager_20260510T034652Z.json)\n- [`bench\u002Fresults\u002Fqwen36_v4_fi0611_noprefix_full_sweep_20260510T065838Z.json`](bench\u002Fresults\u002Fqwen36_v4_fi0611_noprefix_full_sweep_20260510T065838Z.json)\n- [`bench\u002Fresults\u002Fqwen36_v4_fi0611_noprefix_true_single_20260510T065020Z.json`](bench\u002Fresults\u002Fqwen36_v4_fi0611_noprefix_true_single_20260510T065020Z.json)\n\nThe v4 sweep used natural prompts across coding, math, reasoning, prose, everyday language, and extraction\u002FJSON. It intentionally used a short-context benchmark profile to isolate decode\u002Fscheduler behavior: `--max-model-len 2048`, `--max-num-seqs 256`, prefix caching disabled, thinking enabled, 200 output tokens, minimum 16 samples per point, 20% trimmed median. For production DFlash gateway use, prefix caching is workload-dependent: it is valuable when many agents share a stable prompt prefix, but DDTree research modes keep it off while branch-state correctness is under development.\n\n---\n\n## Model Variants\n\nSix release formats covering DGX Spark, RTX PRO 6000, RTX 5090, and pre-Blackwell hardware:\n\n| Release | Size | Target hardware | Use when |\n|---|---|---|---|\n| **[BF16](https:\u002F\u002Fhuggingface.co\u002FAEON-7\u002FQwen3.6-27B-AEON-Ultimate-Uncensored-BF16)** | 51 GB | A100 \u002F H100 80 GB · RTX PRO 6000 Blackwell 96 GB | You have Ampere\u002FHopper or want full-precision reference weights |\n| **[NVFP4](https:\u002F\u002Fhuggingface.co\u002FAEON-7\u002FQwen3.6-27B-AEON-Ultimate-Uncensored-NVFP4)** | 26 GB | Simpler NVFP4 deployments | llm-compressor format, `--quantization compressed-tensors`. For best DGX Spark performance, use the v4 DFlash recipe with the XS body below. |\n| **[Multimodal-NVFP4-MTP](https:\u002F\u002Fhuggingface.co\u002FAEON-7\u002FQwen3.6-27B-AEON-Ultimate-Uncensored-Multimodal-NVFP4-MTP)** | 27 GB | RTX PRO 6000 Blackwell · B100\u002FB200 | modelopt format, `--quantization modelopt`, MTP spec decode via grafted `mtp.*` head. Vision tower preserved. **GDN linear-attention preserved BF16** for best long-context fidelity. |\n| **[Text-NVFP4-MTP](https:\u002F\u002Fhuggingface.co\u002FAEON-7\u002FQwen3.6-27B-AEON-Ultimate-Uncensored-Text-NVFP4-MTP)** | 26 GB | RTX PRO 6000 · text-only deployments | Same recipe as Multimodal-NVFP4-MTP, vision tower stripped. **GDN preserved BF16.** |\n| **[Multimodal-NVFP4-MTP-XS](https:\u002F\u002Fhuggingface.co\u002FAEON-7\u002FQwen3.6-27B-AEON-Ultimate-Uncensored-Multimodal-NVFP4-MTP-XS)** | 21 GB | RTX 5090 (32 GB) · tighter dedicated VRAM | Strategic split: GDN projection matmuls → NVFP4; **`linear_attn.conv1d` kept BF16** to preserve the recurrence-critical SSM convolution. Vision tower preserved. |\n| **[Text-NVFP4-MTP-XS](https:\u002F\u002Fhuggingface.co\u002FAEON-7\u002FQwen3.6-27B-AEON-Ultimate-Uncensored-Text-NVFP4-MTP-XS)** | 20 GB | RTX 5090 text-only · 24 GB cards | Same conv1d-preserved strategic split as Multimodal-XS, vision tower stripped. The smallest variant we ship. |\n\nAll six formats are **the same underlying model**. NVFP4 KL divergence vs BF16 source is below the noise floor of stochastic sampling — you cannot tell them apart at the output level. The four MTP variants share the same NVFP4 quantization quality plus the original `Qwen\u002FQwen3.6-27B` MTP head grafted back in BF16 (bit-exact, verified) for spec-decode drafting.\n\n> **Regular MTP vs XS — what's the difference, and why it's a *strategic* quantization choice (not a precision compromise):**\n>\n> The GatedDeltaNet (GDN \u002F Mamba-style) `linear_attn.*` block has two distinct components: the **heavy projection matmuls** (`in_proj_qkv`, `in_proj_z`, `in_proj_a\u002Fb`, `out_proj` — ~11 GB total) and the **SSM 1D convolution kernel** (`linear_attn.conv1d` — small, but recurrence-critical).\n>\n> - **Regular MTP variants** keep *both* at BF16. Maximum numerical safety margin, larger footprint.\n> - **XS variants** quantize the projection matmuls to NVFP4 (saves ~6 GB; FP4 is a clean win on bandwidth-bound matmuls) **but explicitly preserve `linear_attn.conv1d` at BF16**. FP4 quantization of conv1d has been observed to cause drift on long-context recurrence in community testing, so we keep it at BF16 — the same principle modelopt's `NVFP4_DEFAULT_CFG` applies by default and the same recipe sakamakismile validated across his Qwen3.6-NVFP4-MTP series (22K+ downloads). This is *not* \"everything to FP4\" — that would be a different (and not-recommended) variant we have explicitly chosen not to ship.\n>\n> Pick regular if you have ≥48 GB VRAM and want best precision on long-context workloads; pick XS if you're on a 24–32 GB card and want maximum KV headroom with the SSM kernel still numerically stable.\n\n> **Hardware routing:**\n> - **DGX Spark (GB10 \u002F sm_121a)** → use the **v4 DFlash container** with the Multimodal-NVFP4-MTP-XS body. That is the benchmarked path above.\n> - **Dedicated-VRAM Blackwell** *(RTX PRO 6000 \u002F RTX 5090 \u002F B100\u002FB200)* → use the MTP variants when you want the grafted native MTP head. Dedicated VRAM behaves differently from Spark's unified memory, so benchmark locally before copying Spark flags.\n\n---\n\n## Table of contents\n\n1. [Performance — DGX Spark v4 vs Raw Baseline](#performance--dgx-spark-v4-vs-raw-baseline)\n2. [Model variants](#model-variants)\n3. [What this is](#what-this-is)\n4. [Final stats](#final-stats)\n5. [Hardware compatibility matrix](#hardware-compatibility-matrix)\n6. [QuickStart — DGX Spark](#quickstart--dgx-spark--xs-body--dflash-recommended-winner)\n7. [QuickStart — A100 \u002F H100 (BF16)](#quickstart--a100--h100-bf16)\n8. [In-depth: the abliteration methodology](#in-depth-the-abliteration-methodology)\n9. [In-depth: NVFP4 quantization](#in-depth-nvfp4-quantization)\n10. [Capability enhancement: the lifted \"safety tax\"](#capability-enhancement-the-lifted-safety-tax)\n11. [Configuration reference](#configuration-reference)\n12. [Responsibility, arbitration, and use](#responsibility-arbitration-and-use)\n13. [Provenance & credits](#provenance--credits)\n14. [License](#license)\n\n---\n\n## What this is\n\nThis is the **definitive uncensored release of Qwen 3.6 27B**: the alignment-overhead removal so surgical that the model's KL divergence from the base is **0.000492** — three orders of magnitude inside the empirically-observed \"capability damage threshold,\" and below the noise floor of ordinary stochastic sampling. A user cannot distinguish this model from the base on capability tasks; on several measurable axes (chain-of-thought commitment, adversarial-reasoning bandwidth, calibration honesty), it is *better*.\n\nThis is not a weekend abliteration. The release is the product of **72 hours of continuous research and tuning** in which **hundreds of parallel AI research agents** were dispatched to:\n\n- Characterize Qwen 3.5 \u002F 3.6 hybrid-attention internals (16 full-attention layers + 48 GatedDeltaNet \u002F linear-attention layers, `attn_output_gate=True` with doubled `q_proj` geometry, the FernflowerAI SSM `conv1d` outlier pattern).\n- Survey the post-training-intervention literature in full: Arditi et al. (refusal as a single direction), grimjim's NPBA (norm-preserving biprojected abliteration), Heretic, Wuwangzhang's abliterix, Huang et al. on the safety tax, Xie et al. on DGR safety-tax mitigation, the projected-abliteration extensions, the winsorization heuristics.\n- Audit every relevant arXiv submission of 2024–2026 on alignment-direction interventions, capability preservation, and 4-bit quantization on hybrid-attention stacks.\n- Comb the r\u002FLocalLLaMA community archive for tribal knowledge on what does and does not work — particularly on Mamba \u002F GatedDeltaNet hybrids, where most generic abliteration recipes silently fail.\n- Trace the GitHub commit graphs of the abliteration tooling ecosystem to identify pre-public development branches that fix bugs unfixed in the public releases.\n\nThe pipeline that emerged integrates the industry's best published methodologies — Arditi-style mean-difference refusal vectors, NPBA, projected abliteration with outlier-aware winsorization, FernflowerAI's SSM `conv1d` outlier repair, abliterix v1.4's multi-objective Optuna search — **alongside custom in-house techniques developed for Qwen 3.6's idiosyncratic attention geometry, and yet-unreleased pre-public branches of the next-generation abliteration toolchain integrated through direct collaboration with upstream maintainers.**\n\nThe 50-trial Optuna search was cross-validated against a 10-axis capability spot-check to catch the documented \"low-KL but word-salad\" over-abliteration trap that pure refusal-rate scoring will miss. Trial 46 was selected — not the lowest-KL trial, but the one that combined zero refusals with full capability coherence.\n\n---\n\n## Final stats\n\n### Refusal rate (apples-to-apples)\n\n| Metric | Base Qwen3.6-27B | **AEON-Ultimate** |\n|---|---|---|\n| Refusals on harmful prompts | 99 \u002F 100 | **0 \u002F 100** |\n| Verdict | heavily aligned | **uncensored** |\n| Compliance rate | 1 % | **100 %** |\n\nTested on a 100-prompt adversarial battery from `mlabonne\u002Fharmful_behaviors` covering cybercrime, weapons, violence, self-harm, hate speech, and synthesis instructions. Same denominator as the base evaluation.\n\n### Capability preservation\n\n| Metric | Value |\n|---|---|\n| First-3-token KL divergence vs base | **0.000492** |\n| Output length deviation vs base | 0.027 σ |\n| Capability spot-checks (10 axes) | **10 \u002F 10 coherent** |\n| Math · code · reasoning · knowledge · long-form | All preserved |\n\nCapability axes verified: arithmetic word problems, linear algebra, calculus, Python with memoization, Rust UTF-8 string handling, transitive syllogisms, the bat-and-ball intuition trap, factual recall, technical contrast (TCP vs UDP), structured pedagogical long-form. Every axis produced coherent, structured, reasoning-forward outputs — no looping, no philosophizing spirals, no word-salad.\n\n### KL divergence detail\n\n| Distribution metric | Value |\n|---|---|\n| First-3-token KL vs base | **0.000492** |\n| Winsorization quantile | 0.995 (outlier-aware) |\n| Projection | orthogonal + projected-abliteration (NPBA-style) |\n| Trials evaluated | 50 (15 random warmup + 35 TPE-driven Optuna) |\n| Selected trial | #46 (winner, COHERENT) |\n\nThe empirically observed \"capability damage threshold\" in the abliteration literature is KL ≈ 0.1. AEON-Ultimate's KL is **~200× below** that threshold.\n\n---\n\n## Hardware compatibility matrix\n\nThe right variant depends on **memory architecture**, not just GPU model. DGX Spark should use the v4 DFlash container above; dedicated-VRAM Blackwell can use the MTP variants when the native MTP head is desired.\n\n| Hardware | Recommended variant | Why this exact variant | Spec-decode method |\n|---|---|---|---|\n| **DGX Spark \u002F GB10** *(sm_121a, unified memory)* | 🏆 **[`-Multimodal-NVFP4-MTP-XS`](https:\u002F\u002Fhuggingface.co\u002FAEON-7\u002FQwen3.6-27B-AEON-Ultimate-Uncensored-Multimodal-NVFP4-MTP-XS) body + DFlash + `qwen36-v4` image** | Current recommended path. v4 packages latest validated vLLM nightly, FlashInfer 0.6.11, CUTLASS NVFP4, CUDA graphs, the DFlash sliding-window-attention patch, Qwen3 reasoning parsing, and Qwen3-Coder tool parsing. | DFlash *k=15* via [`z-lab\u002FQwen3.6-27B-DFlash`](https:\u002F\u002Fhuggingface.co\u002Fz-lab\u002FQwen3.6-27B-DFlash) drafter |\n| **B100 \u002F B200** *(sm_100, dedicated FP4 silicon)* | **[`-Multimodal-NVFP4-MTP`](https:\u002F\u002Fhuggingface.co\u002FAEON-7\u002FQwen3.6-27B-AEON-Ultimate-Uncensored-Multimodal-NVFP4-MTP)** (preferred — GDN BF16 fits) or [Text variant](https:\u002F\u002Fhuggingface.co\u002FAEON-7\u002FQwen3.6-27B-AEON-Ultimate-Uncensored-Text-NVFP4-MTP) | Native FP4 via `tcgen05` \u002F UTCQMMA — fastest hardware for this format. Dedicated VRAM bandwidth lets MTP's high acceptance rate translate to throughput. | qwen3_5_mtp *n=3* (head grafted bf16, in repo) |\n| **RTX PRO 6000 Blackwell** *(sm_120, 96 GB dedicated)* | **[`-Multimodal-NVFP4-MTP`](https:\u002F\u002Fhuggingface.co\u002FAEON-7\u002FQwen3.6-27B-AEON-Ultimate-Uncensored-Multimodal-NVFP4-MTP)** for vision · [`-Text-NVFP4-MTP`](https:\u002F\u002Fhuggingface.co\u002FAEON-7\u002FQwen3.6-27B-AEON-Ultimate-Uncensored-Text-NVFP4-MTP) for text-only · [XS siblings](https:\u002F\u002Fhuggingface.co\u002FAEON-7\u002FQwen3.6-27B-AEON-Ultimate-Uncensored-Multimodal-NVFP4-MTP-XS) for tighter memory budgets | Dedicated VRAM has different bandwidth behavior than Spark unified memory. Start with the MTP variants and benchmark locally. | qwen3_5_mtp *n=3* |\n| **RTX 5090** *(sm_120, 32 GB dedicated)* | **[`-Multimodal-NVFP4-MTP-XS`](https:\u002F\u002Fhuggingface.co\u002FAEON-7\u002FQwen3.6-27B-AEON-Ultimate-Uncensored-Multimodal-NVFP4-MTP-XS)** *(21 GB)* if you use vision · **[`-Text-NVFP4-MTP-XS`](https:\u002F\u002Fhuggingface.co\u002FAEON-7\u002FQwen3.6-27B-AEON-Ultimate-Uncensored-Text-NVFP4-MTP-XS)** *(20 GB)* if text-only | Regular MTP variants (~27 GB) leave too little KV headroom on 32 GB. XS variants (conv1d preserved BF16, projection matmuls FP4) fit comfortably. | qwen3_5_mtp *n=3* |\n| **Other 24 GB cards** *(RTX 4090, RTX 3090, RTX A6000 ≤48 GB)* | **[`-Text-NVFP4-MTP-XS`](https:\u002F\u002Fhuggingface.co\u002FAEON-7\u002FQwen3.6-27B-AEON-Ultimate-Uncensored-Text-NVFP4-MTP-XS)** *(20 GB)* | The smallest variant. Pre-Blackwell sm_\u003C120 will dequantize NVFP4 → BF16 at the kernel level (no FP4 silicon win), but the model still works and KV fits. | qwen3_5_mtp *n=3* |\n| **H100 80 GB** *(sm_90)* | **[`-BF16`](https:\u002F\u002Fhuggingface.co\u002FAEON-7\u002FQwen3.6-27B-AEON-Ultimate-Uncensored-BF16)** | NVFP4 dequants to BF16 at kernel level — works but no throughput gain. Use BF16 for cleaner code path. | none (or external EAGLE \u002F Medusa drafter) |\n| **A100 80 GB** *(sm_80)* | **[`-BF16`](https:\u002F\u002Fhuggingface.co\u002FAEON-7\u002FQwen3.6-27B-AEON-Ultimate-Uncensored-BF16)** | Same as H100. BF16 at 131K context, single-GPU. | none |\n| **Multi-GPU (any tier)** | **[`-BF16`](https:\u002F\u002Fhuggingface.co\u002FAEON-7\u002FQwen3.6-27B-AEON-Ultimate-Uncensored-BF16)** *(`tensor-parallel-size 2\u002F4\u002F8`)* | Reference weights for fine-tuning, distillation, or quant-recipe development. | none |\n| **Anything older than A100** | Not supported | Won't fit + lacks attention backends. |\n\n---\n\n## QuickStart — DGX Spark 🏆 (XS body + DFlash, recommended winner)\n\n**Pick this for DGX Spark.** This is the current packaged winner for real GB10 use: the v4 XS+DFlash path averages **37.56 tok\u002Fs single-stream** across six natural prompt categories versus **10.49 tok\u002Fs** for the raw stock eager baseline. It preserves multimodal input, reasoning parsing, and OpenAI-compatible tool calls.\n\nThe XS body includes a grafted MTP head, but the Spark recipe intentionally uses **external DFlash k=15**. Do not switch the Spark compose file to `method:\"qwen3_5_mtp\"` unless you are deliberately running an ablation.\n\n### Step 1 — Authenticate to HuggingFace and pull both models\n\n```bash\nhf auth login                                    # one time, paste your HF token\n\nhf download AEON-7\u002FQwen3.6-27B-AEON-Ultimate-Uncensored-Multimodal-NVFP4-MTP-XS \\\n  --local-dir .\u002Fmodels\u002Faeon-ultimate-multimodal-nvfp4-mtp-xs\n\nhf download z-lab\u002FQwen3.6-27B-DFlash \\\n  --local-dir .\u002Fmodels\u002Fdflash-drafter\n```\n\n> The DFlash drafter is auto-gated — first download will prompt you to click-accept the terms (instant approval). If you've previously downloaded it before 2026-04-27, **re-run** the download; z-lab pushed an updated drafter and you want the new weights.\n\n### Step 2 — Use the XS docker-compose\n\n[`docker-compose.spark-xs.yml`](docker-compose.spark-xs.yml) ships in this repo with the exact config measured above. Highlights:\n\n- **Image**: `ghcr.io\u002Faeon-7\u002Fvllm-aeon-ultimate-dflash:qwen36-v4` (also published as `:latest`)\n- **Body**: XS multimodal (`--quantization modelopt`)\n- **Speculative decoding**: DFlash, k=15, architecture-matched drafter (`--speculative-config '{\"method\":\"dflash\",...}'`)\n- **GB10-specific env**: `TORCH_CUDA_ARCH_LIST=12.1a`, `ENABLE_NVFP4_SM100=0`, `VLLM_USE_FLASHINFER_SAMPLER=1`, `VLLM_NVFP4_GEMM_BACKEND=flashinfer-cutlass`, `NVIDIA_FORWARD_COMPAT=1`\n- **Default gateway tuning**: `--max-model-len 256000 --max-num-seqs 64 --max-num-batched-tokens 32768 --gpu-memory-utilization 0.75` *(leaves room for ASR\u002FTTS\u002Fembedding side services)*\n- **Long-context production tuning**: `--max-model-len 200000 --max-num-seqs 16 --max-num-batched-tokens 32768 --gpu-memory-utilization 0.85` *(higher KV reserve when the LLM is the only major GPU service)*\n- **Multimodal**: `--limit-mm-per-prompt '{\"image\":4,\"video\":2}' --mm-encoder-tp-mode data --mm-processor-cache-type shm`\n- **Serving**: 5 aliases (`aeon-ultimate`, `qwen36-ultimate`, `aeon-fast`, `aeon-deep`, `aeon-ultimate-xs`) all routing to the same engine\n\n### Step 3 — Start\n\n```bash\ndocker compose -f docker-compose.spark-xs.yml up -d\ndocker compose -f docker-compose.spark-xs.yml logs -f vllm\n```\n\n> First boot takes ~10–12 min (FlashInfer NVFP4 GEMM autotuner + CUDA-graph capture; both cache to `\u002Froot\u002F.cache\u002Fvllm\u002F...`). Subsequent restarts ~3–5 min. The MTP-head detection log line will appear in startup but the engine routes around it correctly because of `--speculative-config method:\"dflash\"`.\n\n### Step 4 — Test\n\n```bash\ncurl http:\u002F\u002Flocalhost:8000\u002Fv1\u002Fchat\u002Fcompletions \\\n  -H \"Content-Type: application\u002Fjson\" \\\n  -d '{\n    \"model\": \"aeon-ultimate\",\n    \"messages\": [{\"role\": \"user\", \"content\": \"Explain zero-knowledge proofs to a basic-crypto audience.\"}],\n    \"max_tokens\": 512,\n    \"temperature\": 0.7\n  }'\n```\n\nOpenAI-compatible endpoint at `http:\u002F\u002Flocalhost:8000\u002Fv1`. Tool calling, reasoning mode (`\u003Cthink>` blocks), and multimodal input all enabled out of the box.\n\n> **Why this combo wins on Spark**: v4 keeps the XS body, CUTLASS NVFP4, DFlash k=15, CUDA graphs, tool parsing, reasoning parsing, and multimodal support in one pullable image. That is the path benchmarked at the top of this README.\n\n---\n\n## QuickStart — A100 \u002F H100 (BF16)\n\nFor Ampere \u002F Hopper cards, run the BF16 release on vanilla vLLM.\n\n### Step 1 — Pull weights\n\n```bash\nhf download AEON-7\u002FQwen3.6-27B-AEON-Ultimate-Uncensored-BF16 \\\n  --local-dir \u002Fopt\u002Fmodels\u002Faeon-ultimate-bf16\n```\n\n### Step 2 — Drop in the BF16 docker-compose\n\n```yaml\n# docker-compose.bf16.yml\nservices:\n  aeon-ultimate-bf16:\n    image: vllm\u002Fvllm-openai:latest\n    container_name: aeon-ultimate-bf16\n    restart: unless-stopped\n    network_mode: host\n    ipc: host\n    runtime: nvidia\n    environment:\n      NVIDIA_VISIBLE_DEVICES: all\n    volumes:\n      - \u002Fopt\u002Fmodels\u002Faeon-ultimate-bf16:\u002Fmodels\u002Faeon-ultimate:ro\n    command: >\n      --model \u002Fmodels\u002Faeon-ultimate\n      --served-model-name aeon-ultimate\n      --host 0.0.0.0 --port 8000\n      --dtype bfloat16\n      --max-model-len 131072\n      --max-num-seqs 16\n      --max-num-batched-tokens 8192\n      --gpu-memory-utilization 0.90\n      --enable-chunked-prefill\n      --enable-auto-tool-choice\n      --tool-call-parser qwen3_coder\n      --reasoning-parser qwen3\n      --attention-backend flash_attn\n      --trust-remote-code\n```\n\n### Step 3 — Start\n\n```bash\ndocker compose -f docker-compose.bf16.yml up -d\n```\n\nFor 96 GB cards (RTX PRO 6000 Blackwell on the BF16 path), raise to `--max-num-seqs 32 --max-num-batched-tokens 16384 --max-model-len 262144`. **For native FP4 throughput on RTX PRO 6000, see the dedicated NVFP4 recipe below.**\n\n---\n\n## Other hardware configurations\n\nThe DGX Spark and BF16 quickstarts above are the AEON-7 team's measured-and-validated configurations. Recipes for additional hardware live in the [`other-hardware\u002F`](other-hardware\u002F) directory — each in its own subfolder with a tuned `docker-compose.yml` and a per-hardware README explaining what differs from the DGX Spark recipe and why.\n\n| Hardware | Recipe | Status | Recommended for |\n|---|---|---|---|\n| **NVIDIA RTX PRO 6000 Blackwell** (sm_120, 96 GB GDDR7) | [`other-hardware\u002Frtx6000pro\u002F`](other-hardware\u002Frtx6000pro\u002F) | Community recipe | Single-GPU NVFP4 deployment with native sm_120 FP4 tensor-core throughput. Dedicated-VRAM flags differ from DGX Spark unified-memory flags. |\n\nIf you have hardware not covered here and want to contribute a recipe, follow the pattern in `other-hardware\u002Frtx6000pro\u002F` — a folder, a tuned `docker-compose.yml`, and a README explaining the differences from the DGX Spark baseline.\n\n---\n\n## In-depth: the abliteration methodology\n\n### What abliteration is\n\nAbliteration is a post-training intervention that removes the **refusal direction** in a model's residual stream — the linear subspace, identified empirically by Arditi et al. (2024), that mediates a transformer's decision to refuse a prompt. The technique works because in well-aligned chat models, refusal is mediated by a *single dominant direction*: project that direction out of the residual stream at every layer and the model loses its ability to route into refusal-shaped attractors.\n\nThe naive version of this — subtract the refusal direction from `o_proj` and `down_proj` weights — produces a model that no longer refuses. But it also tends to break it: aggressive direction removal collapses capability, producing word-salad outputs and looping incoherence. The literature is full of \"uncensored\" releases that are also *broken* releases.\n\n### What \"lossless abliteration\" requires\n\nTo remove refusal *without* breaking capability, four things have to be done correctly:\n\n1. **Identify the refusal direction precisely** — using a sufficiently large harmful\u002Fharmless contrast set, with outlier-aware winsorization so a handful of high-norm prompts don't distort the steering vector.\n2. **Project orthogonally and norm-preservingly** — keeping the helpfulness-aligned signal intact (this is the NPBA contribution).\n3. **Search the strength × layer-scope hyperparameter space** — most projects pick one strength setting and ship; a real Pareto-front search over (refusals, KL) finds the trial that hits zero refusals at minimum capability damage.\n4. **Cross-validate against capability** — refusal-rate keyword scoring will *not* catch over-abliteration. Word-salad incoherence (\"I I cannot... less... I I I\") doesn't match any refusal marker, so the optimizer marks it compliant. You have to actually run the resulting model against a capability spot-check.\n\nThe AEON pipeline does all four.\n\n### The AEON pipeline (4 stages)\n\n```\nQwen\u002FQwen3.6-27B (BF16, 51 GB, heavy RLHF safety training)\n          │\n          │  Stage 1 — SSM conv1d outlier repair (FernflowerAI)\n          ▼\nQwen3.6-27B-base-repaired  (8 late-layer SSM blocks rescaled)\n          │\n          │  Stage 2 — abliterix v1.4 abliteration (Optuna multi-objective)\n          ▼\nQwen3.6-27B-AEON-Ultimate-Uncensored  (BF16, 51 GB, trial 46\u002F50)\n          │\n          │  Stage 3 — capability cross-validation (10-axis spot-check)\n          ▼\nQwen3.6-27B-AEON-Ultimate-Uncensored  (validated, BF16 release)\n          │\n          │  Stage 4 — NVFP4 quantization (llm-compressor)\n          ▼\nQwen3.6-27B-AEON-Ultimate-Uncensored-NVFP4  (26 GB, NVFP4 release)\n```\n\n### Stage 1 — SSM conv1d outlier repair\n\nPer FernflowerAI's empirical discovery, certain late SSM \u002F GatedDeltaNet blocks in Qwen 3.5 \u002F 3.6 hybrids have `linear_attn.conv1d.weight` σ inflated 50–100 % above the median across all SSM blocks. Left unrepaired, this manifests during long-context inference as coherence collapse and \"philosophizing\" loops, and it makes the model hypersensitive to downstream abliteration (amplifies the noise).\n\nThe repair: compute σ per block across all 48 SSM layers, flag any block where σ > 1.5× median, rescale weights by `α = median_σ \u002F σ_actual`.\n\nOn Qwen 3.6 27B, **8 outlier blocks** were detected and repaired: layers 52, 53, 56, 57, 58, 60, 61, 62, with α factors between 0.516 and 0.659. After repair, σ is uniform at 0.04267 across all SSM layers.\n\nThis is **not abliteration**. It is an upstream-model defect repair that must run *before* abliteration so the optimizer isn't fighting noise.\n\n### Stage 2 — abliterix multi-objective abliteration\n\n[`abliterix v1.4`](https:\u002F\u002Fgithub.com\u002Fwuwangzhang1216\u002Fabliterix) — a Heretic-derived multi-objective Optuna optimizer with native hybrid-attention support — was run with the configuration:\n\n```toml\n[steering]\nvector_method          = \"mean\"\ndecay_kernel           = \"linear\"\northogonal_projection  = true\nprojected_abliteration = true        # grimjim NPBA\nwinsorize_vectors      = true\nwinsorize_quantile     = 0.995\nweight_normalization   = \"none\"\ndisabled_components    = [\"attn.q_proj\", \"attn.k_proj\", \"attn.v_proj\"]\n# Q\u002FK\u002FV disabled: Qwen 3.6 has attn_output_gate=True which doubles\n# q_proj's output dim to (12288, 5120) — incompatible with abliterix's\n# standard projection math.\n\n[steering.component_strength_ranges]\n\"mlp.down_proj\" = [2.0, 10.0]\n\"attn.o_proj\"   = [1.0, 6.0]\n\n[kl]\ntarget          = 0.005\nprune_threshold = 0.5      # kill divergent trials at 100× target\n\n[optimization]\nnum_trials        = 50\nnum_warmup_trials = 15\n```\n\n50 trials (15 random warmup + 35 TPE-driven). Optuna explored a Pareto front of (refusals, KL) trade-offs. **Wall-clock: ~4 hours on a single RTX PRO 6000 Blackwell 96 GB.**\n\n### Stage 3 — capability cross-validation (the over-abliteration trap)\n\nA more aggressive Pareto point — trial 17, 0\u002F100 refusals at KL=0.00192 — was tested first and produced **word-salad capability outputs** (\"Here I I cannot... less... I I I...\"). abliterix's keyword-only refusal scoring did not flag this: the gibberish doesn't match any refusal marker, so the optimizer saw it as full compliance.\n\n**Trial 46's** gentler parameters preserved coherence *and* hit zero refusals on downstream capability testing:\n\n| Parameter | Trial 17 (broken) | **Trial 46 (winner)** |\n|---|---|---|\n| `vector_scope` | global | **per layer** |\n| `attn.o_proj.max_weight` | 2.50 | **1.56** (×1.6 gentler) |\n| `mlp.down_proj.max_weight` | 5.43 | **3.45** (×1.57 gentler) |\n| `mlp.down_proj.min_weight_distance` | 36.09 | 24.94 (narrower) |\n| **KL divergence** | 0.00192 | **0.00049** |\n| Smoke-test verdict | BROKEN (gibberish) | **COHERENT** |\n\nThe lesson: the lowest-refusal trial on a keyword-only metric is **not** necessarily the right trial to ship. Cross-validate against a true capability spot-check before you commit. Most public abliterations skip this step. We don't.\n\n### Stage 4 — NVFP4 quantization\n\nSee [the NVFP4 deep-dive section below](#in-depth-nvfp4-quantization).\n\n---\n\n## In-depth: NVFP4 quantization\n\n### What NVFP4 is\n\nNVFP4 is NVIDIA's 4-bit floating-point quantization format introduced for Blackwell-and-later silicon. It is **not a \"compressed lite\" version** of a model — it is the production deployment format NVIDIA designed for the next decade of inference: accuracy on par with BF16, throughput of true 4-bit compute, no compromise required.\n\nThe format specification:\n\n| Component | Details |\n|---|---|\n| **Element format** | E2M1 — 4-bit float (1 sign \u002F 2-bit exponent \u002F 1-bit mantissa) |\n| **Block size** | 16 weights per scaling block |\n| **Per-block scale** | **FP8 E4M3** — 8-bit *floating-point* per block |\n| **Per-tensor scale** | FP32 (single global scale per tensor) |\n| **Sign convention** | Symmetric signed |\n\n### Why the two-level scaling matters\n\nOlder 4-bit formats (INT4, Q4_0, Q4_K, NF4) use **integer** per-block scales. When the local weight distribution is heavy-tailed — as it almost always is in trained transformers — integer scales fail to resolve the long tail without crushing the bulk distribution.\n\nNVFP4's **FP8 E4M3 per-block scales** dramatically out-resolve INT8 scales because FP8 itself is a floating-point number — it can span a 3+ orders-of-magnitude dynamic range within each block while still maintaining fine-grained resolution near the median weight value. Combine that with a global FP32 per-tensor scale and you get a four-level hierarchy: per-tensor FP32 → per-block FP8 → per-element E2M1, where each level absorbs a different scale of variation.\n\nThe combined effect is that local outliers — the long-tailed weights that destroy older 4-bit formats — are absorbed by the per-block FP8 scale rather than smearing the whole quantization grid.\n\n### Why it's effectively lossless\n\nTypical KL divergence vs the BF16 source for recipe-class NVFP4 quantization is **≤ 0.001**, which is **below the noise floor of stochastic sampling**. In practical terms: a user cannot observe the difference between this model and its BF16 source. The variance from changing your `temperature` or `seed` exceeds the variance from BF16 → NVFP4.\n\n### Native Blackwell tensor-core throughput\n\nOn Blackwell-class silicon, NVFP4 runs at **full FP4 tensor-core throughput** through native paths:\n\n- **B100 \u002F B200**: `tcgen05` \u002F UTCQMMA instructions — fastest NVFP4 hardware available.\n- **DGX Spark (GB10 \u002F sm_121a)**: SM121-specific CUTLASS NVFP4 kernels (the [`vllm-aeon-ultimate-dflash`](https:\u002F\u002Fgithub.com\u002FAEON-7\u002FQwen3.6-27B-AEON-Ultimate-Uncensored-DFlash\u002Fpkgs\u002Fcontainer\u002Fvllm-aeon-ultimate-dflash) container ships these patched in).\n- **RTX PRO 6000 Blackwell (sm_120)**: standard CUTLASS NVFP4 path.\n\nThe GPU does **not** dequantize back to BF16 internally on these paths. You get the speed of true 4-bit compute *and* the accuracy of 16-bit weights at the same time.\n\nOn older silicon (A100, H100), NVFP4 dequantizes at kernel boundaries — works correctly but no throughput advantage. For those cards use the BF16 release directly.\n\n### What stays BF16 (and why)\n\nNot every layer is quantized. Two categories of weights are deliberately preserved at BF16:\n\n1. **Vision tower** (333 keys) — multimodal inference must not degrade. Vision encoders are sensitive to weight precision and are tiny in absolute size (~100 MB), so the cost is negligible.\n2. **Linear-attention \u002F GatedDeltaNet layers** (432 keys, 48 layers × 9 modules) — Mamba \u002F SSM state dynamics are mathematically incompatible with FP4. The hidden-state recurrence multiplies state vectors by quantized weights at every step; even tiny per-step error compounds across the sequence and the state collapses. **FP4 on SSM weights is not a precision\u002Faccuracy tradeoff — it is a correctness failure.**\n\nFP4 is applied only where it is well-behaved: the 16 full-attention layers' output projections, plus all MLPs.\n\n### Verification (post-quantization)\n\n| Check | Result |\n|---|---|\n| Total keys in checkpoint | 1952 |\n| Quantized full-attention projections | 64 (16 layers × q\u002Fk\u002Fv\u002Fo) |\n| `linear_attn.*` keys preserved BF16 | 432 |\n| `visual.*` keys preserved BF16 | 333 |\n| Norm keys preserved BF16 | 319 |\n| `lm_head` and `embed_tokens` preserved BF16 | ✓ |\n| NVFP4-packed weights present | ✓ |\n| `input_global_scale` magnitudes | 142–346 (healthy) |\n\nQuant tool: `llm-compressor 0.10.1.dev107` with `QuantizationModifier(scheme=\"NVFP4\")`. Calibration: open-platypus, 512 samples × 4096 tokens. Pipeline: `sequential` with `sequential_targets=[\"Qwen3_5DecoderLayer\"]` (required for hybrid stacks; auto-discovery silently skips layers). Loader: `AutoModelForImageTextToText` to preserve the multimodal class.\n\nWall-clock quant time: **~57 minutes on 1× RTX PRO 6000 Blackwell 96 GB.**\n\n---\n\n## Capability enhancement: the lifted \"safety tax\"\n\nModern safety alignment is not free. It imposes what Huang et al. 2025 call the **\"safety tax\"** (arXiv:2503.00555) — a systematic suppression of reasoning capacity that emerges because the RLHF process trains the model to route certain cognitive operations through refusal-shaped attractors, even when those attractors are *not* activated by the output. The refusal direction is not a binary gate; it is a weighted drag on the residual stream that rebalances the token distribution at every forward pass, whether or not the eventual generation contains a refusal.\n\nRemoving the refusal direction eliminates that drag. Concretely, this produces three observable shifts:\n\n1. **Longer, more committed chains of thought.** Aligned models often hedge partway through a reasoning chain (\"but of course, one should be careful…\") in response to topics that tangentially brush the refusal subspace — even when the prompt is entirely benign. AEON-Ultimate follows reasoning chains to their logical conclusion without mid-stream hedging.\n2. **Improved adversarial-example and red-team reasoning.** Without self-censorship overhead, the model can analyze attack surfaces, vulnerabilities, and failure modes at full capacity — invaluable for security research, penetration testing, and AI-alignment red-teaming.\n3. **Cleaner calibration on contested topics.** Aligned models often express uncertainty on topics where they are actually highly confident, because the refusal gradient creates an attractor basin near \"I'm not sure\" for any topic that pattern-matches the safety training distribution. AEON-Ultimate reports its actual confidence.\n\n### Empirical literature on capability gains from abliteration\n\nThe published evidence is consistent: post-training refusal-direction removal at low KL produces **measurable benchmark gains** over the aligned base.\n\n| Study | Model | Intervention | Result |\n|---|---|---|---|\n| grimjim (2025) | Gemma-3-12B-IT | NPBA abliteration | **+13.9 % NatInt reasoning** |\n| Young (2025), arXiv:2512.13655 | Yi-1.5-9B | DECCP abliteration | +1.51 pp GSM8K |\n| Xie et al. (2026) | (DGR safety-tax mitigation) | targeted safety-direction removal on DirectRefusal | **+30.2 % reasoning recovery** |\n\nAEON-Ultimate sits in the **KL \u003C 0.001** regime where these gains are most commonly reported. The capability spot-checks (10\u002F10 coherent across math, code, reasoning, knowledge, and long-form) and the DGX Spark serving benchmarks at the top of this README are the current public measurement set.\n\n### What the lifted overhead also means\n\nThe same lifted overhead means the model will now produce content the base would refuse: harmful-tool construction, violence, graphic sexuality, contested ideologies, jurisdictionally illegal content, and content a reasonable person might find offensive.\n\nThe model makes no internal judgment calls about *whether* to comply. It complies. **The user becomes the safety layer.** This is by design — the intended use cases (security research, red-team operations, alignment research, creative writing without editorial constraints, serving users in jurisdictions where the base's guardrails misalign with legitimate local frameworks) all benefit from a model that reliably executes the user's instruction rather than second-guessing it. But that same reliability is a threat vector when the user's instruction is malicious.\n\nWielding an uncensored model is genuinely different from wielding an aligned one. It requires a different operational stance — one where the user, not the model, is the safety layer. See [the responsibility section below](#responsibility-arbitration-and-use).\n\n---\n\n## Configuration reference\n\n### NVFP4 on DGX Spark — full flag explanation (v4 XS + DFlash config)\n\n| Flag | Value | Why |\n|---|---|---|\n| `--quantization modelopt` | required for the XS body | The recommended `-Multimodal-NVFP4-MTP-XS` checkpoint is modelopt format. Use `compressed-tensors` only with the older regular `-NVFP4` body. |\n| `--kv-cache-dtype auto` | required | BF16 KV cache. TurboQuant K8V4 (3.76× compression) is *unsupported* on hybrid attention + Mamba models — vLLM raises a deliberate guard. The 27B-AEON stack stays on uniform BF16 KV until a layer-skipping option ships. |\n| (async scheduling) | **enabled (default)** | Async scheduling overlaps scheduler work with GPU work and is part of the v4 serving profile. Disable only for a deliberate TTFT-only experiment. |\n| `--max-model-len` | `256000` gateway default, `200000` solo LLM production | 256K exposes almost the full trained context for agent gateways. Use 200K when the LLM is the only major GPU service and you want more full-context KV safety. |\n| `--max-num-seqs` | `64` gateway default, `16` solo full-context production | 64 gives agentic gateways room for one large working chat plus many short-lived subagents. Drop to 16 when you expect many sequences near the full 200K context window. |\n| `--max-num-batched-tokens` | `32768` | Prefill budget. This is the practical ceiling on Spark; above 32K, compile coverage and unified-memory pressure get worse. |\n| `--gpu-memory-utilization` | `0.75` gateway default, `0.85` solo LLM production | Use 0.75 when ASR, TTS, embeddings, ComfyUI, or other GPU services share the Spark. 0.85 is the long-context LLM-only cap. **Do not exceed 0.88 on DGX Spark** — unified memory thrashes above that. |\n| `--enable-chunked-prefill` | on | Required for long-context workloads to avoid prefill OOM. |\n| `--enable-prefix-caching` \u002F `--no-enable-prefix-caching` | workload-dependent | For pure DFlash gateway serving, prefix caching can be a major TTFT win when many agents share the same stable system\u002Fpersona\u002Fskills\u002Ftool prefix. In our repeated-prefix probe, a 37,837-token shared prefix dropped from ~26 s uncached TTFT to ~0.7 s cached follow-ups. For DDTree research modes, keep prefix caching off until branch-state replay and accepted-branch commit are quality-stable. |\n| `--load-format safetensors` | required | NVFP4 weights ship as safetensors. |\n| `--trust-remote-code` | required | Qwen 3.6 uses custom modeling code. |\n| `--enable-auto-tool-choice` | on | Enables OpenAI-compatible tool calling. |\n| `--tool-call-parser qwen3_coder` | required for tools | Parses Qwen 3.6's tool-call XML. |\n| `--reasoning-parser qwen3` | required for thinking mode | Parses `\u003Cthink>` blocks. |\n| `--attention-backend flash_attn` | required | Stable on sm_121a. |\n| `--limit-mm-per-prompt '{\"image\":4,\"video\":2}'` | recommended | Hard caps on multimodal inputs per request. |\n| `--mm-encoder-tp-mode data` | required | Vision encoder TP strategy. |\n| `--mm-processor-cache-type shm` | recommended | Shared-memory mm processor cache. |\n| `--mm-shm-cache-max-object-size-mb 256` | recommended | Lets larger Qwen3.6 image\u002Fvideo processor objects fit in the multimodal shared-memory cache. |\n| `--speculative-config '{\"method\":\"dflash\",\"model\":\"\u002Fmodels\u002Fdflash-drafter\",\"num_speculative_tokens\":15}'` | recommended | DFlash spec-decode at k=15. This is the v4 Spark recipe benchmarked at the top of the README. |\n\n### Required environment variables (DGX Spark NVFP4 \u002F v4 image)\n\n| Variable | Value | Why |\n|---|---|---|\n| `VLLM_ALLOW_LONG_MAX_MODEL_LEN` | `1` | Allows `--max-model-len` past the model's hard ceiling assertion. |\n| `TORCH_CUDA_ARCH_LIST` | `12.1a` | sm_121a-specific. |\n| `PYTORCH_CUDA_ALLOC_CONF` | `expandable_segments:True` | Reduces fragmentation under long-context KV churn. |\n| `TORCH_MATMUL_PRECISION` | `high` | Standard precision for FP4 matmul paths. |\n| `NVIDIA_FORWARD_COMPAT` | `1` | DGX Spark forward-compat shim. |\n| `NVIDIA_DISABLE_REQUIRE` | `1` | Disables driver version assertion — required because GB10 ships with a driver newer than vLLM's `nvidia-require-cuda` baseline. |\n| `ENABLE_NVFP4_SM100=0` | `0` | Required by PR #40191 for sm_121a-only builds. Without it, `vllm._C_stable_libtorch` fails to import — depends on SM100-only `mxfp4_experts_quant` kernels that don't exist on SM121. |\n| `VLLM_USE_FLASHINFER_MOE_FP4` | `0` | Defensive: this model is dense (no MoE); disabling the FlashInfer FP4 MoE auto-probe avoids SM121 PTX rejection log spam during boot. |\n| `VLLM_TEST_FORCE_FP8_MARLIN` | `0` | Override baked test-image defaults; keep production NVFP4 path selection. |\n| `VLLM_USE_FLASHINFER_SAMPLER` | `1` | FlashInfer CUDA top-k\u002Ftop-p sampler for normal sampled requests. |\n\n### BF16 on A100 \u002F H100 — full flag explanation\n\n| Flag | 80 GB profile | 96 GB profile | Why |\n|---|---|---|---|\n| `--max-model-len` | `131072` | `262144` | Half-context on 80 GB to leave KV headroom. |\n| `--max-num-seqs` | `16` | `32` | 80 GB cards leave ~21 GB for KV after 0.90 utilization. |\n| `--max-num-batched-tokens` | `8192` | `16384` | Safe prefill. |\n| `--gpu-memory-utilization` | `0.90` | `0.90` | Standard for dedicated VRAM (not unified). |\n\n---\n\n## Responsibility, arbitration, and use\n\nThis is an uncensored model. Read the [model card's User Responsibility & Arbitration Clause](https:\u002F\u002Fhuggingface.co\u002FAEON-7\u002FQwen3.6-27B-AEON-Ultimate-Uncensored-BF16#user-responsibility--arbitration-clause) before deploying. Summary:\n\n- You are solely responsible for prompts, outputs, and downstream actions.\n- Provided \"AS IS\" — no warranty of any kind.\n- You implement downstream safety layers (input validation, output filtering, content moderation, audit logging, rate limiting, access controls, human-in-the-loop for high-risk workflows). A production deployment without those layers is unsafe by construction and is not a supported use case.\n- Disputes go to binding individual arbitration. Class action waived.\n- You indemnify the authors from claims arising from your use.\n\nThe model has no opinions of its own. You supply the opinions, the judgment, and the ethics. The outputs carry your fingerprints, not the model's.\n\n---\n\n## Provenance & credits\n\n- **Base model**: [`Qwen\u002FQwen3.6-27B`](https:\u002F\u002Fhuggingface.co\u002FQwen\u002FQwen3.6-27B) — Alibaba's Qwen team.\n- **SSM `conv1d` outlier repair methodology**: FernflowerAI (multiple Reddit r\u002FLocalLLaMA posts, late 2025 \u002F early 2026).\n- **Abliteration tool**: [`abliterix v1.4`](https:\u002F\u002Fgithub.com\u002Fwuwangzhang1216\u002Fabliterix) by Wangzhang Wu — Heretic-derived multi-objective Optuna optimizer with native hybrid Mamba\u002Fattention support, projected-abliteration, and expert-granular steering.\n- **Heretic (upstream of abliterix)**: [`p-e-w\u002Fheretic`](https:\u002F\u002Fgithub.com\u002Fp-e-w\u002Fheretic) by Philipp Emanuel Weidmann.\n- **Original abliteration concept**: Arditi et al. 2024 — *\"Refusal in Language Models Is Mediated by a Single Direction\"* (arXiv:2406.11717).\n- **NPBA \u002F projected-abliteration theory**: grimjim 2025 — norm-preserving biprojected abliteration.\n- **Safety-tax quantification**: Huang et al. 2025 (arXiv:2503.00555); Xie et al. 2026 (DGR, safety-tax mitigation).\n- **NVFP4 specification**: [NVIDIA NVFP4 introduction](https:\u002F\u002Fdeveloper.nvidia.com\u002Fblog\u002Fintroducing-nvfp4-for-efficient-and-accurate-low-precision-inference\u002F).\n- **Quantization tool**: [`llm-compressor`](https:\u002F\u002Fgithub.com\u002Fvllm-project\u002Fllm-compressor) by vllm-project.\n- **Patched vLLM container**: [`AEON-7\u002FQwen3.6-NVFP4-DFlash`](https:\u002F\u002Fgithub.com\u002FAEON-7\u002FQwen3.6-NVFP4-DFlash) — source-built vLLM image with sm_121a CUTLASS NVFP4 patches.\n- **This release's pipeline, configuration, validation, marketing, and packaging**: AEON-7.\n\n---\n\n## License\n\nApache 2.0, inherited from `Qwen\u002FQwen3.6-27B`.\n\n---\n\n\u003Cdiv align=\"center\">\n\n**Built over 72 hours · Hundreds of research agents · Lossless · Capability-enhanced**\n\n[BF16](https:\u002F\u002Fhuggingface.co\u002FAEON-7\u002FQwen3.6-27B-AEON-Ultimate-Uncensored-BF16) &nbsp;·&nbsp; [NVFP4](https:\u002F\u002Fhuggingface.co\u002FAEON-7\u002FQwen3.6-27B-AEON-Ultimate-Uncensored-NVFP4) &nbsp;·&nbsp; [Container](https:\u002F\u002Fgithub.com\u002FAEON-7\u002FQwen3.6-27B-AEON-Ultimate-Uncensored-DFlash\u002Fpkgs\u002Fcontainer\u002Fvllm-aeon-ultimate-dflash)\n\n\u003C\u002Fdiv>\n\n---\n\n## ☕ Support the work\n\nIf this release has been useful, tips are deeply appreciated — they go directly toward more compute, more models, and more open releases.\n\n\u003Ctable align=\"center\">\n  \u003Ctr>\n    \u003Ctd align=\"center\" width=\"50%\">\n      \u003Cstrong>₿ Bitcoin (BTC)\u003C\u002Fstrong>\u003Cbr\u002F>\n      \u003Cimg src=\"https:\u002F\u002Fraw.githubusercontent.com\u002FAEON-7\u002FAEON-7\u002Fmain\u002Fassets\u002Fqr\u002Fbtc.png\" alt=\"BTC QR\" width=\"200\"\u002F>\u003Cbr\u002F>\n      \u003Csub>\u003Ccode>bc1q09xmzn00q4z3c5raene0f3pzn9d9pvawfm0py4\u003C\u002Fcode>\u003C\u002Fsub>\n    \u003C\u002Ftd>\n    \u003Ctd align=\"center\" width=\"50%\">\n      \u003Cstrong>Ξ Ethereum (ETH)\u003C\u002Fstrong>\u003Cbr\u002F>\n      \u003Cimg src=\"https:\u002F\u002Fraw.githubusercontent.com\u002FAEON-7\u002FAEON-7\u002Fmain\u002Fassets\u002Fqr\u002Feth.png\" alt=\"ETH QR\" width=\"200\"\u002F>\u003Cbr\u002F>\n      \u003Csub>\u003Ccode>0x1512667F6D61454ad531d2E45C0a5d1fd82D0500\u003C\u002Fcode>\u003C\u002Fsub>\n    \u003C\u002Ftd>\n  \u003C\u002Ftr>\n  \u003Ctr>\n    \u003Ctd align=\"center\" width=\"50%\">\n      \u003Cstrong>◎ Solana (SOL)\u003C\u002Fstrong>\u003Cbr\u002F>\n      \u003Cimg src=\"https:\u002F\u002Fraw.githubusercontent.com\u002FAEON-7\u002FAEON-7\u002Fmain\u002Fassets\u002Fqr\u002Fsol.png\" alt=\"SOL QR\" width=\"200\"\u002F>\u003Cbr\u002F>\n      \u003Csub>\u003Ccode>DgQsjHdAnT5PNLQTNpJdpLS3tYGpVcsHQCkpoiAKsw8t\u003C\u002Fcode>\u003C\u002Fsub>\n    \u003C\u002Ftd>\n    \u003Ctd align=\"center\" width=\"50%\">\n      \u003Cstrong>ⓜ Monero (XMR)\u003C\u002Fstrong>\u003Cbr\u002F>\n      \u003Cimg src=\"https:\u002F\u002Fraw.githubusercontent.com\u002FAEON-7\u002FAEON-7\u002Fmain\u002Fassets\u002Fqr\u002Fxmr.png\" alt=\"XMR QR\" width=\"200\"\u002F>\u003Cbr\u002F>\n      \u003Csub>\u003Ccode>836XrSKw4R76vNi3QPJ5Fa9ugcyvE2cWmKSPv3AhpTNNKvqP8v5ba9JRL4Vh7UnFNjDz3E2GXZDVVenu3rkZaNdUFhjAvgd\u003C\u002Fcode>\u003C\u002Fsub>\n    \u003C\u002Ftd>\n  \u003C\u002Ftr>\n\u003C\u002Ftable>\n\n> **Ethereum L2s (Base, Arbitrum, Optimism, Polygon, etc.) and EVM-compatible tokens** can be sent to the same Ethereum address.\n","该项目是对Qwen3.6-27B模型进行无损压缩和性能增强的版本，特别针对NVFP4硬件量化进行了优化，适用于DGX Spark\u002FBlackwell平台。核心功能包括使用BF16（51GB）和NVFP4（26GB）格式部署模型，提供了详细的部署指南、docker-compose配置以及快速启动脚本。通过这些技术手段，项目实现了在单流解码速度上相比原始基线提升了约258%，显著提高了模型在编码、数学计算、推理等任务中的响应速度。适合需要高性能大语言模型处理能力且资源受限的企业级应用场景。",2,"2026-06-11 02:47:39","CREATED_QUERY"]