[{"data":1,"prerenderedAt":-1},["ShallowReactive",2],{"project-74700":3},{"id":4,"name":5,"fullName":6,"owner":7,"repo":5,"description":8,"homepage":9,"htmlUrl":9,"language":10,"languages":9,"totalLinesOfCode":9,"stars":11,"forks":12,"watchers":13,"openIssues":13,"contributorsCount":14,"subscribersCount":14,"size":14,"stars1d":14,"stars7d":15,"stars30d":16,"stars90d":14,"forks30d":14,"starsTrendScore":14,"compositeScore":17,"rankGlobal":9,"rankLanguage":9,"license":18,"archived":19,"fork":19,"defaultBranch":20,"hasWiki":21,"hasPages":19,"topics":22,"createdAt":9,"pushedAt":9,"updatedAt":23,"readmeContent":24,"aiSummary":25,"trendingCount":14,"starSnapshotCount":14,"syncStatus":26,"lastSyncTime":27,"discoverSource":28},74700,"iris.c","antirez\u002Firis.c","antirez","Flux 2 image generation model pure C inference",null,"C",1951,140,12,0,1,14,19.45,"MIT License",false,"main",true,[],"2026-06-12 02:03:27","# Iris - a C inference pipeline for image synthesis models\n\nIris is an inference pipeline that generates images from text prompts using open weights diffusion transformer models. It is implemented entirely in C, with zero external dependencies beyond the C standard library. MPS and BLAS acceleration are optional but recommended. Under macOS, a BLAS API is part of the system, so nothing is required.\n\nThe name comes from the Greek goddess Iris, messenger of the gods and personification of the rainbow.\n\nSupported model families:\n\n- **[FLUX.2 Klein](https:\u002F\u002Fbfl.ai\u002Fmodels\u002Fflux-2-klein)** (by [Black Forest Labs](https:\u002F\u002Fbfl.ai\u002F)):\n  - **4B distilled** (4 steps, auto guidance set to 1, very fast).\n  - **4B base** (50 steps for max quality, or less. Classifier-Free Diffusion Guidance, much slower but more generation variety).\n  - **9B distilled** (4 steps, larger model, higher quality. Non-commercial license).\n  - **9B base** (50 steps, CFG, highest quality. Non-commercial license).\n- **[Z-Image-Turbo](https:\u002F\u002Fhuggingface.co\u002FTongyi-MAI\u002FZ-Image-Turbo)** (by Tongyi-MAI):\n  - **6B** (8 NFE \u002F 9 scheduler steps, no CFG, fast).\n\n## Quick Start\n\n```bash\n# Build (choose your backend)\nmake mps       # Apple Silicon (fastest)\n# or: make blas    # Intel Mac \u002F Linux with OpenBLAS\n# or: make generic # Pure C, no dependencies\n\n# Download a model (~16GB) - pick one:\n.\u002Fdownload_model.sh 4b                   # using curl\n# or: pip install huggingface_hub && python download_model.py 4b\n\n# Generate an image\n.\u002Firis -d flux-klein-4b -p \"A woman wearing sunglasses\" -o output.png\n```\n\nIf you want to try the base model, instead of the distilled one (much slower, higher quality), use the following instructions. Use 10 steps if your computer is quite slow, instead of the default of 50, it will still work well enough to test it (10 seconds to generate a 256x256 image on a MacBook M3 Max).\n```\n.\u002Fdownload_model.sh 4b-base\n# or: pip install huggingface_hub && python download_model.py 4b-base\n.\u002Firis -d flux-klein-4b-base -p \"A woman wearing sunglasses\" -o output.png\n```\n\nIf you want to try the 9B model (higher quality, non-commercial license, ~30GB download):\n```bash\n# 9B is a gated model - you need a HuggingFace token\n# 1. Accept the license at https:\u002F\u002Fhuggingface.co\u002Fblack-forest-labs\u002FFLUX.2-klein-9B\n# 2. Get your token from https:\u002F\u002Fhuggingface.co\u002Fsettings\u002Ftokens\n.\u002Fdownload_model.sh 9b --token YOUR_TOKEN\n# or: python download_model.py 9b --token YOUR_TOKEN\n# or: set HF_TOKEN env var\n.\u002Firis -d flux-klein-9b -p \"A woman wearing sunglasses\" -o output.png\n```\n\nFor Z-Image-Turbo:\n```bash\n# Download Z-Image-Turbo (~12GB)\npip install huggingface_hub && python download_model.py zimage-turbo\n.\u002Firis -d zimage-turbo -p \"a fish\" -o fish.png\n```\n\nThat's it. No Python runtime or CUDA toolkit required at inference time.\n\n## Example Output\n\n![Woman with sunglasses](images\u002Fwoman_with_sunglasses.png)\n\n*Generated with: `.\u002Firis -d flux-klein-4b -p \"A picture of a woman in 1960 America. Sunglasses. ASA 400 film. Black and White.\" -W 512 -H 512 -o woman.png`*\n\n### Image-to-Image Example\n\n![antirez to drawing](images\u002Fantirez_to_drawing.png)\n\n*Generated with: `.\u002Firis -i antirez.png -o antirez_to_drawing.png -p \"make it a drawing\" -d flux-klein-4b`*\n\n## Features\n\n- **Zero dependencies**: Pure C implementation, works standalone. BLAS optional for ~30x speedup (Apple Accelerate on macOS, OpenBLAS on Linux)\n- **Metal GPU acceleration**: Automatic on Apple Silicon Macs. Performance matches PyTorch's optimized MPS pipeline\n- **Runs where Python can't**: Memory-mapped weights (default) enable inference on 8GB RAM systems where the Python ML stack cannot run at all\n- **Text-to-image**: Generate images from text prompts\n- **Image-to-image**: Transform existing images guided by prompts (Flux models)\n- **Multi-reference**: Combine multiple reference images (e.g., `-i car.png -i beach.png` for \"car on beach\")\n- **Integrated text encoder**: Qwen3 encoder built-in (4B or 8B depending on model), no external embedding computation needed\n- **Memory efficient**: Automatic encoder release after encoding (up to ~16GB freed)\n- **Memory-mapped weights**: Enabled by default. Reduces peak memory from ~16GB to ~4-5GB. Fastest mode on MPS; BLAS users with plenty of RAM may prefer `--no-mmap` for faster inference\n- **Size-independent seeds**: Same seed produces similar compositions at different resolutions. Explore at 256x256, then render at 512x512 with the same seed\n- **Terminal image display**: watch the resulting image without leaving your terminal (Ghostty, Kitty, iTerm2, WezTerm, or Konsole).\n\n### Terminal Image Display\n\n![Kitty protocol example](images\u002Fkitty-example.png)\n\nDisplay generated images directly in your terminal with `--show`, or watch the denoising process step-by-step with `--show-steps`:\n\n```bash\n# Display final image in terminal (auto-detects Kitty\u002FGhostty\u002FiTerm2\u002FWezTerm\u002FKonsole)\n.\u002Firis -d flux-klein-4b -p \"a cute robot\" -o robot.png --show\n\n# Display each denoising step (slower, but interesting to watch)\n.\u002Firis -d flux-klein-4b -p \"a cute robot\" -o robot.png --show-steps\n```\n\nRequires a terminal supporting the [Kitty graphics protocol](https:\u002F\u002Fsw.kovidgoyal.net\u002Fkitty\u002Fgraphics-protocol\u002F) (such as [Kitty](https:\u002F\u002Fsw.kovidgoyal.net\u002Fkitty\u002F) or [Ghostty](https:\u002F\u002Fghostty.org\u002F)), the iTerm2 inline image protocol ([iTerm2](https:\u002F\u002Fiterm2.com\u002F), [WezTerm](https:\u002F\u002Fwezfurlong.org\u002Fwezterm\u002F)), or [Konsole](https:\u002F\u002Fkonsole.kde.org\u002F). Terminal type is auto-detected from environment variables.\n\nUse `--zoom N` to adjust the display size (default: 2 for Retina displays, use 1 for non-HiDPI screens).\n\n## Usage\n\n### Text-to-Image\n\n```bash\n.\u002Firis -d flux-klein-4b -p \"A fluffy orange cat sitting on a windowsill\" -o cat.png\n```\n\n### Image-to-Image\n\nTransform an existing image based on a prompt:\n\n```bash\n.\u002Firis -d flux-klein-4b -p \"oil painting style\" -i photo.png -o painting.png\n```\n\nFLUX.2 uses **in-context conditioning** for image-to-image generation. Unlike traditional approaches that add noise to the input image, FLUX.2 passes the reference image as additional tokens that the model can attend to during generation. This means:\n\n- The model \"sees\" your input image and uses it as a reference\n- The prompt describes what you want the output to look like\n- Results tend to preserve the composition while applying the described transformation\n\n**Tips for good results:**\n- Use descriptive prompts that describe the desired output, not instructions\n- Good: `\"oil painting of a woman with sunglasses, impressionist style\"`\n- Less good: `\"make it an oil painting\"` (instructional prompts may work less well)\n\n**Super Resolution:** Since the reference image can be a different size than the output, you can use img2img for upscaling:\n\n```bash\n.\u002Firis -d flux-klein-4b -i small.png -W 1024 -H 1024 -o big.png -p \"Create an exact copy of the input image.\"\n```\n\nThe model will generate a higher-resolution version while preserving the composition and details of the input.\n\n### Multi-Reference Generation\n\nCombine elements from multiple reference images:\n\n```bash\n.\u002Firis -d flux-klein-4b -i car.png -i beach.png -p \"a sports car on the beach\" -o result.png\n```\n\nEach reference image is encoded separately and passed to the transformer with different positional embeddings (T=10, T=20, T=30, ...). The model attends to all references during generation, allowing it to combine elements from each.\n\n**Example:**\n- Reference 1: A red sports car\n- Reference 2: A tropical beach with palm trees\n- Prompt: \"combine the two images\"\n- Result: A red sports car on a tropical beach\n\nYou can specify up to 16 reference images with multiple `-i` flags. The prompt guides how the references are combined.\n\n### Interactive CLI Mode\n\nStart without `-p` to enter interactive mode:\n\n```bash\n.\u002Firis -d flux-klein-4b\n```\n\nGenerate images by typing prompts. Each image gets a `$N` reference ID:\n\n```\niris> a red sports car\nDone -> \u002Ftmp\u002Firis-...\u002Fimage-0001.png (ref $0)\n\niris> a tropical beach\nDone -> \u002Ftmp\u002Firis-...\u002Fimage-0002.png (ref $1)\n\niris> $0 $1 combine them\nGenerating 256x256 (multi-ref, 2 images)...\nDone -> \u002Ftmp\u002Firis-...\u002Fimage-0003.png (ref $2)\n```\n\n**Prompt syntax:**\n- `prompt` - text-to-image\n- `512x512 prompt` - set size inline\n- `$ prompt` - img2img with last image\n- `$N prompt` - img2img with reference $N\n- `$0 $3 prompt` - multi-reference (combine images)\n\n**Commands:** `!help`, `!save`, `!load`, `!seed`, `!size`, `!steps`, `!guidance`, `!linear`, `!power`, `!explore`, `!show`, `!quit`\n\n### Command Line Options\n\n**Required:**\n```\n-d, --dir PATH        Path to model directory\n-p, --prompt TEXT     Text prompt for generation\n-o, --output PATH     Output image path (.png or .ppm)\n```\n\n**Generation options:**\n```\n-W, --width N         Output width in pixels (default: 256)\n-H, --height N        Output height in pixels (default: 256)\n-s, --steps N         Sampling steps (default: auto, 4 distilled \u002F 50 base \u002F 9 zimage)\n-S, --seed N          Random seed for reproducibility\n-g, --guidance N      CFG guidance scale (default: auto, 1.0 distilled \u002F 4.0 base \u002F 0.0 zimage)\n    --linear          Use linear timestep schedule (see below)\n    --power           Use power curve timestep schedule (see below)\n    --power-alpha N   Set power schedule exponent (default: 2.0)\n    --base            Force base model mode (undistilled, CFG enabled)\n```\n\n**Image-to-image options:**\n```\n-i, --input PATH      Reference image (can be specified multiple times)\n```\n\n**Output options:**\n```\n-q, --quiet           Silent mode, no output\n-v, --verbose         Show detailed config and timing info\n    --show            Display image in terminal (auto-detects Kitty\u002FGhostty\u002FiTerm2\u002FWezTerm\u002FKonsole)\n    --show-steps      Display each denoising step (slower)\n    --zoom N          Terminal image zoom factor (default: 2 for Retina)\n```\n\n**Other options:**\n```\n-m, --mmap            Memory-mapped weights (default, fastest on MPS)\n    --no-mmap         Disable mmap, load all weights upfront\n    --no-license-info Suppress non-commercial license warning (9B model)\n-e, --embeddings PATH Load pre-computed text embeddings (advanced)\n-h, --help            Show help\n```\n\n## Reproducibility\n\nThe seed is always printed to stderr, even when random:\n```\n$ .\u002Firis -d flux-klein-4b -p \"a landscape\" -o out.png\nSeed: 1705612345\n...\nSaving... out.png 256x256 (0.1s)\n```\n\nTo reproduce the same image, use the printed seed:\n```\n$ .\u002Firis -d flux-klein-4b -p \"a landscape\" -o out.png -S 1705612345\n```\n\n## PNG Metadata\n\nGenerated PNG images include metadata with the seed and model information, so you can always recover the seed even if you didn't save the terminal output:\n\n```bash\n# Using exiftool\nexiftool image.png | grep iris\n\n# Using Python\u002FPIL\npython3 -c \"from PIL import Image; print(Image.open('image.png').info)\"\n\n# Using ImageMagick\nidentify -verbose image.png | grep -A1 \"Properties:\"\n```\n\nThe following metadata fields are stored:\n- `iris:seed` - The random seed used for generation\n- `iris:model` - The model used\n- `Software` - Program identifier\n\n## Building\n\nChoose a backend when building:\n\n```bash\nmake            # Show available backends\nmake generic    # Pure C, no dependencies (slow)\nmake blas       # BLAS acceleration (~30x faster)\nmake mps        # Apple Silicon Metal GPU (fastest, macOS only)\n```\n\n**Recommended:**\n- macOS Apple Silicon: `make mps`\n- macOS Intel: `make blas`\n- Linux with OpenBLAS: `make blas`\n- Linux without OpenBLAS: `make generic`\n\nFor `make blas` on Linux, install OpenBLAS first:\n```bash\n# Ubuntu\u002FDebian\nsudo apt install libopenblas-dev\n\n# Fedora\nsudo dnf install openblas-devel\n```\n\nOther targets:\n```bash\nmake clean      # Clean build artifacts\nmake info       # Show available backends for this platform\n```\n\n## Testing\n\nRun the test suite to verify your build produces correct output:\n\n```bash\nmake test        # Run all tests\nmake test-quick  # Run only the quick 64x64 test\n```\n\nThe tests compare generated images against reference images in `test_vectors\u002F`. A test passes if the maximum pixel difference is within tolerance (to allow for minor floating-point variations across platforms).\n\n**Test cases:**\n| Test | Size | Steps | Purpose |\n|------|------|-------|---------|\n| Quick | 64x64 | 2 | Fast txt2img sanity check |\n| Full | 512x512 | 4 | Validates txt2img at larger resolution |\n| img2img | 256x256 | 4 | Validates image-to-image transformation |\n| Z-Image | 256x256 | 2 | Z-Image smoke test (auto-detected) |\n\nYou can also run the test script directly for more options:\n```bash\npython3 run_test.py --help\npython3 run_test.py --quick\npython3 run_test.py --flux-binary .\u002Firis --model-dir \u002Fpath\u002Fto\u002Fmodel\n```\n\n## Model Download\n\nDownload model weights from HuggingFace using one of these methods:\n\n**4B Distilled model** (~16GB, fast 4-step inference):\n```bash\n.\u002Fdownload_model.sh 4b                   # using curl\n# or: python download_model.py 4b        # using huggingface_hub\n```\n\n**4B Base model** (~16GB, 50-step inference with CFG, higher quality):\n```bash\n.\u002Fdownload_model.sh 4b-base\n# or: python download_model.py 4b-base\n```\n\n**9B models** (~30GB, higher quality, non-commercial license):\n```bash\n# 9B models are gated - require HuggingFace authentication\n# 1. Accept the license at https:\u002F\u002Fhuggingface.co\u002Fblack-forest-labs\u002FFLUX.2-klein-9B\n# 2. Get a token from https:\u002F\u002Fhuggingface.co\u002Fsettings\u002Ftokens\n.\u002Fdownload_model.sh 9b --token YOUR_TOKEN       # distilled\n.\u002Fdownload_model.sh 9b-base --token YOUR_TOKEN   # base (CFG, highest quality)\n# or: python download_model.py 9b --token YOUR_TOKEN\n# You can also set the HF_TOKEN environment variable\n```\n\n**Z-Image-Turbo** (~12GB):\n```bash\npip install huggingface_hub && python download_model.py zimage-turbo\n```\n\n| Model | Directory | Size | Components |\n|-------|-----------|------|------------|\n| 4B distilled | `.\u002Fflux-klein-4b` | ~16GB | VAE (~300MB), Transformer (~4GB), Qwen3-4B (~8GB) |\n| 4B base | `.\u002Fflux-klein-4b-base` | ~16GB | VAE (~300MB), Transformer (~4GB), Qwen3-4B (~8GB) |\n| 9B distilled | `.\u002Fflux-klein-9b` | ~30GB | VAE (~300MB), Transformer (~17GB), Qwen3-8B (~15GB) |\n| 9B base | `.\u002Fflux-klein-9b-base` | ~30GB | VAE (~300MB), Transformer (~17GB), Qwen3-8B (~15GB) |\n| Z-Image-Turbo | `.\u002Fzimage-turbo` | ~12GB | VAE, Transformer (~6B), Qwen3-4B |\n\n## How Fast Is It?\n\nBenchmarks on **Apple M3 Max** (128GB RAM), Flux distilled model (4 steps).\n\nThe MPS implementation is faster than the PyTorch optimized pipeline at all resolutions.\n\n| Size | C (MPS) | PyTorch (MPS) |\n|------|---------|---------------|\n| 256x256 | 5.2s | 11s |\n| 512x512 | 7.6s | 13s |\n| 1024x1024 | 19s | 25s |\n\n**Notes:**\n- All times measured as wall clock, including model loading, no warmup. PyTorch times exclude library import overhead (~5-10s) to be fair.\n- The base model is roughly 25x slower (50 steps x 2 passes per step vs 4 steps x 1 pass). It actually produces acceptable results even with 10 steps, so you can tune quality\u002Ftime. The 25x figure is not exactly accurate because it only covers the denoising steps: text encoding and VAE use the same time for both the models, however such steps are a minor percentage of the generation time.\n- The C BLAS backend (CPU) is not shown.\n- The `make generic` backend (pure C, no BLAS) is approximately 30x slower than BLAS and not included in benchmarks.\n- The fastest implementation for Metal remains [the Draw Things app](https:\u002F\u002Fdrawthings.ai\u002F) that can produce a 1024x1024 image in just 14.23 seconds (in the same hardware), however it is worth noting that it uses 6-bit quantized weights, while this implementation uses the official BF16 weights. The 6-bit quantization used by Draw Things provides both a big memory win and a moderate speed advantage (not nearly as much as it could in an LLM, where causal attention is dominated by memory bandwidth); if we account for this, the performance is comparable.\n\n### Community Benchmarks\n\nThe following timings for 512x512 generation (Flux distilled model, 4 steps) were reported by users. They can serve as a rough indication of the performance you could expect, but results vary widely depending on the hardware, Metal availability (the code is heavily optimized for Apple Silicon via MPS), and whether BLAS acceleration is used on CPU.\n\n| Hardware | Backend | 512x512 |\n|----------|---------|---------|\n| M3 Ultra | MPS | 4.5s |\n| M3 Max | MPS | 7.6s |\n| MacBook Pro M4 | MPS | 19s |\n| MacBook Pro M1 Max | MPS | 39.9s |\n| Apple M1 Pro | MPS | 42.4s |\n| AMD Ryzen 7800X3D | BLAS | 47.8s |\n| Intel i5-1135G7 | BLAS | 218s |\n\n## Resolution Limits\n\n**Maximum resolution**: 1792x1792 pixels. The model produces good results up to this size; beyond this resolution image quality degrades significantly (this is a model limitation, not an implementation issue).\n\n**Minimum resolution**: 64x64 pixels.\n\nDimensions should be multiples of 16 (the VAE downsampling factor).\n\n## Model Architecture\n\n### FLUX.2 Klein\n\nAll Flux models share the same rectified flow transformer architecture, differing only in dimensions:\n\n| Component | 4B | 9B |\n|-----------|-----|-----|\n| Transformer hidden | 3072 | 4096 |\n| Attention heads | 24 | 32 |\n| Head dim | 128 | 128 |\n| Double blocks | 5 | 8 |\n| Single blocks | 20 | 24 |\n| Text Encoder | Qwen3-4B (2560 hidden, 36 layers) | Qwen3-8B (4096 hidden, 36 layers) |\n| VAE | AutoencoderKL, 128 latent channels, 8x spatial compression | Same |\n\nArchitecture dimensions are read automatically from the model's config JSON files at load time.\n\nThe distilled and base variants differ in inference:\n\n| | Distilled | Base |\n|---|-----------|------|\n| Steps | 4 | 50 (default) |\n| CFG guidance | 1.0 (none) | 4.0 (default) |\n| Passes per step | 1 | 2 (conditioned + unconditioned) |\n\nThe model type (distilled vs base, 4B vs 9B) is autodetected from the model directory. Use `--base` to force base model mode if autodetection fails.\n\n**Classifier-Free Guidance (CFG)**: The base model runs the transformer twice per step -- once with an empty prompt (unconditioned) and once with the real prompt (conditioned). The final velocity is `v = v_uncond + guidance * (v_cond - v_uncond)`. This makes each step ~2x slower than the distilled model, and the base model needs ~12x more steps, making it roughly 25x slower overall.\n\n### Z-Image-Turbo\n\nZ-Image-Turbo uses an S3-DiT single-stream architecture with noise and context refiners:\n\n| Component | Z-Image-Turbo |\n|-----------|---------------|\n| Transformer dim | 3840 |\n| Attention heads | 30 |\n| Head dim | 128 |\n| Main layers | 30 |\n| Refiner layers | 2 (noise) + 2 (context) |\n| Text Encoder | Qwen3-4B (hidden_states[-2]) |\n| VAE | 16 latent channels, patch_size=2 |\n\n## Timestep Schedules\n\nEach model family has its own default schedule. Alternative schedules (`--linear`, `--power`, `--sigmoid`, `--flowmatch`) are available for experimentation. Any schedule can be used with any model.\n\n### Flux Distilled (4B \u002F 9B)\n\nThe distilled models use a **shifted sigmoid** schedule (matching the official BFL distillation). This schedule concentrates most steps in the high-noise regime and is part of the distillation training -- changing it will produce poor results. Use 4 steps (the default).\n\n### Flux Base (4B-base \u002F 9B-base)\n\nThe base models default to the same shifted sigmoid schedule. At 50 steps it works very well, but 50 steps is slow. For quick tests, **10 steps** already produce decent results.\n\nThe shifted sigmoid can look extremely unbalanced at low step counts -- for example at 10 steps, the first 5 steps cover only 12% of the denoising trajectory while the last 5 cover 88%. The `--linear` flag switches to a uniform schedule where each step covers an equal portion of the trajectory, which sometimes produces more realistic results at reduced step counts. The `--power` flag provides a middle ground: a power curve (`t = 1 - (i\u002Fn)^a`) that is denser at the start and sparser at the end, but less extreme than the shifted sigmoid. The default exponent is 2.0 (quadratic); use `--power-alpha` to adjust it (1.0 = linear, higher = more front-loaded).\n\n```bash\n# Base model, 10 steps with default schedule\n.\u002Firis -d flux-klein-4b-base -p \"a cat\" -o cat.png -s 10\n\n# Base model with linear schedule\n.\u002Firis -d flux-klein-4b-base -p \"a cat\" -o cat.png -s 10 --linear\n\n# Base model with power schedule (quadratic by default)\n.\u002Firis -d flux-klein-4b-base -p \"a cat\" -o cat.png -s 10 --power\n\n# Power schedule with custom exponent\n.\u002Firis -d flux-klein-4b-base -p \"a cat\" -o cat.png -s 10 --power-alpha 1.5\n```\n\n### Z-Image-Turbo\n\nZ-Image-Turbo uses the official diffusers **FlowMatch Euler** schedule by default (static shift). The default is 8 NFE (9 scheduler values, with the terminal sigma at 0 making the last step a no-op).\n\nFor quick tests, **4 steps with `--linear`** works well and is twice as fast as the default:\n\n```bash\n# Default schedule (8 NFE)\n.\u002Firis -d zimage-turbo -p \"a fish\" -o fish.png\n\n# Quick test: 4 steps with linear schedule\n.\u002Firis -d zimage-turbo -p \"a fish\" -o fish.png -s 4 --linear\n```\n\n### Cross-model schedules\n\nYou can use any schedule with any model via `--sigmoid` and `--flowmatch`:\n\n```bash\n# Flux base with Z-Image's FlowMatch schedule\n.\u002Firis -d flux-klein-4b-base -p \"a cat\" -o cat.png -s 10 --flowmatch\n\n# Z-Image with Flux's shifted sigmoid schedule\n.\u002Firis -d zimage-turbo -p \"a fish\" -o fish.png --sigmoid\n```\n\nThis is not really useful with the Flux distilled models, but is interesting with both the base models of Flux and with Z-Image Turbo, even if it is a distilled model: the 9 steps training gives it enough flexibility, and other schedulers may be interesting especially at reduced steps count (quick preview).\n\n### Interactive mode\n\nIn interactive CLI mode, toggle schedules with `!linear`, `!power [alpha]`, `!sigmoid`, or `!flowmatch`.\n\nIf you have a terminal supporting the iTerm2 or Kitty terminal graphics protocols, it is strongly suggested to test the different schedulers with `--show` and `--show-steps` options. It is quite an experience to see the denoising process happening in different ways.\n\n## Memory Requirements\n\n### 4B model\n\nWith mmap (default):\n\n| Phase | Memory |\n|-------|--------|\n| Text encoding | ~2GB (layers loaded on-demand) |\n| Diffusion | ~1-2GB (blocks loaded on-demand) |\n| Peak | ~4-5GB |\n\nWith `--no-mmap` (all weights in RAM):\n\n| Phase | Memory |\n|-------|--------|\n| Text encoding | ~8GB (encoder weights) |\n| Diffusion | ~8GB (transformer ~4GB + VAE ~300MB + activations) |\n| Peak | ~16GB (if encoder not released) |\n\n### 9B model\n\nWith mmap (default):\n\n| Phase | Memory |\n|-------|--------|\n| Text encoding | ~3-4GB (larger layers loaded on-demand) |\n| Diffusion | ~2-3GB (more\u002Flarger blocks loaded on-demand) |\n| Peak | ~8-10GB |\n\nWith `--no-mmap` (all weights in RAM):\n\n| Phase | Memory |\n|-------|--------|\n| Text encoding | ~15GB (Qwen3-8B encoder weights) |\n| Diffusion | ~17GB (transformer ~17GB + VAE ~300MB + activations) |\n| Peak | ~32GB (if encoder not released) |\n\nThe text encoder is automatically released after encoding, reducing peak memory during diffusion. If you generate multiple images with different prompts, the encoder reloads automatically.\n\n## Memory-Mapped Weights (Default)\n\nMemory-mapped weight loading is enabled by default. Use `--no-mmap` to disable and load all weights upfront.\n\n```bash\n.\u002Firis -d flux-klein-4b -p \"A cat\" -o cat.png           # mmap (default)\n.\u002Firis -d flux-klein-4b -p \"A cat\" -o cat.png --no-mmap # load all upfront\n```\n\n**How it works:** Instead of loading all model weights into RAM upfront, mmap keeps the safetensors files memory-mapped and loads weights on-demand:\n\n- **Text encoder (Qwen3):** Each of the 36 transformer layers (~400MB each) is loaded, processed, and immediately freed. Only ~2GB stays resident instead of ~8GB.\n- **Denoising transformer:** Each of the 5 double-blocks (~300MB) and 20 single-blocks (~150MB) is loaded on-demand and freed after use. Only ~200MB of shared weights stays resident instead of ~4GB.\n\nThis reduces peak memory from ~16GB to ~4-5GB, making inference possible on 16GB RAM systems where the Python ML stack cannot run at all.\n\n**Performance varies by backend:**\n\n- **MPS (Apple Silicon):** mmap is the **fastest** mode. The model stores weights in bf16 format, and MPS uses them directly via zero-copy pointers into the memory-mapped region. No conversion overhead, and the kernel handles paging efficiently.\n\n- **BLAS (CPU):** mmap is **slightly slower** but uses much less RAM. BLAS requires f32 weights, so each block must be converted from bf16->f32 on every step (25 blocks x 4 steps = 100 conversions). With `--no-mmap`, this conversion happens once at startup. **Recommendation:** If you have 32GB+ RAM and use BLAS, try `--no-mmap` for faster inference. If RAM is limited, mmap lets you run at all.\n\n- **Generic (pure C):** Same tradeoffs as BLAS, but slower overall.\n\n## C Library API\n\nThe library can be integrated into your own C\u002FC++ projects. Link against `libiris.a` and include `iris.h`.\n\n### Text-to-Image Generation\n\nHere's a complete program that generates an image from a text prompt:\n\n```c\n#include \"iris.h\"\n#include \u003Cstdio.h>\n\nint main(void) {\n    \u002F* Load the model. This loads VAE, transformer, and text encoder. *\u002F\n    iris_ctx *ctx = iris_load_dir(\"flux-klein-4b\");\n    if (!ctx) {\n        fprintf(stderr, \"Failed to load model: %s\\n\", iris_get_error());\n        return 1;\n    }\n\n    \u002F* Configure generation parameters. Start with defaults and customize. *\u002F\n    iris_params params = IRIS_PARAMS_DEFAULT;\n    params.width = 512;\n    params.height = 512;\n    params.seed = 42;  \u002F* Use -1 for random seed *\u002F\n\n    \u002F* Generate the image. This handles text encoding, diffusion, and VAE decode. *\u002F\n    iris_image *img = iris_generate(ctx, \"A fluffy orange cat in a sunbeam\", &params);\n    if (!img) {\n        fprintf(stderr, \"Generation failed: %s\\n\", iris_get_error());\n        iris_free(ctx);\n        return 1;\n    }\n\n    \u002F* Save to file. Format is determined by extension (.png or .ppm). *\u002F\n    iris_image_save(img, \"cat.png\");\n    printf(\"Saved cat.png (%dx%d)\\n\", img->width, img->height);\n\n    \u002F* Clean up *\u002F\n    iris_image_free(img);\n    iris_free(ctx);\n    return 0;\n}\n```\n\nCompile with:\n```bash\ngcc -o myapp myapp.c -L. -liris -lm -framework Accelerate  # macOS\ngcc -o myapp myapp.c -L. -liris -lm -lopenblas              # Linux\n```\n\n### Image-to-Image Transformation\n\nTransform an existing image guided by a text prompt using in-context conditioning:\n\n```c\n#include \"iris.h\"\n#include \u003Cstdio.h>\n\nint main(void) {\n    iris_ctx *ctx = iris_load_dir(\"flux-klein-4b\");\n    if (!ctx) return 1;\n\n    \u002F* Load the input image *\u002F\n    iris_image *photo = iris_image_load(\"photo.png\");\n    if (!photo) {\n        fprintf(stderr, \"Failed to load image\\n\");\n        iris_free(ctx);\n        return 1;\n    }\n\n    \u002F* Set up parameters. Output size defaults to input size. *\u002F\n    iris_params params = IRIS_PARAMS_DEFAULT;\n    params.seed = 123;\n\n    \u002F* Transform the image - describe the desired output *\u002F\n    iris_image *painting = iris_img2img(ctx, \"oil painting of the scene, impressionist style\",\n                                         photo, &params);\n    iris_image_free(photo);  \u002F* Done with input *\u002F\n\n    if (!painting) {\n        fprintf(stderr, \"Transformation failed: %s\\n\", iris_get_error());\n        iris_free(ctx);\n        return 1;\n    }\n\n    iris_image_save(painting, \"painting.png\");\n    printf(\"Saved painting.png\\n\");\n\n    iris_image_free(painting);\n    iris_free(ctx);\n    return 0;\n}\n```\n\n### Generating Multiple Images\n\nWhen generating multiple images with different seeds but the same prompt, you can avoid reloading the text encoder:\n\n```c\niris_ctx *ctx = iris_load_dir(\"flux-klein-4b\");\niris_params params = IRIS_PARAMS_DEFAULT;\nparams.width = 256;\nparams.height = 256;\n\n\u002F* Generate 5 variations with different seeds *\u002F\nfor (int i = 0; i \u003C 5; i++) {\n    iris_set_seed(1000 + i);\n\n    iris_image *img = iris_generate(ctx, \"A mountain landscape at sunset\", &params);\n\n    char filename[64];\n    snprintf(filename, sizeof(filename), \"landscape_%d.png\", i);\n    iris_image_save(img, filename);\n    iris_image_free(img);\n}\n\niris_free(ctx);\n```\n\nNote: The text encoder (~8GB) is automatically released after the first generation to save memory. It reloads automatically if you use a different prompt.\n\n### Error Handling\n\nAll functions that can fail return NULL on error. Use `iris_get_error()` to get a description:\n\n```c\niris_ctx *ctx = iris_load_dir(\"nonexistent-model\");\nif (!ctx) {\n    fprintf(stderr, \"Error: %s\\n\", iris_get_error());\n    \u002F* Prints something like: \"Failed to load VAE - cannot generate images\" *\u002F\n    return 1;\n}\n```\n\n### API Reference\n\n**Core functions:**\n```c\niris_ctx *iris_load_dir(const char *model_dir);   \u002F* Load model, returns NULL on error *\u002F\nvoid iris_free(iris_ctx *ctx);                     \u002F* Free all resources *\u002F\n\niris_image *iris_generate(iris_ctx *ctx, const char *prompt, const iris_params *params);\niris_image *iris_img2img(iris_ctx *ctx, const char *prompt, const iris_image *input,\n                          const iris_params *params);\n```\n\n**Image handling:**\n```c\niris_image *iris_image_load(const char *path);     \u002F* Load PNG, JPEG, or PPM *\u002F\nint iris_image_save(const iris_image *img, const char *path);  \u002F* 0=success, -1=error *\u002F\nint iris_image_save_with_seed(const iris_image *img, const char *path, int64_t seed);  \u002F* Save with metadata *\u002F\niris_image *iris_image_resize(const iris_image *img, int new_w, int new_h);\nvoid iris_image_free(iris_image *img);\n```\n\n**Utilities:**\n```c\nvoid iris_set_seed(int64_t seed);                  \u002F* Set RNG seed for reproducibility *\u002F\nconst char *iris_get_error(void);                  \u002F* Get last error message *\u002F\nvoid iris_release_text_encoder(iris_ctx *ctx);     \u002F* Manually free ~8GB (optional) *\u002F\nint iris_is_distilled(iris_ctx *ctx);              \u002F* 1 = distilled, 0 = base *\u002F\nvoid iris_set_base_mode(iris_ctx *ctx);            \u002F* Force base model mode *\u002F\n```\n\n### Parameters\n\n```c\ntypedef struct {\n    int width;              \u002F* Output width in pixels (default: 256) *\u002F\n    int height;             \u002F* Output height in pixels (default: 256) *\u002F\n    int num_steps;          \u002F* Denoising steps, 0 = auto (4 distilled, 50 base, 9 zimage) *\u002F\n    int64_t seed;           \u002F* Random seed, -1 for random (default: -1) *\u002F\n    float guidance;         \u002F* CFG guidance scale, 0 = auto (1.0 distilled, 4.0 base, 0.0 zimage) *\u002F\n    int linear_schedule;    \u002F* Use linear timestep schedule (0 = shifted sigmoid) *\u002F\n    int power_schedule;     \u002F* Use power curve timestep schedule *\u002F\n    float power_alpha;      \u002F* Exponent for power schedule (default: 2.0) *\u002F\n} iris_params;\n\n\u002F* Initialize with sensible defaults (auto steps and guidance from model type) *\u002F\n#define IRIS_PARAMS_DEFAULT { 256, 256, 0, -1, 0.0f, 0, 0, 2.0f }\n```\n\n## Debugging\n\n### Comparing with Python Reference\n\nWhen debugging img2img issues, the `--debug-py` flag allows you to run the C implementation with exact inputs saved from a Python reference script. This isolates whether differences are due to input preparation (VAE encoding, text encoding, noise generation) or the transformer itself.\n\n**Setup:**\n\n1. Set up the Python environment:\n```bash\npython -m venv flux_env\nsource flux_env\u002Fbin\u002Factivate\npip install torch diffusers transformers safetensors einops huggingface_hub\n```\n\n2. Clone the flux2 reference (for the model class):\n```bash\ngit clone https:\u002F\u002Fgithub.com\u002Fblack-forest-labs\u002Fflux flux2\n```\n\n3. Run the Python debug script to save inputs:\n```bash\npython debug\u002Fdebug_img2img_compare.py\n```\n\nThis saves to `\u002Ftmp\u002F`:\n- `py_noise.bin` - Initial noise tensor\n- `py_ref_latent.bin` - VAE-encoded reference image\n- `py_text_emb.bin` - Text embeddings from Qwen3\n\n4. Run C with the same inputs:\n```bash\n.\u002Firis -d flux-klein-4b --debug-py -W 256 -H 256 --steps 4 -o \u002Ftmp\u002Fc_debug.png\n```\n\n5. Compare the outputs visually or numerically.\n\n**What this helps diagnose:**\n- If C and Python produce identical outputs with identical inputs, any differences in normal operation are due to input preparation (VAE, text encoder, RNG)\n- If outputs differ even with identical inputs, the issue is in the transformer or sampling implementation\n\n### Debug Scripts\n\nThe `debug\u002F` directory contains Python scripts for comparing C and Python implementations:\n\n- `debug_img2img_compare.py` - Full img2img comparison with step-by-step statistics\n- `debug_rope_img2img.py` - Verify RoPE position encoding matches between C and Python\n\n## License\n\nMIT\n","Iris.c 是一个使用纯 C 语言实现的图像生成模型推理管道，能够根据文本提示生成图像。项目完全基于 C 标准库开发，无需额外依赖，并支持可选的 MPS 和 BLAS 加速以提高性能。它适用于需要轻量级、跨平台解决方案的场景，特别是当环境限制不允许使用 Python 或 CUDA 时。目前支持包括 FLUX.2 Klein 和 Z-Image-Turbo 在内的多个模型家族，用户可以根据需求选择不同大小和质量级别的预训练模型进行部署。",2,"2026-06-11 03:50:28","high_star"]