[{"data":1,"prerenderedAt":-1},["ShallowReactive",2],{"project-84000":3},{"id":4,"name":5,"fullName":6,"owner":7,"repo":5,"description":8,"homepage":8,"htmlUrl":8,"language":9,"languages":8,"totalLinesOfCode":8,"stars":10,"forks":11,"watchers":12,"openIssues":11,"contributorsCount":11,"subscribersCount":11,"size":11,"stars1d":11,"stars7d":13,"stars30d":13,"stars90d":11,"forks30d":11,"starsTrendScore":13,"compositeScore":14,"rankGlobal":8,"rankLanguage":8,"license":15,"archived":16,"fork":16,"defaultBranch":17,"hasWiki":18,"hasPages":16,"topics":19,"createdAt":8,"pushedAt":20,"updatedAt":21,"readmeContent":22,"aiSummary":8,"trendingCount":11,"starSnapshotCount":11,"syncStatus":23,"lastSyncTime":24,"discoverSource":25},84000,"inference_driven_model_compiler","naomili0924\u002Finference_driven_model_compiler","naomili0924",null,"Python",99,0,15,17,47.2,"MIT License",false,"main",true,[],"2026-06-08 06:51:38","2026-06-08 22:54:45","# Inference-Driven Model Compiler\n\nExport 🤗 Transformers models to ONNX by **observing a real inference pass**\ninstead of relying on hand-written, per-architecture ONNX configurations.\n\nStandard [Optimum](https:\u002F\u002Fgithub.com\u002Fhuggingface\u002Foptimum) ONNX export requires\na model-specific `OnnxConfig` that declares every input\u002Foutput and which tensor\ndimensions are dynamic. This project removes that requirement: it runs the model\non your actual inputs, traces the tensor shapes that flow through it, determines\nwhich dimensions are dynamic **empirically**, and exports the result — all behind\nthe familiar `from_pretrained(...)` interface.\n\nIt is built entirely on top of an **unmodified** `optimum` \u002F `optimum-onnx`\ninstallation.\n\n---\n\n## Installation\n\n```bash\npip install torch transformers onnx onnxruntime\npip install \"optimum @ git+https:\u002F\u002Fgithub.com\u002Fhuggingface\u002Foptimum\"\n# the ONNX exporter\u002Fruntime now lives in the separate optimum-onnx package\npip install \"optimum-onnx[onnxruntime] @ git+https:\u002F\u002Fgithub.com\u002Fhuggingface\u002Foptimum-onnx\"\n\ngit clone https:\u002F\u002Fgithub.com\u002Fnaomili0924\u002Finference_driven_model_compiler.git\n# Put the *repo directory itself* on PYTHONPATH. This activates the shadow\n# `optimum` package, which (a) makes `optimum-cli` use the inference-driven\n# exporter and (b) auto-registers the new export flags onto\n# `optimum-cli export onnx` (see optimum\u002Fcommands\u002Fregister\u002Fregister_idmc.py).\n# Add the parent dir too if you also want `import inference_driven_model_compiler`\n# or the `idmc` CLI.\nexport PYTHONPATH=\u002Fpath\u002Fto\u002Finference_driven_model_compiler:$PYTHONPATH\n```\n\nWith this repo *off* PYTHONPATH, `optimum-cli` behaves exactly as stock — the\nintegration is inert unless the shadow `optimum` is active.\n\n---\n\n## CLI export\n\nWhen this repo is on `PYTHONPATH` (see Installation), the standard\n`optimum-cli export onnx` command gains three extra flags — no separate tool or\nlauncher needed. They are registered automatically via\n`optimum\u002Fcommands\u002Fregister\u002Fregister_idmc.py`, which optimum-cli auto-discovers:\n\n| Flag | Description |\n|---|---|\n| `--export_by_inference` | Enable inference-driven export (traces the model instead of using a hand-written `OnnxConfig`). |\n| `--module_fixed_axis_fields` | JSON dict mapping submodule names to config field names whose values should be treated as **static** tensor dimensions. |\n| `--inference_kwargs` | JSON dict of inputs used to trace the model (overrides the auto-generated dummy inputs). |\n\n### Encoder model\n\n```bash\noptimum-cli export onnx \\\n    --model sentence-transformers\u002Fparaphrase-MiniLM-L12-v2 \\\n    \u002Fdev\u002Fshm\u002Fparaphrase-MiniLM \\\n    --export_by_inference=true \\\n    --module_fixed_axis_fields='{\"transformer\": [\"hidden_size\",\"intermediate_size\",\"type_vocab_size\",\"vocab_size\"]}'\n```\n\n### Decoder model (with KV cache)\n\n```bash\noptimum-cli export onnx \\\n    --model Qwen\u002FQwen3-4B-Thinking-2507 \\\n    \u002Fdev\u002Fshm\u002Fqwen3-4b-thinking-onnx \\\n    --task text-generation-with-past \\\n    --export_by_inference=true \\\n    --dtype fp16 --device cpu\n```\n\n> The inference-driven tracer builds its dummy inputs on CPU, so export decoder\n> models with `--device cpu` (a `--device cuda` model would mismatch the\n> CPU-resident traced inputs).\n\n`--module_fixed_axis_fields` is optional for decoder models — the dynamic-axis\ninference step figures out `num_heads`, `head_dim`, etc. automatically.\n\n> **Tip for large models:** if your disk is limited, export to `\u002Fdev\u002Fshm` (a\n> RAM-backed tmpfs typically >= 80 GB on GPU instances) and copy the result\n> elsewhere afterwards.\n\n---\n\n## Python API\n\n### Encoder model (BERT feature extraction)\n\n```python\nfrom transformers import AutoTokenizer\nfrom inference_driven_model_compiler.optimum.onnxruntime import (\n    OnTheFlyORTModelForFeatureExtraction,\n)\n\nckpt = \"bert-base-uncased\"\ntokenizer = AutoTokenizer.from_pretrained(ckpt)\nencoded = tokenizer(\"ONNX Runtime accelerates inference.\", return_tensors=\"pt\")\n\nmodel = OnTheFlyORTModelForFeatureExtraction.from_pretrained(\n    ckpt,\n    inference_kwargs=dict(encoded),\n    export_by_inference=True,\n    export=True,\n    module_fixed_axis_fields={\"transformer\": [\"hidden_size\", \"num_attention_heads\"]},\n)\n\nout = model(**encoded)\nprint(out.last_hidden_state.shape)      # (1, seq_len, 768)\n```\n\n### Decoder model (GPT-2 text generation, with KV cache)\n\n```python\nfrom transformers import GPT2Tokenizer\nfrom inference_driven_model_compiler.optimum.onnxruntime import OnTheFlyORTModelForCausalLM\n\nckpt = \"gpt2\"\ntokenizer = GPT2Tokenizer.from_pretrained(ckpt)\nencoded = tokenizer(\"Replace me by any text you'd like.\", return_tensors=\"pt\")\n\nmodel = OnTheFlyORTModelForCausalLM.from_pretrained(\n    ckpt,\n    inference_kwargs=dict(encoded),\n    export_by_inference=True,\n    export=True,\n    module_fixed_axis_fields={\"transformer\": [\"n_ctx\", \"n_embd\"]},\n)\n\noutput_ids = model.generate(**encoded)\nprint(tokenizer.decode(output_ids[0]))\n```\n\n### Diffusion pipeline (text-to-video)\n\nPass `export_by_inference=True` together with `inference_kwargs` (the same kwargs\nyou would pass to the pipeline `__call__`). The pipeline runs once in PyTorch to\ncapture real tensor shapes for every submodule, then exports each one to ONNX\nautomatically — no hand-written `OnnxConfig` required.\n\n```python\nimport torch\nfrom inference_driven_model_compiler.optimum.onnxruntime import ORTDiffusionPipeline\n\ninf_kwargs = {\n    \"prompt\": \"A cat walks on the grass, realistic\",\n    \"negative_prompt\": \"low quality, blurred\",\n    \"height\": 240,\n    \"width\": 416,\n    \"num_frames\": 21,\n    \"guidance_scale\": 5.0,\n}\n\npipe = ORTDiffusionPipeline.from_pretrained(\n    \"Wan-AI\u002FWan2.1-T2V-1.3B-Diffusers\",\n    provider=\"CUDAExecutionProvider\",\n    torch_dtype=torch.float16,\n    export_by_inference=True,\n    inference_kwargs=inf_kwargs,\n    module_fixed_axis_fields={\n        \"text_encoder\": [\"d_model\", \"vocab_size\"],\n        \"transformer\":  [\"in_channels\", \"text_dim\"],\n        \"vae_decoder\":  [\"base_dim\", \"z_dim\"],\n    },\n)\n\noutput = pipe(**inf_kwargs).frames[0]\n```\n\nThis exports three ONNX files to a temporary directory and immediately loads them\ninto ORT sessions — all in one `from_pretrained` call:\n\n| Submodule | ONNX file | Typical size |\n|---|---|---|\n| `text_encoder` | `text_encoder\u002Fmodel.onnx` | ~13 GB (fp16) |\n| `transformer` | `transformer\u002Fmodel.onnx` | ~3 GB (fp16) |\n| `vae_decoder` | `vae_decoder\u002Fmodel.onnx` | ~137 MB (fp16) |\n\n### Loading pre-exported ONNX weights\n\n```python\npipe = ORTDiffusionPipeline.from_pretrained(\n    \"optimum\u002Fstable-diffusion-v1-5\",   # Hub repo with pre-exported ONNX weights\n    export=False,\n)\n```\n\n### Common arguments\n\n| Argument | Meaning |\n|---|---|\n| `inference_kwargs` | Inputs used to trace the model (e.g. a tokenized prompt, or full pipeline kwargs for diffusion). |\n| `export_by_inference=True` | Enable the inference-driven export path. |\n| `export=True` | Force a fresh ONNX export (transformer models). |\n| `module_fixed_axis_fields` | Per-submodule config field names whose values should be treated as fixed (static) tensor dims. |\n| `skip_random_generation` | Keep the actual traced tensors as fixed dummy inputs instead of regenerating them. |\n| `n_trials` | Number of inference passes for dynamic-axis detection, transformer models only (default `3`). |\n\n---\n\n## Available classes\n\nAll live in `inference_driven_model_compiler.optimum.onnxruntime` and share the\nsame `from_pretrained(...)` interface:\n\n### Transformer models\n\n| Class | Task |\n|---|---|\n| `OnTheFlyORTModelForCausalLM` | Text generation (decoder-only, KV cache) |\n| `OnTheFlyORTModelForFeatureExtraction` | Embeddings \u002F hidden states |\n| `OnTheFlyORTModelForMaskedLM` | Masked language modeling |\n| `OnTheFlyORTModelForSequenceClassification` | Sequence classification |\n| `OnTheFlyORTModelForTokenClassification` | Token classification \u002F NER |\n| `OnTheFlyORTModelForQuestionAnswering` | Extractive QA |\n\n### Diffusion pipelines\n\n| Class | Purpose |\n|---|---|\n| `ORTDiffusionPipeline` | Generic base — wraps **any** `diffusers.DiffusionPipeline` |\n| `ORTUnet` | ORT session wrapper for a UNet2D\u002F3D denoiser |\n| `ORTTransformer` | ORT session wrapper for a DiT\u002Ftransformer denoiser |\n| `ORTTextEncoder` | ORT session wrapper for a text encoder |\n| `ORTVaeEncoder` | ORT session wrapper for a VAE encoder |\n| `ORTVaeDecoder` | ORT session wrapper for a VAE decoder |\n| `ORTVae` | Combines `ORTVaeEncoder` + `ORTVaeDecoder` behind the standard `vae` API |\n\n`ORTDiffusionPipeline` requires no model-specific subclass. When called as the\nbase class it reads `_class_name` from the model's `model_index.json` and creates\nan `ORT\u003CClassName>` wrapper on the fly via `_make_ort_pipeline_class`. Every\ndiffusers pipeline — including ones not yet written — is handled automatically.\n\nSupported text-to-video pipeline names (as of diffusers 0.38):\n`AnimateDiffPipeline`, `AnimateDiffSDXLPipeline`, `CogVideoXPipeline`,\n`HunyuanVideo15Pipeline`, `HunyuanVideoPipeline`, `LTXPipeline`, `LTX2Pipeline`,\n`LattePipeline`, `MochiPipeline`, `SanaVideoPipeline`, `TextToVideoSDPipeline`,\n`WanPipeline`, `WanAnimatePipeline`.\n\n---\n\n## Verified models\n\n### Transformer models\n\n| Model | Type | Task tested |\n|---|---|---|\n| GPT-2 | decoder-only | text generation (KV cache) |\n| Gemma 4 (2B) | decoder-only | text generation (KV cache, fp16) |\n| BERT-base | encoder | masked-LM, seq-cls, token-cls, QA, feature extraction |\n| Sentence-Transformers \u002F paraphrase-MiniLM-L12-v2 | encoder | feature extraction |\n| T5-small | encoder-decoder | feature extraction (encoder) |\n| BART-base | encoder-decoder | feature extraction (encoder) |\n| ViT-base | vision encoder | feature extraction |\n| CLIP-ViT-base | vision encoder | feature extraction |\n| Whisper-tiny | audio encoder | feature extraction |\n\n### Diffusion pipelines\n\n| Model | Pipeline | Submodules exported | Notes |\n|---|---|---|---|\n| Wan2.1-T2V-1.3B | `WanPipeline` | text_encoder, transformer, vae_decoder | Verified end-to-end on CUDA; 50-step inference at ~7.4 it\u002Fs |\n\n---\n\n## Limitations\n\n- Encoder-decoder models (T5, BART, Whisper) are exported **encoder-only** for\n  feature-extraction; full encoder-decoder generation is not yet wired up.\n- CLIP exports the **vision encoder** (the full CLIP forward needs both text and\n  image inputs and returns embeddings rather than `last_hidden_state`).\n- The exported ONNX is written to a temporary directory; call\n  `model.save_pretrained(...)` to persist it.\n- Diffusion pipeline export runs one full inference pass before exporting, which\n  requires enough GPU\u002FCPU memory to hold the full PyTorch pipeline during tracing.\n- VAE encoder export is included in the export spec but the WAN pipeline does not\n  use it during text-to-video inference; it is exported as a no-op placeholder\n  when the submodule exists on the VAE.\n\n---\n\n## How it works\n\n### Transformer models\n\n```\nfrom_pretrained(export_by_inference=True)\n        │\n        ▼\n1. Load the PyTorch model (TasksManager)\n        │\n        ▼\n2. trace_model_shapes()  ── run N inference passes with varied input shapes\n        │                    • encoder-only      → single forward\n        │                    • encoder-decoder   → encoder submodule only\n        │                    • decoder-only      → prefill + decode (KV cache)\n        ▼\n3. _compute_dynamic_axes()  ── a dim is \"dynamic\" iff its size changed across runs\n        │                       (dim-0\u002Fbatch always dynamic; hidden_size, vocab,\n        ▼                        num_heads, head_dim, image H\u002FW … stay static)\n4. DummyOnnxConfig  ── a generic OnnxConfig built from the traced shapes + axes\n        │\n        ▼\n5. export_models()  ── standard Optimum ONNX export (disable_dynamic_axes_fix=True)\n        │\n        ▼\n6. ORTModel._from_pretrained()  ── load the ONNX model into an ORT session\n```\n\n### Diffusion pipelines\n\n```\nORTDiffusionPipeline.from_pretrained(export_by_inference=True, inference_kwargs={...})\n        │\n        ▼\n1. Load the PyTorch diffusion pipeline (diffusers)\n        │\n        ▼\n2. Register forward pre-hooks on each submodule\n   (text_encoder, transformer\u002Funet, vae.post_quant_conv)\n        │\n        ▼\n3. Run ONE full pipeline inference pass with the provided inference_kwargs\n   — all submodule inputs are captured live as real tensors\n        │\n        ▼\n4. For each submodule:\n   • Move captured tensors to CPU\n   • Build DummyOnnxConfig from the observed shapes + dynamic axes\n   • Export to ONNX (constant folding disabled for text encoders to\n     avoid 89 GB inflation from precomputed attention bias)\n        │\n        ▼\n5. VAE decoder special case: export post_quant_conv + decoder as a single\n   _VaeFullDecodeWrapper (WAN VAE decodes per-frame with caching in PyTorch;\n   ONNX needs one call over the full latent video)\n        │\n        ▼\n6. ORTDiffusionPipeline loaded with each ONNX submodule in an ORT session\n```\n\n### Dynamic-axis inference\n\nRather than guessing from config fields, the compiler runs the model several\ntimes (default `n_trials=3`) with randomly varied **batch sizes** and **sequence\nlengths** (and, for decoder models, a varied **decode query length**). A\ndimension is marked dynamic only if its value actually changes between runs;\neverything else — `hidden_size`, `num_attention_heads`, `head_dim`,\n`vocab_size`, image height\u002Fwidth, ViT patch count, etc. — is correctly kept\nstatic.\n\nFor each tensor seen across the trial runs:\n\n- **Dimension 0** → always dynamic (`batch`).\n- **Any other dimension** → dynamic **iff** its size differed between at least\n  two trials; otherwise static.\n- For decoder KV-cache tensors `past_key_values.{i}.key\u002Fvalue`, the\n  past-sequence dimension (axis 2) is dynamic while `num_heads` (axis 1) and\n  `head_dim` (axis 3) stay static.\n\n---\n\n## Why\n\n| Standard Optimum export | Inference-driven export |\n|---|---|\n| Needs a hand-written `OnnxConfig` per architecture | Works with any model that runs a forward pass |\n| Dynamic axes declared manually | Dynamic axes inferred from multiple varied runs |\n| New architectures require code changes upstream | New architectures work out of the box |\n\n---\n\n## Project layout\n\n```\ninference_driven_model_compiler\u002F\n├── cli.py                        # idmc CLI — wraps optimum-cli with extra flags\n├── optimum\u002F\n│   ├── exporters\u002Fonnx\u002F\n│   │   ├── utils.py              # trace_model_shapes(), dynamic-axis inference\n│   │   ├── model_configs.py      # DummyOnnxConfig — generic shape-driven OnnxConfig\n│   │   ├── input_generators.py   # DummyTupleInputGenerator — dtype-aware dummies\n│   │   └── __init__.py           # main_export wrapper (renamed params)\n│   └── onnxruntime\u002F\n│       ├── modeling.py           # _OnTheFlyORTMixin + 5 encoder model classes\n│       ├── modeling_decoder.py   # OnTheFlyORTModelForCausalLM\n│       ├── modeling_diffusion.py # ORTDiffusionPipeline + submodule wrappers\n│       └── utils.py              # load_shapes_as_torch_size and helpers\n└── on_the_fly_pipeline_tests\u002F    # per-model tests + dynamic-axis + diffusion suites\n```\n",2,"2026-06-11 04:12:01","CREATED_QUERY"]