[{"data":1,"prerenderedAt":-1},["ShallowReactive",2],{"project-82682":3},{"id":4,"name":5,"fullName":6,"owner":7,"repo":5,"description":8,"homepage":8,"htmlUrl":8,"language":9,"languages":8,"totalLinesOfCode":8,"stars":10,"forks":11,"watchers":11,"openIssues":12,"contributorsCount":12,"subscribersCount":12,"size":12,"stars1d":12,"stars7d":11,"stars30d":13,"stars90d":12,"forks30d":12,"starsTrendScore":12,"compositeScore":14,"rankGlobal":8,"rankLanguage":8,"license":15,"archived":16,"fork":16,"defaultBranch":17,"hasWiki":18,"hasPages":18,"topics":19,"createdAt":8,"pushedAt":8,"updatedAt":20,"readmeContent":21,"aiSummary":22,"trendingCount":12,"starSnapshotCount":12,"syncStatus":23,"lastSyncTime":24,"discoverSource":25},82682,"TinyEdgeBench","keys2023190905023\u002FTinyEdgeBench","keys2023190905023",null,"Python",106,1,0,5,38.9,"MIT License",false,"main",true,[],"2026-06-12 04:01:38","# TinyEdgeBench\n\n[![Python](https:\u002F\u002Fimg.shields.io\u002Fbadge\u002Fpython-3.9%2B-blue)](https:\u002F\u002Fwww.python.org\u002F)\n[![License](https:\u002F\u002Fimg.shields.io\u002Fbadge\u002Flicense-MIT-green)](LICENSE)\n[![Version](https:\u002F\u002Fimg.shields.io\u002Fbadge\u002Fversion-v0.1.0-black)](pyproject.toml)\n[![Backend](https:\u002F\u002Fimg.shields.io\u002Fbadge\u002Fbackend-CPU%20%7C%20GPU-lightgrey)](#verified-local-cpu-and-gpu-results)\n[![Verified](https:\u002F\u002Fimg.shields.io\u002Fbadge\u002Fbenchmarked-RTX%204060-76b900)](docs\u002Fresults\u002Frtx4060_laptop\u002F)\n\nTinyEdgeBench is a reproducible local benchmark suite for low-bit edge AI on the user's own CPU and GPU.\n\nIt connects operator simulation, model-block benchmarking, and real local backend comparison into one inspectable Python workflow. The goal is simple: when someone installs TinyEdgeBench on their own computer, the generated report reflects that machine's actual CPU, GPU, driver, CUDA, PyTorch, and ONNX Runtime stack.\n\n[Website](docs\u002F) | [Quick Start](#quick-start) | [Benchmark Protocol](docs\u002Fbenchmark_protocol.md) | [Verified Results](docs\u002Fhardware_results.md) | [Roadmap](#roadmap)\n\n## Why TinyEdgeBench\n\nEdge-AI work often starts with practical questions:\n\n- How fast is this operator on my laptop or edge box?\n- How much error does an INT8-style approximation introduce?\n- Which layer family is the likely latency bottleneck?\n- Does the same workload behave differently on NumPy CPU, PyTorch CPU, ONNX Runtime CPU, PyTorch CUDA, or ONNX Runtime CUDA?\n- What is the memory, power, and energy tradeoff, not only latency?\n\nTinyEdgeBench is not a production inference runtime. It is a small, inspectable benchmarking harness for deployment decisions on local CPU\u002FGPU machines: operator diagnosis, precision-error tradeoff, backend comparison, and reproducible report generation.\n\n## Verified Local CPU And GPU Results\n\nThe repository now includes verified local result artifacts under [docs\u002Fresults\u002F](docs\u002Fresults\u002F). A result is treated as verified only when the directory includes the generated CSV\u002Freport\u002Fplots plus system information.\n\n| Platform | Backend | Workload | Precision | Key Result |\n| --- | --- | --- | --- | --- |\n| Laptop CPU | NumPy \u002F Torch CPU \u002F ONNX CPU | Conv3x3 \u002F MatMul128 | FP32 | [summary.csv](docs\u002Fresults\u002Fcpu_baseline\u002Fsummary.csv) with median, P90, std, RSS, error |\n| RTX 4060 Laptop | Torch CUDA \u002F ONNX CUDA plus CPU baselines | MatMul256 \u002F Conv3x3 | FP32 | [summary.csv](docs\u002Fresults\u002Frtx4060_laptop\u002Fsummary.csv) with CUDA memory and estimated energy |\n\n## Highlights\n\n| Capability | Status |\n| --- | --- |\n| Local CPU execution | Supported by default |\n| YAML benchmark configs | Supported |\n| Interactive CLI wizard | Supported |\n| Streamlit Web UI | Supported |\n| CSV, Markdown, and PNG outputs | Supported |\n| 100+ operator microbenchmarks | Supported |\n| 25+ network\u002Fblock presets | Supported |\n| Verified CPU and RTX 4060 result artifacts | Supported |\n| Memory, P90\u002Fstd latency, and estimated CUDA energy columns | Supported |\n| Benchmark protocol documentation | Supported |\n| FP32 baseline | Supported |\n| Real `torch_cpu` \u002F `onnxruntime_cpu` comparison | Optional |\n| Real `torch_cuda` \u002F `onnxruntime_cuda` comparison | Optional, local GPU required |\n| ONNX Runtime TensorRT Provider comparison | Optional, local TensorRT provider required |\n| OpenVINO \u002F TVM \u002F native TensorRT backend registry | Planned executors with availability checks |\n| Model-level benchmark presets | Supported |\n| Historical run comparison | Supported |\n| Simulated INT8 | Supported |\n| Shift-only approximation | Supported |\n| CUDA\u002FGPU execution | Supported through optional local backends |\n\n## Installation\n\nClone the repository and install it in editable mode:\n\n```bash\ngit clone https:\u002F\u002Fgithub.com\u002Fkeys2023190905023\u002FTinyEdgeBench.git\ncd TinyEdgeBench\npython -m pip install -e \".[dev]\"\n```\n\nTinyEdgeBench requires Python 3.9 or newer. CUDA is not required.\n\n## Where Benchmarks Run\n\nTinyEdgeBench is designed for local deployment-style measurements:\n\n- GitHub hosts the source code, documentation, and static project website.\n- GitHub Pages is only a showcase page; it cannot run CPU or GPU benchmarks for visitors.\n- `python -m tinyedgebench.benchmark ...`, `tinyedgebench wizard`, and `tinyedgebench web` execute on the machine where the command is launched.\n- Reported latency and error data reflect that local machine's Python environment, CPU, GPU, drivers, and installed runtimes.\n\nThis means a user with an NVIDIA GPU can run `torch_cuda` or `onnxruntime_cuda` locally and generate real local GPU measurements, while a CPU-only machine still works through the default NumPy CPU backend.\n\n## Quick Start\n\nRun the default benchmark suite:\n\n```bash\npython -m tinyedgebench.benchmark --config configs\u002Fdefault.yaml\n```\n\nOutputs are written to `results\u002F`:\n\n```text\nresults\u002F\n  summary.csv\n  report.md\n  latency_plot.png\n  error_plot.png\n```\n\n## Web UI\n\nLaunch the local Streamlit application:\n\n```bash\ntinyedgebench web\n```\n\nThen open:\n\n```text\nhttp:\u002F\u002Flocalhost:8501\n```\n\nThe Web UI runs locally on your own computer. The browser is a control panel for the local Python process, so benchmark data is generated by your own CPU\u002FGPU environment. From the browser you can choose:\n\n- single-operator benchmarks\n- network or model-block presets\n- precision modes\n- tensor or matrix shapes\n- warmup runs and benchmark runs\n- output directory\n\nAfter a run, the app shows a summary table, latency chart, numerical error chart, Markdown report preview, and download buttons for generated artifacts.\n\nTo choose a different Streamlit port:\n\n```bash\ntinyedgebench web -- --server.port 8502\n```\n\nThe Web UI also supports uploaded YAML configs, historical run comparison, Plotly charts when Plotly is installed, and one-click ZIP downloads for generated reports.\n\n## Project Website\n\nThe repository includes a static, GitHub Pages-ready website in [docs\u002F](docs\u002F):\n\n```text\ndocs\u002F\n  index.html\n  styles.css\n  app.js\n  assets\u002Fhero-edge-bench.png\n```\n\nTo publish it on GitHub, enable Pages in the repository settings and choose `main` plus the `\u002Fdocs` folder as the source.\n\n## CLI Wizard\n\nUse the interactive terminal wizard:\n\n```bash\ntinyedgebench wizard\n```\n\nThe wizard asks for the operator, shape parameters, precision modes, backend, and output directory. CPU is the default supported backend.\n\n## YAML Usage\n\nCreate a benchmark config:\n\n```yaml\noutput_dir: results\nwarmup: 2\nruns: 5\nbackend: cpu\nseed: 42\nbenchmarks:\n  - name: conv2d_small\n    operator: conv2d\n    input_shape: [1, 3, 16, 16]\n    output_channels: 8\n    kernel_size: [3, 3]\n    stride: 1\n    padding: 1\n    precision_modes: [fp32, int8_sim, shift_only]\n\n  - name: matmul_small\n    operator: matmul\n    matrix_m: 32\n    matrix_k: 64\n    matrix_n: 16\n    precision_modes: [fp32, int8_sim, shift_only]\n```\n\nRun it:\n\n```bash\npython -m tinyedgebench.benchmark --config path\u002Fto\u002Fconfig.yaml\n```\n\nSee [configs\u002Fdefault.yaml](configs\u002Fdefault.yaml), [configs\u002Fextended_operators.yaml](configs\u002Fextended_operators.yaml), and [configs\u002Fmodel_presets.yaml](configs\u002Fmodel_presets.yaml) for complete examples.\n\n## Real Backend Comparison\n\nBy default, `cpu` uses the built-in NumPy benchmark path. TinyEdgeBench can also compare against real local deployment-style kernels through optional backends:\n\n| Backend | What it measures |\n| --- | --- |\n| `cpu` | Default NumPy CPU implementation |\n| `torch_cpu` | PyTorch CPU operator kernels |\n| `torch_cuda` | PyTorch CUDA kernels on the local NVIDIA GPU |\n| `onnxruntime_cpu` | ONNX Runtime CPUExecutionProvider kernels |\n| `onnxruntime_cuda` | ONNX Runtime CUDAExecutionProvider kernels on the local NVIDIA GPU |\n| `onnxruntime_tensorrt` | ONNX Runtime TensorrtExecutionProvider kernels when available locally |\n| `openvino_cpu` | Registered CPU deployment target with availability checks; executor integration is planned |\n| `tvm_cpu`, `tvm_cuda` | Registered compiler-runtime targets with availability checks; executor integration is planned |\n| `tensorrt_cuda` | Registered native TensorRT target; use `onnxruntime_tensorrt` today for TensorRT-provider runs |\n\nInstall optional backend dependencies:\n\n```bash\npython -m pip install -e \".[real-backends]\"\n```\n\nFor ONNX Runtime CUDA provider experiments, install the GPU extra in an environment with compatible NVIDIA drivers and CUDA runtime support:\n\n```bash\npython -m pip install -e \".[real-backends-gpu]\"\n```\n\nRun a backend comparison suite:\n\n```bash\npython -m tinyedgebench.benchmark --config configs\u002Freal_backends.yaml\n```\n\nExample config:\n\n```yaml\noutput_dir: results_real_backends\nwarmup: 2\nruns: 10\nbackends: [cpu, torch_cpu, onnxruntime_cpu]\nbenchmarks:\n  - name: deploy_matmul\n    operator: matmul\n    matrix_m: 128\n    matrix_k: 256\n    matrix_n: 128\n    precision_modes: [fp32]\n```\n\nThese backend rows are measured on your local machine and reflect the installed PyTorch or ONNX Runtime kernels. ONNX Runtime benchmark graphs freeze weights as model initializers where practical, which is closer to deployment-style inference than feeding every tensor as an input. `int8_sim` and `shift_only` remain simulation modes unless a backend-specific quantized kernel is added.\n\nExample local GPU config:\n\n```yaml\noutput_dir: results_gpu_backends\nwarmup: 5\nruns: 20\nbackends: [cpu, torch_cpu, torch_cuda, onnxruntime_cpu, onnxruntime_cuda]\nbenchmarks:\n  - name: gpu_matmul_256\n    operator: matmul\n    matrix_m: 256\n    matrix_k: 256\n    matrix_n: 256\n    precision_modes: [fp32]\n```\n\nSee [configs\u002Fgpu_backends.example.yaml](configs\u002Fgpu_backends.example.yaml). Use CUDA backends only on a local machine where PyTorch CUDA or ONNX Runtime CUDAExecutionProvider is available.\n\nFor TensorRT Provider experiments through ONNX Runtime:\n\n```bash\npython -m tinyedgebench.benchmark --config configs\u002Fdeployment_backends.example.yaml\n```\n\nIf the local ONNX Runtime install does not expose `TensorrtExecutionProvider`, remove `onnxruntime_tensorrt` from the `backends` list.\n\n## Network Presets\n\nTinyEdgeBench can run lightweight suites that approximate common model blocks:\n\n| Preset | Description |\n| --- | --- |\n| `tiny_cnn` | Conv\u002FBN\u002FReLU\u002FPool\u002FLinear image pipeline |\n| `mobilenet_block` | Depthwise separable convolution block |\n| `resnet_basic_block` | Residual Conv\u002FBN\u002FReLU\u002FAdd block |\n| `transformer_encoder_tiny` | Attention, normalization, MLP, and softmax block |\n| `mlp_edge` | Small MLP-style matrix and activation block |\n| `efficientnet_mbconv` | Mobile inverted bottleneck convolution block |\n| `convnext_block` | ConvNeXt-style depthwise convolution and pointwise MLP block |\n| `unet_encoder_block` | UNet downsampling encoder block |\n| `unet_decoder_block` | UNet upsampling decoder block |\n| `deeplab_aspp_tiny` | Tiny segmentation ASPP-style block |\n| `fpn_lateral_block` | Feature pyramid lateral fusion block |\n| `yolo_head_tiny` | Tiny detection head block |\n| `detection_neck_pan` | PAN-style detection neck fusion block |\n| `segmentation_head` | Lightweight semantic segmentation head |\n| `vit_patch_embed` | Vision Transformer patch embedding block |\n| `swin_window_attention_tiny` | Tiny Swin-style attention and MLP block |\n| `bert_ffn_block` | BERT-style feed-forward block |\n| `gpt_decoder_tiny` | Tiny causal decoder block |\n| `recommender_embedding_mlp` | Embedding plus MLP recommendation block |\n| `speech_command_cnn` | Small speech-command CNN block |\n| `wav2vec_conv_frontend` | Speech representation frontend approximation |\n| `autoencoder_bottleneck` | Encoder bottleneck and decoder projection block |\n| `gan_generator_block` | Generator-style upsampling convolution block |\n| `super_resolution_block` | Pixel-shuffle-like super-resolution block |\n| `lstm_gate_block` | LSTM gate approximation block |\n| `gru_gate_block` | GRU gate approximation block |\n| `pointnet_mlp_block` | PointNet-style per-point MLP and global reduction block |\n| `graphsage_mlp_block` | GraphSAGE-style aggregate and projection block |\n| `anomaly_mlp` | Small anomaly-detection MLP block |\n| `mobilenetv2_tiny` | Layer-wise MobileNetV2-style tiny model |\n| `resnet18_tiny` | Layer-wise ResNet18-style tiny image model |\n| `efficientnet_lite_tiny` | Layer-wise EfficientNet-Lite-style model |\n| `yolo_tiny_head` | Layer-wise YOLO tiny detection head |\n| `tinybert_block` | Layer-wise TinyBERT encoder block |\n| `whisper_tiny_encoder` | Layer-wise Whisper encoder approximation |\n| `llama_mlp_attention_micro` | Layer-wise LLaMA attention and MLP microbenchmark |\n\nRun model-level presets:\n\n```bash\npython -m tinyedgebench.benchmark --config configs\u002Fmodel_level.yaml\n```\n\n## Historical Comparison\n\nSave a timestamped copy of a run:\n\n```bash\npython -m tinyedgebench.benchmark --config configs\u002Fdefault.yaml --history\n```\n\nThis writes the normal output directory and also copies artifacts into:\n\n```text\nresults\u002Fruns\u002F\u003Ctimestamp>\u002F\n```\n\nCompare two saved runs:\n\n```bash\ntinyedgebench compare results\u002Fruns\u002F\u003Cbaseline> results\u002Fruns\u002F\u003Ccandidate>\n```\n\nThe comparison generates:\n\n```text\nresults\u002Fcompare\u002F\n  comparison.csv\n  comparison.md\n```\n\nExample:\n\n```yaml\nnetwork_presets:\n  - name: tiny_cnn\n    precision_modes: [fp32, int8_sim, shift_only]\n  - name: transformer_encoder_tiny\n    precision_modes: [fp32, int8_sim]\n```\n\n## Supported Operators\n\n| Category | Operators |\n| --- | --- |\n| Convolution | `conv2d`, `depthwise_conv2d`, `pointwise_conv2d` |\n| Matrix and linear | `matmul`, `batch_matmul`, `linear` |\n| Activations | `relu`, `relu6`, `sigmoid`, `tanh`, `gelu`, `silu`, `leaky_relu`, `elu`, `selu`, `celu`, `softplus`, `softsign`, `hard_sigmoid`, `hard_swish`, `mish`, `prelu`, `glu`, `swiglu`, `geglu` |\n| Pooling and image ops | `maxpool2d`, `avgpool2d`, `global_avgpool2d`, `upsample_nearest2d`, `pad` |\n| Normalization | `batchnorm2d`, `layernorm`, `rmsnorm`, `groupnorm`, `instance_norm`, `l2_normalize` |\n| Tensor ops | `add`, `sub`, `mul`, `div`, `maximum`, `minimum`, `bias_add`, `where`, `masked_fill`, `greater`, `less`, `equal`, `not_equal`, `concat`, `transpose`, `reshape`, `flatten`, `squeeze`, `expand_dims`, `tile`, `slice`, `gather`, `one_hot` |\n| Layout\u002Fimage transforms | `channel_shuffle`, `space_to_depth`, `depth_to_space` |\n| Pooling extras | `adaptive_avgpool2d`, `adaptive_maxpool2d` |\n| Reductions and probabilities | `softmax`, `log_softmax`, `reduce_mean`, `reduce_sum`, `reduce_max`, `reduce_min`, `reduce_prod`, `argmax`, `argmin`, `topk`, `sort`, `cumsum`, `cumprod` |\n| Unary math | `identity`, `abs`, `neg`, `square`, `sqrt`, `rsqrt`, `exp`, `log`, `log1p`, `pow`, `sin`, `cos`, `reciprocal`, `floor`, `ceil`, `round`, `clip`, `sign`, `standardize`, `minmax_normalize`, `pixel_norm`, `dropout_inference` |\n| Similarity and distance | `cosine_similarity`, `pairwise_distance` |\n| Sequence\u002Fmodel ops | `embedding`, `scaled_dot_product_attention`, `causal_self_attention`, `rotary_embedding` |\n\n## Precision Modes\n\n| Mode | Meaning |\n| --- | --- |\n| `fp32` | Float32 reference path |\n| `int8_sim` | Symmetric INT8-style quantization simulation with float dequantization |\n| `shift_only` | Signed power-of-two operand approximation for shift-like experiments |\n\n## Output Files\n\n| File | Purpose |\n| --- | --- |\n| `summary.csv` | Machine-readable benchmark summary |\n| `report.md` | Markdown report with system information and result table |\n| `latency_plot.png` | Latency comparison chart |\n| `error_plot.png` | Numerical error chart |\n\nThe report records the local execution machine, operating system, Python version, CPU\u002FGPU information, CUDA visibility, PyTorch CUDA status, ONNX Runtime providers, backend ranking, bottleneck rows, memory fields, optional CUDA power\u002Fenergy estimates, and reproducibility commands.\n\n## Example CSV\n\n```csv\nname,operator,precision,backend,input_description,latency_ms,throughput_ops_per_s,mean_abs_error,max_abs_error,latency_median_ms,latency_p90_ms,latency_std_ms,valid_runs,failed_runs,oom_runs,peak_memory_mb,gpu_memory_allocated_mb,gpu_memory_reserved_mb,power_w,energy_mj,edp_mj_ms,preprocess_ms,inference_ms,postprocess_ms\nrtx4060_matmul_256,matmul,fp32,onnxruntime_cuda,256x256 @ 256x256,0.539750,62166617878.78,0.00093977,0.00499487,0.539750,0.653900,0.082526,20,0,0,584.684,,,14.213,7.672,4.141,0.000000,0.539750,0.000000\n```\n\n## Project Layout\n\n```text\nTinyEdgeBench\u002F\n  benchmark_suites\u002F          story-driven benchmark suites\n  configs\u002F                  YAML benchmark examples\n  docs\u002F\n    benchmark_protocol.md    reproducibility and measurement protocol\n    hardware_results.md      verified hardware result index\n    results\u002F                 CPU\u002FGPU benchmark artifacts\n  src\u002Ftinyedgebench\u002F         package source\n    benchmark.py             YAML entry point\n    cli.py                   CLI commands\n    web_app.py               Streamlit application\n    runner.py                benchmark orchestration\n    operators.py             NumPy operator implementations\n    artifacts.py             CSV, report, and plot generation\n    network_presets.py       common model-block presets\n  tests\u002F                     pytest suite\n```\n\n## Development\n\nInstall development dependencies:\n\n```bash\npython -m pip install -e \".[dev]\"\n```\n\nRun tests:\n\n```bash\npython -m pytest\n```\n\nRun end-to-end examples:\n\n```bash\npython -m tinyedgebench.benchmark --config configs\u002Fdefault.yaml\npython -m tinyedgebench.benchmark --config configs\u002Fextended_operators.yaml\npython -m tinyedgebench.benchmark --config configs\u002Fmodel_presets.yaml\npython -m tinyedgebench.benchmark --config configs\u002Fmodel_level.yaml\npython -m tinyedgebench.benchmark --config configs\u002Freal_backends.yaml\n```\n\nIf your local machine has CUDA-enabled PyTorch and\u002For ONNX Runtime GPU providers:\n\n```bash\npython -m tinyedgebench.benchmark --config configs\u002Fgpu_backends.example.yaml\n```\n\n## Screenshots\n\nStatic project website with the refined local benchmark dashboard:\n\n![TinyEdgeBench local benchmark dashboard preview](docs\u002Fassets\u002Ftinyedgebench-dynamic-preview.png)\n\n## Continuous Integration\n\nThe repository includes GitHub Actions CI in `.github\u002Fworkflows\u002Fci.yml`. It installs the package, runs `pytest`, and verifies the default YAML config on every push and pull request.\n\n## Roadmap\n\n- More CPU\u002FGPU deployment backends such as CuPy, OpenVINO CPU, TensorRT, and TVM\n- Backend-specific quantized INT8 kernels beyond the current simulation path\n- More fused kernels and model-specific operator groups\n- PyPI release packaging and versioned benchmark artifacts\n\n## License\n\nTinyEdgeBench is released under the MIT License. See [LICENSE](LICENSE).\n","TinyEdgeBench 是一个可复现的本地基准测试套件，用于在用户自己的CPU和GPU上进行低比特边缘AI性能评估。其核心功能包括支持超过100个算子微基准测试、25种以上的网络\u002F块预设，并能生成包含内存使用、P90\u002F标准差延迟及估计CUDA能耗等指标的CSV、Markdown和PNG格式报告。此外，它还提供了一个交互式的CLI向导和Streamlit Web UI界面，方便用户配置和运行测试。该工具适用于需要对特定硬件（如笔记本电脑或边缘设备）上的算子性能、精度误差折中、后端比较等方面做出决策的场景，特别适合于边缘AI开发人员在选择最佳部署方案时使用。",2,"2026-06-11 04:08:56","CREATED_QUERY"]