[{"data":1,"prerenderedAt":-1},["ShallowReactive",2],{"project-82329":3},{"id":4,"name":5,"fullName":6,"owner":7,"repo":5,"description":8,"homepage":8,"htmlUrl":8,"language":8,"languages":8,"totalLinesOfCode":8,"stars":9,"forks":10,"watchers":11,"openIssues":12,"contributorsCount":13,"subscribersCount":13,"size":13,"stars1d":14,"stars7d":15,"stars30d":16,"stars90d":13,"forks30d":13,"starsTrendScore":10,"compositeScore":17,"rankGlobal":8,"rankLanguage":8,"license":18,"archived":19,"fork":19,"defaultBranch":20,"hasWiki":19,"hasPages":19,"topics":21,"createdAt":8,"pushedAt":8,"updatedAt":22,"readmeContent":23,"aiSummary":24,"trendingCount":13,"starSnapshotCount":13,"syncStatus":25,"lastSyncTime":26,"discoverSource":27},82329,"Step-3.7-Flash","stepfun-ai\u002FStep-3.7-Flash","stepfun-ai",null,231,19,105,3,0,4,31,115,74.4,"Apache License 2.0",false,"main",[],"2026-06-12 04:01:37","# Step 3.7 Flash\n\n- **[ModelPage]**: https:\u002F\u002Fstatic.stepfun.com\u002Fblog\u002Fstep-3.7-flash\u002F\n- **[HuggingFace]**:\n    - BF16: https:\u002F\u002Fhuggingface.co\u002Fstepfun-ai\u002FStep-3.7-Flash\u002F\n    - FP8: https:\u002F\u002Fhuggingface.co\u002Fstepfun-ai\u002FStep-3.7-Flash-FP8\n    - NVFP4: https:\u002F\u002Fhuggingface.co\u002Fstepfun-ai\u002FStep-3.7-Flash-NVFP4\n    - GGUF: https:\u002F\u002Fhuggingface.co\u002Fstepfun-ai\u002FStep-3.7-Flash-GGUF\n\n## 1. Introduction\n\nStep 3.7 Flash is a 198B-parameter sparse Mixture-of-Experts (MoE) vision-language model that combines a 196B-parameter language backbone with a 1.8B-parameter vision encoder for native image understanding. Engineered for high-frequency production workloads, it activates approximately 11B parameters per token and delivers a throughput of up to 400 tokens per second. Step 3.7 Flash supports a 256k context window and offers three selectable reasoning levels (low, medium, and high) so developers can easily balance speed, cost, and cognitive depth.\n\nWe built Step 3.7 Flash for developers who need to scale agentic workflows that combine perception, search, and reasoning. It is designed to handle intensive tasks such as parsing massive financial reports in one pass, running multi-step search loops with cross-source verification, or operating concurrent coding agents in high-throughput pipelines.\n\n## 2. Capabilities & Performance\n\n### Multimodal Perception and Verification\n\nThe model delivers top-tier visual intelligence, securing first place on SimpleVQA (Search) with a 79.2 and achieving frontier parity on V* (Python) at 95.3. These metrics reflect strong visual grounding and retrieval-augmented reasoning beyond basic image description. The model accurately processes dense visual interfaces, such as UI wireframes, application GUIs, and data charts, to map them into structured code. When it encounters an incomplete visual asset, it can independently identify missing data and execute lookups to verify context before returning a factually verified conclusion.\n\n### Workflow Integrity and Tool Orchestration\n\nExecution reliability is critical for autonomous agents. Step 3.7 Flash leads the ClawEval-1.1 benchmark with a score of 67.1, which significantly outperforms the next closest competitor at 59.8. This performance demonstrates high resistance to adversarial traps and strict adherence to system policies during multi-turn orchestration. Backed by scores of 49.5 on Toolathlon and 48.1 on HLE w. Tool, this profile ensures high trajectory integrity. Step 3.7 Flash reliably interacts with external APIs and executes long-horizon workflows without drifting from instructions or violating system constraints.\n\n### Code Engineering and Professional Baselines\n\nStep 3.7 Flash is built for live engineering tasks and secured a definitive second-place finish on SWE-Bench PRO with a score of 56.3. It can independently trace multi-file repositories, isolate bugs from raw issue reports, and generate functional patches that pass automated unit tests. While evaluations like Terminal-Bench 2.1 (59.5) and GDPVal-AA (45.8) show clear areas for future optimization compared to the absolute peak of the cohort, they establish a dependable baseline for system interactions and structured professional deliverables.\n\n![Step 3.7 Flash benchmark results across General Agent, Agentic Coding, and Multimodal evaluations](assets\u002Fbenchmarks.png)\n\n### NVFP4 + MTP\n\nStep 3.7 Flash is also available in an NVFP4-quantized variant for efficient deployment on NVIDIA GPUs. The latest [NVFP4 checkpoint](https:\u002F\u002Fhuggingface.co\u002Fstepfun-ai\u002FStep-3.7-Flash-NVFP4) includes MTP draft layers and supports vLLM speculative decoding with:\n\n```bash\n--speculative-config '{\"method\": \"mtp\", \"num_speculative_tokens\": 3}'\n```\n\nOn GPQA Diamond avg@16, the NVFP4 + MTP checkpoint matches quality within statistical noise compared with the same NVFP4 checkpoint without MTP: **77.81% vs. 78.41%** item accuracy over 3168 records.\n\nOn a GB200 TP=4 vLLM setup with GPQA-style long-reasoning streaming prompts (~250 token prompt, ~1.6K token completion), NVFP4 + MTP improves aggregate decode throughput:\n\n| Concurrency | NVFP4 + MTP | NVFP4 no-MTP | Speedup |\n|---:|---:|---:|---:|\n| 8 | 1309 tok\u002Fs | 1155 tok\u002Fs | 1.13x |\n| 32 | 4391 tok\u002Fs | 3480 tok\u002Fs | 1.26x |\n| **64** | **8229 tok\u002Fs** | 5667 tok\u002Fs | **1.45x** |\n\nThis makes the NVFP4 checkpoint a practical option for high-throughput long-reasoning workloads. This benchmark characterizes short-prompt, decode-heavy reasoning rather than long-context prefill performance.\n\n## 3. Pricing\n\n| Token Type | Price |\n|---|---|\n| Input (cache miss) | $0.20 \u002F M tokens |\n| Input (cache hit) | $0.04 \u002F M tokens |\n| Output | $1.15 \u002F M tokens |\n\n## 4. Availability, Deployment, and Ecosystem\n- Availability: Step 3.7 Flash is available on the StepFun Open Platform — [platform.stepfun.ai](https:\u002F\u002Fplatform.stepfun.ai) (Global) and [platform.stepfun.com](https:\u002F\u002Fplatform.stepfun.com) (China), OpenRouter, and NVIDIA NIM. StepFun is also partnering with DeepInfra, Fireworks AI, and Modal to expand availability soon.\n- Deployment: Step 3.7 Flash supports flexible deployment across cloud, data center, and local environments. For large-scale production and enterprise use cases, Step 3.7 Flash can be deployed on modern data center infrastructure. For local and workstation scenarios, it can also run on high-memory devices such as NVIDIA DGX Station, AMD Ryzen AI Max+ 395-based systems, and Mac Studio \u002F Macbook Pro devices with at least 128GB unified memory.\n- Ecosystem: Step 3.7 Flash is supported across popular open-source infrastructure for both inference and model development. For inference and serving, developers can use vLLM, SGLang, Hugging Face Transformers, and llama.cpp. For model development & customization workflows, StepFun model support has landed in the NVIDIA Nemo ecosystem, including AutoModel, Megatron Core and Megatron Bridge. Step 3.7 Flash is also available as an NVIDIA NIM inference microservice for on-prem, cloud, or hybrid deployment.\n\n## 5. Examples\n\nYou can get started with Step 3.7 Flash in minutes using StepFun's API or via other inference providers.\n\n> Pick the right `base_url` for your region. StepFun operates two regional platforms with separate API hosts. The `base_url` you pass to the OpenAI client must match the platform where your API key was issued, otherwise requests will be rejected as unauthorized.\n>\n> - **Global**: [platform.stepfun.ai](https:\u002F\u002Fplatform.stepfun.ai) — `base_url=https:\u002F\u002Fapi.stepfun.ai\u002Fv1`\n> - **China**: [platform.stepfun.com](https:\u002F\u002Fplatform.stepfun.com) — `base_url=https:\u002F\u002Fapi.stepfun.com\u002Fv1`\n>\n> To avoid hard-coding the wrong region, the examples below read both the API key and base URL from environment variables. Export them once before running:\n>\n> ```bash\n> export STEP_API_KEY=\"sk-...\"\n> export STEP_BASE_URL=\"https:\u002F\u002Fapi.stepfun.ai\u002Fv1\"   # use https:\u002F\u002Fapi.stepfun.com\u002Fv1 for the China platform\n> ```\n\n### 5.1 Chat Example\n\n```python\nimport os\nfrom openai import OpenAI\n\nclient = OpenAI(\n    api_key=os.environ[\"STEP_API_KEY\"],\n    base_url=os.environ[\"STEP_BASE_URL\"],\n)\n\ncompletion = client.chat.completions.create(\n    model=\"step-3.7-flash\",\n    messages=[\n        {\n            \"role\": \"system\",\n            \"content\": \"You are an AI assistant provided by StepFun. You are good at Chinese, English, and many other languages, and you can see, think, and act to help users get things done.\",\n        },\n        {\n            \"role\": \"user\",\n            \"content\": \"Introduce StepFun's artificial intelligence capabilities.\"\n        },\n    ],\n)\n\nprint(completion)\n```\n\n### 5.2 Text and Image Input Example\n\n```python\nimport os\nfrom openai import OpenAI\n\nclient = OpenAI(\n    api_key=os.environ[\"STEP_API_KEY\"],\n    base_url=os.environ[\"STEP_BASE_URL\"],\n)\n\ncompletion = client.chat.completions.create(\n    model=\"step-3.7-flash\",\n    messages=[\n        {\n            \"role\": \"user\",\n            \"content\": [\n                {\"type\": \"text\", \"text\": \"What is in this picture?\"},\n                {\n                    \"type\": \"image_url\",\n                    \"image_url\": {\"url\": \"https:\u002F\u002Fexample.com\u002Fphoto.jpg\"},\n                },\n            ],\n        },\n    ],\n)\n\nprint(completion)\n```\n\n## 6. Local Deployment\n\nStep 3.7 Flash is optimized for local inference and supports industry-standard backends including vLLM, SGLang, Hugging Face Transformers and llama.cpp.\n\n### 6.1 vLLM\n\nWe recommend using StepFun's prebuilt vLLM Docker image with Step 3.7 support.\n\n1. Install vLLM.\n\n```bash\n# via Docker\ndocker pull vllm\u002Fvllm-openai:stepfun37\n```\n\n2. Launch the server.\n\n  - For FP8 model\n  ```bash\n  vllm serve \u003CMODEL_PATH_OR_HF_ID> \\\n  --served-model-name step3p7-flash \\\n  --tensor-parallel-size 8 \\\n  --enable-expert-parallel \\\n  --disable-cascade-attn \\\n  --reasoning-parser step3p5 \\\n  --enable-auto-tool-choice \\\n  --tool-call-parser step3p5 \\\n  --speculative-config '{\"method\": \"mtp\", \"num_speculative_tokens\": 3}' \\\n  --trust-remote-code\n  ```\n  - For BF16 model\n  ```bash\n  vllm serve \u003CMODEL_PATH_OR_HF_ID> \\\n  --served-model-name step3p7-flash-bf16 \\\n  --tensor-parallel-size 8 \\\n  --enable-expert-parallel \\\n  --disable-cascade-attn \\\n  --reasoning-parser step3p5 \\\n  --enable-auto-tool-choice \\\n  --tool-call-parser step3p5 \\\n  --speculative-config '{\"method\": \"mtp\", \"num_speculative_tokens\": 3}' \\\n  --trust-remote-code\n  ```\n\n  - For NVFP4 model\n  Compared to standard precisions, running the FP4 quantized version requires modelopt activation and FP8 KV Cache alignment.\n  ```bash\n  python3 -m vllm.entrypoints.openai.api_server \\\n  --host 0.0.0.0 \\\n  --port ${PORT} \\\n  --model stepfun-ai\u002FStep-3.7-Flash-NVFP4 \\\n  --served-model-name step3p7 \\\n  --tensor-parallel-size 4 \\\n  --gpu-memory-utilization 0.9 \\\n  --enable-expert-parallel \\\n  --trust-remote-code \\\n  --quantization modelopt \\\n  --kv-cache-dtype fp8 \\\n  --max-model-len 8192 \\\n  --reasoning-parser step3p5 \\\n  --enable-auto-tool-choice \\\n  --tool-call-parser step3p5 \\\n  --async-scheduling \\\n  --speculative-config '{\"method\": \"mtp\", \"num_speculative_tokens\": 3}'\n  ```\n\n### 6.2 SGLang\n\n1. Install SGLang.\n\n```bash\n# via Docker\ndocker pull lmsysorg\u002Fsglang:dev-step-3.7-flash\n\n# or from source (pip)\npip install \"sglang[all] @ git+https:\u002F\u002Fgithub.com\u002Fsgl-project\u002Fsglang.git\"\n```\n\n2. Launch the server.\n\n> **Note:** For Blackwell GPUs, `--mm-attention-backend fa4` may be used.\n\n- For BF16 model\n\n```bash\nsglang serve --model-path stepfun-ai\u002FStep-3.7-Flash \\\n  --tp 8 \\\n  --reasoning-parser step3p5 \\\n  --tool-call-parser step3p5 \\\n  --enable-multimodal \\\n  --speculative-algorithm EAGLE \\\n  --speculative-num-steps 3 \\\n  --speculative-eagle-topk 1 \\\n  --speculative-num-draft-tokens 4 \\\n  --enable-multi-layer-eagle \\\n  --trust-remote-code \\\n  --host 0.0.0.0 \\\n  --port 8000\n```\n\n- For FP8 model\n\n```bash\nsglang serve --model-path stepfun-ai\u002FStep-3.7-Flash-FP8 \\\n  --tp 8 \\\n  --ep 4 \\\n  --reasoning-parser step3p5 \\\n  --tool-call-parser step3p5 \\\n  --enable-multimodal \\\n  --speculative-algorithm EAGLE \\\n  --speculative-num-steps 3 \\\n  --speculative-eagle-topk 1 \\\n  --speculative-num-draft-tokens 4 \\\n  --enable-multi-layer-eagle \\\n  --trust-remote-code \\\n  --host 0.0.0.0 \\\n  --port 8000\n```\n\n- For NVFP4 model\n\n```bash\nsglang serve --model-path stepfun-ai\u002FStep-3.7-Flash-NVFP4 \\\n  --tp 4 --ep 4 \\\n  --moe-runner-backend flashinfer_trtllm \\\n  --kv-cache-dtype fp8_e4m3 \\\n  --quantization modelopt_fp4 \\\n  --trust-remote-code \\\n  --reasoning-parser step3p5 \\\n  --tool-call-parser step3p5 \\\n  --attention-backend trtllm_mha\n```\n\n### 6.3 Transformers (Debug \u002F Verification)\n\nUse this snippet for quick functional verification. For high-throughput serving, use vLLM or SGLang.\n\n> **Note:** Deployment of this model requires `transformers` 5.0 or later.\n\n```python\nfrom transformers import AutoProcessor, AutoModelForCausalLM\n\nMODEL_PATH = \"\u003CMODEL_PATH_OR_HF_ID>\"\n\n# 1. Setup\nprocessor = AutoProcessor.from_pretrained(MODEL_PATH, trust_remote_code=True)\nmodel = AutoModelForCausalLM.from_pretrained(\n    MODEL_PATH,\n    device_map=\"auto\",\n    dtype=\"auto\",\n    trust_remote_code=True\n)\n\n# 2. Prepare Input\nmessages = [\n    {\n        \"role\": \"user\",\n        \"content\": [\n            {\"type\": \"image\", \"url\": \"https:\u002F\u002Fexample.com\u002Fphoto.jpg\"},\n            {\"type\": \"text\", \"text\": \"What is in this picture?\"}\n        ]\n    },\n]\ninputs = processor.apply_chat_template(\n    messages,\n    tokenize=True,\n    add_generation_prompt=True,\n    return_dict=True,\n    return_tensors=\"pt\",\n).to(model.device)\n\n# 3. Generate\ngenerated_ids = model.generate(**inputs, max_new_tokens=128, do_sample=False)\noutput_text = processor.decode(generated_ids[0][inputs.input_ids.shape[1]:], skip_special_tokens=True)\n\nprint(output_text)\n```\n\n### 6.4 llama.cpp\n\n**System Requirements**\n\nGGUF Model Weights:\n\n| Component | Quantization | File Size |\n|---|---|---|\n| Language Model | Q4_K_S | 111.5 GB |\n| Language Model | IQ4_XS | 104.99 GB |\n| Language Model | Q3_K_L | 102.5 GB |\n| Multimodal Projector | FP16 | 3.97 GB |\n\n- **Runtime Overhead:** ~7 GB\n- **Minimum unified memory \u002F VRAM:** 120 GB (e.g., Mac Studio, NVIDIA DGX Station, AMD Ryzen AI Max+ 395)\n- **Recommended:** 128 GB unified memory\n\n**Steps**\n\n1. Use llama.cpp:\n\n```bash\ngit clone https:\u002F\u002Fgithub.com\u002Fstepfun-ai\u002Fllama.cpp.git\ncd llama.cpp\ngit checkout -b step3.7 origin\u002Fstep3.7\n```\n\n2. Build llama.cpp on Mac:\n\n```bash\ncmake -B build-macos -S . \\\n    -DCMAKE_BUILD_TYPE=Release \\\n    -DBUILD_SHARED_LIBS=ON \\\n    -DLLAMA_BUILD_SERVER=ON \\\n    -DLLAMA_BUILD_TESTS=ON \\\n    -DGGML_METAL=ON \\\n    -DGGML_METAL_EMBED_LIBRARY=ON \\\n    -DGGML_BLAS=ON \\\n    -DGGML_BLAS_VENDOR=Apple \\\n    -DGGML_ACCELERATE=ON \\\n    -DGGML_NATIVE=ON\ncmake --build build-macos -j8\n```\n\n3. Build llama.cpp on DGX-Spark:\n\n```bash\ncmake -S . -B build-cuda \\\n  -DCMAKE_BUILD_TYPE=Release \\\n  -DGGML_CUDA=ON \\\n  -DGGML_CUDA_GRAPHS=ON \\\n  -DGGML_CUDA_FORCE_MMQ=ON \\\n  -DLLAMA_OPENSSL=OFF \\\n  -DLLAMA_BUILD_COMMON=ON \\\n  -DLLAMA_BUILD_TOOLS=ON \\\n  -DLLAMA_BUILD_SERVER=ON \\\n  -DLLAMA_BUILD_EXAMPLES=OFF \\\n  -DLLAMA_BUILD_TESTS=OFF\ncmake --build build-cuda -j8\n```\n\n4. Build llama.cpp on AMD Windows:\n\n```bash\ncmake -S . -B build-vulkan \\\n  -DCMAKE_BUILD_TYPE=Release \\\n  -DGGML_VULKAN=ON \\\n  -DGGML_NATIVE=ON \\\n  -DLLAMA_BUILD_SERVER=ON \\\n  -DLLAMA_BUILD_UI=OFF \\\n  -DLLAMA_BUILD_TOOLS=ON\ncmake --build build-vulkan -j8\n```\n\n5. Run with `llama-cli`:\n\n```bash\n.\u002Fllama-cli -m Step3.7_Q4_K_S.gguf -b 2048 -ub 2048 -fa on --temp 1.0 -p \"What's your name?\"\n```\n\n6. Test performance with `llama-batched-bench`:\n\n```bash\n.\u002Fllama-batched-bench -m step3.7_Q4_K_S.gguf -c 32768 -b 2048 -ub 2048 -npp 0,2048,8192,16384,32768 -ntg 128 -npl 1\n```\n\n## 7. Using Step 3.7 Flash on Agent Platforms\n\nYou can use Step 3.7 Flash on Agent platforms such as Hermes Agent, OpenClaw, Kilo Code, and more.\n\n## 8. Getting in Touch\n\nAs we work to shape the future of AGI by expanding broad model capabilities, we want to ensure we are solving the right problems. We invite you to be part of this continuous feedback loop — your insights directly influence our priorities.\n\n- **Join the Conversation:** Our [Discord](https:\u002F\u002Fdiscord.gg\u002FRcMJhNVAQc) community is the primary hub for brainstorming future architectures, proposing capabilities, and getting early access updates 🚀\n- **Report Friction:** Encountering limitations? You can open an issue or start a discussion on GitHub \u002F HuggingFace, or flag it directly in our Discord support channels.\n\n## 📄 License\n\nThis project is open-sourced under the [Apache 2.0 License](https:\u002F\u002Fwww.apache.org\u002Flicenses\u002FLICENSE-2.0).\n","Step 3.7 Flash 是一个具有1980亿参数的稀疏混合专家（MoE）视觉-语言模型，结合了1960亿参数的语言主干和1.8亿参数的视觉编码器，用于原生图像理解。该模型针对高频生产工作负载进行了优化，每处理一个token激活约110亿参数，并提供高达400 tokens\u002F秒的吞吐量。它支持256k上下文窗口，并提供三种可选推理级别（低、中、高），使开发者能够轻松平衡速度、成本和认知深度。Step 3.7 Flash适用于需要将感知、搜索和推理相结合的工作流进行扩展的场景，如一次性解析大量财务报告、执行多步骤搜索循环与跨源验证或在高吞吐量管道中操作并发编码代理等密集型任务。",2,"2026-06-11 04:08:25","CREATED_QUERY"]