[{"data":1,"prerenderedAt":-1},["ShallowReactive",2],{"project-80904":3},{"id":4,"name":5,"fullName":6,"owner":7,"repo":5,"description":8,"homepage":8,"htmlUrl":8,"language":9,"languages":8,"totalLinesOfCode":8,"stars":10,"forks":11,"watchers":12,"openIssues":13,"contributorsCount":13,"subscribersCount":13,"size":13,"stars1d":13,"stars7d":13,"stars30d":14,"stars90d":13,"forks30d":13,"starsTrendScore":13,"compositeScore":15,"rankGlobal":8,"rankLanguage":8,"license":8,"archived":16,"fork":16,"defaultBranch":17,"hasWiki":18,"hasPages":16,"topics":19,"createdAt":8,"pushedAt":8,"updatedAt":20,"readmeContent":21,"aiSummary":22,"trendingCount":13,"starSnapshotCount":13,"syncStatus":23,"lastSyncTime":24,"discoverSource":25},80904,"sam2-mlx","avbiswas\u002Fsam2-mlx","avbiswas",null,"Python",34,5,33,0,1,2.33,false,"main",true,[],"2026-06-12 02:04:08","# mlx-sam\n\nMLX-native SAM models for Apple Silicon. The first supported model family is\nMeta SAM 2.1 for interactive image segmentation and video object tracking.\n\nThe goal of this repo is practical local video segmentation: load an MLX SAM2\ncheckpoint, click objects in a video, add positive\u002Fnegative corrections, track\nforward or backward from the edited frame, and render masks back to reviewable\noverlay videos. The default runtime is Python 3.14 + MLX and does not install\nPyTorch.\n\nhttps:\u002F\u002Fgithub.com\u002Fuser-attachments\u002Fassets\u002F0946cad0-8af8-4efc-b504-f7416083d64c\n\nhttps:\u002F\u002Fgithub.com\u002Fuser-attachments\u002Fassets\u002F868b1156-6bc2-4ffd-ad1a-0971761a45d7\n\n## What You Can Do\n\n- Segment an image from points or boxes.\n- Track one or more objects through a video with SAM2 memory.\n- Add positive and negative correction clicks across frames.\n- Start from the middle of a clip and propagate forward, backward, or both.\n- Use box prompts in the video flow.\n- Render `.npy` or `.npz` masks as overlay videos for visual inspection.\n- Convert SAM2.1 Hugging Face checkpoints into MLX `.safetensors`.\n- Load converted checkpoints from local disk or Hugging Face.\n\nThe public video predictor mirrors the official SAM2 method names where the\nimplemented behavior matches closely:\n\n- `SAM2VideoPredictor.from_pretrained(...)`\n- `init_state(...)`\n- `add_new_points_or_box(...)`\n- `add_new_points(...)`\n- `add_new_mask(...)`\n- `propagate_in_video(...)`\n- `clear_all_prompts_in_frame(...)`\n- `reset_state(...)`\n\n\n## Quick API\n\nYou can pip install this library\n\n```bash\npip install mlx-sam\n```\n\nLoad a Hugging Face checkpoint, add a prompt, and stream masks as they are\ngenerated:\n\n```python\nimport numpy as np\n\nfrom mlx_sam import SAM2VideoPredictor\n\npredictor = SAM2VideoPredictor.from_pretrained(\n    \"avbiswas\u002Fsam2.1-hiera-small-mlx\"\n)\nstate = predictor.init_state(\"third_party\u002Fsam2\u002Fdemo\u002Fdata\u002Fgallery\u002F01_dog.mp4\")\n\nframe_idx, obj_ids, masks = predictor.add_new_points_or_box(\n    state,\n    frame_idx=0,\n    obj_id=1,\n    points=np.array([[625.0, 429.0]], dtype=np.float32),\n    labels=np.array([1], dtype=np.int32),\n)\n\nfor frame_idx, obj_ids, masks in predictor.propagate_in_video(state):\n    # masks is a NumPy float32 array shaped O,1,H,W in original video resolution.\n    pass\n```\n\nFor UI or worker streaming, use `stream_in_video(...)`. It returns dictionary\nevents, throttles intermediate frame events with `yield_every`, and can emit a\nfinal stacked mask tensor:\n\n```python\nfor event in predictor.stream_in_video(state, yield_every=30, return_full=True):\n    if event[\"type\"] == \"frame\":\n        frame_idx = event[\"frame_idx\"]\n        masks = event[\"masks\"]  # O,1,H,W for this frame\n    elif event[\"type\"] == \"final\":\n        frame_indices = event[\"frame_indices\"]  # T\n        masks = event[\"masks\"]  # T,O,1,H,W for every processed frame\n```\n\nLocal checkpoint loading works the same way:\n\n```python\nfrom mlx_sam import SAM2VideoPredictor\n\npredictor = SAM2VideoPredictor(\n    checkpoint=\"checkpoints\u002Fsam2.1_hiera_small_image_segmenter.safetensors\"\n)\n```\n\nSpatial downsampling is opt-in through `image_size`. The default is `1024`,\nmatching SAM2. Lower values trade mask quality for latency and memory:\n\n```python\npredictor = SAM2VideoPredictor.from_pretrained(\n    \"avbiswas\u002Fsam2.1-hiera-small-mlx\",\n    image_size=768,\n    memory_dtype=\"float16\",\n    memory_attention_dtype=\"float16\",\n)\n```\n\nFor editor-style use, precompute image features once during `init_state`.\nThis is a parity-preserving speed path for repeated propagation and correction\npasses, at the cost of higher upfront work and cached memory:\n\n```python\nstate = predictor.init_state(\n    \"clip.mp4\",\n    precompute_image_features=True,\n    feature_batch_size=4,\n)\n```\n\nTrack from the middle of a clip in either direction by choosing\n`start_frame_idx` and `reverse`. Run both directions to build bidirectional\nresults around an edit frame:\n\n```python\nedit_frame = 120\n\npredictor.add_new_points_or_box(\n    state,\n    frame_idx=edit_frame,\n    obj_id=1,\n    points=np.array([[640.0, 360.0]], dtype=np.float32),\n    labels=np.array([1], dtype=np.int32),\n)\n\nfor frame_idx, obj_ids, masks in predictor.propagate_in_video(\n    state, start_frame_idx=edit_frame, reverse=True\n):\n    pass\n\nfor frame_idx, obj_ids, masks in predictor.propagate_in_video(\n    state, start_frame_idx=edit_frame, reverse=False\n):\n    pass\n```\n\nUse positive and negative clicks together by setting labels to `1` and `0`.\nPass `clear_old_points=False` to accumulate correction clicks on a frame:\n\n```python\npredictor.add_new_points_or_box(\n    state,\n    frame_idx=120,\n    obj_id=1,\n    points=np.array([[640.0, 360.0], [710.0, 370.0]], dtype=np.float32),\n    labels=np.array([1, 0], dtype=np.int32),\n    clear_old_points=False,\n)\n```\n\nTemporal downsampling is available today as an explicit preview experiment in\n`scripts\u002Fbenchmark_video_frame_skip_mlx.py`: it runs SAM2 on every `k`-th frame,\nsnaps arbitrary prompt frames to the nearest sampled frame, propagates forward\nand backward over the sampled frames, and interpolates skipped masks. Normal\n`propagate_in_video(...)` still evaluates every frame.\n\n## Manual App\n\nInstall the optional app dependencies and launch the local browser UI:\n\n```bash\nuv sync --extra app\nuv run mlx-sam-app\n```\n\nThen open:\n\n```text\nhttp:\u002F\u002F127.0.0.1:7861\n```\n\nThe frontend is a small demo client for the local API server. It lets you upload\na video, add positive and negative points, and run forward, backward, or\nbidirectional propagation. It defaults to\n`avbiswas\u002Fsam2.1-hiera-base-plus-mlx-8bit`. The API server is documented in\n[docs\u002FAPI_SERVER.md](docs\u002FAPI_SERVER.md).\n\nSee [scripts\u002FREADME.md](scripts\u002FREADME.md) for benchmark commands, temporal\ndownsampling experiments, quantization, conversion, parity checks, and upload\nhelpers.\n\n## Install\n\n```bash\nuv sync --python 3.14\n```\n\nTorch is only used for conversion and comparison fixtures:\n\n```bash\nuv sync --python 3.14 --extra torch-parity\n```\n\nReference repositories may exist locally for development, but they are not\nruntime dependencies:\n\n```text\nthird_party\u002Fsam2\nreferences\u002Fmlx-vlm\n```\n\n## Convert Weights\n\nConvert from Hugging Face:\n\n```bash\nuv run --extra torch-parity mlx-sam-convert \\\n  --hf-id facebook\u002Fsam2.1-hiera-small \\\n  --output-dir checkpoints\n```\n\nSupported source ids:\n\n```text\nfacebook\u002Fsam2.1-hiera-tiny\nfacebook\u002Fsam2.1-hiera-small\nfacebook\u002Fsam2.1-hiera-base-plus\nfacebook\u002Fsam2.1-hiera-large\n```\n\nConvert a local Torch checkpoint:\n\n```bash\nuv run --extra torch-parity mlx-sam-convert \\\n  --checkpoint checkpoints\u002Fsam2.1_hiera_small.pt \\\n  --model-id facebook\u002Fsam2.1-hiera-small \\\n  --output checkpoints\u002Fsam2.1_hiera_small_image_segmenter.safetensors\n```\n\nThe converted checkpoint includes the Hiera image encoder, FPN neck, prompt\nencoder, mask decoder, object pointer projection, memory encoder, and memory\nattention. Generated checkpoints are ignored by git.\n\nThe old script path remains as a compatibility wrapper:\n\n```bash\nuv run --extra torch-parity python scripts\u002Fconvert_image_encoder_weights.py \\\n  --checkpoint checkpoints\u002Fsam2.1_hiera_small.pt \\\n  --model-id facebook\u002Fsam2.1-hiera-small\n```\n\n## Feature Regression\n\nRun MLX feature scenarios and compare against Torch fixtures:\n\n```bash\nuv run python scripts\u002Frun_feature_regression.py --frames 130\n```\n\nRegenerate official Torch fixtures first:\n\n```bash\nuv run python scripts\u002Frun_feature_regression.py --refresh-torch --frames 130\n```\n\nCompare existing outputs without rerunning MLX:\n\n```bash\nuv run python scripts\u002Frun_feature_regression.py --skip-mlx --frames 130\n```\n\nCovered scenarios:\n\n- `multi_object`\n- `box_prompt`\n- `negative_clicks`\n- `cross_frame_corrections`\n- `bidirectional_middle`\n\n\nCurrent low-level parity results:\n\n- Image `vision_features` max abs error: about `1.63e-05`\n- Prompted low-res masks max abs error: about `4.67e-05`\n- Prompted IoU max abs error: about `4.77e-07`\n\n\n\n## Model Catalog\n\nBenchmarks below were run on an Apple M2 Max with 32 GB unified memory. The\nsource media is `third_party\u002Fsam2\u002Fdemo\u002Fdata\u002Fgallery\u002F01_dog.mp4`, a\n`1280x720`, 289-frame clip at 29.97 FPS (`9.64 s`). The fp32 speed and parity\nrows use the prompted first-frame fixture at `1024x1024` internal resolution;\nspeedup is MLX full-image-plus-prompt latency versus the original Torch\u002FMPS\nmodel of the same SAM2.1 family.\n\n| FP32 model | Size | Torch\u002FMPS | MLX | Speedup | Parity vs Torch main |\n| --- | ---: | ---: | ---: | ---: | --- |\n| `avbiswas\u002Fsam2.1-hiera-tiny-mlx` | `172.6 MiB` | `96.6 ms` | `71.3 ms` | `1.36x` | mask mean abs `1.17e-05`, IoU max abs `1.43e-06` |\n| `avbiswas\u002Fsam2.1-hiera-small-mlx` | `199.7 MiB` | `112.5 ms` | `84.5 ms` | `1.33x` | mask mean abs `8.14e-06`, IoU max abs `4.77e-07` |\n| `avbiswas\u002Fsam2.1-hiera-base-plus-mlx` | `336.4 MiB` | `203.5 ms` | `144.7 ms` | `1.41x` | mask mean abs `5.04e-06`, IoU max abs `3.49e-06` |\n| `avbiswas\u002Fsam2.1-hiera-large-mlx` | `892.2 MiB` | `433.0 ms` | `341.1 ms` | `1.27x` | mask mean abs `7.84e-06`, IoU max abs `2.50e-06` |\n\nQuantized checkpoints reduce memory footprint and distribution size. On current\nMLX kernels they should not be assumed to speed up video tracking; in our tests\nquantization primarily helps memory, not latency.\n\n| Quantized model | Size | Variant | Parity vs fp32 MLX |\n| --- | ---: | --- | --- |\n| `avbiswas\u002Fsam2.1-hiera-tiny-mlx-16bit` | `86.3 MiB` | fp16 | mask mean abs `5.43e-03`, IoU max abs `9.36e-04` |\n| `avbiswas\u002Fsam2.1-hiera-tiny-mlx-8bit` | `69.0 MiB` | int8 | mask mean abs `6.19e-02`, IoU max abs `2.80e-03` |\n| `avbiswas\u002Fsam2.1-hiera-tiny-mlx-4bit` | `49.2 MiB` | mixed-q4 | mask mean abs `6.29e-02`, IoU max abs `2.58e-03` |\n| `avbiswas\u002Fsam2.1-hiera-small-mlx-16bit` | `99.9 MiB` | fp16 | mask mean abs `8.24e-03`, IoU max abs `1.10e-03` |\n| `avbiswas\u002Fsam2.1-hiera-small-mlx-8bit` | `76.7 MiB` | int8 | mask mean abs `2.99e-02`, IoU max abs `1.90e-03` |\n| `avbiswas\u002Fsam2.1-hiera-small-mlx-4bit` | `56.4 MiB` | mixed-q4 | mask mean abs `2.87e-02`, IoU max abs `8.80e-04` |\n| `avbiswas\u002Fsam2.1-hiera-base-plus-mlx-16bit` | `168.2 MiB` | fp16 | mask mean abs `1.58e-03`, IoU max abs `8.83e-04` |\n| `avbiswas\u002Fsam2.1-hiera-base-plus-mlx-8bit` | `124.6 MiB` | int8 | mask mean abs `2.24e-02`, IoU max abs `8.98e-03` |\n| `avbiswas\u002Fsam2.1-hiera-base-plus-mlx-4bit` | `95.8 MiB` | mixed-q4 | mask mean abs `2.70e-02`, IoU max abs `6.11e-03` |\n| `avbiswas\u002Fsam2.1-hiera-large-mlx-16bit` | `446.2 MiB` | fp16 | mask mean abs `2.11e-03`, IoU max abs `8.34e-05` |\n| `avbiswas\u002Fsam2.1-hiera-large-mlx-8bit` | `300.2 MiB` | int8 | mask mean abs `1.57e-02`, IoU max abs `2.71e-03` |\n| `avbiswas\u002Fsam2.1-hiera-large-mlx-4bit` | `249.7 MiB` | mixed-q4 | mask mean abs `1.56e-02`, IoU max abs `2.61e-03` |\n\n\nAll models can be found here: https:\u002F\u002Fhuggingface.co\u002Fcollections\u002Favbiswas\u002Fsam2-mlx\n","该项目提供了一个基于Apple Silicon的MLX-native版本的SAM 2.1模型，主要用于交互式图像分割和视频对象跟踪。核心功能包括通过点击视频中的对象添加正负校正、从编辑帧向前或向后追踪，并将生成的掩码渲染为可审查的叠加视频。项目支持从Hugging Face加载预训练模型并转换为MLX格式，使用Python 3.14 + MLX运行时，无需安装PyTorch。适合需要在本地进行高效视频分割与对象跟踪的应用场景，如视频编辑、计算机视觉研究等。",2,"2026-06-06 04:03:55","CREATED_QUERY"]