[{"data":1,"prerenderedAt":-1},["ShallowReactive",2],{"project-74151":3},{"id":4,"name":5,"fullName":6,"owner":7,"repo":5,"description":8,"homepage":9,"htmlUrl":10,"language":11,"languages":10,"totalLinesOfCode":10,"stars":12,"forks":13,"watchers":14,"openIssues":15,"contributorsCount":16,"subscribersCount":16,"size":16,"stars1d":16,"stars7d":17,"stars30d":18,"stars90d":16,"forks30d":16,"starsTrendScore":16,"compositeScore":19,"rankGlobal":10,"rankLanguage":10,"license":20,"archived":21,"fork":21,"defaultBranch":22,"hasWiki":21,"hasPages":21,"topics":23,"createdAt":10,"pushedAt":10,"updatedAt":24,"readmeContent":25,"aiSummary":26,"trendingCount":16,"starSnapshotCount":16,"syncStatus":27,"lastSyncTime":28,"discoverSource":29},74151,"JoyAI-Image","jd-opensource\u002FJoyAI-Image","jd-opensource","JoyAI-Image is the unified multimodal foundation model for image understanding, text-to-image generation, and instruction-guided image editing.","",null,"Python",2168,153,67,12,0,10,45,28.56,"Apache License 2.0",false,"main",[],"2026-06-12 02:03:23","\u003Ch1 align=\"center\">JoyAI-Image\u003Cbr>\u003Csub>\u003Csup>Awakening Spatial Intelligence in Unified Multimodal Understanding and Generation\u003C\u002Fsup>\u003C\u002Fsub>\u003C\u002Fh1>\n\n\u003Cdiv align=\"center\">\n\n[![Report PDF](https:\u002F\u002Fimg.shields.io\u002Fbadge\u002FReport-PDF-red)](https:\u002F\u002Farxiv.org\u002Fabs\u002F2605.04128)\n[![Project](https:\u002F\u002Fimg.shields.io\u002Fbadge\u002FProject-JoyAI--Image-333399)](https:\u002F\u002Fgithub.com\u002Fjd-opensource\u002FJoyAI-Image)\n[![Hugging Face](https:\u002F\u002Fimg.shields.io\u002Fbadge\u002F%F0%9F%A4%97%20Checkpoint-JoyAI--Image--Edit--Diffusers-yellow)](https:\u002F\u002Fhuggingface.co\u002Fjdopensource\u002FJoyAI-Image-Edit-Diffusers)&#160;\n[![ModelScope](https:\u002F\u002Fimg.shields.io\u002Fbadge\u002F%F0%9F%A4%96%20ModelScope-JoyAI--Image--Edit--Diffusers-624aff)](https:\u002F\u002Fmodelscope.cn\u002Fmodels\u002Fjd-opensource\u002FJoyAI-Image-Edit-Diffusers)&#160;\n[![Demo](https:\u002F\u002Fimg.shields.io\u002Fbadge\u002F%F0%9F%9A%80%20Demo-Spatial--Edit-orange)](https:\u002F\u002Fhuggingface.co\u002Fspaces\u002Fstevengrove\u002FJoyAI-Image-Edit-Space)&#160;\n[![Demo](https:\u002F\u002Fimg.shields.io\u002Fbadge\u002F%F0%9F%9A%80%20Demo-General--Edit-orange)](https:\u002F\u002Fhuggingface.co\u002Fspaces\u002Fstevengrove\u002FJoyAI-Image-Edit)&#160;\n[![License](https:\u002F\u002Fimg.shields.io\u002Fbadge\u002FLicense-Apache_2.0-blue.svg)](LICENSE)\n\n\u003C\u002Fdiv>\n\n\n\n## 🔥🔥🔥 News!!\n- 2026.05.08: 🎉 Diffusers has merged our [PR](https:\u002F\u002Fgithub.com\u002Fhuggingface\u002Fdiffusers\u002Fpull\u002F13444)! Using JoyAI-Image-Edit will be much easier now. See [Running with Diffusers](#diffusers) for details.\n- 2026.05.07: 🎉 Our technical report for Joy-Image is now available on [arXiv](https:\u002F\u002Farxiv.org\u002Fabs\u002F2605.04128).\n- 2026.04.15: 🎉 We are excited to release the **OpenSpatial data engine** and the **OpenSpatial-3M dataset**! You can find the code on [GitHub](https:\u002F\u002Fgithub.com\u002FVINHYU\u002FOpenSpatial) and the data on [Hugging Face](https:\u002F\u002Fhuggingface.co\u002Fdatasets\u002Fjdopensource\u002FJoyAI-Image-OpenSpatial). If you find this project useful, please consider giving us a ⭐ to show your support!\n- 2026.04.11: 🎉 JoyAI-Image-Edit now supports Diffusers! Check out the integration and usage in [JoyAI-Image-Edit-Diffusers](https:\u002F\u002Fhuggingface.co\u002Fjdopensource\u002FJoyAI-Image-Edit-Diffusers).\n- 2026.04.10: 🎉 JoyAI-Image-Edit now supports ComfyUI! Check out the integration and usage in our [ComfyUI](https:\u002F\u002Fgithub.com\u002Fjd-opensource\u002FJoyAI-Image\u002Ftree\u002Fmain\u002Fjoyai_image_comfyui).\n- 2026.04.10: 🎉 We release the Spatial-Edit training dataset and benchmark: [JoyAI-Image-SpatialEdit](https:\u002F\u002Fhuggingface.co\u002Fdatasets\u002Fjdopensource\u002FJoyAI-Image-SpatialEdit) and [JoyAI-Image-SpatialEdit-Bench](https:\u002F\u002Fhuggingface.co\u002Fdatasets\u002Fjdopensource\u002FJoyAI-Image-SpatialEdit-Bench).If you find our work helpful, please consider giving our repository a star—your support is greatly appreciated.\n- 2026.04.06: 🎉 The demo for spatial editing is available at [Demo](https:\u002F\u002Fhuggingface.co\u002Fspaces\u002Fstevengrove\u002FJoyAI-Image-Edit-Space), and the demo for general editing can be accessed at [Huggingface Demo](https:\u002F\u002Fhuggingface.co\u002Fspaces\u002Fstevengrove\u002FJoyAI-Image-Edit). \n- 2026.04.02: 🎉 We release the JoyAI-Image-Edit weights. Please Check at [Huggingface](https:\u002F\u002Fhuggingface.co\u002Fjdopensource\u002FJoyAI-Image-Edit) and [ModelScope](https:\u002F\u002Fmodelscope.cn\u002Fmodels\u002Fjd-opensource\u002FJoyAI-Image-Edit).\n\n## 🐶 JoyAI-Image\n\nJoyAI-Image is a **unified multimodal foundation model** for image understanding, text-to-image generation, and instruction-guided image editing. It combines an 8B Multimodal Large Language Model (MLLM) with a 16B Multimodal Diffusion Transformer (MMDiT). A central principle of JoyAI-Image is the **closed-loop collaboration between understanding, generation, and editing**. Stronger spatial understanding improves grounded generation and contrallable editing through better scene parsing, relational grounding, and instruction decomposition, while generative transformations such as viewpoint changes provide complementary evidence for spatial reasoning.\n\nJoyAI-Image Architecture\n\n## 💎 Highlights\n\n- **Unified multimodal foundation**: one model family for understanding, generation, and editing through a shared MLLM-MMDiT interface.\n- **Practical data and training recipe**: a scalable pipeline with spatial understanding data ([Code](https:\u002F\u002Fgithub.com\u002FVINHYU\u002FOpenSpatial), [Data](https:\u002F\u002Fhuggingface.co\u002Fdatasets\u002Fjdopensource\u002FJoyAI-Image-OpenSpatial)), long-text rendering data, editing data ([SpatialEdit](https:\u002F\u002Fgithub.com\u002FEasonXiao-888\u002FSpatialEdit)), and multi-stage optimization strategies.\n- **Awakened spatial intelligence**: stronger spatial understanding, controllable spatial editing, and novel-view-assisted reasoning through a bidirectional loop between understanding and generation.\n- **Advanced visual generation**: strong long-text typography, layout fidelity, multi-view generation, and controllable editing with better preservation of scene structure.\n\n## 📦 Model Zoo\n\n\n| Models                     | Task                     | Description                                                                                                                           | Download Link                                                                                                                                     |\n| -------------------------- | ------------------------ | ------------------------------------------------------------------------------------------------------------------------------------- | ------------------------------------------------------------------------------------------------------------------------------------------------- |\n| JoyAI-Image-Und            | Multimodal Understanding | A text–image understanding backbone that enables high-fidelity spatial reasoning and editing-aware perception.                        | 🤗[Hugging Face](https:\u002F\u002Fhuggingface.co\u002Fjdopensource\u002FJoyAI-Image-Edit\u002Ftree\u002Fmain\u002FJoyAI-Image-Und)                                                  |\n| JoyAI-Image-Edit           | Image Editing            | An instruction-guided image editing model with precise and controllable spatial manipulation.                                         | 🤗[Hugging Face](https:\u002F\u002Fhuggingface.co\u002Fjdopensource\u002FJoyAI-Image-Edit)🤖[ModelScope](https:\u002F\u002Fmodelscope.cn\u002Fmodels\u002Fjd-opensource\u002FJoyAI-Image-Edit) |\n| JoyAI-Image-Edit-Distilled | Image-Editing            | Distilled version of JoyAI-Image-Edit for faster inference                                                                            | To be released                                                                                                                                    |\n| JoyAI-Image-Edit-Plus      | Multi-Image Editing      | An instruction-guided model that supports multi-image editing, enabling cross-image composition, consistency, and joint manipulation. | To be released                                                                                                                                    |\n| JoyAI-Image                | Text-to-Image            | A high-quality text-to-image generation model with strong multi-view consistency.                                                     | To be released                                                                                                                                    |\n\n\n## 🔍 Visual Overview\n\n### Capability Profile\n\nJoyAI-Image demonstrates broad multimodal performance across understanding, synthesis, and editing, with particular strengths in spatial reasoning, long-text rendering, multi-view generation, and controllable editing.\n\nJoyAI-Image Capability Radar\n\n### Advanced Text Rendering Showcase\n\nJoyAI-Image is optimized for challenging text-heavy scenarios, including multi-panel comics, dense multi-line text, multilingual typography, long-form layouts, real-world scene text, and handwritten styles.\n\nJoyAI-Image Text Rendering Showcase\n\n### Multi-view Generation and Spatial Editing Showcase\n\nJoyAI-Image showcases a spatially grounded generation and editing pipeline that supports multi-view generation, geometry-aware transformations, camera control, object rotation, and precise location-specific object editing. Across these settings, it preserves scene content, structure, and visual consistency while following viewpoint-sensitive instructions more accurately.\n\nJoyAI-Image Multi-view Generation and Spatial Editing Showcase Showcase\n\n### Spatial Editing for Spatial Reasoning Showcase\n\nJoyAI-Image poses high-fidelity spatial editing, serving as a powerful catalyst for enhancing spatial reasoning. Compared with Qwen-Image-Edit and Nano Banana Pro, JoyAI-Image-Edit synthesizes the most diagnostic viewpoints by faithfully executing camera motions. These high-fidelity novel views effectively disambiguate complex spatial relations, providing clearer visual evidence for downstream reasoning.\n\nJoyAI-Image Spatial Editing for Spatial Reasoning Showcase\n\n## 🚀 Quick Start\n\n### 1. Environment Setup\n\n**Requirements**: Python >= 3.10, CUDA-capable GPU\n\nCreate a virtual environment and install:\n\n```bash\nconda create -n joyai python=3.10 -y\nconda activate joyai\n\npip install -e .\n```\n\n> **Note on Flash Attention**: `flash-attn >= 2.8.0` is listed as a dependency for best performance.\n\n#### Core Dependencies\n\n\n| Package        | Version             | Purpose               |\n| -------------- | ------------------- | --------------------- |\n| `torch`        | >= 2.8              | PyTorch               |\n| `transformers` | >= 4.57.0, \u003C 4.58.0 | Text encoder          |\n| `diffusers`    | >= 0.34.0           | Pipeline utilities    |\n| `flash-attn`   | >= 2.8.0            | Fast attention kernel |\n\n\n### 2. Inference\n\n#### 2.1 Image Understanding\n\n```bash\npython inference_und.py \\\n  --ckpt-root \u002Fpath\u002Fto\u002Fckpts_infer \\\n  --image \"test_images\u002Ftest_1.jpg,test_images\u002Ftest3.png\" \\\n  --prompt \"Compare these two images.\" \\\n  --max-new-tokens 1024\n```\n\n#### CLI Reference (`inference_und.py`)\n\n\n| Argument           | Type  | Default                            | Description                                                              |\n| ------------------ | ----- | ---------------------------------- | ------------------------------------------------------------------------ |\n| `--ckpt-root`      | str   | *required*                         | Checkpoint root containing `text_encoder\u002F`                               |\n| `--image`          | str   | *required*                         | Input image path, or comma-separated paths for multiple images           |\n| `--prompt`         | str   | `\"Describe this image in detail.\"` | User question or instruction. When omitted, defaults to image captioning |\n| `--max-new-tokens` | int   | 2048                               | Maximum number of tokens to generate                                     |\n| `--temperature`    | float | 0.7                                | Sampling temperature. Use `0` for greedy decoding                        |\n| `--top-p`          | float | 0.8                                | Top-p (nucleus) sampling threshold                                       |\n| `--top-k`          | int   | 50                                 | Top-k sampling threshold                                                 |\n| `--output`         | str   | None                               | Optional output file to save the response text                           |\n\n\n#### 2.2 Image Editing\n\n```bash\npython inference.py \\\n  --ckpt-root \u002Fpath\u002Fto\u002Fckpts \\\n  --prompt \"Turn the plate blue\" \\\n  --image test_images\u002Ftest_1.jpg \\\n  --output outputs\u002Fresult.png \\\n  --seed 123 \\\n  --steps 30 \\\n  --guidance-scale 4.0 \\\n  --basesize 1024\n```\n\n#### CLI Reference (`inference.py`)\n\n\n| Argument           | Type  | Default       | Description                                                  |\n| ------------------ | ----- | ------------- | ------------------------------------------------------------ |\n| `--ckpt-root`      | str   | *required*    | Checkpoint root                                              |\n| `--prompt`         | str   | *required*    | Edit instruction or T2I prompt                               |\n| `--image`          | str   | None          | Input image path (required for editing, omit for T2I)        |\n| `--output`         | str   | `example.png` | Output image path                                            |\n| `--steps`          | int   | 30            | Denoising steps                                              |\n| `--guidance-scale` | float | 4.0           | Classifier-free guidance scale                               |\n| `--seed`           | int   | 42            | Random seed for reproducibility                              |\n| `--neg-prompt`     | str   | `\"\"`          | Negative prompt                                              |\n| `--basesize`       | int   | 1024          | Bucket base size for input image resizing (256\u002F512\u002F768\u002F1024) |\n| `--config`         | str   | auto          | Config path; defaults to `\u003Cckpt-root>\u002Finfer_config.py`       |\n| `--rewrite-prompt` | flag  | off           | Enable LLM-based prompt rewriting                            |\n| `--rewrite-model`  | str   | `gpt-5`       | Model name for prompt rewriting                              |\n| `--hsdp-shard-dim` | int   | 1             | FSDP shard dimension for multi-GPU (set to GPU count)        |\n\n\n#### Diffusers\n\n##### Install\n\n```bash\npip install torch transformers torchvision\npip install git+https:\u002F\u002Fgithub.com\u002Fhuggingface\u002Fdiffusers.git\n```\n\n> **Note**: `JoyImageEditPipeline` will be included in the next official diffusers release (>0.38.0). Until then, install from source as shown above.\n\n##### Running with Diffusers\n\n```python\nimport torch\nfrom diffusers import JoyImageEditPipeline\nfrom diffusers.utils import load_image\n\npipeline = JoyImageEditPipeline.from_pretrained(\n    \"jdopensource\u002FJoyAI-Image-Edit-Diffusers\", torch_dtype=torch.bfloat16\n)\npipeline.to(\"cuda\")\n\nimg_path = \"https:\u002F\u002Fhuggingface.co\u002Fdatasets\u002Fhuggingface\u002Fdocumentation-images\u002Fresolve\u002Fmain\u002Fdiffusers\u002Fastronaut.jpg\"\nimage = load_image(img_path)\n\nprompt = \"Add wings to the astronaut.\"\n\noutput = pipeline(\n    image=image,\n    prompt=prompt,\n    num_inference_steps=40,\n    guidance_scale=4.0,\n    generator=torch.Generator(\"cuda\").manual_seed(0),\n).images[0]\noutput.save(\"joyai_image_edit_output.png\")\n```\n\n### 3. Spatial Editing Reference\n\nJoyAI-Image supports three spatial editing prompt patterns: **Object Move**, **Object Rotation**, and **Camera Control**. For the most stable behavior, we recommend following the prompt templates below as closely as possible.\nFor more information (including data curation and evaluation strategies), please refer to [SpatialEdit](https:\u002F\u002Fgithub.com\u002FEasonXiao-888\u002FSpatialEdit).\n\n#### 3.1 Object Move\n\nUse this pattern when you want to move a target object into a specified region.\n\n**Prompt template:**\n\n```text\nMove the \u003Cobject> into the red box and finally remove the red box.\n```\n\n**Rules:**\n\n- Replace `\u003Cobject>` with a clear description of the target object to be moved.\n- The **red box** indicates the target destination in the image.\n- The phrase **\"finally remove the red box\"** means the guidance box should not appear in the final edited result.\n\n**Example:**\n\n```text\nMove the apple into the red box and finally remove the red box.\n```\n\n#### 3.2 Object Rotation\n\nUse this pattern when you want to rotate an object to a specific canonical view.\n\n**Prompt template:**\n\n```text\nRotate the \u003Cobject> to show the \u003Cview> side view.\n```\n\n**Supported `\u003Cview>` values:**\n\n```\nfront, right, left, rear, front right, front left, rear right, rear left\n```\n\n**Rules:**\n\n- Replace `\u003Cobject>` with a clear description of the object to rotate.\n- Replace `\u003Cview>` with one of the supported directions above.\n- This instruction is intended to change the **object orientation**, while keeping the object identity and surrounding scene as consistent as possible.\n\n**Examples:**\n\n```text\nRotate the chair to show the front side view.\nRotate the car to show the rear left side view.\n```\n\n#### 3.3 Camera Control\n\nUse this pattern when you want to change only the camera viewpoint while keeping the 3D scene itself unchanged.\n\n**Prompt template:**\n\n```text\nMove the camera.\n- Camera rotation: Yaw {y_rotation}°, Pitch {p_rotation}°.\n- Camera zoom: in\u002Fout\u002Funchanged.\n- Keep the 3D scene static; only change the viewpoint.\n```\n\n**Rules:**\n\n- `{y_rotation}` specifies the yaw rotation angle in degrees.\n- `{p_rotation}` specifies the pitch rotation angle in degrees.\n- `Camera zoom` must be one of: `in`, `out`,`unchanged`\n- The last line is important: it explicitly tells the model to preserve the 3D scene content and geometry, and only adjust the camera viewpoint.\n\n**Examples:**\n\n```text\nMove the camera.\n- Camera rotation: Yaw 45°, Pitch 0°.\n- Camera zoom: in.\n- Keep the 3D scene static; only change the viewpoint.\n```\n\n```text\nMove the camera.\n- Camera rotation: Yaw -90°, Pitch 20°.\n- Camera zoom: unchanged.\n- Keep the 3D scene static; only change the viewpoint.\n```\n\n#### 3.4 Application\n\n**3D Reconstruction:**\n\nThe first and third examples show point clouds with only a single given viewpoint. The second and fourth examples are augmented by [SpatialEdit](https:\u002F\u002Fgithub.com\u002FEasonXiao-888\u002FSpatialEdit), which synthesizes richer spatial observations from the sparse input view.\n\n**Conditional-frames Based Video Generation:**\n\nGiven the first frame, [SpatialEdit](https:\u002F\u002Fgithub.com\u002FEasonXiao-888\u002FSpatialEdit) first generates the final frame of the video, and a video generation model then creates a smooth rotational transition between them while maintaining background consistency.\n\n#### 3.5 Demo Display\n\n[https:\u002F\u002Fgithub.com\u002Fuser-attachments\u002Fassets\u002F54cb2fcc-0646-44e9-a002-21a7228501f3](https:\u002F\u002Fgithub.com\u002Fuser-attachments\u002Fassets\u002F54cb2fcc-0646-44e9-a002-21a7228501f3)\n\n## ⚖️ License Agreement\n\nJoyAI-Image is licensed under Apache 2.0. \n\n## Acknowledgements\n\nThis project builds upon and benefits from the following open-source repositories.\n\n- Wan2.1: [https:\u002F\u002Fgithub.com\u002FWan-Video\u002FWan2.1](https:\u002F\u002Fgithub.com\u002FWan-Video\u002FWan2.1)\n- HunyuanVideo: [https:\u002F\u002Fgithub.com\u002FTencent-Hunyuan\u002FHunyuanVideo](https:\u002F\u002Fgithub.com\u002FTencent-Hunyuan\u002FHunyuanVideo)\n\nPlease refer to the respective repositories for their licenses and citation guidelines.\n\n## ☎️  We're Hiring!\n\nWe are actively hiring Research Scientists, Engineers, and Interns to join us in building next-generation generative foundation models and bringing them into real-world applications. If you’re interested, please send your resume to: [huanghaoyang.ocean@jd.com](mailto:huanghaoyang.ocean@jd.com)\n","JoyAI-Image 是一个统一的多模态基础模型，用于图像理解、文本到图像生成以及指令引导的图像编辑。其核心功能包括多模态理解和生成，支持多种图像处理任务，并且集成了Diffusers和ComfyUI等工具以增强易用性和灵活性。该项目采用Python语言开发，基于Apache License 2.0开源许可协议发布。适用于需要高质量图像生成与编辑的应用场景，如创意设计、内容创作及科研领域。",2,"2026-06-11 03:49:04","high_star"]