[{"data":1,"prerenderedAt":-1},["ShallowReactive",2],{"project-9748":3},{"id":4,"name":5,"fullName":6,"owner":7,"repo":5,"description":8,"homepage":9,"htmlUrl":10,"language":11,"languages":10,"totalLinesOfCode":10,"stars":12,"forks":13,"watchers":14,"openIssues":15,"contributorsCount":16,"subscribersCount":16,"size":16,"stars1d":17,"stars7d":18,"stars30d":19,"stars90d":16,"forks30d":16,"starsTrendScore":20,"compositeScore":21,"rankGlobal":10,"rankLanguage":10,"license":22,"archived":23,"fork":23,"defaultBranch":24,"hasWiki":25,"hasPages":23,"topics":26,"createdAt":10,"pushedAt":10,"updatedAt":32,"readmeContent":33,"aiSummary":34,"trendingCount":16,"starSnapshotCount":16,"syncStatus":18,"lastSyncTime":35,"discoverSource":36},9748,"imagen-pytorch","lucidrains\u002Fimagen-pytorch","lucidrains","Implementation of Imagen, Google's Text-to-Image Neural Network, in Pytorch","",null,"Python",8414,800,113,101,0,1,2,20,3,39.71,"MIT License",false,"main",true,[27,28,29,30,31],"artificial-intelligence","deep-learning","imagination-machine","text-to-image","text-to-video","2026-06-12 02:02:12","\u003Cimg src=\".\u002Fimagen.png\" width=\"450px\">\u003C\u002Fimg>\n\n## Imagen - Pytorch\n\nImplementation of \u003Ca href=\"https:\u002F\u002Fgweb-research-imagen.appspot.com\u002F\">Imagen\u003C\u002Fa>, Google's Text-to-Image Neural Network that beats DALL-E2, in Pytorch. It is the new SOTA for text-to-image synthesis.\n\nArchitecturally, it is actually much simpler than DALL-E2. It consists of a cascading DDPM conditioned on text embeddings from a large pretrained T5 model (attention network). It also contains dynamic clipping for improved classifier free guidance, noise level conditioning, and a memory efficient unet design.\n\nIt appears neither CLIP nor prior network is needed after all. And so research continues.\n\n\u003Ca href=\"https:\u002F\u002Fwww.youtube.com\u002Fwatch?v=xqDeAz0U-R4\">AI Coffee Break with Letitia\u003C\u002Fa> | \u003Ca href=\"https:\u002F\u002Fwww.assemblyai.com\u002Fblog\u002Fhow-imagen-actually-works\u002F\">Assembly AI\u003C\u002Fa> | \u003Ca href=\"https:\u002F\u002Fwww.youtube.com\u002Fwatch?v=af6WPqvzjjk\">Yannic Kilcher\u003C\u002Fa>\n\nPlease join \u003Ca href=\"https:\u002F\u002Fdiscord.gg\u002FxBPBXfcFHd\">\u003Cimg alt=\"Join us on Discord\" src=\"https:\u002F\u002Fimg.shields.io\u002Fdiscord\u002F823813159592001537?color=5865F2&logo=discord&logoColor=white\">\u003C\u002Fa> if you are interested in helping out with the replication with the \u003Ca href=\"https:\u002F\u002Flaion.ai\u002F\">LAION\u003C\u002Fa> community\n\n## Shoutouts\n\n- \u003Ca href=\"https:\u002F\u002Fstability.ai\u002F\">StabilityAI\u003C\u002Fa> for the generous sponsorship, as well as my other sponsors out there\n\n- \u003Ca href=\"https:\u002F\u002Fhuggingface.co\u002F\">🤗 Huggingface\u003C\u002Fa> for their amazing transformers library. The text encoder portion is pretty much taken care of because of them\n\n- \u003Ca href=\"http:\u002F\u002Fwww.jonathanho.me\u002F\">Jonathan Ho\u003C\u002Fa> for bringing about a revolution in generative artificial intelligence through \u003Ca href=\"https:\u002F\u002Farxiv.org\u002Fabs\u002F2006.11239\">his seminal paper\u003C\u002Fa>\n\n- \u003Ca href=\"https:\u002F\u002Fgithub.com\u002Fsgugger\">Sylvain\u003C\u002Fa> and \u003Ca href=\"https:\u002F\u002Fgithub.com\u002Fmuellerzr\">Zachary\u003C\u002Fa> for the \u003Ca href=\"https:\u002F\u002Fgithub.com\u002Fhuggingface\u002Faccelerate\">Accelerate\u003C\u002Fa> library, which this repository uses for distributed training\n\n- \u003Ca href=\"https:\u002F\u002Fgithub.com\u002Farogozhnikov\">Alex\u003C\u002Fa> for \u003Ca href=\"https:\u002F\u002Fgithub.com\u002Farogozhnikov\u002Feinops\">einops\u003C\u002Fa>, indispensable tool for tensor manipulation\n\n- \u003Ca href=\"https:\u002F\u002Fgithub.com\u002Fjorgemcgomes\">Jorge Gomes\u003C\u002Fa> for helping out with the T5 loading code and advice on the correct T5 version\n\n- \u003Ca href=\"https:\u002F\u002Fgithub.com\u002Fcrowsonkb\">Katherine Crowson\u003C\u002Fa>, for her \u003Ca href=\"https:\u002F\u002Fgithub.com\u002Fcrowsonkb\u002Fv-diffusion-jax\u002Fblob\u002Fmaster\u002Fdiffusion\u002Futils.py\">beautiful code\u003C\u002Fa>, which helped me understand the continuous time version of gaussian diffusion\n\n- \u003Ca href=\"https:\u002F\u002Fgithub.com\u002Fmarunine\">Marunine\u003C\u002Fa> and \u003Ca href=\"https:\u002F\u002Fgithub.com\u002FNetruk44\">Netruk44\u003C\u002Fa>, for reviewing code, sharing experimental results, and help with debugging\n\n- \u003Ca href=\"https:\u002F\u002Fgithub.com\u002Fmarunine\">Marunine\u003C\u002Fa> for providing a \u003Ca href=\"https:\u002F\u002Fgithub.com\u002Flucidrains\u002Fimagen-pytorch\u002Fissues\u002F72#issuecomment-1163275757\">potential solution\u003C\u002Fa> for a color shifting issue in the memory efficient u-nets. Thanks to \u003Ca href=\"https:\u002F\u002Fgithub.com\u002Fjacobwjs\">Jacob\u003C\u002Fa> for sharing experimental comparisons between the base and memory-efficient unets\n\n- \u003Ca href=\"https:\u002F\u002Fgithub.com\u002Fmarunine\">Marunine\u003C\u002Fa> for finding numerous bugs, resolving an issue with resize right, and for sharing his experimental configurations and results\n\n- \u003Ca href=\"https:\u002F\u002Fgithub.com\u002FMalumaDev\">MalumaDev\u003C\u002Fa> for proposing the use of pixel shuffle upsampler to fix checkboard artifacts\n\n- \u003Ca href=\"https:\u002F\u002Fgithub.com\u002FKhrulkovV\">Valentin\u003C\u002Fa> for pointing out insufficient skip connections in the unet, as well as the specific method of attention conditioning in the base-unet in the appendix\n\n- \u003Ca href=\"https:\u002F\u002Fgithub.com\u002FBIGJUN777\">BIGJUN\u003C\u002Fa> for catching a big bug with continuous time gaussian diffusion noise level conditioning at inference time\n\n- \u003Ca href=\"https:\u002F\u002Fgithub.com\u002Fanimebing\">Bingbing\u003C\u002Fa> for identifying a bug with sampling and order of normalizing and noising with low resolution conditioning image\n\n- \u003Ca href=\"https:\u002F\u002Fgithub.com\u002FTheFusion21\">Kay\u003C\u002Fa> for contributing one line command training of Imagen!\n\n- \u003Ca href=\"https:\u002F\u002Fgithub.com\u002FHReynaud\">Hadrien Reynaud\u003C\u002Fa> for testing out text-to-video on a medical dataset, sharing his results, and identifying issues!\n\n## Install\n\n```bash\n$ pip install imagen-pytorch\n```\n\n## Usage\n\n```python\nimport torch\nfrom imagen_pytorch import Unet, Imagen\n\n# unet for imagen\n\nunet1 = Unet(\n    dim = 32,\n    cond_dim = 512,\n    dim_mults = (1, 2, 4, 8),\n    num_resnet_blocks = 3,\n    layer_attns = (False, True, True, True),\n    layer_cross_attns = (False, True, True, True)\n)\n\nunet2 = Unet(\n    dim = 32,\n    cond_dim = 512,\n    dim_mults = (1, 2, 4, 8),\n    num_resnet_blocks = (2, 4, 8, 8),\n    layer_attns = (False, False, False, True),\n    layer_cross_attns = (False, False, False, True)\n)\n\n# imagen, which contains the unets above (base unet and super resoluting ones)\n\nimagen = Imagen(\n    unets = (unet1, unet2),\n    image_sizes = (64, 256),\n    timesteps = 1000,\n    cond_drop_prob = 0.1\n).cuda()\n\n# mock images (get a lot of this) and text encodings from large T5\n\ntext_embeds = torch.randn(4, 256, 768).cuda()\nimages = torch.randn(4, 3, 256, 256).cuda()\n\n# feed images into imagen, training each unet in the cascade\n\nfor i in (1, 2):\n    loss = imagen(images, text_embeds = text_embeds, unet_number = i)\n    loss.backward()\n\n# do the above for many many many many steps\n# now you can sample an image based on the text embeddings from the cascading ddpm\n\nimages = imagen.sample(texts = [\n    'a whale breaching from afar',\n    'young girl blowing out candles on her birthday cake',\n    'fireworks with blue and green sparkles'\n], cond_scale = 3.)\n\nimages.shape # (3, 3, 256, 256)\n```\n\nFor simpler training, you can directly supply text strings instead of precomputing text encodings. (Although for scaling purposes, you will definitely want to precompute the textual embeddings + mask)\n\nThe number of textual captions must match the batch size of the images if you go this route.\n\n```python\n# mock images and text (get a lot of this)\n\ntexts = [\n    'a child screaming at finding a worm within a half-eaten apple',\n    'lizard running across the desert on two feet',\n    'waking up to a psychedelic landscape',\n    'seashells sparkling in the shallow waters'\n]\n\nimages = torch.randn(4, 3, 256, 256).cuda()\n\n# feed images into imagen, training each unet in the cascade\n\nfor i in (1, 2):\n    loss = imagen(images, texts = texts, unet_number = i)\n    loss.backward()\n```\n\nWith the `ImagenTrainer` wrapper class, the exponential moving averages for all of the U-nets in the cascading DDPM will be automatically taken care of when calling `update`\n\n```python\nimport torch\nfrom imagen_pytorch import Unet, Imagen, ImagenTrainer\n\n# unet for imagen\n\nunet1 = Unet(\n    dim = 32,\n    cond_dim = 512,\n    dim_mults = (1, 2, 4, 8),\n    num_resnet_blocks = 3,\n    layer_attns = (False, True, True, True),\n)\n\nunet2 = Unet(\n    dim = 32,\n    cond_dim = 512,\n    dim_mults = (1, 2, 4, 8),\n    num_resnet_blocks = (2, 4, 8, 8),\n    layer_attns = (False, False, False, True),\n    layer_cross_attns = (False, False, False, True)\n)\n\n# imagen, which contains the unets above (base unet and super resoluting ones)\n\nimagen = Imagen(\n    unets = (unet1, unet2),\n    text_encoder_name = 't5-large',\n    image_sizes = (64, 256),\n    timesteps = 1000,\n    cond_drop_prob = 0.1\n).cuda()\n\n# wrap imagen with the trainer class\n\ntrainer = ImagenTrainer(imagen)\n\n# mock images (get a lot of this) and text encodings from large T5\n\ntext_embeds = torch.randn(64, 256, 1024).cuda()\nimages = torch.randn(64, 3, 256, 256).cuda()\n\n# feed images into imagen, training each unet in the cascade\n\nloss = trainer(\n    images,\n    text_embeds = text_embeds,\n    unet_number = 1,            # training on unet number 1 in this example, but you will have to also save checkpoints and then reload and continue training on unet number 2\n    max_batch_size = 4          # auto divide the batch of 64 up into batch size of 4 and accumulate gradients, so it all fits in memory\n)\n\ntrainer.update(unet_number = 1)\n\n# do the above for many many many many steps\n# now you can sample an image based on the text embeddings from the cascading ddpm\n\nimages = trainer.sample(texts = [\n    'a puppy looking anxiously at a giant donut on the table',\n    'the milky way galaxy in the style of monet'\n], cond_scale = 3.)\n\nimages.shape # (2, 3, 256, 256)\n```\n\nYou can also train Imagen without text (unconditional image generation) as follows\n\n```python\nimport torch\nfrom imagen_pytorch import Unet, Imagen, SRUnet256, ImagenTrainer\n\n# unets for unconditional imagen\n\nunet1 = Unet(\n    dim = 32,\n    dim_mults = (1, 2, 4),\n    num_resnet_blocks = 3,\n    layer_attns = (False, True, True),\n    layer_cross_attns = False,\n    use_linear_attn = True\n)\n\nunet2 = SRUnet256(\n    dim = 32,\n    dim_mults = (1, 2, 4),\n    num_resnet_blocks = (2, 4, 8),\n    layer_attns = (False, False, True),\n    layer_cross_attns = False\n)\n\n# imagen, which contains the unets above (base unet and super resoluting ones)\n\nimagen = Imagen(\n    condition_on_text = False,   # this must be set to False for unconditional Imagen\n    unets = (unet1, unet2),\n    image_sizes = (64, 128),\n    timesteps = 1000\n)\n\ntrainer = ImagenTrainer(imagen).cuda()\n\n# now get a ton of images and feed it through the Imagen trainer\n\ntraining_images = torch.randn(4, 3, 256, 256).cuda()\n\n# train each unet separately\n# in this example, only training on unet number 1\n\nloss = trainer(training_images, unet_number = 1)\ntrainer.update(unet_number = 1)\n\n# do the above for many many many many steps\n# now you can sample images unconditionally from the cascading unet(s)\n\nimages = trainer.sample(batch_size = 16) # (16, 3, 128, 128)\n```\n\nOr train only super-resoluting unets\n\n```python\nimport torch\nfrom imagen_pytorch import Unet, NullUnet, Imagen\n\n# unet for imagen\n\nunet1 = NullUnet()  # add a placeholder \"null\" unet for the base unet\n\nunet2 = Unet(\n    dim = 32,\n    cond_dim = 512,\n    dim_mults = (1, 2, 4, 8),\n    num_resnet_blocks = (2, 4, 8, 8),\n    layer_attns = (False, False, False, True),\n    layer_cross_attns = (False, False, False, True)\n)\n\n# imagen, which contains the unets above (base unet and super resoluting ones)\n\nimagen = Imagen(\n    unets = (unet1, unet2),\n    image_sizes = (64, 256),\n    timesteps = 250,\n    cond_drop_prob = 0.1\n).cuda()\n\n# mock images (get a lot of this) and text encodings from large T5\n\ntext_embeds = torch.randn(4, 256, 768).cuda()\nimages = torch.randn(4, 3, 256, 256).cuda()\n\n# feed images into imagen, training each unet in the cascade\n\nloss = imagen(images, text_embeds = text_embeds, unet_number = 2)\nloss.backward()\n\n# do the above for many many many many steps\n# now you can sample an image based on the text embeddings as well as low resolution images\n\nlowres_images = torch.randn(3, 3, 64, 64).cuda()  # starting un-resoluted images\n\nimages = imagen.sample(\n    texts = [\n        'a whale breaching from afar',\n        'young girl blowing out candles on her birthday cake',\n        'fireworks with blue and green sparkles'\n    ],\n    start_at_unet_number = 2,              # start at unet number 2\n    start_image_or_video = lowres_images,  # pass in low resolution images to be resoluted\n    cond_scale = 3.)\n\nimages.shape # (3, 3, 256, 256)\n```\n\nAt any time you can save and load the trainer and all associated states with the `save` and `load` methods. It is recommended you use these methods instead of manually saving with a `state_dict` call, as there are some device memory management being done underneath the hood within the trainer.\n\nex.\n\n```python\ntrainer.save('.\u002Fpath\u002Fto\u002Fcheckpoint.pt')\n\ntrainer.load('.\u002Fpath\u002Fto\u002Fcheckpoint.pt')\n\ntrainer.steps # (2,) step number for each of the unets, in this case 2\n```\n\n## Dataloader\n\nYou can also rely on the `ImagenTrainer` to automatically train off `DataLoader` instances. You simply have to craft your `DataLoader` to return either `images` (for unconditional case), or of `('images', 'text_embeds')` for text-guided generation.\n\nex. unconditional training\n\n```python\nfrom imagen_pytorch import Unet, Imagen, ImagenTrainer\nfrom imagen_pytorch.data import Dataset\n\n# unets for unconditional imagen\n\nunet = Unet(\n    dim = 32,\n    dim_mults = (1, 2, 4, 8),\n    num_resnet_blocks = 1,\n    layer_attns = (False, False, False, True),\n    layer_cross_attns = False\n)\n\n# imagen, which contains the unet above\n\nimagen = Imagen(\n    condition_on_text = False,  # this must be set to False for unconditional Imagen\n    unets = unet,\n    image_sizes = 128,\n    timesteps = 1000\n)\n\ntrainer = ImagenTrainer(\n    imagen = imagen,\n    split_valid_from_train = True # whether to split the validation dataset from the training\n).cuda()\n\n# instantiate your dataloader, which returns the necessary inputs to the DDPM as tuple in the order of images, text embeddings, then text masks. in this case, only images is returned as it is unconditional training\n\ndataset = Dataset('\u002Fpath\u002Fto\u002Ftraining\u002Fimages', image_size = 128)\n\ntrainer.add_train_dataset(dataset, batch_size = 16)\n\n# working training loop\n\nfor i in range(200000):\n    loss = trainer.train_step(unet_number = 1, max_batch_size = 4)\n    print(f'loss: {loss}')\n\n    if not (i % 50):\n        valid_loss = trainer.valid_step(unet_number = 1, max_batch_size = 4)\n        print(f'valid loss: {valid_loss}')\n\n    if not (i % 100) and trainer.is_main: # is_main makes sure this can run in distributed\n        images = trainer.sample(batch_size = 1, return_pil_images = True) # returns List[Image]\n        images[0].save(f'.\u002Fsample-{i \u002F\u002F 100}.png')\n\n```\n\n## Multi GPU\n\nThanks to \u003Ca href=\"https:\u002F\u002Fhuggingface.co\u002Fdocs\u002Faccelerate\u002Findex\">🤗 Accelerate\u003C\u002Fa>, you can do multi GPU training easily with two steps.\n\nFirst you need to invoke `accelerate config` in the same directory as your training script (say it is named `train.py`)\n\n```bash\n$ accelerate config\n```\n\nNext, instead of calling `python train.py` as you would for single GPU, you would use the accelerate CLI as so\n\n```bash\n$ accelerate launch train.py\n```\n\nThat's it!\n\n## Command-line\n\nImagen can also be used via CLI directly.\n\n### Configuration\n\nex.\n\n```bash\n$ imagen config\n```\nor\n```bash\n$ imagen config --path .\u002Fconfigs\u002Fconfig.json\n```\n\nIn the config you are able to change settings for the trainer, dataset and the imagen config.\n\nThe Imagen config parameters can be found \u003Ca href=\"https:\u002F\u002Fgithub.com\u002Flucidrains\u002Fimagen-pytorch\u002Fblob\u002Ff8cc75f4d9020998c577b3770d3f260ce2ee2dcf\u002Fimagen_pytorch\u002Fconfigs.py#L68\">here\u003C\u002Fa>\n\nThe Elucidated Imagen config parameters can be found \u003Ca href=\"https:\u002F\u002Fgithub.com\u002Flucidrains\u002Fimagen-pytorch\u002Fblob\u002Ff8cc75f4d9020998c577b3770d3f260ce2ee2dcf\u002Fimagen_pytorch\u002Fconfigs.py#L108\">here\u003C\u002Fa>\n\nThe Imagen Trainer config parameters can be found \u003Ca href=\"https:\u002F\u002Fgithub.com\u002Flucidrains\u002Fimagen-pytorch\u002Fblob\u002Ff8cc75f4d9020998c577b3770d3f260ce2ee2dcf\u002Fimagen_pytorch\u002Ftrainer.py#L226\">here\u003C\u002Fa>\n\nFor the dataset parameters all dataloader parameters can be used.\n\n### Training\n\nThis command allows you to train or resume training your model\n\nex.\n```bash\n$ imagen train\n```\nor\n```bash\n$ imagen train --unet 2 --epoches 10\n```\n\nYou can pass following arguments to the training command.\n\n- `--config` specify the config file to use for training [default: .\u002Fimagen_config.json]\n- `--unet` the index of the unet to train [default: 1]\n- `--epoches` how many epoches to train for [default: 50]\n\n### Sampling\n\nBe aware when sampling your checkpoint should have trained all unets to get a usable result.\n\nex.\n\n```bash\n$ imagen sample --model .\u002Fpath\u002Fto\u002Fmodel\u002Fcheckpoint.pt \"a squirrel raiding the birdfeeder\"\n# image is saved to .\u002Fa_squirrel_raiding_the_birdfeeder.png\n```\n\nYou can pass following arguments to the sample command.\n\n- `--model` specify the model file to use for sampling\n- `--cond_scale` conditioning scale (classifier free guidance) in decoder\n- `--load_ema` load EMA version of unets if available\n\nIn order to use a saved checkpoint with this feature, you either must instantiate your Imagen instance using the config classes, `ImagenConfig` and `ElucidatedImagenConfig` or create a checkpoint via the CLI directly\n\nFor proper training, you'll likely want to setup config-driven training anyways.\n\nex.\n\n```python\nimport torch\nfrom imagen_pytorch import ImagenConfig, ElucidatedImagenConfig, ImagenTrainer\n\n# in this example, using elucidated imagen\n\nimagen = ElucidatedImagenConfig(\n    unets = [\n        dict(dim = 32, dim_mults = (1, 2, 4, 8)),\n        dict(dim = 32, dim_mults = (1, 2, 4, 8))\n    ],\n    image_sizes = (64, 128),\n    cond_drop_prob = 0.5,\n    num_sample_steps = 32\n).create()\n\ntrainer = ImagenTrainer(imagen)\n\n# do your training ...\n\n# then save it\n\ntrainer.save('.\u002Fcheckpoint.pt')\n\n# you should see a message informing you that .\u002Fcheckpoint.pt is commandable from the terminal\n```\n\nIt really should be as simple as that\n\nYou can also pass this checkpoint file around, and anyone can continue finetune on their own data\n\n```python\nfrom imagen_pytorch import load_imagen_from_checkpoint, ImagenTrainer\n\nimagen = load_imagen_from_checkpoint('.\u002Fcheckpoint.pt')\n\ntrainer = ImagenTrainer(imagen)\n\n# continue training \u002F fine-tuning\n```\n\n## Inpainting\n\nInpainting follows the formulation laid out by the recent \u003Ca href=\"https:\u002F\u002Farxiv.org\u002Fabs\u002F2201.09865\">Repaint paper\u003C\u002Fa>. Simply pass in `inpaint_images` and `inpaint_masks` to the `sample` function on either `Imagen` or `ElucidatedImagen`\n\n```python\n\ninpaint_images = torch.randn(4, 3, 512, 512).cuda()      # (batch, channels, height, width)\ninpaint_masks = torch.ones((4, 512, 512)).bool().cuda()  # (batch, height, width)\n\ninpainted_images = trainer.sample(texts = [\n    'a whale breaching from afar',\n    'young girl blowing out candles on her birthday cake',\n    'fireworks with blue and green sparkles',\n    'dust motes swirling in the morning sunshine on the windowsill'\n], inpaint_images = inpaint_images, inpaint_masks = inpaint_masks, cond_scale = 5.)\n\ninpainted_images # (4, 3, 512, 512)\n```\n\nFor video, similarly pass in your videos to `inpaint_videos` keyword on `.sample`. Inpainting mask can either be the same across all frames `(batch, height, width)` or different `(batch, frames, height, width)`\n\n```python\n\ninpaint_videos = torch.randn(4, 3, 8, 512, 512).cuda()   # (batch, channels, frames, height, width)\ninpaint_masks = torch.ones((4, 8, 512, 512)).bool().cuda()  # (batch, frames, height, width)\n\ninpainted_videos = trainer.sample(texts = [\n    'a whale breaching from afar',\n    'young girl blowing out candles on her birthday cake',\n    'fireworks with blue and green sparkles',\n    'dust motes swirling in the morning sunshine on the windowsill'\n], inpaint_videos = inpaint_videos, inpaint_masks = inpaint_masks, cond_scale = 5.)\n\ninpainted_videos # (4, 3, 8, 512, 512)\n```\n\n## Experimental\n\n\u003Ca href=\"https:\u002F\u002Fresearch.nvidia.com\u002Fperson\u002Ftero-karras\">Tero Karras\u003C\u002Fa> of StyleGAN fame has written a \u003Ca href=\"https:\u002F\u002Farxiv.org\u002Fabs\u002F2206.00364\">new paper\u003C\u002Fa> with results that have been corroborated by a number of independent researchers as well as on my own machine. I have decided to create a version of `Imagen`, the `ElucidatedImagen`, so that one can use the new elucidated DDPM for text-guided cascading generation.\n\nSimply import `ElucidatedImagen`, and then instantiate the instance as you did before. The hyperparameters are different than the usual ones for discrete and continuous time gaussian diffusion, and can be individualized for each unet in the cascade.\n\nEx.\n\n```python\nfrom imagen_pytorch import ElucidatedImagen\n\n# instantiate your unets ...\n\nimagen = ElucidatedImagen(\n    unets = (unet1, unet2),\n    image_sizes = (64, 128),\n    cond_drop_prob = 0.1,\n    num_sample_steps = (64, 32), # number of sample steps - 64 for base unet, 32 for upsampler (just an example, have no clue what the optimal values are)\n    sigma_min = 0.002,           # min noise level\n    sigma_max = (80, 160),       # max noise level, @crowsonkb recommends double the max noise level for upsampler\n    sigma_data = 0.5,            # standard deviation of data distribution\n    rho = 7,                     # controls the sampling schedule\n    P_mean = -1.2,               # mean of log-normal distribution from which noise is drawn for training\n    P_std = 1.2,                 # standard deviation of log-normal distribution from which noise is drawn for training\n    S_churn = 80,                # parameters for stochastic sampling - depends on dataset, Table 5 in apper\n    S_tmin = 0.05,\n    S_tmax = 50,\n    S_noise = 1.003,\n).cuda()\n\n# rest is the same as above\n\n```\n\n## Text to Video\n\nThis repository will also start accumulating new research around text guided video synthesis. For starters it will adopt the 3d unet architecture described by Jonathan Ho in \u003Ca href=\"https:\u002F\u002Farxiv.org\u002Fabs\u002F2204.03458\">Video Diffusion Models\u003C\u002Fa>\n\nUpdate: verified \u003Ca href=\"https:\u002F\u002Fgithub.com\u002Flucidrains\u002Fimagen-pytorch\u002Fissues\u002F305#issuecomment-1407015141\">working\u003C\u002Fa> by \u003Ca href=\"https:\u002F\u002Fgithub.com\u002FHReynaud\">Hadrien Reynaud\u003C\u002Fa>!\n\nEx.\n\n```python\nimport torch\nfrom imagen_pytorch import Unet3D, ElucidatedImagen, ImagenTrainer\n\nunet1 = Unet3D(dim = 64, dim_mults = (1, 2, 4, 8)).cuda()\n\nunet2 = Unet3D(dim = 64, dim_mults = (1, 2, 4, 8)).cuda()\n\n# elucidated imagen, which contains the unets above (base unet and super resoluting ones)\n\nimagen = ElucidatedImagen(\n    unets = (unet1, unet2),\n    image_sizes = (16, 32),\n    random_crop_sizes = (None, 16),\n    temporal_downsample_factor = (2, 1),        # in this example, the first unet would receive the video temporally downsampled by 2x\n    num_sample_steps = 10,\n    cond_drop_prob = 0.1,\n    sigma_min = 0.002,                          # min noise level\n    sigma_max = (80, 160),                      # max noise level, double the max noise level for upsampler\n    sigma_data = 0.5,                           # standard deviation of data distribution\n    rho = 7,                                    # controls the sampling schedule\n    P_mean = -1.2,                              # mean of log-normal distribution from which noise is drawn for training\n    P_std = 1.2,                                # standard deviation of log-normal distribution from which noise is drawn for training\n    S_churn = 80,                               # parameters for stochastic sampling - depends on dataset, Table 5 in apper\n    S_tmin = 0.05,\n    S_tmax = 50,\n    S_noise = 1.003,\n).cuda()\n\n# mock videos (get a lot of this) and text encodings from large T5\n\ntexts = [\n    'a whale breaching from afar',\n    'young girl blowing out candles on her birthday cake',\n    'fireworks with blue and green sparkles',\n    'dust motes swirling in the morning sunshine on the windowsill'\n]\n\nvideos = torch.randn(4, 3, 10, 32, 32).cuda() # (batch, channels, time \u002F video frames, height, width)\n\n# feed images into imagen, training each unet in the cascade\n# for this example, only training unet 1\n\ntrainer = ImagenTrainer(imagen)\n\n# you can also ignore time when training on video initially, shown to improve results in video-ddpm paper. eventually will make the 3d unet trainable with either images or video. research shows it is essential (with current data regimes) to train first on text-to-image. probably won't be true in another decade. all big data becomes small data\n\ntrainer(videos, texts = texts, unet_number = 1, ignore_time = False)\ntrainer.update(unet_number = 1)\n\nvideos = trainer.sample(texts = texts, video_frames = 20) # extrapolating to 20 frames from training on 10 frames\n\nvideos.shape # (4, 3, 20, 32, 32)\n\n```\n\nYou can also train on text - image pairs first. The `Unet3D` will automatically convert it to single framed videos and learn without the temporal components (by automatically setting `ignore_time = True`), whether it be 1d convolutions or causal attention across time.\n\nThis is the current approach taken by all the big artificial intelligence labs (Brain, MetaAI, Bytedance)\n\n## FAQ\n\n- Why are my generated images not aligning well with the text?\n\nImagen uses an algorithm called \u003Ca href=\"https:\u002F\u002Fopenreview.net\u002Fforum?id=qw8AKxfYbI\">Classifier Free Guidance\u003C\u002Fa>. When sampling, you apply a scale to the conditioning (text in this case) of greater than `1.0`.\n\nResearcher \u003Ca href=\"https:\u002F\u002Fgithub.com\u002FNetruk44 \">Netruk44\u003C\u002Fa> have reported `5-10` to be optimal, but anything greater than `10` to break.\n\n```python\ntrainer.sample(texts = [\n    'a cloud in the shape of a roman gladiator'\n], cond_scale = 5.) # \u003C-- cond_scale is the conditioning scale, needs to be greater than 1.0 to be better than average\n```\n\n- Are there any pretrained models yet?\n\nNot at the moment but one will likely be trained and open sourced within the year, if not sooner. If you would like to participate, you can join the community of artificial neural network trainers at Laion (discord link is in the Readme above) and start collaborating.\n\n- Will this technology take my job?\n\nMore the reason why you should start training your own model, starting today! The last thing we need is this technology being in the hands of an elite few. Hopefully this repository reduces the work to just finding the necessary compute, and augmenting with your own curated dataset.\n\n- What am I allowed to do with this repository?\n\nAnything! It is MIT licensed. In other words, you can freely copy \u002F paste for your own research, remixed for whatever modality you can think of. Go train amazing models for profit, for science, or simply to satiate your own personal pleasure at witnessing something divine unravel in front of you.\n\n## Cool Applications!\n\n- \u003Ca href=\"https:\u002F\u002Farxiv.org\u002Fabs\u002F2303.12644\">Echocardiogram synthesis\u003C\u002Fa> \u003Ca href=\"https:\u002F\u002Fgithub.com\u002FHReynaud\u002FEchoDiffusion\">[Code]\u003C\u002Fa>\n\n- \u003Ca href=\"https:\u002F\u002Fwww.biorxiv.org\u002Fcontent\u002F10.1101\u002F2023.10.25.564065v1\">SOTA Hi-C contact matrix synthesis\u003C\u002Fa> \u003Ca href=\"https:\u002F\u002Fgithub.com\u002FCHNFTQ\u002FCapricorn\">[Code]\u003C\u002Fa>\n\n- \u003Ca href=\"https:\u002F\u002Farxiv.org\u002Fabs\u002F2311.15941\">Floor plan generation\u003C\u002Fa>\n\n- \u003Ca href=\"https:\u002F\u002Farxiv.org\u002Fabs\u002F2312.01152\">Ultra High Resolution Histopathology Slides\u003C\u002Fa>\n\n- \u003Ca href=\"https:\u002F\u002Farxiv.org\u002Fabs\u002F2312.03043\">Synthetic Laparoscopic Images\u003C\u002Fa>\n\n- \u003Ca href=\"https:\u002F\u002Fwww.nature.com\u002Farticles\u002Fs42256-023-00762-x\">Designing MetaMaterials\u003C\u002Fa>\n\n## Related Works\n\n- \u003Ca href=\"https:\u002F\u002Fgithub.com\u002Farchinetai\u002Faudio-diffusion-pytorch\">Audio diffusion\u003C\u002Fa> from \u003Ca href=\"https:\u002F\u002Fgithub.com\u002Fflavioschneider\">Flavio Schneider\u003C\u002Fa>\n\n- \u003Ca href=\"https:\u002F\u002Fgithub.com\u002FAssemblyAI-Examples\u002FMinImagen\">Mini Imagen\u003C\u002Fa> from \u003Ca href=\"https:\u002F\u002Fgithub.com\u002Foconnoob\">Ryan O.\u003C\u002Fa> | \u003Ca href=\"https:\u002F\u002Fwww.assemblyai.com\u002Fblog\u002Fbuild-your-own-imagen-text-to-image-model\u002F\">AssemblyAI writeup\u003C\u002Fa>\n\n## Todo\n\n- [x] use huggingface transformers for T5-small text embeddings\n- [x] add dynamic thresholding\n- [x] add dynamic thresholding DALLE2 and video-diffusion repository as well\n- [x] allow for one to set T5-large (and perhaps small factory method to take in any huggingface transformer)\n- [x] add the lowres noise level with the pseudocode in appendix, and figure out what is this sweep they do at inference time\n- [x] port over some training code from DALLE2\n- [x] need to be able to use a different noise schedule per unet (cosine was used for base, but linear for SR)\n- [x] just make one master-configurable unet\n- [x] complete resnet block (biggan inspired? but with groupnorm) - complete self attention\n- [x] complete conditioning embedding block (and make it completely configurable, whether it be attention, film etc)\n- [x] consider using perceiver-resampler from https:\u002F\u002Fgithub.com\u002Flucidrains\u002Fflamingo-pytorch in place of attention pooling\n- [x] add attention pooling option, in addition to cross attention and film\n- [x] add optional cosine decay schedule with warmup, for each unet, to trainer\n- [x] switch to continuous timesteps instead of discretized, as it seems that is what they used for all stages - first figure out the linear noise schedule case from the variational ddpm paper https:\u002F\u002Fopenreview.net\u002Fforum?id=2LdBqxc1Yv\n- [x] figure out log(snr) for alpha cosine noise schedule.\n- [x] suppress the transformers warning because only T5encoder is used\n- [x] allow setting for using linear attention on layers where full attention cannot be used\n- [x] force unets in continuous time case to use non-fouriered conditions (just pass the log(snr) through an MLP with optional layernorms), as that is what i have working locally\n- [x] removed learned variance\n- [x] add p2 loss weighting for continuous time\n- [x] make sure cascading ddpm can be trained without text condition, and make sure both continuous and discrete time gaussian diffusion works\n- [x] use primer's depthwise convs on the qkv projections in linear attention (or use token shifting before projections) - also use new dropout proposed by bayesformer, as it seems to work well with linear attention\n- [x] explore skip layer excitation in unet decoder\n- [x] accelerate integration\n- [x] build out CLI tool and one-line generation of image\n- [x] knock out any issues that arised from accelerate\n- [x] add inpainting ability using resampler from repaint paper https:\u002F\u002Farxiv.org\u002Fabs\u002F2201.09865\n- [x] build a simple checkpointing system, backed by a folder\n- [x] add skip connection from outputs of all upsample blocks, used in unet squared paper and some previous unet works\n- [x] add fsspec, recommended by Romain @rom1504, for cloud \u002F local file system agnostic persistence of checkpoints\n- [x] test out persistence in gcs with https:\u002F\u002Fgithub.com\u002Ffsspec\u002Fgcsfs\n- [x] extend to video generation, using axial time attention as in Ho's video ddpm paper\n- [x] allow elucidated imagen to generalize to any shape\n- [x] allow for imagen to generalize to any shape\n- [x] add \u003Ca href=\"https:\u002F\u002Fgithub.com\u002Flucidrains\u002Fx-transformers#dynamic-positional-bias\">dynamic positional bias\u003C\u002Fa> for the best type of length extrapolation across video time\n- [x] move video frames to sample function, as we will be attempting time extrapolation\n- [x] attention bias to null key \u002F values should be a learned scalar of head dimension\n- [x] add self-conditioning from \u003Ca href=\"https:\u002F\u002Farxiv.org\u002Fabs\u002F2208.04202\">bit diffusion\u003C\u002Fa> paper, already coded up at \u003Ca href=\"https:\u002F\u002Fgithub.com\u002Flucidrains\u002Fdenoising-diffusion-pytorch\u002Fcommit\u002Fbeb2f2d8dd9b4f2bd5be4719f37082fe061ee450\">ddpm-pytorch\u003C\u002Fa>\n- [x] add v-parameterization (https:\u002F\u002Farxiv.org\u002Fabs\u002F2202.00512) from \u003Ca href=\"https:\u002F\u002Fimagen.research.google\u002Fvideo\u002Fpaper.pdf\">imagen video\u003C\u002Fa> paper, the only thing new\n- [x] incorporate all learnings from make-a-video (https:\u002F\u002Fmakeavideo.studio\u002F)\n- [x] build out CLI tool for training, resuming training off config file\n- [x] allow for temporal interpolation at specific stages\n- [x] make sure temporal interpolation works with inpainting\n- [x] make sure one can customize all interpolation modes (some researchers are finding better results with trilinear)\n- [x] imagen-video : allow for conditioning on preceding (and possibly future) frames of videos. ignore time should not be allowed in that scenario\n- [x] make sure to automatically take care of temporal down\u002Fupsampling for conditioning video frames, but allow for an option to turn it off\n- [x] make sure inpainting works with video\n- [x] make sure inpainting mask for video can accept be customized per frame\n\n- [ ] add flash attention\n- [ ] reread \u003Ca href=\"https:\u002F\u002Farxiv.org\u002Fabs\u002F2205.15868\">cogvideo\u003C\u002Fa> and figure out how frame rate conditioning could be used\n- [ ] bring in attention expertise for self attention layers in unet3d\n- [ ] consider bringing in NUWA's 3d convolutional attention\n- [ ] consider transformer-xl memories in the temporal attention blocks\n- [ ] consider \u003Ca href=\"github.com\u002Flucidrains\u002Fperceiver-ar-pytorch\">perceiver-ar approach\u003C\u002Fa> to attending to past time\n- [ ] frame dropouts during attention for achieving both regularizing effect as well as shortened training time\n- [ ] investigate frank wood's claims https:\u002F\u002Fgithub.com\u002Flucidrains\u002Fflexible-diffusion-modeling-videos-pytorch and either add the hierarchical sampling technique, or let people know about its deficiencies\n- [ ] offer challenging moving mnist (with distractor objects) as a one-line trainable baseline for researchers to branch off of for text to video\n- [ ] preencoding of text to memmapped embeddings\n- [ ] be able to create dataloader iterators based on the old epoch style, also configure shuffling etc\n- [ ] be able to also pass in arguments (instead of requiring forward to be all keyword args on model)\n- [ ] bring in reversible blocks from revnets for 3d unet, to lessen memory burden\n- [ ] add ability to only train super-resolution network\n- [ ] read \u003Ca href=\"https:\u002F\u002Farxiv.org\u002Fabs\u002F2206.00927v1\">dpm-solver\u003C\u002Fa> see if it is applicable to continuous time gaussian diffusion\n- [ ] allow for conditioning video frames with arbitrary absolute times (calculate RPE during temporal attention)\n- [ ] accommodate \u003Ca href=\"https:\u002F\u002Fdreambooth.github.io\u002F\">dream booth\u003C\u002Fa> fine tuning\n- [ ] add textual inversion\n- [ ] cleanup self conditioning to be extracted at imagen instantiation\n- [ ] make sure eventual dreambooth works with imagen-video\n- [ ] add framerate conditioning for video diffusion\n- [ ] make sure one can simulataneously condition on video frames as a prompt, as well as some conditioning image across all frames\n- [ ] test and add distillation technique from \u003Ca href=\"https:\u002F\u002Farxiv.org\u002Fabs\u002F2303.01469\">consistency models\u003C\u002Fa>\n\n## Citations\n\n```bibtex\n@inproceedings{Saharia2022PhotorealisticTD,\n    title   = {Photorealistic Text-to-Image Diffusion Models with Deep Language Understanding},\n    author  = {Chitwan Saharia and William Chan and Saurabh Saxena and Lala Li and Jay Whang and Emily L. Denton and Seyed Kamyar Seyed Ghasemipour and Burcu Karagol Ayan and Seyedeh Sara Mahdavi and Raphael Gontijo Lopes and Tim Salimans and Jonathan Ho and David Fleet and Mohammad Norouzi},\n    year    = {2022}\n}\n```\n\n```bibtex\n@article{Alayrac2022Flamingo,\n    title   = {Flamingo: a Visual Language Model for Few-Shot Learning},\n    author  = {Jean-Baptiste Alayrac et al},\n    year    = {2022}\n}\n```\n\n```bibtex\n@inproceedings{Sankararaman2022BayesFormerTW,\n    title   = {BayesFormer: Transformer with Uncertainty Estimation},\n    author  = {Karthik Abinav Sankararaman and Sinong Wang and Han Fang},\n    year    = {2022}\n}\n```\n\n```bibtex\n@article{So2021PrimerSF,\n    title   = {Primer: Searching for Efficient Transformers for Language Modeling},\n    author  = {David R. So and Wojciech Ma'nke and Hanxiao Liu and Zihang Dai and Noam M. Shazeer and Quoc V. Le},\n    journal = {ArXiv},\n    year    = {2021},\n    volume  = {abs\u002F2109.08668}\n}\n```\n\n```bibtex\n@misc{cao2020global,\n    title   = {Global Context Networks},\n    author  = {Yue Cao and Jiarui Xu and Stephen Lin and Fangyun Wei and Han Hu},\n    year    = {2020},\n    eprint  = {2012.13375},\n    archivePrefix = {arXiv},\n    primaryClass = {cs.CV}\n}\n```\n\n```bibtex\n@article{Karras2022ElucidatingTD,\n    title   = {Elucidating the Design Space of Diffusion-Based Generative Models},\n    author  = {Tero Karras and Miika Aittala and Timo Aila and Samuli Laine},\n    journal = {ArXiv},\n    year    = {2022},\n    volume  = {abs\u002F2206.00364}\n}\n```\n\n```bibtex\n@inproceedings{NEURIPS2020_4c5bcfec,\n    author      = {Ho, Jonathan and Jain, Ajay and Abbeel, Pieter},\n    booktitle   = {Advances in Neural Information Processing Systems},\n    editor      = {H. Larochelle and M. Ranzato and R. Hadsell and M.F. Balcan and H. Lin},\n    pages       = {6840--6851},\n    publisher   = {Curran Associates, Inc.},\n    title       = {Denoising Diffusion Probabilistic Models},\n    url         = {https:\u002F\u002Fproceedings.neurips.cc\u002Fpaper\u002F2020\u002Ffile\u002F4c5bcfec8584af0d967f1ab10179ca4b-Paper.pdf},\n    volume      = {33},\n    year        = {2020}\n}\n```\n\n```bibtex\n@article{Lugmayr2022RePaintIU,\n    title   = {RePaint: Inpainting using Denoising Diffusion Probabilistic Models},\n    author  = {Andreas Lugmayr and Martin Danelljan and Andr{\\'e}s Romero and Fisher Yu and Radu Timofte and Luc Van Gool},\n    journal = {ArXiv},\n    year    = {2022},\n    volume  = {abs\u002F2201.09865}\n}\n```\n\n```bibtex\n@misc{ho2022video,\n    title   = {Video Diffusion Models},\n    author  = {Jonathan Ho and Tim Salimans and Alexey Gritsenko and William Chan and Mohammad Norouzi and David J. Fleet},\n    year    = {2022},\n    eprint  = {2204.03458},\n    archivePrefix = {arXiv},\n    primaryClass = {cs.CV}\n}\n```\n\n```bibtex\n@inproceedings{rogozhnikov2022einops,\n    title   = {Einops: Clear and Reliable Tensor Manipulations with Einstein-like Notation},\n    author  = {Alex Rogozhnikov},\n    booktitle = {International Conference on Learning Representations},\n    year    = {2022},\n    url     = {https:\u002F\u002Fopenreview.net\u002Fforum?id=oapKSVM2bcj}\n}\n```\n\n```bibtex\n@misc{chen2022analog,\n    title   = {Analog Bits: Generating Discrete Data using Diffusion Models with Self-Conditioning},\n    author  = {Ting Chen and Ruixiang Zhang and Geoffrey Hinton},\n    year    = {2022},\n    eprint  = {2208.04202},\n    archivePrefix = {arXiv},\n    primaryClass = {cs.CV}\n}\n```\n\n```bibtex\n@misc{Singer2022,\n    author  = {Uriel Singer},\n    url     = {https:\u002F\u002Fmakeavideo.studio\u002FMake-A-Video.pdf}\n}\n```\n\n```bibtex\n@article{Sunkara2022NoMS,\n    title   = {No More Strided Convolutions or Pooling: A New CNN Building Block for Low-Resolution Images and Small Objects},\n    author  = {Raja Sunkara and Tie Luo},\n    journal = {ArXiv},\n    year    = {2022},\n    volume  = {abs\u002F2208.03641}\n}\n```\n\n```bibtex\n@article{Salimans2022ProgressiveDF,\n    title   = {Progressive Distillation for Fast Sampling of Diffusion Models},\n    author  = {Tim Salimans and Jonathan Ho},\n    journal = {ArXiv},\n    year    = {2022},\n    volume  = {abs\u002F2202.00512}\n}\n```\n\n```bibtex\n@article{Ho2022ImagenVH,\n    title   = {Imagen Video: High Definition Video Generation with Diffusion Models},\n    author  = {Jonathan Ho and William Chan and Chitwan Saharia and Jay Whang and Ruiqi Gao and Alexey A. Gritsenko and Diederik P. Kingma and Ben Poole and Mohammad Norouzi and David J. Fleet and Tim Salimans},\n    journal = {ArXiv},\n    year    = {2022},\n    volume  = {abs\u002F2210.02303}\n}\n```\n\n```bibtex\n@misc{gilmer2023intriguing\n    title  = {Intriguing Properties of Transformer Training Instabilities},\n    author = {Justin Gilmer, Andrea Schioppa, and Jeremy Cohen},\n    year   = {2023},\n    status = {to be published - one attention stabilization technique is circulating within Google Brain, being used by multiple teams}\n}\n```\n\n```bibtex\n@inproceedings{Hang2023EfficientDT,\n    title   = {Efficient Diffusion Training via Min-SNR Weighting Strategy},\n    author  = {Tiankai Hang and Shuyang Gu and Chen Li and Jianmin Bao and Dong Chen and Han Hu and Xin Geng and Baining Guo},\n    year    = {2023}\n}\n```\n\n```bibtex\n@article{Zhang2021TokenST,\n    title   = {Token Shift Transformer for Video Classification},\n    author  = {Hao Zhang and Y. Hao and Chong-Wah Ngo},\n    journal = {Proceedings of the 29th ACM International Conference on Multimedia},\n    year    = {2021}\n}\n```\n\n```bibtex\n@inproceedings{anonymous2022normformer,\n    title   = {NormFormer: Improved Transformer Pretraining with Extra Normalization},\n    author  = {Anonymous},\n    booktitle = {Submitted to The Tenth International Conference on Learning Representations },\n    year    = {2022},\n    url     = {https:\u002F\u002Fopenreview.net\u002Fforum?id=GMYWzWztDx5},\n    note    = {under review}\n}\n```\n\n```bibtex\n@inproceedings{Sadat2024EliminatingOA,\n    title   = {Eliminating Oversaturation and Artifacts of High Guidance Scales in Diffusion Models},\n    author  = {Seyedmorteza Sadat and Otmar Hilliges and Romann M. Weber},\n    year    = {2024},\n    url     = {https:\u002F\u002Fapi.semanticscholar.org\u002FCorpusID:273098845}\n}\n```\n","该项目是Imagen的PyTorch实现，这是一种由Google开发的文本到图像的神经网络，其性能超越了DALL-E2。核心功能包括基于大型预训练T5模型（注意力网络）生成高质量图像，采用级联DDPM架构，并结合动态裁剪、噪声水平调节及内存高效UNet设计以优化分类器无指导效果。适用于需要高精度文本到图像转换的应用场景，如创意设计、虚拟内容生成等。","2026-06-11 03:24:34","top_topic"]