[{"data":1,"prerenderedAt":-1},["ShallowReactive",2],{"project-9783":3},{"id":4,"name":5,"fullName":6,"owner":7,"repo":5,"description":8,"homepage":9,"htmlUrl":10,"language":11,"languages":10,"totalLinesOfCode":10,"stars":12,"forks":13,"watchers":14,"openIssues":15,"contributorsCount":16,"subscribersCount":16,"size":16,"stars1d":16,"stars7d":16,"stars30d":17,"stars90d":16,"forks30d":16,"starsTrendScore":16,"compositeScore":18,"rankGlobal":10,"rankLanguage":10,"license":19,"archived":20,"fork":20,"defaultBranch":21,"hasWiki":22,"hasPages":20,"topics":23,"createdAt":10,"pushedAt":10,"updatedAt":30,"readmeContent":31,"aiSummary":32,"trendingCount":16,"starSnapshotCount":16,"syncStatus":33,"lastSyncTime":34,"discoverSource":35},9783,"DALLE-pytorch","lucidrains\u002FDALLE-pytorch","lucidrains","Implementation \u002F replication of DALL-E, OpenAI's Text to Image Transformer, in Pytorch","",null,"Python",5628,642,91,120,0,4,39.42,"MIT License",false,"main",true,[24,25,26,27,28,29],"artificial-intelligence","attention-mechanism","deep-learning","multi-modal","text-to-image","transformers","2026-06-12 02:02:12","# DALL-E in Pytorch\n\n\u003Cp align='center'>\n  \u003Ca href=\"https:\u002F\u002Fcolab.research.google.com\u002Fgist\u002Fafiaka87\u002Fb29213684a1dd633df20cab49d05209d\u002Ftrain_dalle_pytorch.ipynb\">\n         \u003Cimg alt=\"Train DALL-E w\u002F DeepSpeed\" src=\"https:\u002F\u002Fcolab.research.google.com\u002Fassets\u002Fcolab-badge.svg\">\n  \u003C\u002Fa>\n  \u003Ca href=\"https:\u002F\u002Fdiscord.gg\u002FxBPBXfcFHd\">\u003Cimg alt=\"Join us on Discord\" src=\"https:\u002F\u002Fimg.shields.io\u002Fdiscord\u002F823813159592001537?color=5865F2&logo=discord&logoColor=white\">\u003C\u002Fa>\u003C\u002Fbr>\n  \u003Ca href=\"https:\u002F\u002Fgithub.com\u002Frobvanvolt\u002FDALLE-models\">Released DALLE Models\u003C\u002Fa>\u003C\u002Fbr>\n  \u003Ca href=\"https:\u002F\u002Fgithub.com\u002From1504\u002Fdalle-service\">Web-Hostable DALLE Checkpoints\u003C\u002Fa>\u003C\u002Fbr>\n\n  \u003Ca href=\"https:\u002F\u002Fwww.youtube.com\u002Fwatch?v=j4xgkjWlfL4\">Yannic Kilcher's video\u003C\u002Fa>\n\u003Cp>\nImplementation \u002F replication of \u003Ca href=\"https:\u002F\u002Fopenai.com\u002Fblog\u002Fdall-e\u002F\">DALL-E\u003C\u002Fa> (\u003Ca href=\"https:\u002F\u002Farxiv.org\u002Fabs\u002F2102.12092\">paper\u003C\u002Fa>), OpenAI's Text to Image Transformer, in Pytorch.  It will also contain \u003Ca href=\"https:\u002F\u002Fopenai.com\u002Fblog\u002Fclip\u002F\">CLIP\u003C\u002Fa> for ranking the generations.\n\n---\n\n\n\n[Quick Start](https:\u002F\u002Fgithub.com\u002Flucidrains\u002FDALLE-pytorch\u002Fwiki)\n\n\u003Ca href=\"https:\u002F\u002Fgithub.com\u002Flucidrains\u002Fdeep-daze\">Deep Daze\u003C\u002Fa> or \u003Ca href=\"https:\u002F\u002Fgithub.com\u002Flucidrains\u002Fbig-sleep\">Big Sleep\u003C\u002Fa> are great alternatives!\n\nFor generating video and audio, please see \u003Ca href=\"https:\u002F\u002Fgithub.com\u002Flucidrains\u002Fnuwa-pytorch\">NÜWA\u003C\u002Fa>\n\n## Appreciation\n  \nThis library could not have been possible without the contributions of \u003Ca href=\"https:\u002F\u002Fgithub.com\u002FjanEbert\">janEbert\u003C\u002Fa>, \u003Ca href=\"https:\u002F\u002Fgithub.com\u002Fafiaka87\">Clay\u003C\u002Fa>, \u003Ca href=\"https:\u002F\u002Fgithub.com\u002Frobvanvolt\">robvanvolt\u003C\u002Fa>, \u003Ca href=\"https:\u002F\u002Fgithub.com\u002From1504\">Romain Beaumont\u003C\u002Fa>, and \u003Ca href=\"https:\u002F\u002Fgithub.com\u002Fborzunov\">Alexander\u003C\u002Fa>! 🙏\n\n## Status\n\u003Cp align='center'>\n\n- \u003Ca href=\"https:\u002F\u002Fgithub.com\u002Fhtoyryla\">Hannu\u003C\u002Fa> has managed to train a small 6 layer DALL-E on a dataset of just 2000 landscape images! (2048 visual tokens)\n\n\u003Cimg src=\".\u002Fimages\u002Flandscape.png\">\u003C\u002Fimg>\n\n- \u003Ca href=\"https:\u002F\u002Fgithub.com\u002Fkobiso\">Kobiso\u003C\u002Fa>, a research engineer from Naver, has trained on the CUB200 dataset \u003Ca href=\"https:\u002F\u002Fgithub.com\u002Flucidrains\u002FDALLE-pytorch\u002Fdiscussions\u002F131\">here\u003C\u002Fa>, using full and deepspeed sparse attention\n\n\u003Cimg src=\".\u002Fimages\u002Fbirds.png\" width=\"256\">\u003C\u002Fimg>\n\n- (3\u002F15\u002F21) \u003Ca href=\"https:\u002F\u002Fgithub.com\u002Fafiaka87\">afiaka87\u003C\u002Fa> has managed one epoch using a reversible DALL-E and the dVaE \u003Ca href=\"https:\u002F\u002Fgithub.com\u002Flucidrains\u002FDALLE-pytorch\u002Fissues\u002F86#issue-832121328\">here\u003C\u002Fa>\n\n- \u003Ca href=\"https:\u002F\u002Fgithub.com\u002Frobvanvolt\">TheodoreGalanos\u003C\u002Fa> has trained on 150k layouts with the following results\n\u003Cp>\n  \u003Cimg src=\".\u002Fimages\u002Flayouts-1.jpg\" width=\"256\">\u003C\u002Fimg>\n  \u003Cimg src=\".\u002Fimages\u002Flayouts-2.jpg\" width=\"256\">\u003C\u002Fimg>\n\u003C\u002Fp>\n- \u003Ca href=\"https:\u002F\u002Fgithub.com\u002From1504\">Rom1504\u003C\u002Fa> has trained on 50k fashion images with captions with a really small DALL-E (2 layers) for just 24 hours with the following results\n\u003Cp\u002F>\n\u003Cimg src=\".\u002Fimages\u002Fclothing.png\" width=\"420\">\u003C\u002Fimg>\n\n- \u003Ca href=\"https:\u002F\u002Fgithub.com\u002Fafiaka87\">afiaka87\u003C\u002Fa> trained for 6 epochs on the same dataset as before thanks to the efficient 16k VQGAN with the following \u003Ca href=\"https:\u002F\u002Fgithub.com\u002Flucidrains\u002FDALLE-pytorch\u002Fdiscussions\u002F322>discussion\">results\u003C\u002Fa>\n\n\u003Cp align='centered'>\n  \u003Cimg src=\"https:\u002F\u002Fuser-images.githubusercontent.com\u002F3994972\u002F123564891-b6f18780-d780-11eb-9019-8a1b6178f861.png\" width=\"420\" alt-text='a photo of westwood park, san francisco, from the water in the afternoon'>\u003C\u002Fimg>\n  \u003Cimg src=\"https:\u002F\u002Fuser-images.githubusercontent.com\u002F3994972\u002F123564776-4c404c00-d780-11eb-9c8e-3356df358df3.png\" width=\"420\" alt-text='a female mannequin dressed in an olive button-down shirt and gold palazzo pants'> \u003C\u002Fimg>\n\u003C\u002Fp>\n  \nThanks to the amazing \"mega b#6696\" you can generate from this checkpoint in colab - \n\u003Ca href=\"https:\u002F\u002Fcolab.research.google.com\u002Fdrive\u002F11V2xw1eLPfZvzW8UQyTUhqCEU71w6Pr4?usp=sharing\">\n  \u003Cimg alt=\"Run inference on the Afiaka checkpoint in Colab\" src=\"https:\u002F\u002Fcolab.research.google.com\u002Fassets\u002Fcolab-badge.svg\">\n\u003C\u002Fa>\n\n- (5\u002F2\u002F21) First \u003Ca href=\"https:\u002F\u002Fgithub.com\u002Fsberbank-ai\u002Fru-dalle\">1.3B DALL-E\u003C\u002Fa> from 🇷🇺 has been trained and released to the public! 🎉\n\n- (4\u002F8\u002F22) Moving onwards to \u003Ca href=\"https:\u002F\u002Fgithub.com\u002Flucidrains\u002Fdalle2-pytorch\">DALLE-2\u003C\u002Fa>!\n\n## Install\n\n```bash\n$ pip install dalle-pytorch\n```\n\n## Usage\n\nTrain VAE\n\n```python\nimport torch\nfrom dalle_pytorch import DiscreteVAE\n\nvae = DiscreteVAE(\n    image_size = 256,\n    num_layers = 3,           # number of downsamples - ex. 256 \u002F (2 ** 3) = (32 x 32 feature map)\n    num_tokens = 8192,        # number of visual tokens. in the paper, they used 8192, but could be smaller for downsized projects\n    codebook_dim = 512,       # codebook dimension\n    hidden_dim = 64,          # hidden dimension\n    num_resnet_blocks = 1,    # number of resnet blocks\n    temperature = 0.9,        # gumbel softmax temperature, the lower this is, the harder the discretization\n    straight_through = False, # straight-through for gumbel softmax. unclear if it is better one way or the other\n)\n\nimages = torch.randn(4, 3, 256, 256)\n\nloss = vae(images, return_loss = True)\nloss.backward()\n\n# train with a lot of data to learn a good codebook\n```\n\nTrain DALL-E with pretrained VAE from above\n\n```python\nimport torch\nfrom dalle_pytorch import DiscreteVAE, DALLE\n\nvae = DiscreteVAE(\n    image_size = 256,\n    num_layers = 3,\n    num_tokens = 8192,\n    codebook_dim = 1024,\n    hidden_dim = 64,\n    num_resnet_blocks = 1,\n    temperature = 0.9\n)\n\ndalle = DALLE(\n    dim = 1024,\n    vae = vae,                  # automatically infer (1) image sequence length and (2) number of image tokens\n    num_text_tokens = 10000,    # vocab size for text\n    text_seq_len = 256,         # text sequence length\n    depth = 12,                 # should aim to be 64\n    heads = 16,                 # attention heads\n    dim_head = 64,              # attention head dimension\n    attn_dropout = 0.1,         # attention dropout\n    ff_dropout = 0.1            # feedforward dropout\n)\n\ntext = torch.randint(0, 10000, (4, 256))\nimages = torch.randn(4, 3, 256, 256)\n\nloss = dalle(text, images, return_loss = True)\nloss.backward()\n\n# do the above for a long time with a lot of data ... then\n\nimages = dalle.generate_images(text)\nimages.shape # (4, 3, 256, 256)\n```\n\nTo prime with a starting crop of an image, simply pass two more arguments\n\n```python\nimg_prime = torch.randn(4, 3, 256, 256)\n\nimages = dalle.generate_images(\n    text,\n    img = img_prime,\n    num_init_img_tokens = (14 * 32)  # you can set the size of the initial crop, defaults to a little less than ~1\u002F2 of the tokens, as done in the paper\n)\n\nimages.shape # (4, 3, 256, 256)\n```\n\nYou may also want to generate text using DALL-E. For that call this function:\n\n```python\ntext_tokens, texts = dalle.generate_texts(tokenizer, text)\n```\n\n## OpenAI's Pretrained VAE\n\nYou can also skip the training of the VAE altogether, using the pretrained model released by OpenAI! The wrapper class should take care of downloading and caching the model for you auto-magically.\n\n```python\nimport torch\nfrom dalle_pytorch import OpenAIDiscreteVAE, DALLE\n\nvae = OpenAIDiscreteVAE()       # loads pretrained OpenAI VAE\n\ndalle = DALLE(\n    dim = 1024,\n    vae = vae,                  # automatically infer (1) image sequence length and (2) number of image tokens\n    num_text_tokens = 10000,    # vocab size for text\n    text_seq_len = 256,         # text sequence length\n    depth = 1,                  # should aim to be 64\n    heads = 16,                 # attention heads\n    dim_head = 64,              # attention head dimension\n    attn_dropout = 0.1,         # attention dropout\n    ff_dropout = 0.1            # feedforward dropout\n)\n\ntext = torch.randint(0, 10000, (4, 256))\nimages = torch.randn(4, 3, 256, 256)\n\nloss = dalle(text, images, return_loss = True)\nloss.backward()\n```\n\n## Taming Transformer's Pretrained VQGAN VAE\n\nYou can also use the pretrained VAE offered by the authors of \u003Ca href=\"https:\u002F\u002Fgithub.com\u002FCompVis\u002Ftaming-transformers\">Taming Transformers\u003C\u002Fa>! Currently only the VAE with a codebook size of 1024 is offered, with the hope that it may train a little faster than OpenAI's, which has a size of 8192.\n\nIn contrast to OpenAI's VAE, it also has an extra layer of downsampling, so the image sequence length is 256 instead of 1024 (this will lead to a 16 reduction in training costs, when you do the math). Whether it will generalize as well as the original DALL-E is up to the citizen scientists out there to discover.\n\nUpdate - \u003Ca href=\"https:\u002F\u002Fgithub.com\u002Flucidrains\u002FDALLE-pytorch\u002Fdiscussions\u002F131\">it works!\u003C\u002Fa>\n\n```python\nfrom dalle_pytorch import VQGanVAE\n\nvae = VQGanVAE()\n\n# the rest is the same as the above example\n```\n\nThe default VQGan is the codebook size 1024 one trained on imagenet. If you wish to use a different one, you can use the `vqgan_model_path` and `vqgan_config_path` to pass the .ckpt file and the .yaml file. These options can be used both in train-dalle script or as argument of VQGanVAE class. Other pretrained VQGAN can be found in [taming transformers readme](https:\u002F\u002Fgithub.com\u002FCompVis\u002Ftaming-transformers#overview-of-pretrained-models). If you want to train a custom one you can [follow this guide](https:\u002F\u002Fgithub.com\u002FCompVis\u002Ftaming-transformers\u002Fpull\u002F54)\n\n\n## Adjust text conditioning strength\n\nRecently there has surfaced a \u003Ca href=\"https:\u002F\u002Fopenreview.net\u002Fforum?id=qw8AKxfYbI\">new technique\u003C\u002Fa> for guiding diffusion models without a classifier. The gist of the technique involves randomly dropping out the text condition during training, and at inference time, deriving the rough direction from unconditional to conditional distributions.\n\n\u003Ca href=\"https:\u002F\u002Fgithub.com\u002Fcrowsonkb\">Katherine Crowson\u003C\u002Fa> outlined in a \u003Ca href=\"https:\u002F\u002Ftwitter.com\u002FRiversHaveWings\u002Fstatus\u002F1478093658716966912\">tweet\u003C\u002Fa> how this could work for autoregressive attention models. I have decided to include her idea in this repository for further exploration. One only has to account for two extra keyword arguments on training (`null_cond_prob`) and generation (`cond_scale`).\n\n```python\nimport torch\nfrom dalle_pytorch import DiscreteVAE, DALLE\n\nvae = DiscreteVAE(\n    image_size = 256,\n    num_layers = 3,\n    num_tokens = 8192,\n    codebook_dim = 1024,\n    hidden_dim = 64,\n    num_resnet_blocks = 1,\n    temperature = 0.9\n)\n\ndalle = DALLE(\n    dim = 1024,\n    vae = vae,\n    num_text_tokens = 10000,\n    text_seq_len = 256,\n    depth = 12,\n    heads = 16,\n    dim_head = 64,\n    attn_dropout = 0.1,\n    ff_dropout = 0.1\n)\n\ntext = torch.randint(0, 10000, (4, 256))\nimages = torch.randn(4, 3, 256, 256)\n\nloss = dalle(\n    text,\n    images,\n    return_loss = True,\n    null_cond_prob = 0.2  # firstly, set this to the probability of dropping out the condition, 20% is recommended as a default\n)\n\nloss.backward()\n\n# do the above for a long time with a lot of data ... then\n\nimages = dalle.generate_images(\n    text,\n    cond_scale = 3. # secondly, set this to a value greater than 1 to increase the conditioning beyond average\n)\n\nimages.shape # (4, 3, 256, 256)\n```\n\nThat's it!\n\n## Ranking the generations\n\nTrain CLIP\n\n```python\nimport torch\nfrom dalle_pytorch import CLIP\n\nclip = CLIP(\n    dim_text = 512,\n    dim_image = 512,\n    dim_latent = 512,\n    num_text_tokens = 10000,\n    text_enc_depth = 6,\n    text_seq_len = 256,\n    text_heads = 8,\n    num_visual_tokens = 512,\n    visual_enc_depth = 6,\n    visual_image_size = 256,\n    visual_patch_size = 32,\n    visual_heads = 8\n)\n\ntext = torch.randint(0, 10000, (4, 256))\nimages = torch.randn(4, 3, 256, 256)\nmask = torch.ones_like(text).bool()\n\nloss = clip(text, images, text_mask = mask, return_loss = True)\nloss.backward()\n```\n\nTo get the similarity scores from your trained Clipper, just do\n\n```python\nimages, scores = dalle.generate_images(text, mask = mask, clip = clip)\n\nscores.shape # (2,)\nimages.shape # (2, 3, 256, 256)\n\n# do your topk here, in paper they sampled 512 and chose top 32\n```\n\nOr you can just use the official \u003Ca href=\"https:\u002F\u002Fgithub.com\u002Fopenai\u002FCLIP\">CLIP model\u003C\u002Fa> to rank the images from DALL-E\n\n## Scaling depth\n\nIn the blog post, they used 64 layers to achieve their results. I added reversible networks, from the \u003Ca href=\"https:\u002F\u002Fgithub.com\u002Flucidrains\u002Freformer-pytorch\">Reformer\u003C\u002Fa> paper, in order for users to attempt to scale depth at the cost of compute. Reversible networks allow you to scale to any depth at no memory cost, but a little over 2x compute cost (each layer is rerun on the backward pass).\n\nSimply set the `reversible` keyword to `True` for the `DALLE` class\n\n```python\ndalle = DALLE(\n    dim = 1024,\n    vae = vae,\n    num_text_tokens = 10000,\n    text_seq_len = 256,\n    depth = 64,\n    heads = 16,\n    reversible = True  # \u003C-- reversible networks https:\u002F\u002Farxiv.org\u002Fabs\u002F2001.04451\n)\n```\n\n## Sparse Attention\n\nThe blogpost alluded to a mixture of different types of sparse attention, used mainly on the image (while the text presumably had full causal attention). I have done my best to replicate these types of sparse attention, on the scant details released. Primarily, it seems as though they are doing causal axial row \u002F column attention, combined with a causal convolution-like attention.\n\nBy default `DALLE` will use full attention for all layers, but you can specify the attention type per layer as follows.\n\n- `full` full attention\n\n- `axial_row` axial attention, along the rows of the image feature map\n\n- `axial_col` axial attention, along the columns of the image feature map\n\n- `conv_like` convolution-like attention, for the image feature map\n\nThe sparse attention only applies to the image. Text will always receive full attention, as said in the blogpost.\n\n```python\ndalle = DALLE(\n    dim = 1024,\n    vae = vae,\n    num_text_tokens = 10000,\n    text_seq_len = 256,\n    depth = 64,\n    heads = 16,\n    reversible = True,\n    attn_types = ('full', 'axial_row', 'axial_col', 'conv_like')  # cycles between these four types of attention\n)\n```\n\n## Deepspeed Sparse Attention\n\nYou can also train with Microsoft Deepspeed's \u003Ca href=\"https:\u002F\u002Fwww.deepspeed.ai\u002Fnews\u002F2020\u002F09\u002F08\u002Fsparse-attention.html\">Sparse Attention\u003C\u002Fa>, with any combination of dense and sparse attention that you'd like. However, you will have to endure the installation process.\n\nFirst, you need to install Deepspeed with Sparse Attention\n\n```bash\n$ sh install_deepspeed.sh\n```\n\nNext, you need to install the pip package `triton`. It will need to be a version `\u003C 1.0` because that's what Microsoft used.\n\n```bash\n$ pip install triton==0.4.2\n```\n\nIf both of the above succeeded, now you can train with Sparse Attention!\n\n```python\ndalle = DALLE(\n    dim = 512,\n    vae = vae,\n    num_text_tokens = 10000,\n    text_seq_len = 256,\n    depth = 64,\n    heads = 8,\n    attn_types = ('full', 'sparse')  # interleave sparse and dense attention for 64 layers\n)\n```\n\n## Training\n\nThis section will outline how to train the discrete variational autoencoder as well as the final multi-modal transformer (DALL-E). We are going to use \u003Ca href=\"https:\u002F\u002Fwandb.ai\u002F\">Weights & Biases\u003C\u002Fa> for all the experiment tracking.\n\n(You can also do everything in this section in a Google Colab, link below)\n\n[![Open In Colab](https:\u002F\u002Fcolab.research.google.com\u002Fassets\u002Fcolab-badge.svg)](https:\u002F\u002Fcolab.research.google.com\u002Fdrive\u002F1dWvA54k4fH8zAmiix3VXbg95uEIMfqQM?usp=sharing) Train in Colab\n\n```bash\n$ pip install wandb\n```\n\nFollowed by\n\n```bash\n$ wandb login\n```\n\n### VAE\n\nTo train the VAE, you just need to run\n\n```python\n$ python train_vae.py --image_folder \u002Fpath\u002Fto\u002Fyour\u002Fimages\n```\n\nIf you installed everything correctly, a link to the experiments page should show up in your terminal. You can follow your link there and customize your experiment, like the example layout below.\n\n\u003Cimg src=\".\u002Fimages\u002Fwb.png\" width=\"700px\">\u003C\u002Fimg>\n\nYou can of course open up the training script at `.\u002Ftrain_vae.py`, where you can modify the constants, what is passed to Weights & Biases, or any other tricks you know to make the VAE learn better.\n\nModel will be saved periodically to `.\u002Fvae.pt`\n\nIn the experiment tracker, you will have to monitor the hard reconstruction, as we are essentially teaching the network to compress images into discrete visual tokens for use in the transformer as a visual vocabulary.\n\nWeights and Biases will allow you to monitor the temperature annealing, image reconstructions (encoder and decoder working properly), as well as to watch out for codebook collapse (where the network decides to only use a few tokens out of what you provide it).\n\nOnce you have trained a decent VAE to your satisfaction, you can move on to the next step with your model weights at `.\u002Fvae.pt`.\n\n### DALL-E Training\n\n## Training using an Image-Text-Folder\n\nNow you just have to invoke the `.\u002Ftrain_dalle.py` script, indicating which VAE model you would like to use, as well as the path to your folder if images and text.\n\nThe dataset I am currently working with contains a folder of images and text files, arbitraily nested in subfolders, where text file name corresponds with the image name, and where each text file contains multiple descriptions, delimited by newlines. The script will find and pair all the image and text files with the same names, and randomly select one of the textual descriptions during batch creation.\n\nex.\n\n```\n📂image-and-text-data\n ┣ 📜cat.png\n ┣ 📜cat.txt\n ┣ 📜dog.jpg\n ┣ 📜dog.txt\n ┣ 📜turtle.jpeg\n ┗ 📜turtle.txt\n```\n\nex. `cat.txt`\n\n```text\nA black and white cat curled up next to the fireplace\nA fireplace, with a cat sleeping next to it\nA black cat with a red collar napping\n```\n\nIf you have a dataset with its own directory structure for tying together image and text descriptions, do let me know in the issues, and I'll see if I can accommodate it in the script.\n\n```python\n$ python train_dalle.py --vae_path .\u002Fvae.pt --image_text_folder \u002Fpath\u002Fto\u002Fdata\n```\n\nYou likely will not finish DALL-E training as quickly as you did your Discrete VAE. To resume from where you left off, just run the same script, but with the path to your DALL-E checkpoints.\n\n```python\n$ python train_dalle.py --dalle_path .\u002Fdalle.pt --image_text_folder \u002Fpath\u002Fto\u002Fdata\n```\n\n## Training using WebDataset\n\nWebDataset files are regular .tar(.gz) files which can be streamed and used for DALLE-pytorch training.\nYou Just need to provide the image (first comma separated argument) and caption (second comma separated argument) \ncolumn key after the --wds argument. The ---image_text_folder points to your .tar(.gz) file instead of the datafolder.\n\n```python\n$ python train_dalle.py --wds img,cap --image_text_folder \u002Fpath\u002Fto\u002Fdata.tar(.gz)\n```\n\nDistributed training with deepspeed works the same way, e.g.:\n\n```python\n$ deepspeed train_dalle.py --wds img,cap --image_text_folder \u002Fpath\u002Fto\u002Fdata.tar(.gz) --fp16 --deepspeed\n```\n\nIf you have containing shards (dataset split into several .tar(.gz) files), this is also supported:\n\n```python\n$ deepspeed train_dalle.py --wds img,cap --image_text_folder \u002Fpath\u002Fto\u002Fshardfolder --fp16 --deepspeed\n```\n\nYou can stream the data from a http server or gloogle cloud storage like this:\n\n```python\n$ deepspeed train_dalle.py --image_text_folder \"http:\u002F\u002Fstorage.googleapis.com\u002Fnvdata-openimages\u002Fopenimages-train-{000000..000554}.tar\" --wds jpg,json --taming --truncate_captions --random_resize_crop_lower_ratio=0.8 --attn_types=full --epochs=2 --fp16 --deepspeed\n```\n\nIn order to convert your image-text-folder to WebDataset format, you can make use of one of several methods.\n(https:\u002F\u002Fwww.youtube.com\u002Fwatch?v=v_PacO-3OGQ here are given 4 examples, or a little helper script which also supports splitting your dataset\ninto shards of .tar.gz files https:\u002F\u002Fgithub.com\u002Frobvanvolt\u002FDALLE-datasets\u002Fblob\u002Fmain\u002Fwds_create_shards.py)\n\n### DALL-E with OpenAI's VAE\n\nYou can now also train DALL-E without having to train the Discrete VAE at all, courtesy to their open-sourcing their model. You simply have to invoke the `train_dalle.py` script without specifying the `--vae_path`\n\n```python\n$ python train_dalle.py --image_text_folder \u002Fpath\u002Fto\u002Fcoco\u002Fdataset\n```\n\n### DALL-E with Taming Transformer's VQVAE\n\nJust use the `--taming` flag. Highly recommended you use this VAE over the OpenAI one!\n\n```python\n$ python train_dalle.py --image_text_folder \u002Fpath\u002Fto\u002Fcoco\u002Fdataset --taming\n```\n\n### Generation\n\nOnce you have successfully trained DALL-E, you can then use the saved model for generation!\n\n```python\n$ python generate.py --dalle_path .\u002Fdalle.pt --text 'fireflies in a field under a full moon'\n```\n\nYou should see your images saved as `.\u002Foutputs\u002F{your prompt}\u002F{image number}.jpg`\n\nTo generate multiple images, just pass in your text with '|' character as a separator.\n\nex.\n\n```python\n$ python generate.py --dalle_path .\u002Fdalle.pt --text 'a dog chewing a bone|a cat chasing mice|a frog eating a fly'\n```\n\nNote that DALL-E is a full image+text language model. As a consequence you can also generate text using a dalle model.\n\n```python\n$ python generate.py --dalle_path .\u002Fdalle.pt --text 'a dog chewing a bone' --gentext\n```\n\nThis will complete the provided text, save it in a caption.txt and generate the corresponding images.\n\n### Docker\n\nYou can use a docker container to make sure the version of Pytorch and Cuda are correct for training DALL-E. \u003Ca href=\"https:\u002F\u002Fdocs.docker.com\u002Fget-docker\u002F\">Docker\u003C\u002Fa> and \u003Ca href='#'>Docker Container Runtime\u003C\u002Fa> should be installed.\n\nTo build:\n\n```bash\ndocker build -t dalle docker\n```\n\nTo run in an interactive shell:\n\n```bash\ndocker run --gpus all -it --mount src=\"$(pwd)\",target=\u002Fworkspace\u002Fdalle,type=bind dalle:latest bash\n```\n\n### Distributed Training\n\n#### DeepSpeed\n\nThanks to \u003Ca href=\"https:\u002F\u002Fgithub.com\u002FjanEbert\">janEbert\u003C\u002Fa>, the repository is now equipped so you can train DALL-E with Microsoft's \u003Ca href=\"https:\u002F\u002Fwww.deepspeed.ai\u002F\">Deepspeed\u003C\u002Fa>!\n\nYou can simply replace any `$ python \u003Cfile>.py [args...]` command with\n\n```sh\n$ deepspeed \u003Cfile>.py [args...] --deepspeed\n```\n\nto use the aforementioned DeepSpeed library for distributed training, speeding up your experiments.\n\nModify the `deepspeed_config` dictionary in `train_dalle.py` or\n`train_vae.py` according to the DeepSpeed settings you'd like to use\nfor each one. See the [DeepSpeed configuration\ndocs](https:\u002F\u002Fwww.deepspeed.ai\u002Fdocs\u002Fconfig-json\u002F) for more\ninformation.\n\n#### DeepSpeed - 32 and 16 bit Precision\nAs of DeepSpeed version 0.3.16, ZeRO optimizations can be used with\nsingle-precision floating point numbers. If you are using an older\nversion, you'll have to pass the `--fp16` flag to be able to enable\nZeRO optimizations.\n\n\n#### DeepSpeed - Apex Automatic Mixed Precision.\nAutomatic mixed precision is a stable alternative to fp16 which still provides a decent speedup.\nIn order to run with Apex AMP (through DeepSpeed), you will need to install DeepSpeed using either the Dockerfile or the bash script.\n\nThen you will need to install apex from source. \nThis may take awhile and you may see some compilation warnings which can be ignored. \n```sh\nsh install_apex.sh\n```\n\nNow, run `train_dalle.py` with `deepspeed` instead of `python` as done here:\n```sh\ndeepspeed train_dalle.py \\\n    --taming \\\n    --image_text_folder 'DatasetsDir' \\\n    --distr_backend 'deepspeed' \\\n    --amp\n```\n\n#### Horovod\n\n[Horovod](https:\u002F\u002Fhorovod.ai) offers a stable way for data parallel\ntraining.\n\nAfter [installing\nHorovod](https:\u002F\u002Fgithub.com\u002Flucidrains\u002FDALLE-pytorch\u002Fwiki\u002FHorovod-Installation),\nreplace any `$ python \u003Cfile>.py [args...]` command with\n\n```sh\n$ horovodrun -np \u003Cnum-gpus> \u003Cfile>.py [args...] --distributed_backend horovod\n```\n\nto use the Horovod library for distributed training, speeding up your\nexperiments. This will multiply your effective batch size per training\nstep by `\u003Cnum-gpus>`, so you may need to rescale the learning rate\naccordingly.\n\n#### Custom Tokenizer\n\nThis repository supports custom tokenization with \u003Ca href=\"https:\u002F\u002Fgithub.com\u002FVKCOM\u002FYouTokenToMe\">YouTokenToMe\u003C\u002Fa>, if you wish to use it instead of the default simple tokenizer. Simply pass in an extra `--bpe_path` when invoking `train_dalle.py` and `generate.py`, with the path to your BPE model file.\n\nThe only requirement is that you use `0` as the padding during tokenization\n\nex.\n\n```sh\n$ python train_dalle.py --image_text_folder .\u002Fpath\u002Fto\u002Fdata --bpe_path .\u002Fpath\u002Fto\u002Fbpe.model\n```\n\nTo create a BPE model file from scratch, firstly\n\n```bash\n$ pip install youtokentome\n```\n\nThen you need to prepare a big text file that is a representative sample of the type of text you want to encode. You can then invoke the `youtokentome` command-line tools. You'll also need to specify the vocab size you wish to use, in addition to the corpus of text.\n\n```bash\n$ yttm bpe --vocab_size 8000 --data .\u002Fpath\u002Fto\u002Fbig\u002Ftext\u002Ffile.txt --model .\u002Fpath\u002Fto\u002Fbpe.model\n```\n\nThat's it! The BPE model file is now saved to `.\u002Fpath\u002Fto\u002Fbpe.model` and you can begin training!\n\n#### Chinese\n\nYou can train with a \u003Ca href=\"https:\u002F\u002Fhuggingface.co\u002Fbert-base-chinese\">pretrained chinese tokenizer\u003C\u002Fa> offered by Huggingface 🤗 by simply passing in an extra flag `--chinese`\n\nex.\n\n```sh\n$ python train_dalle.py --chinese --image_text_folder .\u002Fpath\u002Fto\u002Fdata\n```\n\n```sh\n$ python generate.py --chinese --text '追老鼠的猫'\n```\n\n## Citations\n\n```bibtex\n@misc{ramesh2021zeroshot,\n    title   = {Zero-Shot Text-to-Image Generation}, \n    author  = {Aditya Ramesh and Mikhail Pavlov and Gabriel Goh and Scott Gray and Chelsea Voss and Alec Radford and Mark Chen and Ilya Sutskever},\n    year    = {2021},\n    eprint  = {2102.12092},\n    archivePrefix = {arXiv},\n    primaryClass = {cs.CV}\n}\n```\n\n```bibtex\n@misc{unpublished2021clip,\n    title  = {CLIP: Connecting Text and Images},\n    author = {Alec Radford, Ilya Sutskever, Jong Wook Kim, Gretchen Krueger, Sandhini Agarwal},\n    year   = {2021}\n}\n```\n\n```bibtex\n@misc{kitaev2020reformer,\n    title   = {Reformer: The Efficient Transformer},\n    author  = {Nikita Kitaev and Łukasz Kaiser and Anselm Levskaya},\n    year    = {2020},\n    eprint  = {2001.04451},\n    archivePrefix = {arXiv},\n    primaryClass = {cs.LG}\n}\n```\n\n```bibtex\n@misc{esser2021taming,\n    title   = {Taming Transformers for High-Resolution Image Synthesis},\n    author  = {Patrick Esser and Robin Rombach and Björn Ommer},\n    year    = {2021},\n    eprint  = {2012.09841},\n    archivePrefix = {arXiv},\n    primaryClass = {cs.CV}\n}\n```\n\n```bibtex\n@misc{ding2021cogview,\n    title   = {CogView: Mastering Text-to-Image Generation via Transformers},\n    author  = {Ming Ding and Zhuoyi Yang and Wenyi Hong and Wendi Zheng and Chang Zhou and Da Yin and Junyang Lin and Xu Zou and Zhou Shao and Hongxia Yang and Jie Tang},\n    year    = {2021},\n    eprint  = {2105.13290},\n    archivePrefix = {arXiv},\n    primaryClass = {cs.CV}\n}\n```\n\n```bibtex\n@software{peng_bo_2021_5196578,\n    author       = {PENG Bo},\n    title        = {BlinkDL\u002FRWKV-LM: 0.01},\n    month        = {aug},\n    year         = {2021},\n    publisher    = {Zenodo},\n    version      = {0.01},\n    doi          = {10.5281\u002Fzenodo.5196578},\n    url          = {https:\u002F\u002Fdoi.org\u002F10.5281\u002Fzenodo.5196578}\n}\n```\n\n```bibtex\n@misc{su2021roformer,\n    title   = {RoFormer: Enhanced Transformer with Rotary Position Embedding},\n    author  = {Jianlin Su and Yu Lu and Shengfeng Pan and Bo Wen and Yunfeng Liu},\n    year    = {2021},\n    eprint  = {2104.09864},\n    archivePrefix = {arXiv},\n    primaryClass = {cs.CL}\n}\n```\n\n```bibtex\n@inproceedings{ho2021classifierfree,\n    title   = {Classifier-Free Diffusion Guidance},\n    author  = {Jonathan Ho and Tim Salimans},\n    booktitle = {NeurIPS 2021 Workshop on Deep Generative Models and Downstream Applications},\n    year    = {2021},\n    url     = {https:\u002F\u002Fopenreview.net\u002Fforum?id=qw8AKxfYbI}\n}\n```\n\n```bibtex\n@misc{crowson2022,\n    author  = {Katherine Crowson},\n    url     = {https:\u002F\u002Ftwitter.com\u002FRiversHaveWings\u002Fstatus\u002F1478093658716966912}\n}\n```\n\n```bibtex\n@article{Liu2023BridgingDA,\n    title   = {Bridging Discrete and Backpropagation: Straight-Through and Beyond},\n    author  = {Liyuan Liu and Chengyu Dong and Xiaodong Liu and Bin Yu and Jianfeng Gao},\n    journal = {ArXiv},\n    year    = {2023},\n    volume  = {abs\u002F2304.08612}\n}\n```\n\n*Those who do not want to imitate anything, produce nothing.* - Dali\n","该项目是DALL-E的PyTorch实现，旨在将文本转化为图像。它基于Transformer架构，采用注意力机制和多模态处理技术来生成高质量的图像。项目提供了CLIP模型以对生成的图像进行排名，并支持通过Deepspeed进行高效训练。适合用于需要根据描述性文本自动生成图像的应用场景，如创意设计、内容创作等。此外，项目还提供了一些预训练模型和在线服务部署示例，方便用户快速上手使用。",2,"2026-06-11 03:24:43","top_topic"]