[{"data":1,"prerenderedAt":-1},["ShallowReactive",2],{"project-9780":3},{"id":4,"name":5,"fullName":6,"owner":7,"repo":5,"description":8,"homepage":9,"htmlUrl":10,"language":11,"languages":10,"totalLinesOfCode":10,"stars":12,"forks":13,"watchers":14,"openIssues":15,"contributorsCount":16,"subscribersCount":16,"size":16,"stars1d":16,"stars7d":17,"stars30d":18,"stars90d":16,"forks30d":16,"starsTrendScore":19,"compositeScore":20,"rankGlobal":10,"rankLanguage":10,"license":21,"archived":22,"fork":22,"defaultBranch":23,"hasWiki":24,"hasPages":22,"topics":25,"createdAt":10,"pushedAt":10,"updatedAt":30,"readmeContent":31,"aiSummary":32,"trendingCount":16,"starSnapshotCount":16,"syncStatus":33,"lastSyncTime":34,"discoverSource":35},9780,"x-transformers","lucidrains\u002Fx-transformers","lucidrains","A concise but complete full-attention transformer with a set of promising experimental features from various papers","",null,"Python",5892,511,54,71,0,7,42,5,39.13,"MIT License",false,"main",true,[26,27,28,29],"artificial-intelligence","attention-mechanism","deep-learning","transformers","2026-06-12 02:02:12","## x-transformers\n\n[![PyPI version](https:\u002F\u002Fbadge.fury.io\u002Fpy\u002Fx-transformers.svg)](https:\u002F\u002Fbadge.fury.io\u002Fpy\u002Fx-transformers)\n\nA concise but fully-featured transformer, complete with a set of promising e**x**perimental features from various papers.\n\n## Install\n\n```bash\n$ pip install x-transformers\n```\n\n## Usage\n\nFull encoder \u002F decoder\n\n```python\nimport torch\nfrom x_transformers import XTransformer\n\nmodel = XTransformer(\n    dim = 512,\n    enc_num_tokens = 256,\n    enc_depth = 6,\n    enc_heads = 8,\n    enc_max_seq_len = 1024,\n    dec_num_tokens = 256,\n    dec_depth = 6,\n    dec_heads = 8,\n    dec_max_seq_len = 1024,\n    tie_token_emb = True      # tie embeddings of encoder and decoder\n)\n\nsrc = torch.randint(0, 256, (1, 1024))\nsrc_mask = torch.ones_like(src).bool()\ntgt = torch.randint(0, 256, (1, 1024))\n\nloss = model(src, tgt, mask = src_mask) # (1, 1024, 512)\nloss.backward()\n```\n\nDecoder-only (GPT-like)\n\n```python\nimport torch\nfrom x_transformers import TransformerWrapper, Decoder\n\nmodel = TransformerWrapper(\n    num_tokens = 20000,\n    max_seq_len = 1024,\n    attn_layers = Decoder(\n        dim = 512,\n        depth = 12,\n        heads = 8\n    )\n).cuda()\n\nx = torch.randint(0, 256, (1, 1024)).cuda()\n\nmodel(x) # (1, 1024, 20000)\n```\n\nGPT3 would be approximately the following (but you wouldn't be able to run it anyways)\n\n```python\n\ngpt3 = TransformerWrapper(\n    num_tokens = 50000,\n    max_seq_len = 2048,\n    attn_layers = Decoder(\n        dim = 12288,\n        depth = 96,\n        heads = 96,\n        attn_dim_head = 128\n    )\n).cuda()\n```\n\nEncoder-only (BERT-like)\n\n```python\nimport torch\nfrom x_transformers import TransformerWrapper, Encoder\n\nmodel = TransformerWrapper(\n    num_tokens = 20000,\n    max_seq_len = 1024,\n    attn_layers = Encoder(\n        dim = 512,\n        depth = 12,\n        heads = 8\n    )\n).cuda()\n\nx = torch.randint(0, 256, (1, 1024)).cuda()\nmask = torch.ones_like(x).bool()\n\nmodel(x, mask = mask) # (1, 1024, 20000)\n```\n\nState of the art image classification (\u003Ca href=\"https:\u002F\u002Farxiv.org\u002Fabs\u002F2205.01580\">SimpleViT\u003C\u002Fa>)\n\n```python\nimport torch\nfrom x_transformers import ViTransformerWrapper, Encoder\n\nmodel = ViTransformerWrapper(\n    image_size = 256,\n    patch_size = 32,\n    num_classes = 1000,\n    attn_layers = Encoder(\n        dim = 512,\n        depth = 6,\n        heads = 8,\n    )\n)\n\nimg = torch.randn(1, 3, 256, 256)\nmodel(img) # (1, 1000)\n```\n\nImage -> caption\n\n```python\nimport torch\nfrom x_transformers import ViTransformerWrapper, TransformerWrapper, Encoder, Decoder\n\nencoder = ViTransformerWrapper(\n    image_size = 256,\n    patch_size = 32,\n    attn_layers = Encoder(\n        dim = 512,\n        depth = 6,\n        heads = 8\n    )\n)\n\ndecoder = TransformerWrapper(\n    num_tokens = 20000,\n    max_seq_len = 1024,\n    attn_layers = Decoder(\n        dim = 512,\n        depth = 6,\n        heads = 8,\n        cross_attend = True\n    )\n)\n\nimg = torch.randn(1, 3, 256, 256)\ncaption = torch.randint(0, 20000, (1, 1024))\n\nencoded = encoder(img, return_embeddings = True)\ndecoder(caption, context = encoded) # (1, 1024, 20000)\n```\n\n\u003Ca href=\"https:\u002F\u002Farxiv.org\u002Fabs\u002F2209.06794\">PaLI\u003C\u002Fa>, state of the art language-vision model\n\n```python\nimport torch\nfrom x_transformers import ViTransformerWrapper, XTransformer, Encoder\n\n# PaLI composes of\n# 1. vision transformer (ViTransformerWrapper) +\n# 2. encoder-decoder transformer (XTransformer)\n\nvit = ViTransformerWrapper(\n    image_size = 256,\n    patch_size = 32,\n    attn_layers = Encoder(\n        dim = 512,\n        depth = 6,\n        heads = 8\n    )\n)\n\npali = XTransformer(\n    dim = 512,\n    enc_num_tokens = 256,\n    enc_depth = 6,\n    enc_heads = 8,\n    enc_max_seq_len = 1024,\n    dec_num_tokens = 256,\n    dec_depth = 6,\n    dec_heads = 8,\n    dec_max_seq_len = 1024\n)\n\n# training data\n\nimg = torch.randn(1, 3, 256, 256)               # images\nprompt = torch.randint(0, 256, (1, 1024))       # prompt\nprompt_mask = torch.ones(1, 1024).bool()        # prompt text mask\noutput_text = torch.randint(0, 256, (1, 1024))  # target output text\n\n# train\n\nimg_embeds = vit(\n    img,\n    return_embeddings = True\n)\n\nloss = pali(\n    prompt,\n    output_text,\n    mask = prompt_mask,\n    src_prepend_embeds = img_embeds             # will preprend image embeddings to encoder text embeddings before attention\n)\n\nloss.backward()\n\n# do the above for many steps on a 17B parameter model\n# attention is all you need\n```\n\n## Dropouts\n\n```python\nimport torch\nfrom x_transformers import TransformerWrapper, Decoder, Encoder\n\nmodel = TransformerWrapper(\n    num_tokens = 20000,\n    max_seq_len = 1024,\n    emb_dropout = 0.1,         # dropout after embedding\n    attn_layers = Decoder(\n        dim = 512,\n        depth = 6,\n        heads = 8,\n        layer_dropout = 0.1,   # stochastic depth - dropout entire layer\n        attn_dropout = 0.1,    # dropout post-attention\n        ff_dropout = 0.1       # feedforward dropout\n    )\n)\n\nx = torch.randint(0, 20000, (1, 1024))\nmodel(x)\n```\n\n## Features\n\n### Flash Attention\n\n\u003Cimg src=\".\u002Fimages\u002Fflash-attention.png\" width=\"500px\">\u003C\u002Fimg>\n\nWhat originally started off as \u003Ca href=\"https:\u002F\u002Farxiv.org\u002Fabs\u002F2112.05682\">a short paper\u003C\u002Fa> from Markus Rabe culminated as a practical fused attention CUDA kernel, named \u003Ca href=\"https:\u002F\u002Farxiv.org\u002Fabs\u002F2205.14135\">Flash Attention\u003C\u002Fa> by \u003Ca href=\"https:\u002F\u002Ftridao.me\u002F\">Tri Dao\u003C\u002Fa>.\n\nThe technique processes the attention matrix in tiles, only keeping track of the running softmax and exponentiated weighted sums. By recomputing on the backwards pass in a tiled fashion, one is able to keep the memory linear with respect to sequence length. This allows a lot of recent models  to be able to reach for longer context lengths without worrying about the memory bottleneck.\n\nOther engineering decisions made by Tri Dao led to its enormous success, namely minimizing HBM accesses so that both the forwards and backwards outperform naive attention. In other words, flash attention is not only more memory efficient, but faster as well, making it a necessity for training transformers.\n\nMetaAI has recently added the ability to use \u003Ca href=\"https:\u002F\u002Fgithub.com\u002Fhazyresearch\u002Fflash-attention\">Tri Dao's CUDA kernel\u003C\u002Fa> through the \u003Ca href=\"https:\u002F\u002Fpytorch.org\u002Fdocs\u002Fstable\u002Fgenerated\u002Ftorch.nn.functional.scaled_dot_product_attention.html\">scaled_dot_product_attention\u003C\u002Fa> function in Pytorch 2.0. (They also have a `mem_efficient` attention, which is identical to flash attention design, just that the tiles are traversed differently)\n\n\u003Ca href=\"https:\u002F\u002Fai.facebook.com\u002Fblog\u002Flarge-language-model-llama-meta-ai\u002F\">Llama\u003C\u002Fa> was trained using Flash Attention. The only reason to avoid it is if you require operating on the attention matrix (dynamic positional bias, talking heads, residual attention).\n\nYou can use it in this repository by setting `attn_flash` to `True` and enjoy the immediate memory savings and increase in speed.\n\nex.\n\n```python\nimport torch\nfrom x_transformers import TransformerWrapper, Decoder, Encoder\n\nmodel = TransformerWrapper(\n    num_tokens = 20000,\n    max_seq_len = 1024,\n    attn_layers = Decoder(\n        dim = 512,\n        depth = 6,\n        heads = 8,\n        attn_flash = True # just set this to True if you have pytorch 2.0 installed\n    )\n)\n```\n\n### Augmenting Self-attention with Persistent Memory\n\n\u003Cimg src=\".\u002Fimages\u002Fall-attention.png\" width=\"500px\">\u003C\u002Fimg>\n\nhttps:\u002F\u002Farxiv.org\u002Fabs\u002F1907.01470\n\nProposes adding learned memory key \u002F values prior to attention. They were able to remove feedforwards altogether and attain similar performance to the original transformers. I have found that keeping the feedforwards and adding the memory key \u002F values leads to even better performance.\n\n```python\nfrom x_transformers import Decoder, Encoder\n\nenc = Encoder(\n    dim = 512,\n    depth = 6,\n    heads = 8,\n    attn_num_mem_kv = 16 # 16 memory key \u002F values\n)\n```\n\n### Memory Transformers\n\n\u003Cimg src=\".\u002Fimages\u002Fmemory-transformer.png\" width=\"500px\">\u003C\u002Fimg>\n\nhttps:\u002F\u002Farxiv.org\u002Fabs\u002F2006.11527\n\nProposes adding learned tokens, akin to CLS tokens, named memory tokens, that is passed through the attention layers alongside the input tokens. This setting is compatible with both encoder and decoder training.\n\n```python\nimport torch\nfrom x_transformers import TransformerWrapper, Decoder, Encoder\n\nmodel = TransformerWrapper(\n    num_tokens = 20000,\n    max_seq_len = 1024,\n    num_memory_tokens = 20, # 20 memory tokens\n    attn_layers = Encoder(\n        dim = 512,\n        depth = 6,\n        heads = 8\n    )\n)\n```\n\nUpdate: MetaAI researchers \u003Ca href=\"https:\u002F\u002Farxiv.org\u002Fabs\u002F2309.16588\">have found\u003C\u002Fa> that adding memory tokens (they call them register tokens), alleviates outliers (which is suspected now to be a pathology of attention networks unable to \u003Ca href=\"https:\u002F\u002Farxiv.org\u002Fabs\u002F2306.12929\">attend to nothing\u003C\u002Fa>).\n\nUpdate 2: a hybrid architecture out of Nvidia named \u003Ca href=\"https:\u002F\u002Fopenreview.net\u002Fforum?id=A1ztozypga\">Hymba\u003C\u002Fa> used memory tokens successfully in the autoregressive case, termed meta tokens in their paper.\n\nUpdate 3: further corroborated by \u003Ca href=\"https:\u002F\u002Farxiv.org\u002Fabs\u002F2501.00663\">a paper\u003C\u002Fa> trying to extend memory in attention networks, termed persistent memory\n\n### Transformers Without Tears\n\n\u003Cimg src=\".\u002Fimages\u002Fscalenorm.png\">\u003C\u002Fimg>\n\nhttps:\u002F\u002Farxiv.org\u002Fabs\u002F1910.05895\n\nThey experiment with alternatives to Layer normalization and found one that is both effective and simpler. Researchers have shared with me this leads to faster convergence.\n\n```python\nimport torch\nfrom x_transformers import TransformerWrapper, Decoder, Encoder\n\nmodel = TransformerWrapper(\n    num_tokens = 20000,\n    max_seq_len = 1024,\n    attn_layers = Decoder(\n        dim = 512,\n        depth = 6,\n        heads = 8,\n        use_scalenorm = True # set to True to use for all layers\n    )\n)\n```\n\nYou can also use the l2 normalized embeddings proposed as part of `fixnorm`. I have found it leads to improved convergence, when paired with small initialization (proposed by \u003Ca href=\"https:\u002F\u002Fgithub.com\u002FBlinkDL\">BlinkDL\u003C\u002Fa>). The small initialization will be taken care of as long as `l2norm_embed` is set to `True`\n\n```python\nimport torch\nfrom x_transformers import TransformerWrapper, Decoder, Encoder\n\nmodel = TransformerWrapper(\n    num_tokens = 20000,\n    max_seq_len = 1024,\n    l2norm_embed = True,    # set this to True for l2 normalized embedding + small init\n    attn_layers = Decoder(\n        dim = 512,\n        depth = 6,\n        heads = 8\n    )\n)\n```\n\nAlong the same lines of l2 normalized embeddings, Huggingface's \u003Ca href=\"https:\u002F\u002Fhuggingface.co\u002Fbigscience\u002Fbloom\">175B parameter BLOOM\u003C\u002Fa> also places a layernorm right after the embeddings and just before the tokens enter the attention layers. This was corroborated by Yandex's \u003Ca href=\"https:\u002F\u002Fgithub.com\u002Fyandex\u002FYaLM-100B\">100B parameter YaLM\u003C\u002Fa> to stabilize training.\n\nIt is recommended you either have either `l2norm_embed` or `post_emb_norm` set to `True` but not both, as they probably serve the same purpose.\n\n```python\nimport torch\nfrom x_transformers import TransformerWrapper, Decoder, Encoder\n\nmodel = TransformerWrapper(\n    num_tokens = 20000,\n    max_seq_len = 1024,\n    post_emb_norm = True,    # set this to True to layernorm summed token + pos embeddings\n    attn_layers = Decoder(\n        dim = 512,\n        depth = 6,\n        heads = 8\n    )\n)\n```\n\n### Root Mean Square Layer Normalization\n\nhttps:\u002F\u002Farxiv.org\u002Fabs\u002F1910.07467\n\nThe authors propose to replace layer normalization with a simpler alternative, without mean centering and the learned bias. An investigative paper found this to be the \u003Ca href=\"https:\u002F\u002Farxiv.org\u002Fabs\u002F2102.11972\">best performing normalization variant\u003C\u002Fa>. It was also used in Deepmind's latest large language models, \u003Ca href=\"https:\u002F\u002Fdeepmind.com\u002Fresearch\u002Fpublications\u002F2021\u002Fimproving-language-models-by-retrieving-from-trillions-of-tokens\">Retro\u003C\u002Fa> and \u003Ca href=\"https:\u002F\u002Farxiv.org\u002Fabs\u002F2112.11446\">Gopher\u003C\u002Fa>.\n\n```python\nimport torch\nfrom x_transformers import TransformerWrapper, Decoder, Encoder\n\nmodel = TransformerWrapper(\n    num_tokens = 20000,\n    max_seq_len = 1024,\n    attn_layers = Decoder(\n        dim = 512,\n        depth = 6,\n        heads = 8,\n        use_rmsnorm = True # set to true to use for all layers\n    )\n)\n```\n\n*July 2023* \u003Ca href=\"https:\u002F\u002Farxiv.org\u002Fabs\u002F2307.14995\">A linear attention paper\u003C\u002Fa> has experiments to show that removing the learned multiplicative gamma led to no performance degradation. This simplifies the RMS normalization to a satisfying `l2norm(x) * sqrt(dim)`.\n\n```python\nimport torch\nfrom x_transformers import TransformerWrapper, Decoder, Encoder\n\nmodel = TransformerWrapper(\n    num_tokens = 20000,\n    max_seq_len = 1024,\n    attn_layers = Decoder(\n        dim = 512,\n        depth = 6,\n        heads = 8,\n        use_simple_rmsnorm = True # set to true to use for all layers\n    )\n)\n```\n\n### GLU Variants Improve Transformer\n\n\u003Cimg src=\".\u002Fimages\u002Fffglu.png\">\u003C\u002Fimg>\n\nhttps:\u002F\u002Farxiv.org\u002Fabs\u002F2002.05202\n\nNoam Shazeer paper that explores gating in the feedforward, finding that simple gating with GELU leads to significant improvements. This variant also showed up in the latest mT5 architecture. You should always turn this on (I may eventually turn it on by default).\n\n```python\nimport torch\nfrom x_transformers import TransformerWrapper, Decoder, Encoder\n\nmodel = TransformerWrapper(\n    num_tokens = 20000,\n    max_seq_len = 1024,\n    attn_layers = Decoder(\n        dim = 512,\n        depth = 6,\n        heads = 8,\n        ff_glu = True # set to true to use for all feedforwards\n    )\n)\n```\n\nThe \u003Ca href=\"https:\u002F\u002Fai.googleblog.com\u002F2022\u002F04\u002Fpathways-language-model-palm-scaling-to.html\">PaLM\u003C\u002Fa> language model also chose to use the Swish GLU variant. You can turn this on by setting two flags\n\n```python\nimport torch\nfrom x_transformers import TransformerWrapper, Decoder, Encoder\n\nmodel = TransformerWrapper(\n    num_tokens = 20000,\n    max_seq_len = 1024,\n    attn_layers = Decoder(\n        dim = 512,\n        depth = 6,\n        heads = 8,\n        ff_swish = True, # set this to True\n        ff_glu = True    # set to true to use for all feedforwards\n    )\n)\n``````\n\n### No Bias in Feedforward\n\nStarting with \u003Ca href=\"https:\u002F\u002Fai.googleblog.com\u002F2022\u002F04\u002Fpathways-language-model-palm-scaling-to.html\">PaLM\u003C\u002Fa>, there begun a trend to remove biases from the transformer all together. \u003Ca href=\"https:\u002F\u002Fgithub.com\u002Fborisdayma\">Boris Dayma\u003C\u002Fa> has run a number of experiments that showed removing biases from feedforwards led to increased throughput without any loss of accuracy. This was corroborated by \u003Ca href=\"https:\u002F\u002Farxiv.org\u002Fabs\u002F2212.14034\">yet another paper\u003C\u002Fa> investigating transformer architecture variants.\n\nYou can turn off the feedforward bias as follows\n\n```python\nimport torch\nfrom x_transformers import TransformerWrapper, Decoder, Encoder\n\nmodel = TransformerWrapper(\n    num_tokens = 20000,\n    max_seq_len = 1024,\n    attn_layers = Decoder(\n        dim = 512,\n        depth = 6,\n        heads = 8,\n        ff_no_bias = True  # set this to True\n    )\n)\n```\n\n### ReLU²\n\nhttps:\u002F\u002Farxiv.org\u002Fabs\u002F2109.08668\n\nThis paper used neural architecture search and found an activation, Relu Squared, that is both simpler and performs better than GELU, in the autoregressive language model setting. I have confirmed this in my independent experiments. However, if one were using the GLU variant from above, GELU still performs better. Pending further corroboration.\n\n```python\nimport torch\nfrom x_transformers import TransformerWrapper, Decoder, Encoder\n\nmodel = TransformerWrapper(\n    num_tokens = 20000,\n    max_seq_len = 1024,\n    attn_layers = Decoder(\n        dim = 512,\n        depth = 6,\n        heads = 8,\n        ff_relu_squared = True\n    )\n)\n```\n\n### Explicit Sparse Transformer: Concentrated Attention Through Explicit Selection\n\n\u003Cimg src=\".\u002Fimages\u002Ftopk-attention.png\" width=\"500px\">\u003C\u002Fimg>\n\nhttps:\u002F\u002Farxiv.org\u002Fabs\u002F1912.11637\n\nThis paper proposes an efficient way to sparsify attention by zeroing all dot-product query\u002Fkey values not within the top k values. The show that this cheap method was as effective as other more expensive operations like sparsemax or entmax15. This technique comes with the cost of an extra hyperparameter (the top k values to keep). The paper recommends a value of `k = 8`\n\n```python\nimport torch\nfrom x_transformers import TransformerWrapper, Decoder\n\nmodel = TransformerWrapper(\n    num_tokens = 20000,\n    max_seq_len = 1024,\n    attn_layers = Decoder(\n        dim = 512,\n        depth = 6,\n        heads = 8,\n        attn_sparse_topk = 8,                       # keep only the top 8 values before attention (softmax)\n        attn_sparse_topk_straight_through = True    # straight through the original gradients\n    )\n)\n```\n\nAn extreme case of `topk` value of `1`, you can use the following\n\n```python\nmodel = TransformerWrapper(\n    num_tokens = 20000,\n    max_seq_len = 1024,\n    attn_layers = Decoder(\n        dim = 512,\n        depth = 6,\n        heads = 8,\n        attn_hard = True  # will only propagate the single value of the argmax of qk logit. offered in the case it addresses https:\u002F\u002Farxiv.org\u002Fabs\u002F2410.01104\n    )\n)\n```\n\n### Talking-Heads Attention\n\n\u003Cimg src=\".\u002Fimages\u002Ftalking-heads.png\" width=\"500px\">\u003C\u002Fimg>\n\nhttps:\u002F\u002Farxiv.org\u002Fabs\u002F2003.02436\n\nA Noam Shazeer paper that proposes mixing information between heads pre and post attention (softmax). This comes with the cost of extra memory and compute.\n\n```python\nimport torch\nfrom x_transformers import TransformerWrapper, Decoder\n\nmodel = TransformerWrapper(\n    num_tokens = 20000,\n    max_seq_len = 1024,\n    attn_layers = Decoder(\n        dim = 512,\n        depth = 6,\n        heads = 8,\n        attn_pre_talking_heads = True,  # linear combination across pre-softmax attn logits across heads\n        attn_post_talking_heads = True  # linear combination across post-softmax attn across heads\n    )\n)\n```\n\n### One Write-Head Is All You Need\n\nhttps:\u002F\u002Farxiv.org\u002Fabs\u002F1911.02150\n\nYet another Noam Shazeer paper (he's a legend) that proposes to only have one head for the key \u002F values, but multi-headed queries. This paper was largely ignored for a while, but recently validated at scale in \u003Ca href=\"https:\u002F\u002Farxiv.org\u002Fabs\u002F2203.07814\">AlphaCode\u003C\u002Fa> as well as \u003Ca href=\"https:\u002F\u002Farxiv.org\u002Fabs\u002F2204.02311\">PaLM\u003C\u002Fa>. It has the property of being memory efficient when decoding extremely large language models. You can use it with one keyword argument as shown below.\n\n```python\nimport torch\nfrom x_transformers import TransformerWrapper, Decoder\n\nmodel = TransformerWrapper(\n    num_tokens = 20000,\n    max_seq_len = 1024,\n    attn_layers = Decoder(\n        dim = 512,\n        depth = 6,\n        heads = 8,\n        attn_one_kv_head = True\n    )\n)\n```\n\nThis has been further generalized in \u003Ca href=\"https:\u002F\u002Farxiv.org\u002Fabs\u002F2305.13245\">a recent paper\u003C\u002Fa> to allow for groups of query heads to attend to a single key \u002F value head. You can use this by specifying the `attn_kv_heads`\n\n```python\nimport torch\nfrom x_transformers import TransformerWrapper, Decoder\n\nmodel = TransformerWrapper(\n    num_tokens = 20000,\n    max_seq_len = 1024,\n    attn_layers = Decoder(\n        dim = 512,\n        depth = 12,\n        heads = 8,\n        attn_kv_heads = 2 # say you want 4 query heads to attend to 1 key \u002F value head\n    )\n)\n```\n\n### Attention on Attention for Image Captioning\n\n\u003Cimg src=\".\u002Fimages\u002Fattention-on-attention.png\">\u003C\u002Fimg>\n\nhttps:\u002F\u002Farxiv.org\u002Fabs\u002F1908.06954\n\nThis paper proposes to add a gated linear unit at the end of the attention layer, further gated by the original queries. Although this is not widely used outside of visual question \u002F answering, I suspect it should lead to improvements after seeing the success of the feedforward GLU variant.\n\nUpdate: After some experimentation, I found this variant actually performs worse, but if it were to be modified to not concatenate the queries before gating, it performs much better. That is what we will be using in this repository.\n\n```python\nimport torch\nfrom x_transformers import TransformerWrapper, Encoder\n\nmodel = TransformerWrapper(\n    num_tokens = 20000,\n    max_seq_len = 1024,\n    attn_layers = Encoder(\n        dim = 512,\n        depth = 6,\n        heads = 8,\n        attn_on_attn = True  # gate output of attention layer, by queries\n    )\n)\n```\n\n### Intra-attention Gating on Values\n\n\u003Cimg src=\".\u002Fimages\u002Fgate_values.png\" width=\"400px\">\u003C\u002Fimg>\n\n\u003Ca href=\"https:\u002F\u002Fgithub.com\u002Fdeepmind\u002Falphafold\">Alphafold2\u003C\u002Fa> had a peculiar variant of attention where they gate the aggregated values with the input, presumably to have the block have more control over the update.\n\nA quick test shows a small but noticeable improvement, on about the same order as attention on attention.\n\n```python\nimport torch\nfrom x_transformers import TransformerWrapper, Encoder\n\nmodel = TransformerWrapper(\n    num_tokens = 20000,\n    max_seq_len = 1024,\n    attn_layers = Encoder(\n        dim = 512,\n        depth = 6,\n        heads = 8,\n        attn_gate_values = True  # gate aggregated values with the input\n    )\n)\n```\n\n### Improving Transformer Models by Reordering their Sublayers\n\n\u003Cimg src=\".\u002Fimages\u002Fsandwich.png\">\u003C\u002Fimg>\n\n\u003Cimg src=\".\u002Fimages\u002Fsandwich-2.png\">\u003C\u002Fimg>\n\nhttps:\u002F\u002Farxiv.org\u002Fabs\u002F1911.03864\n\nThis paper proposes to break from the normal fixed pattern of alternating attention and feedforwards, but to have blocks of only attention at the beginning followed by blocks of feedforwards at the end. This was further corroborated by a paper by Nvidia that reduces the number of attention layers to be 1\u002F3rd of the feedforwards without loss in performance.\n\nThe amount of interleaving is controlled by a \"sandwich coefficient\", which they found to be optimal at a value of `6`.\n\nYou can experiment with this feature as shown below\n\n```python\nimport torch\nfrom x_transformers import TransformerWrapper, Encoder\n\nmodel = TransformerWrapper(\n    num_tokens = 20000,\n    max_seq_len = 1024,\n    attn_layers = Encoder(\n        dim = 512,\n        depth = 6,\n        heads = 8,\n        sandwich_coef = 6  # interleave attention and feedforwards with sandwich coefficient of 6\n    )\n)\n```\n\n### Weight-tied Layers\n\nIn the early days of the cambrian explosion of BERT, a paper explored weight tying all the layers, the model named \u003Ca href=\"https:\u002F\u002Farxiv.org\u002Fabs\u002F1909.11942\">ALBERT\u003C\u002Fa>. You can use it by setting `weight_tie_layers = True`\n\n```python\nimport torch\nfrom x_transformers import TransformerWrapper, Encoder\n\nmodel = TransformerWrapper(\n    num_tokens = 20000,\n    max_seq_len = 1024,\n    attn_layers = Encoder(\n        dim = 512,\n        depth = 12,\n        weight_tie_layers = True   # set this to True to weight tie all the layers\n    )\n)\n```\n\nIf you wish to do something more sophisticated, say 3 layers, with each layer recurrent 4 times before onto the next (similar to \u003Ca href=\"https:\u002F\u002Farxiv.org\u002Fabs\u002F2405.15071\">this paper\u003C\u002Fa>), that is possible as well. Be aware the `layers_execute_order` is 0-indexed\n\n```python\nimport torch\nfrom x_transformers import TransformerWrapper, Decoder\n\nmodel = TransformerWrapper(\n    num_tokens = 20000,\n    max_seq_len = 1024,\n    attn_layers = Decoder(\n        dim = 512,\n        custom_layers = (\n            'a', 'f',        # 3 sets of attention and feedforward\n            'a', 'f',\n            'a', 'f'\n        ),\n        layers_execute_order = (\n            *((0, 1) * 4),   # each done 4 times before sequentially passed forward, but you can probably imagine some more interesting configurations...\n            *((2, 3) * 4),\n            *((4, 5) * 4),\n        )\n    )\n)\n```\n\n### Understanding and Improving Transformer From a Multi-Particle Dynamic System Point of View\n\n\u003Cimg src=\".\u002Fimages\u002Fmacaron-1.png\">\u003C\u002Fimg>\n\n\u003Cimg src=\".\u002Fimages\u002Fmacaron-2.png\">\u003C\u002Fimg>\n\nhttps:\u002F\u002Farxiv.org\u002Fabs\u002F1906.02762\n\nThe authors propose to view the success of transformers from a dynamical systems point of view, and then proposes an improvement based on mathematics of that POV. Specifically, they propose to place the attention layer in between two feedforward layers. This was adopted by a paper using transformers for speech recognition, the \u003Ca href=\"https:\u002F\u002Farxiv.org\u002Fabs\u002F2005.08100\">Conformer\u003C\u002Fa>.\n\n```python\nimport torch\nfrom x_transformers import TransformerWrapper, Encoder\n\nmodel = TransformerWrapper(\n    num_tokens = 20000,\n    max_seq_len = 1024,\n    attn_layers = Encoder(\n        dim = 512,\n        depth = 6,\n        heads = 8,\n        macaron = True  # use macaron configuration\n    )\n)\n```\n\n### T5's Simplified Relative Positional Encoding\n\nhttps:\u002F\u002Farxiv.org\u002Fabs\u002F1910.10683\n\nT5 is one of the most successful encoder \u002F decoder transformer architectures trained to date. They invented a new simplified relative positional encoding based on learned bias values that are added to the attention matrix pre-softmax. This bias is shared and injected into each attention layer. I have decided to include this because it offers a cheap way to have relative positional encoding (superior to absolute positional), and I have read papers that suggest having positional encoding added to each layer (vs only before the first) is beneficial.\n\n```python\nimport torch\nfrom x_transformers import TransformerWrapper, Decoder\n\nmodel = TransformerWrapper(\n    num_tokens = 20000,\n    max_seq_len = 1024,\n    attn_layers = Decoder(\n        dim = 512,\n        depth = 6,\n        heads = 8,\n        rel_pos_bias = True  # adds relative positional bias to all attention layers, a la T5\n    )\n)\n```\n\n### Residual Attention\n\n\u003Cimg src=\".\u002Fimages\u002Fresidual_attn.png\" width=\"500px\">\u003C\u002Fimg>\n\nhttps:\u002F\u002Farxiv.org\u002Fabs\u002F2012.11747\n\nThis paper from Google proposes residualizing the pre-attention scores across all layers. At the cost of no extra parameters, they show improvement on top of regular attention networks. If you turn on this setting, be aware that the best results in the paper used post-normalization, in which case a learning warmup will be needed. The authors also reported that they could use a higher learning rate and get even better gains in the same amount of steps. (In the paper they use `2e-4` vs `1e-4` for vanilla transformer)\n\n```python\nimport torch\nfrom x_transformers import TransformerWrapper, Encoder\n\nmodel = TransformerWrapper(\n    num_tokens = 20000,\n    max_seq_len = 1024,\n    attn_layers = Encoder(\n        dim = 512,\n        depth = 6,\n        heads = 8,\n        pre_norm = False,       # in the paper, residual attention had best results with post-layernorm\n        residual_attn = True    # add residual attention\n    )\n)\n```\n\nI also tried residualizing cross attention and may have noticed an improvement in convergence. You can try it by setting the `cross_residual_attn` keyword to `True`\n\n```python\nimport torch\nfrom x_transformers import XTransformer\n\nmodel = XTransformer(\n    dim = 512,\n    enc_num_tokens = 256,\n    enc_depth = 6,\n    enc_heads = 8,\n    enc_max_seq_len = 1024,\n    dec_num_tokens = 256,\n    dec_depth = 6,\n    dec_heads = 8,\n    dec_max_seq_len = 1024,\n    dec_cross_residual_attn = True     # residualize cross attention\n)\n```\n\n### Transformer-XL recurrence\n\nYou can also do Transformer-XL recurrence, by simply passing in a `max_mem_len` in the `TransformerWrapper` class, and then making sure your `Decoder` has `rel_pos_bias` (or `rotary_pos_emb`) set to `True`.\n\nThen, you can retrieve the memories at each step with the `return_mems` keyword and pass it to the next iteration.\n\n```python\nimport torch\nfrom x_transformers import TransformerWrapper, Decoder\n\nmodel_xl = TransformerWrapper(\n    num_tokens = 20000,\n    max_seq_len = 512,\n    max_mem_len = 2048,\n    attn_layers = Decoder(\n        dim = 512,\n        depth = 6,\n        heads = 8,\n        rel_pos_bias = True\n    )\n)\n\nseg1 = torch.randint(0, 20000, (1, 512))\nseg2 = torch.randint(0, 20000, (1, 512))\nseg3 = torch.randint(0, 20000, (1, 512))\n\nlogits1, mems1  = model_xl(seg1, return_mems = True)\nlogits2, mems2  = model_xl(seg2, mems = mems1, return_mems = True)\nlogits3, mems3  = model_xl(seg3, mems = mems2, return_mems = True)\n```\n\nSetting up the logic for training and sampling from transformer xl can be a bit overwhelming. This repository offers a simple wrapper that should make this easy, with the `XLAutoregressiveWrapper`.\n\n```python\n# pass in the above model_xl\n\nxl_wrapper = XLAutoregressiveWrapper(model_xl)\n\nseg = torch.randint(0, 20000, (1, 4096)).cuda()  # sequence exceeding max length, automatically segmented and memory managed\n\nloss = xl_wrapper(seg)\nloss.backward()\n\n# then, after much training\n\nprime = seg[:, :1024]   # if prime exceeds max length, memory will be caught up before generating\n\ngenerated = xl_wrapper.generate(prime, 4096)  # (1, 4096)\n```\n\n### Enhanced recurrence\n\n\u003Cimg src=\".\u002Fimages\u002Fenhanced-recurrence.png\" width=\"400px\"\u002F>\n\n\u003Ca href=\"https:\u002F\u002Farxiv.org\u002Fabs\u002F2012.15688\">This paper\u003C\u002Fa> proposes a simple technique to enhance the range of Transformer-XL. They simply route the memory segment of a layer to the layer below it, for the next recurrent step. You can enable this by setting `shift_mem_down = 1`. You can also shift down arbitrary number of layers by setting this value to `> 1`.\n\n```python\nimport torch\nfrom x_transformers import TransformerWrapper, Decoder\n\nmodel_xl = TransformerWrapper(\n    num_tokens = 20000,\n    max_seq_len = 512,\n    max_mem_len = 2048,\n    shift_mem_down = 1,\n    attn_layers = Decoder(\n        dim = 512,\n        depth = 6,\n        heads = 8,\n        rotary_pos_emb = True,\n        rotate_num_heads = 4   # only rotate 4 out of the 8 attention heads\n    )\n)\n\nseg1 = torch.randint(0, 20000, (1, 512))\nseg2 = torch.randint(0, 20000, (1, 512))\nseg3 = torch.randint(0, 20000, (1, 512))\n\nlogits1, mems1  = model_xl(seg1, return_mems = True)\nlogits2, mems2  = model_xl(seg2, mems = mems1, return_mems = True) # mems1 of layer N are automatically routed to the layer N-1\n```\n\n### Gated residual\n\n\u003Cimg src=\".\u002Fimages\u002Fgating.png\" width=\"500px\">\u003C\u002Fimg>\n\nhttps:\u002F\u002Farxiv.org\u002Fabs\u002F1910.06764\n\nThe authors propose gating the residual connections in the transformer network and demonstrate increased stability and performance for Transformer-XL in a variety of reinforcement learning tasks.\n\n```python\nimport torch\nfrom x_transformers import TransformerWrapper, Decoder\n\nmodel = TransformerWrapper(\n    num_tokens = 20000,\n    max_seq_len = 1024,\n    max_mem_len = 2048,\n    attn_layers = Decoder(\n        dim = 512,\n        depth = 6,\n        heads = 16,\n        gate_residual = True\n    )\n)\n```\n\n### Rotary Positional Embeddings\n\n\u003Cimg src=\".\u002Fimages\u002Frotary.png\" width=\"500px\">\u003C\u002Fimg>\n\nDeveloped in Beijing, this new technique quickly gained interest in the NLP circles. In short, it allows you to endow the transformer with relative positional embeddings at the cost of no learned parameters. You apply a rotary operation to the queries and keys prior to their dot product in attention. The big idea is injecting positions through rotations.\n\nHighly recommend that you have this turned on whenever you are working on an ordered sequence.\n\n```python\nimport torch\nfrom x_transformers import TransformerWrapper, Decoder\n\nmodel = TransformerWrapper(\n    num_tokens = 20000,\n    max_seq_len = 1024,\n    attn_layers = Decoder(\n        dim = 512,\n        depth = 6,\n        heads = 8,\n        rotary_pos_emb = True  # turns on rotary positional embeddings\n    )\n)\n```\n\nUpdate (12\u002F2022): Rotary embedding has since been hugely successful, widely adopted in many large language models, including the largest in the world, PaLM. However, it has been uncovered in the ALiBi paper that rotary embeddings cannot length extrapolate well. This was recently addressed in \u003Ca href=\"https:\u002F\u002Farxiv.org\u002Fabs\u002F2212.10554v1\">a Microsoft research paper\u003C\u002Fa>. They propose a way to unobtrusively add the same decay as in ALiBi, and found that this resolves the extrapolation problem. You can use it in this repository by setting `rotary_xpos = True`. Like ALiBi, it would enforce the attention to be local. You can set the receptive field with `rotary_xpos_scale_base` value, which defaults to `512`\n\n```python\nimport torch\nfrom x_transformers import TransformerWrapper, Decoder\n\nmodel = TransformerWrapper(\n    num_tokens = 20000,\n    max_seq_len = 1024,\n    attn_layers = Decoder(\n        dim = 512,\n        depth = 6,\n        heads = 8,\n        rotary_xpos = True   # modified rotary to extrapolate well beyond length at which it was trained\n    )\n)\n```\n\n### Dynamic Positional Bias\n\n\u003Cimg src=\".\u002Fimages\u002Fdynamic-pos-bias.png\" width=\"150px\">\u003C\u002Fimg>\n\nThis technique bears roots from the field of vision transformers, where researchers are trying to have relative positions generalize to larger resolutions (without having to retrain the entire network). It was used in two recent papers, \u003Ca href=\"https:\u002F\u002Farxiv.org\u002Fabs\u002F2108.00154\">CrossFormer\u003C\u002Fa>, as well as \u003Ca href=\"https:\u002F\u002Farxiv.org\u002Fabs\u002F2111.09883\">SwinV2\u003C\u002Fa>.\n\n\u003Ca href=\"https:\u002F\u002Fgithub.com\u002Fcfoster0\">Charles Foster\u003C\u002Fa> first tried this for a language model, and found that it works. Later on \u003Ca href=\"https:\u002F\u002Fgithub.com\u002Fbob80333\">Eric Engelhart\u003C\u002Fa> produced experimental results that show the same type of extrapolation holds, even for 1d sequences.\n\nEric trained at sequence lengths of 128, and showed that it generalized well to 1024. In addition, he showed that linear positions was better than log (used in SwinV2), for language.\n\nLinear distances\n\n\u003Cimg src=\".\u002Fimages\u002Fdynamic-pos-bias-linear.png\" width=\"600px\">\u003C\u002Fimg>\n\nLog distances\n\n\u003Cimg src=\".\u002Fimages\u002Fdynamic-pos-bias-log.png\" width=\"600px\">\u003C\u002Fimg>\n\nNegative control - Sinusoidal\n\n\u003Cimg src=\".\u002Fimages\u002Fdynamic-pos-bias-sinusoidal.png\" width=\"600px\">\u003C\u002Fimg>\n\nMore of Eric's experimental results can be found \u003Ca href=\"https:\u002F\u002Fgithub.com\u002Fbob80333\u002Finvestigating_extrapolation\">here\u003C\u002Fa>\n\nYou can use this type of relative position if you wish to train at smaller sequence lengths and have it generalize to longer ones, for both autoregressive and bidirectional models.\n\nUpdate: \u003Ca href=\"https:\u002F\u002Fwww.kaggle.com\u002Fcompetitions\u002Fstanford-ribonanza-rna-folding\u002Fdiscussion\u002F460121\">First place RNA folding using dynamic positional bias\u003C\u002Fa>\n\n```python\nimport torch\nfrom x_transformers import TransformerWrapper, Decoder\n\nmodel = TransformerWrapper(\n    num_tokens = 256,\n    max_seq_len = 1024,\n    attn_layers = Decoder(\n        dim = 512,\n        depth = 6,\n        heads = 8,\n        dynamic_pos_bias = True,                # set this to True\n        dynamic_pos_bias_log_distance = False   # whether to use log distance, as in SwinV2\n    )\n)\n```\n\n### ALiBi Positional Embedding\n\n\u003Ca href=\"https:\u002F\u002Fofir.io\u002Ftrain_short_test_long.pdf\">This paper\u003C\u002Fa> proposes to simply apply a static linear bias to the attention matrix. The authors show this is not only effective as a relative positional encoding, but also allows the attention net to extrapolate to greater sequences length than what it was trained on, for autoregressive language models.\n\nThis repository also offers a bidirectional variant (nonsymmetric), proposed by the authors \u003Ca href=\"https:\u002F\u002Fgithub.com\u002Fofirpress\u002Fattention_with_linear_biases\u002Fissues\u002F5\">here\u003C\u002Fa>. However, this is untested. If you need bidirectional length extrapolation, the safest option would be Dynamic Position Bias\n\nUpdate: It may be that ALiBi enforces a strong local attention across the heads, and may hinder it from attending at distances greater than 1k. To avoid any issues with global message passing, I've decided to introduce another hyperparameter `alibi_num_heads`, so one can specify less heads for the ALiBi bias\n\nUpdate: There are reports that ALiBi outperform Rotary embeddings for pretraining and downstream fine-tuning.\n\nUpdate: \u003Ca href=\"https:\u002F\u002Farxiv.org\u002Fabs\u002F2305.19466\">New paper\u003C\u002Fa> shows that no positional embedding can length extrapolate even than explicit ones\n\n```python\nimport torch\nfrom x_transformers import TransformerWrapper, Decoder\n\nmodel = TransformerWrapper(\n    num_tokens = 20000,\n    max_seq_len = 1024,\n    attn_layers = Decoder(\n        dim = 512,\n        depth = 6,\n        heads = 8,\n        alibi_pos_bias = True, # turns on ALiBi positional embedding\n        alibi_num_heads = 4    # only use ALiBi for 4 out of the 8 heads, so other 4 heads can still attend far distances\n    )\n)\n```\n\n### Shifted Tokens\n\nAn \u003Ca href=\"https:\u002F\u002Fgithub.com\u002FBlinkDL\">independent researcher\u003C\u002Fa> has found that shifting a subset of the feature dimension along the sequence dimension by 1 token helps with convergence (\u003Ca href=\"https:\u002F\u002Fzhuanlan.zhihu.com\u002Fp\u002F191393788\">Time-mixing\u003C\u002Fa>). I have tested this for the autoregressive case and can confirm that it leads to greatly improved convergence. This also lines up with \u003Ca href=\"https:\u002F\u002Farxiv.org\u002Fabs\u002F2106.07477\">the results\u003C\u002Fa> of some papers in the vision domain.\n\nTo use it, simply set `shift_tokens = 1` (or to whatever number of shifts you desire). The feature dimension will be divided by `shift_tokens + 1` and then each chunk will be shifted `[0, shift_tokens]` respectively\n\nUpdate: new experiments by @sdtblck suggests this may only work for character-level training\n\nUpdate: after more experiments, it seems that in the context of BPE encoding, with rotary turned on, there is no benefit to shifting. for character-level training, shifting may still improve a tiny bit\n\nUpdate: When doing BPE encoded tokens, it seems that shift of 2 will bottleneck the dimensions (divided by 5). It is recommended you always do a shift of 1, unless if you are working with character level.\n\n```python\nimport torch\nfrom x_transformers import TransformerWrapper, Decoder\n\nmodel = TransformerWrapper(\n    num_tokens = 20000,\n    max_seq_len = 1024,\n    attn_layers = Decoder(\n        dim = 512,\n        depth = 6,\n        heads = 8,\n        shift_tokens = 1\n    )\n)\n```\n\nIf you want finer control over how much is shifted per block (whether attention or feedforward), simply pass in a tuple of size that is equal to the number of layers.\n\n```python\nimport torch\nfrom x_transformers import TransformerWrapper, Decoder\n\nmodel = TransformerWrapper(\n    num_tokens = 20000,\n    max_seq_len = 1024,\n    attn_layers = Decoder(\n        dim = 512,\n        depth = 6,\n        heads = 8,\n        shift_tokens = (1, 1, 1, 1, 1, 1, 0, 0, 0, 0, 0, 0) # 12 blocks, attention and feedforward alternating, with progressively less shifting\n    )\n)\n```\n\n### Sandwich Norm\n\n\u003Cimg src=\".\u002Fimages\u002Fsandwich_norm.png\" width=\"400px\"\u002F>\n\nThis technique first made an appearance in \u003Ca href=\"https:\u002F\u002Farxiv.org\u002Fabs\u002F2105.13290\">the CoqView paper\u003C\u002Fa>, a Chinese version of the famous text-to-image transformer DALL-E. They propose, when using pre-layernorm, to add an extra layernorm to all the branch outputs. I have found this to be very effective for a number of projects, when facing instability during training.\n\n```python\nimport torch\nfrom x_transformers import TransformerWrapper, Decoder\n\nmodel = TransformerWrapper(\n    num_tokens = 20000,\n    max_seq_len = 1024,\n    attn_layers = Decoder(\n        dim = 512,\n        depth = 6,\n        heads = 8,\n        sandwich_norm = True # set this to True\n    )\n)\n\nx = torch.randint(0, 20000, (1, 1024))\nmodel(x)\n```\n\n### ResiDual\n\n\u003Cimg src=\".\u002Fimages\u002Fresi_dual.png\" width=\"400px\"\u002F>\n\n\u003Ca href=\"https:\u002F\u002Farxiv.org\u002Fabs\u002F2304.14802\">This Microsoft paper\u003C\u002Fa> proposes yet another normalization configuration, combining both pre and post layernorm. They claim this hybridization reduces representation collapse (known to be an issue with pre-layernorm with increasing depth), while maintaining stability and reducing vanishing gradients (issues with post-layernorm). Initial experiments on my end show it to work no worse than pre-layernorm or sandwich norm. More study needed by the public to see if this is actually a winning technique.\n\n```python\nimport torch\nfrom x_transformers import TransformerWrapper, Decoder\n\nmodel = TransformerWrapper(\n    num_tokens = 20000,\n    max_seq_len = 1024,\n    attn_layers = Decoder(\n        dim = 512,\n        depth = 6,\n        heads = 8,\n        resi_dual = True,               # set this to True\n        resi_dual_scale = 0.1           # in appendix, they said on fp16 the prenorm residual is prone to overflow. they claim by scaling it at each layer by a factor, it would prevent the overflow, and keep results the same (as layernorms are invariant to scaling of the input)\n    )\n)\n\nx = torch.randint(0, 20000, (1, 1024))\nmodel(x)\n```\n\n### Normformer\n\n\u003Cimg src=\".\u002Fimages\u002Fnormformer.png\" width=\"400px\"\u002F>\n\nThis \u003Ca href=\"https:\u002F\u002Fopenreview.net\u002Fforum?id=GMYWzWztDx5\">paper\u003C\u002Fa> uncovers an issue with pre-norm transformers where gradients are mismatched between the early and later layers. They propose 4 changes, of which I will be offering 3.\n\nThe first change is to offer per head scaling after aggregating the values in attention. My experiments show a slight improvement in convergence.\n\n```python\nimport torch\nfrom x_transformers import TransformerWrapper, Decoder\n\nmodel = TransformerWrapper(\n    num_tokens = 20000,\n    max_seq_len = 1024,\n    attn_layers = Decoder(\n        dim = 512,\n        depth = 6,\n        heads = 8,\n        attn_head_scale = True  # set this to True\n    )\n)\n\nx = torch.randint(0, 20000, (1, 1024))\nmodel(x)\n```\n\nThe second change is an extra layernorm right after the activation in the feedforward. I have also verified a slight improvement, at the cost of extra compute.\n\n```python\nimport torch\nfrom x_transformers import TransformerWrapper, Decoder\n\nmodel = TransformerWrapper(\n    num_tokens = 20000,\n    max_seq_len = 1024,\n    attn_layers = Decoder(\n        dim = 512,\n        depth = 6,\n        heads = 8,\n        ff_post_act_ln = True # set this to True\n    )\n)\n\nx = torch.randint(0, 20000, (1, 1024))\nmodel(x)\n```\n\nFor the residual scaling, you simply have to set `scale_residual = True`. I have noticed slight improvements, but occasional instability as well, so use with caution.\n\n```python\nimport torch\nfrom x_transformers import TransformerWrapper, Decoder\n\nmodel = TransformerWrapper(\n    num_tokens = 20000,\n    max_seq_len = 1024,\n    attn_layers = Decoder(\n        dim = 512,\n        depth = 6,\n        heads = 8,\n        scale_residual = True # set this to True\n    )\n)\n\nx = torch.randint(0, 20000, (1, 1024))\nmodel(x)\n```\n\nThe last change is a layernorm right after the outwards projection in attention. This is actually identical to the sandwich norm proposed by the Coqview paper, so you can use this by simply setting `sandwich_norm = True`, although it would also add it to the feedforward layer.\n\n### Cosine Sim Attention\n\n\u003Cimg src=\".\u002Fimages\u002Fcosine-sim-attention.png\" width=\"400px\">\u003C\u002Fimg>\n\nThis \u003Ca href=\"https:\u002F\u002Farxiv.org\u002Fabs\u002F2010.04245\">paper\u003C\u002Fa> proposes to l2 normalize the queries and keys along the head dimension before the dot product (cosine similarity), with the additional change of the scale being learned rather than static. The normalization prevents the attention operation from overflowing, and removes any need for numerical stability measures prior to softmax. Both are perennial problems when training transformers.\n\nThis was validated at scale recently by the training of \u003Ca href=\"https:\u002F\u002Farxiv.org\u002Fabs\u002F2111.09883\">a 3B parameter vision transformer\u003C\u002Fa>. The SwinV2 paper also proposes to change the pre-layernorm to a post-layernorm for further stability.\n\nI have validated that this works just as well as dot product attention in an autoregressive setting, if one were to initialize the temperature as proposed in the QK-norm paper (as a function of the sequence length).\n\nThis flavor of attention also has \u003Ca href=\"https:\u002F\u002Farxiv.org\u002Fabs\u002F2111.05498\">a connection\u003C\u002Fa> to sparse distributed memory. \u003Ca href=\"https:\u002F\u002Fwww.youtube.com\u002Fwatch?v=THIIk7LR9_8\">[youtube talk]\u003C\u002Fa>\n\nUpdate: I have discovered a way to remove the learned temperature altogether, by grouping the feature dimension and doing l2-normalization on each group. This allows the queries and keys to have a similarity that is upper bounded by the number of groups. A group size of 8 or 16 was sufficient in my tests. Decided to name this technique \"Grouped QK Normalization\". The drawback is that I believe an attention head dimension 32 is too small to use this tactic (a dimension often used in vision)\n\nUpdate 2: Tero Karras has successfully used cosine sim attention in \u003Ca href=\"https:\u002F\u002Farxiv.org\u002Fabs\u002F2312.02696\">a new paper\u003C\u002Fa>.\n\nYou can use it as follows\n\n```python\nimport torch\nfrom x_transformers import TransformerWrapper, Decoder\n\nmodel = TransformerWrapper(\n    num_tokens = 20000,\n    max_seq_len = 1024,\n    attn_layers = Decoder(\n        dim = 512,\n        depth = 6,\n        heads = 8,\n        attn_qk_norm = True,       # set this to True\n        attn_qk_norm_groups = 8    # number of groups in the feature dimension for l2norm, similarity scores will be bounded between [-group, group]. determines how sharp the attention can be\n    )\n)\n\nx = torch.randint(0, 20000, (1, 1024))\nmodel(x)\n```\n\nAnother update: Simply scaling the cosine similarity (group of 1) with a fixed constant (10) may work too\n\n```python\nimport torch\nfrom x_transformers import TransformerWrapper, Decoder\n\nmodel = TransformerWrapper(\n    num_tokens = 20000,\n    max_seq_len = 1024,\n    attn_layers = Decoder(\n        dim = 512,\n        depth = 6,\n        heads = 8,\n        attn_qk_norm = True,       # set to True\n        attn_qk_norm_scale = 10    # new scale on the similarity, with groups of 1\n    )\n)\n\nx = torch.randint(0, 20000, (1, 1024))\nmodel(x)\n```\n\n### QK RMSNorm\n\n\u003Cimg src=\".\u002Fimages\u002Fqknorm-analysis.png\" width=\"450px\">\u003C\u002Fimg>\n\nUpdate: Google Brain has proven out something similar to cosine sim attention in \u003Ca href=\"https:\u002F\u002Farxiv.org\u002Fabs\u002F2302.05442\">a 22B parameter model\u003C\u002Fa>. In their papers, they have analysis showing that the normalization resulted in not only extra stability, but also better results in the end (due to less need to adjust learning rate when increasing parameter count).\n\nWe are nearing the point of wiping out a source of transformer training instability with one simple intervention, in my opinion. The only slight difference in the paper is that they still have a learned scale across the feature dimension (per use of rmsnorm). Not sure how critical this is, but just to make sure we don't miss anything, I will include this here. You can use this by setting `qk_norm_dim_scale = True`\n\nUpdate: \u003Ca href=\"https:\u002F\u002Ftwitter.com\u002FTim_Dettmers\u002Fstatus\u002F1625531080513306627\">Counterpoint from Tim Dettmers\u003C\u002Fa>\n\nUpdate 2: \u003Ca href=\"https:\u002F\u002Farxiv.org\u002Fabs\u002F2305.19268\">Counter\u003C\u002Fa> to Tim's assertion that outliers are needed, and potentially even \u003Ca href=\"https:\u002F\u002Farxiv.org\u002Fabs\u002F2306.12929\">some solutions\u003C\u002Fa>\n\nUpdate 3: Used by \u003Ca href=\"https:\u002F\u002Fwww.adept.ai\u002Fblog\u002Fpersimmon-8b\">8B parameter LLM\u003C\u002Fa> successfully\n\nUpdate 4: a MetaAI group found that they can \u003Ca href=\"https:\u002F\u002Farxiv.org\u002Fabs\u002F2309.16588\">alleviate outliers\u003C\u002Fa> by adding `register tokens`, also known as `memory tokens` from earlier literature (Burtsev et al). Perhaps what should be tried next is see if qk norm can be improved in the presence of memory tokens.\n\n```python\nimport torch\nfrom x_transformers import TransformerWrapper, Decoder\n\nmodel = TransformerWrapper(\n    num_tokens = 20000,\n    max_seq_len = 1024,\n    attn_layers = Decoder(\n        dim = 512,\n        depth = 12,\n        heads = 8,\n        attn_qk_norm = True,\n        attn_qk_norm_dim_scale = True # set this to True, in addition to `attn_qk_norm = True`\n    )\n)\n\nx = torch.randint(0, 256, (1, 1024))\nmodel(x)\n```\n\n### Turning off absolute positional embedding\n\nA number of papers have hinted that causal transformers (`Decoder`) can learn absolute positions in the absence of added embeddings of any sort. This was recently thoroughly investigated \u003Ca href=\"https:\u002F\u002Farxiv.org\u002Fabs\u002F2203.16634\">here\u003C\u002Fa>. You can turn off the absolute positional embedding by setting `use_abs_pos_emb = False` in the `TransformerWrapper`\n\nGiven \u003Ca href=\"https:\u002F\u002Fai.googleblog.com\u002F2022\u002F04\u002Fpathways-language-model-palm-scaling-to.html\">PaLM\u003C\u002Fa>, the trend going forward may be to forgo absolute positional embedding (again, for causal transformers only), and add relative positional embeddings with RoPE, ALiBi, etc.\n\nUpdate: \u003Ca href=\"https:\u002F\u002Farxiv.org\u002Fabs\u002F2305.19466\">This paper\u003C\u002Fa> shows that in the absence of any engineered absolute or relative positional embeddings, decoders can generate implicit positions, and even length generalize better than solutions of the past. They were unaware of dynamic positional bias, however.\n\n```python\nimport torch\nfrom x_transformers import TransformerWrapper, Decoder\n\nmodel = TransformerWrapper(\n    num_tokens = 20000,\n    max_seq_len = 1024,\n    use_abs_pos_emb = False,   # set this to False\n    attn_layers = Decoder(\n        dim = 512,\n        depth = 6,\n        heads = 8,\n    )\n)\n\nx = torch.randint(0, 20000, (1, 1024))\nmodel(x)\n```\n\n### Forgetful Causal Mask\n\n\u003Cimg src=\".\u002Fimages\u002Ffcm.png\" width=\"450px\">\u003C\u002Fimg>\n\n\u003Ca href=\"https:\u002F\u002Farxiv.org\u002Fabs\u002F2210.13432\">This paper\u003C\u002Fa> shows convincing results that one can combine masking (from masked language modeling) with autoregressive training, leading to significantly better results.\n\nYou can use this by setting the `mask_prob` on the `AutoregressiveWrapper` class\n\n\n```python\nimport torch\nfrom x_transformers import TransformerWrapper, Decoder, AutoregressiveWrapper\n\nmodel = TransformerWrapper(\n    num_tokens = 20000,\n    max_seq_len = 1024,\n    attn_layers = Decoder(\n        dim = 512,\n        depth = 12,\n        heads = 8\n    )\n)\n\nmodel = AutoregressiveWrapper(\n    model,\n    mask_prob = 0.15  # in paper, they use 15%, same as BERT\n).cuda()\n\n# mock data\n\nx = torch.randint(0, 20000, (1, 1024)).cuda()\n\n# derive cross entropy loss, masking all taken care of\n\nloss = model(x)\nloss.backward()\n```\n\n\n## Miscellaneous\n\n### Cross Attention\n\n```python\nimport torch\nfrom x_transformers import Encoder, CrossAttender\n\nenc = Encoder(dim = 512, depth = 6)\nmodel = CrossAttender(dim = 512, depth = 6)\n\nnodes = torch.randn(1, 1, 512)\nnode_masks = torch.ones(1, 1).bool()\n\nneighbors = torch.randn(1, 5, 512)\nneighbor_masks = torch.ones(1, 5).bool()\n\nencoded_neighbors = enc(neighbors, mask = neighbor_masks)\nmodel(nodes, context = encoded_neighbors, mask = node_masks, context_mask = neighbor_masks) # (1, 1, 512)\n\n```\n\n### Continuous Embeddings\n\n```python\nimport torch\nfrom x_transformers import ContinuousTransformerWrapper, Decoder\n\nmodel = ContinuousTransformerWrapper(\n    dim_in = 32,\n    dim_out = 100,\n    max_seq_len = 1024,\n    attn_layers = Decoder(\n        dim = 512,\n        depth = 12,\n        heads = 8\n    )\n)\n\nx = torch.randn((1, 1024, 32))\nmask = torch.ones(1, 1024).bool()\n\nmodel(x, mask = mask) # (1, 1024, 100)\n```\n\nYou can also train a transformer that accepts continuous values autoregressively easily, in the same scheme as done successfully in \u003Ca href=\"https:\u002F\u002Farxiv.org\u002Fabs\u002F2112.05329\">this paper\u003C\u002Fa>\n\n```python\nimport torch\nfrom x_transformers import ContinuousTransformerWrapper, Decoder\nfrom x_transformers import ContinuousAutoregressiveWrapper\n\nmodel = ContinuousTransformerWrapper(\n    dim_in = 777,\n    dim_out = 777,\n    max_seq_len = 1024,\n    attn_layers = Decoder(\n        dim = 512,\n        depth = 12,\n        heads = 8\n    )\n)\n\n# wrap it with the continuous autoregressive wrapper\n\nmodel = ContinuousAutoregressiveWrapper(model)\n\n# mock data\n\nx = torch.randn((1, 1024, 777))\nmask = torch.ones(1, 1024).bool()\n\n# train on a lot of data above\n\nloss = model(x, mask = mask)\nloss.backward\n\n# then generate\n\nstart_emb = torch.randn(1, 777)\ngenerated = model.generate(start_emb, 17) # (17, 777)\n```\n\n### xVal - Continuous and Discrete\n\n\u003Cimg src=\".\u002Fimages\u002Fxval.png\" width=\"400px\">\u003C\u002Fimg>\n\nThis is promising work that resulted from the collaboration across many institutes (collectively known as Polymathic AI). They found that by offering a continuously scaled number token to the transformer, the transformer was able to generalize arithmetic and forecasting tasks better than the alternative encoding schemes.\n\nThis is corroborated by some [prior work](https:\u002F\u002Fgithub.com\u002Flucidrains\u002Ftab-transformer-pytorch#ft-transformer)\n\n```python\nimport torch\n\nfrom x_transformers import (\n    Decoder,\n    XValTransformerWrapper,\n    XValAutoregressiveWrapper\n)\n\nmodel = XValTransformerWrapper(\n    num_tokens = 4,\n    numerical_token_id = 3,\n    max_seq_len = 1024,\n    attn_layers = Decoder(\n        dim = 512,\n        depth = 12,\n        heads = 8\n    )\n)\n\n# wrap it with the xval autoregressive wrapper\n\nmodel = XValAutoregressiveWrapper(model)\n\n# mock data\n\nids = torch.randint(0, 4, (1, 777))\nnums = torch.randn(1, 777)\n\n# train on a lot of data above\n\nloss = model(ids, nums)\nloss.backward()\n\n# then generate\n\nstart_ids = torch.randint(0, 4, (1, 1))\nstart_nums = torch.randn(1, 1)\n\nids_out, num_out, is_number_mask = model.generate(start_ids, start_nums, 17)\n\n# (1, 17), (1, 17), (1, 17)\n\n# discrete, continuous, mask for discrete \u002F continuous\n```\n\n## Citations\n\n```bibtex\n@misc{vaswani2017attention,\n    title   = {Attention Is All You Need},\n    author  = {Ashish Vaswani and Noam Shazeer and Niki Parmar and Jakob Uszkoreit and Llion Jones and Aidan N. Gomez and Lukasz Kaiser and Illia Polosukhin},\n    year    = {2017},\n    eprint  = {1706.03762},\n    archivePrefix = {arXiv},\n    primaryClass = {cs.CL}\n}\n```\n\n```bibtex\n@article{DBLP:journals\u002Fcorr\u002Fabs-1907-01470,\n    author    = {Sainbayar Sukhbaatar and Edouard Grave and Guillaume Lample and Herv{\\'{e}} J{\\'{e}}gou and Armand Joulin},\n    title     = {Augmenting Self-attention with Persistent Memory},\n    journal   = {CoRR},\n    volume    = {abs\u002F1907.01470},\n    year      = {2019},\n    url       = {http:\u002F\u002Farxiv.org\u002Fabs\u002F1907.01470}\n}\n```\n\n```bibtex\n@article{1910.05895,\n    author  = {Toan Q. Nguyen and Julian Salazar},\n    title   = {Transformers without Tears: Improving the Normalization of Self-Attention},\n    year    = {2019},\n    eprint  = {arXiv:1910.05895},\n    doi     = {10.5281\u002Fzenodo.3525484},\n}\n```\n\n```bibtex\n@misc{shazeer2020glu,\n    title   = {GLU Variants Improve Transformer},\n    author  = {Noam Shazeer},\n    year    = {2020},\n    url     = {https:\u002F\u002Farxiv.org\u002Fabs\u002F2002.05202}\n}\n```\n\n```bibtex\n@inproceedings{Zoph2022STMoEDS,\n    title   = {ST-MoE: Designing Stable and Transferable Sparse Expert Models},\n    author  = {Barret Zoph and Irwan Bello and Sameer Kumar and Nan Du and Yanping Huang and Jeff Dean and Noam M. Shazeer and William Fedus},\n    year    = {2022}\n}\n```\n\n```bibtex\n@misc{bhojanapalli2020lowrank,\n    title   = {Low-Rank Bottleneck in Multi-head Attention Models},\n    author  = {Srinadh Bhojanapalli and Chulhee Yun and Ankit Singh Rawat and Sashank J. Reddi and Sanjiv Kumar},\n    year    = {2020},\n    eprint  = {2002.07028}\n}\n```\n\n```bibtex\n@misc{burtsev2020memory,\n    title   = {Memory Transformer},\n    author  = {Mikhail S. Burtsev and Grigory V. Sapunov},\n    year    = {2020},\n    eprint  = {2006.11527},\n    archivePrefix = {arXiv},\n    primaryClass = {cs.CL}\n}\n```\n\n```bibtex\n@misc{zhao2019explicit,\n    title   = {Explicit Sparse Transformer: Concentrated Attention Through Explicit Selection},\n    author  = {Guangxiang Zhao and Junyang Lin and Zhiyuan Zhang and Xuancheng Ren and Qi Su and Xu Sun},\n    year    = {2019},\n    eprint  = {1912.11637},\n    archivePrefix = {arXiv},\n    primaryClass = {cs.CL}\n}\n```\n\n```bibtex\n@misc{correia2019adaptively,\n    title   = {Adaptively Sparse Transformers},\n    author  = {Gonçalo M. Correia and Vlad Niculae and André F. T. Martins},\n    year    = {2019},\n    eprint  = {1909.00015},\n    archivePrefix = {arXiv},\n    primaryClass = {cs.CL}\n}\n```\n\n```bibtex\n@misc{shazeer2020talkingheads,\n    title   = {Talking-Heads Attention},\n    author  = {Noam Shazeer and Zhenzhong Lan and Youlong Cheng and Nan Ding and Le Hou},\n    year    = {2020},\n    eprint  = {2003.02436},\n    archivePrefix = {arXiv},\n    primaryClass = {cs.LG}\n}\n```\n\n```bibtex\n@misc{press2020improving,\n    title   = {Improving Transformer Models by Reordering their Sublayers},\n    author  = {Ofir Press and Noah A. Smith and Omer Levy},\n    year    = {2020},\n    eprint  = {1911.03864},\n    archivePrefix = {arXiv},\n    primaryClass = {cs.CL}\n}\n```\n\n```bibtex\n@misc{lu2019understanding,\n    title   = {Understanding and Improving Transformer From a Multi-Particle Dynamic System Point of View},\n    author  = {Yiping Lu and Zhuohan Li and Di He and Zhiqing Sun and Bin Dong and Tao Qin and Liwei Wang and Tie-Yan Liu},\n    year    = {2019},\n    eprint  = {1906.02762},\n    archivePrefix = {arXiv},\n    primaryClass = {cs.LG}\n}\n```\n\n```bibtex\n@misc{ke2020rethinking,\n    title     = {Rethinking Positional Encoding in Language Pre-training},\n    author    = {Guolin Ke and Di He and Tie-Yan Liu},\n    year      = {2020},\n    eprint    = {2006.15595},\n    archivePrefix = {arXiv},\n    primaryClass = {cs.CL}\n}\n```\n\n```bibtex\n@misc{dosovitskiy2020image,\n    title   = {An Image is Worth 16x16 Words: Transformers for Image Recognition at Scale},\n    author  = {Alexey Dosovitskiy and Lucas Beyer and Alexander Kolesnikov and Dirk Weissenborn and Xiaohua Zhai and Thomas Unterthiner and Mostafa Dehghani and Matthias Minderer and Georg Heigold and Sylvain Gelly and Jakob Uszkoreit and Neil Houlsby},\n    year    = {2020},\n    eprint  = {2010.11929},\n    archivePrefix = {arXiv},\n    primaryClass = {cs.CV}\n}\n```\n\n```bibtex\n@misc{huang2019attention,\n    title   = {Attention on Attention for Image Captioning},\n    author  = {Lun Huang and Wenmin Wang and Jie Chen and Xiao-Yong Wei},\n    year    = {2019},\n    eprint  = {1908.06954},\n    archivePrefix = {arXiv},\n    primaryClass = {cs.CV}\n}\n```\n\n```bibtex\n@misc{raffel2020exploring,\n    title   = {Exploring the Limits of Transfer Learning with a Unified Text-to-Text Transformer},\n    author  = {Colin Raffel and Noam Shazeer and Adam Roberts and Katherine Lee and Sharan Narang and Michael Matena and Yanqi Zhou and Wei Li and Peter J. Liu},\n    year    = {2020},\n    eprint  = {1910.10683},\n    archivePrefix = {arXiv},\n    primaryClass = {cs.LG}\n}\n```\n\n```bibtex\n@inproceedings{martins-etal-2020-sparse,\n    title   = \"Sparse Text Generation\",\n    author  = \"Martins, Pedro Henrique  and\n        Marinho, Zita  and\n        Martins, Andr{\\'e} F. T.\",\n    booktitle = \"Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing (EMNLP)\",\n    month   = nov,\n    year    = \"2020\",\n    address = \"Online\",\n    publisher = \"Association for Computational Linguistics\",\n    url     = \"https:\u002F\u002Fwww.aclweb.org\u002Fanthology\u002F2020.emnlp-main.348\"\n}\n```\n\n```bibtex\n@misc{he2020realformer,\n    title   = {RealFormer: Transformer Likes Residual Attention},\n    author  = {Ruining He and Anirudh Ravula and Bhargav Kanagal and Joshua Ainslie},\n    year    = {2020},\n    eprint  = {2012.11747},\n    archivePrefix = {arXiv},\n    primaryClass = {cs.LG}\n}\n```\n\n```bibtex\n@misc{carion2020endtoend,\n    title   = {End-to-End Object Detection with Transformers},\n    author  = {Nicolas Carion and Francisco Massa and Gabriel Synnaeve and Nicolas Usunier and Alexander Kirillov and Sergey Zagoruyko},\n    year    = {2020},\n    eprint  = {2005.12872},\n    archivePrefix = {arXiv},\n    primaryClass = {cs.CV}\n}\n```\n\n```bibtex\n@misc{press2021ALiBi,\n    title   = {Train Short, Test Long: Attention with Linear Biases Enable Input Length Extrapolation},\n    author  = {Ofir Press and Noah A. Smith and Mike Lewis},\n    year    = {2021},\n    url     = {https:\u002F\u002Fofir.io\u002Ftrain_short_test_long.pdf}\n}\n```\n\n```bibtex\n@misc{parisotto2019stabilizing,\n    title     = {Stabilizing Transformers for Reinforcement Learning},\n    author    = {Emilio Parisotto and H. Francis Song and Jack W. Rae and Razvan Pascanu and Caglar Gulcehre and Siddhant M. Jayakumar and Max Jaderberg and Raphael Lopez Kaufman and Aidan Clark and Seb Noury and Matthew M. Botvinick and Nicolas Heess and Raia Hadsell},\n    year      = {2019},\n    eprint    = {1910.06764},\n    archivePrefix = {arXiv},\n    primaryClass = {cs.LG}\n}\n```\n\n```bibtex\n@misc{narang2021transformer,\n    title       = {Do Transformer Modifications Transfer Across Implementations and Applications?},\n    author      = {Sharan Narang and Hyung Won Chung and Yi Tay and William Fedus and Thibault Fevry and Michael Matena and Karishma Malkan and Noah Fiedel and Noam Shazeer and Zhenzhong Lan and Yanqi Zhou and Wei Li and Nan Ding and Jake Marcus and Adam Roberts and Colin Raffel},\n    year        = {2021},\n    eprint      = {2102.11972},\n    archivePrefix = {arXiv},\n    primaryClass = {cs.LG}\n}\n```\n\n```bibtex\n@misc{zhang2019root,\n    title   = {Root Mean Square Layer Normalization},\n    author  = {Biao Zhang and Rico Sennrich},\n    year    = {2019},\n    eprint  = {1910.07467},\n    archivePrefix = {arXiv},\n    primaryClass = {cs.LG}\n}\n```\n\n```bibtex\n@inproceedings{Qin2023ScalingTT,\n    title   = {Scaling TransNormer to 175 Billion Parameters},\n    author  = {Zhen Qin and Dong Li and Weigao Sun and Weixuan Sun and Xuyang Shen and Xiaodong Han and Yunshen Wei and Baohong Lv and Fei Yuan and Xiao Luo and Y. Qiao and Yiran Zhong},\n    year    = {2023},\n    url     = {https:\u002F\u002Fapi.semanticscholar.org\u002FCorpusID:260203124}\n}\n```\n\n```bibtex\n@misc{su2021roformer,\n    title   = {RoFormer: Enhanced Transformer with Rotary Position Embedding},\n    author  = {Jianlin Su and Yu Lu and Shengfeng Pan and Bo Wen and Yunfeng Liu},\n    year    = {2021},\n    eprint  = {2104.09864},\n    archivePrefix = {arXiv},\n    primaryClass = {cs.CL}\n}\n```\n\n```bibtex\n@inproceedings{Yang2025RopeTN,\n    title   = {Rope to Nope and Back Again: A New Hybrid Attention Strategy},\n    author  = {Bowen Yang and Bharat Venkitesh and Dwarak Talupuru and Hangyu Lin and David Cairuz and Phil Blunsom and Acyr F. Locatelli},\n    year    = {2025},\n    url     = {https:\u002F\u002Fapi.semanticscholar.org\u002FCorpusID:276079501}\n}\n```\n\n```bibtex\n@inproceedings{Chen2023ExtendingCW,\n    title   = {Extending Context Window of Large Language Models via Positional Interpolation},\n    author  = {Shouyuan Chen and Sherman Wong and Liangjian Chen and Yuandong Tian},\n    year    = {2023}\n}\n```\n\n```bibtex\n@inproceedings{Sun2022ALT,\n  title     = {A Length-Extrapolatable Transformer},\n  author    = {Yutao Sun and Li Dong and Barun Patra and Shuming Ma and Shaohan Huang and Alon Benhaim and Vishrav Chaudhary and Xia Song and Furu Wei},\n  year      = {2022}\n}\n```\n\n```bibtex\n@Article{AlphaFold2021,\n    author  = {Jumper, John and Evans, Richard and Pritzel, Alexander and Green, Tim and Figurnov, Michael and Ronneberger, Olaf and Tunyasuvunakool, Kathryn and Bates, Russ and {\\v{Z}}{\\'\\i}dek, Augustin and Potapenko, Anna and Bridgland, Alex and Meyer, Clemens and Kohl, Simon A A and Ballard, Andrew J and Cowie, Andrew and Romera-Paredes, Bernardino and Nikolov, Stanislav and Jain, Rishub and Adler, Jonas and Back, Trevor and Petersen, Stig and Reiman, David and Clancy, Ellen and Zielinski, Michal and Steinegger, Martin and Pacholska, Michalina and Berghammer, Tamas and Bodenstein, Sebastian and Silver, David and Vinyals, Oriol and Senior, Andrew W and Kavukcuoglu, Koray and Kohli, Pushmeet and Hassabis, Demis},\n    journal = {Nature},\n    title   = {Highly accurate protein structure prediction with {AlphaFold}},\n    year    = {2021},\n    doi     = {10.1038\u002Fs41586-021-03819-2},\n    note    = {(Accelerated article preview)},\n}\n```\n\n```bibtex\n@software{peng_bo_2021_5196578,\n    author       = {PENG Bo},\n    title        = {BlinkDL\u002FRWKV-LM: 0.01},\n    month        = {aug},\n    year         = {2021},\n    publisher    = {Zenodo},\n    version      = {0.01},\n    doi          = {10.5281\u002Fzenodo.5196578},\n    url          = {https:\u002F\u002Fdoi.org\u002F10.5281\u002Fzenodo.5196578}\n}\n```\n\n```bibtex\n@misc{csordás2021devil,\n    title   = {The Devil is in the Detail: Simple Tricks Improve Systematic Generalization of Transformers},\n    author  = {Róbert Csordás and Kazuki Irie and Jürgen Schmidhuber},\n    year    = {2021},\n    eprint  = {2108.12284},\n    archivePrefix = {arXiv},\n    primaryClass = {cs.LG}\n}\n```\n\n```bibtex\n@misc{so2021primer,\n    title   = {Primer: Searching for Efficient Transformers for Language Modeling},\n    author  = {David R. So and Wojciech Mańke and Hanxiao Liu and Zihang Dai and Noam Shazeer and Quoc V. Le},\n    year    = {2021},\n    eprint  = {2109.08668},\n    archivePrefix = {arXiv},\n    primaryClass = {cs.LG}\n}\n```\n\n```bibtex\n@misc{ding2021erniedoc,\n    title   = {ERNIE-Doc: A Retrospective Long-Document Modeling Transformer},\n    author  = {Siyu Ding and Junyuan Shang and Shuohuan Wang and Yu Sun and Hao Tian and Hua Wu and Haifeng Wang},\n    year    = {2021},\n    eprint  = {2012.15688},\n    archivePrefix = {arXiv},\n    primaryClass = {cs.CL}\n}\n```\n\n```bibtex\n@misc{ding2021cogview,\n    title   = {CogView: Mastering Text-to-Image Generation via Transformers},\n    author  = {Ming Ding and Zhuoyi Yang and Wenyi Hong and Wendi Zheng and Chang Zhou and Da Yin and Junyang Lin and Xu Zou and Zhou Shao and Hongxia Yang and Jie Tang},\n    year    = {2021},\n    eprint  = {2105.13290},\n    archivePrefix = {arXiv},\n    primaryClass = {cs.CV}\n}\n```\n\n```bibtex\n@inproceedings{anonymous2022normformer,\n    title   = {NormFormer: Improved Transformer Pretraining with Extra Normalization},\n    author  = {Anonymous},\n    booktitle = {Submitted to The Tenth International Conference on Learning Representations },\n    year    = {2022},\n    url     = {https:\u002F\u002Fopenreview.net\u002Fforum?id=GMYWzWztDx5},\n    note    = {under review}\n}\n```\n\n```bibtex\n@misc{henry2020querykey,\n    title   = {Query-Key Normalization for Transformers},\n    author  = {Alex Henry and Prudhvi Raj Dachapally and Shubham Pawar and Yuxuan Chen},\n    year    = {2020},\n    eprint  = {2010.04245},\n    archivePrefix = {arXiv},\n    primaryClass = {cs.CL}\n}\n```\n\n```bibtex\n@misc{liu2021swin,\n    title   = {Swin Transformer V2: Scaling Up Capacity and ","x-transformers 是一个简洁而全面的Transformer实现，集成了多种论文中提出的实验性功能。该项目支持全注意力机制，并提供了丰富的配置选项，如编码器-解码器结构、仅解码器（类似GPT）和仅编码器（类似BERT）等模式，以及图像分类和图像到文本描述的功能。它使用Python编写，具有良好的模块化设计，易于集成与扩展。适用于需要高效灵活地应用或研究Transformer模型的各种自然语言处理任务及多模态场景。",2,"2026-06-11 03:24:43","top_topic"]