[{"data":1,"prerenderedAt":-1},["ShallowReactive",2],{"project-9682":3},{"id":4,"name":5,"fullName":6,"owner":7,"repo":5,"description":8,"homepage":9,"htmlUrl":10,"language":11,"languages":10,"totalLinesOfCode":10,"stars":12,"forks":13,"watchers":14,"openIssues":15,"contributorsCount":16,"subscribersCount":16,"size":16,"stars1d":17,"stars7d":18,"stars30d":19,"stars90d":16,"forks30d":16,"starsTrendScore":20,"compositeScore":21,"rankGlobal":10,"rankLanguage":10,"license":22,"archived":23,"fork":23,"defaultBranch":24,"hasWiki":25,"hasPages":23,"topics":26,"createdAt":10,"pushedAt":10,"updatedAt":41,"readmeContent":42,"aiSummary":43,"trendingCount":16,"starSnapshotCount":16,"syncStatus":17,"lastSyncTime":44,"discoverSource":45},9682,"RWKV-LM","BlinkDL\u002FRWKV-LM","BlinkDL","RWKV (pronounced RwaKuv) is an RNN with great LLM performance, which can also be directly trained like a GPT transformer (parallelizable). We are at RWKV-7 \"Goose\". So it's combining the best of RNN and transformer - great performance, linear time, constant space (no kv-cache), fast training, infinite ctx_len, and free sentence embedding.","",null,"Python",14560,1008,142,124,0,2,13,45,8,44.01,"Apache License 2.0",false,"main",true,[27,28,29,30,31,32,33,34,35,36,37,38,39,40],"attention-mechanism","chatgpt","deep-learning","gpt","gpt-2","gpt-3","language-model","linear-attention","lstm","pytorch","rnn","rwkv","transformer","transformers","2026-06-12 02:02:11","# RWKV: Parallelizable RNN with Transformer-level LLM Performance (pronounced as \"RwaKuv\" (rʌkuv in IPA), from 4 major params: R W K V)\n\nRWKV website: https:\u002F\u002Frwkv.com (with 150+ papers training various RWKV models)\n\nRWKV twitter: https:\u002F\u002Ftwitter.com\u002FBlinkDL_AI (lastest news)\n\nRWKV discord: https:\u002F\u002Fdiscord.gg\u002FbDSBUMeFpc\n\nRWKV-7 \"Goose\" is the strongest **linear-time** & **constant-space** (no kv-cache) & **attention-free** & 100% RNN architecture on this planet at this moment, suitable for LLM and multimodal applications and more (see [rwkv.com](https:\u002F\u002Frwkv.com)).\n\nRWKV-7 is a [meta-in-context learner](https:\u002F\u002Fraw.githubusercontent.com\u002FBlinkDL\u002FRWKV-LM\u002Fmain\u002FRWKV-v7.png), test-time-training its state on the context via in-context gradient descent at every token.\n\nRWKV is a [Linux Foundation AI project](https:\u002F\u002Flfaidata.foundation\u002Fprojects\u002Frwkv\u002F), so totally free. RWKV runtime is [already in Windows & Office](https:\u002F\u002Fx.com\u002FBlinkDL_AI\u002Fstatus\u002F1831012419508019550).\n\nYou are welcome to ask the RWKV community (such as [RWKV discord](https:\u002F\u002Fdiscord.gg\u002FbDSBUMeFpc)) for advice on upgrading your attention\u002Fssm models to rwkv7 models :)\n\n---\n\nRWKV Chat: https:\u002F\u002Frwkv.halowang.cloud\u002F (local inference for mobile\u002Fdesktop) and https:\u002F\u002Fgithub.com\u002FRWKV-APP\u002FRWKV_APP\n\nLatest RWKV weights: https:\u002F\u002Fhuggingface.co\u002FBlinkDL\n\nGGUF: https:\u002F\u002Fhuggingface.co\u002Fcollections\u002Fshoumenchougou\u002Frwkv7-gxx-gguf\n\nEfficient inference: https:\u002F\u002Fgithub.com\u002FBlinkDL\u002FAlbatross\n* 145+ token\u002Fs RWKV-7 7.2B fp16 bsz1 decoding @ RTX5090 (always const speed & vram)\n* 10250+ token\u002Fs RWKV-7 7.2B fp16 bsz960 decoding @ RTX5090 (always const speed & vram)\n* 9650+ token\u002Fs RWKV-7 7.2B fp16 bsz320 decoding @ RTX5090 (always const speed & vram)\n* 11289 token\u002Fs RWKV-7 7.2B fp16 bsz1 prefill @ RTX5090 (always const speed & vram)\n\nMobile inference library: https:\u002F\u002Fgithub.com\u002FMollySophia\u002Frwkv-mobile\n\n---\n\nFast RWKV-7 CUDA kernels (vanilla, state-tuning, state-passing infctx): https:\u002F\u002Fgithub.com\u002FBlinkDL\u002FRWKV-CUDA\u002Ftree\u002Fmain\u002Frwkv7_fast_fused\n\nRWKV7 7.2B bf16 training on 4x8xH100 ctx8192 zero2+cp = **259k tokens\u002Fs** (note: current RWKV7 kernel is slower for 0.1\u002F0.4B vs transformer, but you can reach great speed with larger models)\n\n**Please use https:\u002F\u002Fgithub.com\u002FBlinkDL\u002FRWKV-LM\u002Ftree\u002Fmain\u002FRWKV-v7\u002Ftrain_temp as RWKV-7 reference implementation**. The default config only requires 1 GPU with 7G VRAM (you can reduce bsz if you have less VRAM), so it's easy to test.\n\nSimplified RWKV-7 training demo: https:\u002F\u002Fgithub.com\u002FBlinkDL\u002FRWKV-LM\u002Fblob\u002Fmain\u002FRWKV-v7\u002Ftrain_temp\u002Frwkv7_train_simplified.py\n\n**Important** (all shown in rwkv7_train_simplified.py):\n* Use PreLN LayerNorm (instead of RMSNorm) for RWKV. I think it's related to better initial state, because I am not using trainable initial state (found it useless when using LayerNorm).\n* Only apply weight decay to large matrix parameters (basically projections) in your model instead of all parameters. THIS IS VERY IMPORTANT.\n* Use correct initialization.\n\nNote FLA RWKV-7 is NOT aligned with reference implementation yet, and you will get less performance.\n\nThis is because RWKV-7 is the whole model with carefully set stuffs, including different init \u002F wd \u002F lr for each parameter, so it's readily scalable and very stable (spike-free).\n\nBut the price to pay is there is no good simple \"RWKV-7 layer\" because a pytorch layer can't make sure itself is using correct init and hyperparameters.\n\nSo if you need to use RWKV-7 for another task, please study train_temp code (only several hundred lines) and change it to suit you.\n\nSee: https:\u002F\u002Fgithub.com\u002FYS-Tang\u002FRWKV-FLA-comparison\n\n\u003Cimg width=\"3318\" height=\"2475\" alt=\"image\" src=\"https:\u002F\u002Fgithub.com\u002Fuser-attachments\u002Fassets\u002Fd9f019c2-a178-4837-8539-3a360c0e6801\" \u002F>\n\n\u003Cimg width=\"2656\" height=\"1956\" alt=\"image\" src=\"https:\u002F\u002Fgithub.com\u002Fuser-attachments\u002Fassets\u002F871d358b-dcd4-4b86-a04b-45c1bcc910b7\" \u002F>\n\n===\n\nRWKV-8:\n\n\u003Cimg src=\"RWKV-8-ROSA.png\">\n\nImproving RNNs: https:\u002F\u002Fgithub.com\u002FBlinkDL\u002FRWKV-LM\u002Fblob\u002Fmain\u002FRWKV-8.md\n\n===\n\nHistory of RWKV (from v1 to v7): [https:\u002F\u002Fwiki.rwkv.com](https:\u002F\u002Fwiki.rwkv.com\u002F) (note: AI-written. might contain errors)\n\nGradio Demo 1: https:\u002F\u002Fhuggingface.co\u002Fspaces\u002FBlinkDL\u002FRWKV-Gradio-1\n\nGradio Demo 2: https:\u002F\u002Fhuggingface.co\u002Fspaces\u002FBlinkDL\u002FRWKV-Gradio-2\n\nWebGPU Demo: https:\u002F\u002Fcryscan.github.io\u002Fweb-rwkv-puzzles\u002F#\u002Fchat\n\n===\n\nRWKV-Runner GUI: https:\u002F\u002Fgithub.com\u002FjosStorer\u002FRWKV-Runner\u002Freleases\n\nAi00 Server: https:\u002F\u002Fgithub.com\u002FAi00-X\u002Fai00_server\n\nRWKV pip pkg: https:\u002F\u002Fpypi.org\u002Fproject\u002Frwkv\u002F\n\nPEFT (Lora etc.): https:\u002F\u002Fgithub.com\u002FJL-er\u002FRWKV-PEFT\n\nRLHF: https:\u002F\u002Fgithub.com\u002FOpenMOSE\u002FRWKV-LM-RLHF\n\n400+ RWKV projects: https:\u002F\u002Fgithub.com\u002Fsearch?o=desc&q=rwkv&s=updated&type=Repositories\n\n**Faster RWKV-7 kernels**: https:\u002F\u002Fgithub.com\u002Fjohanwind\u002Fwind_rwkv\n\n===\n\nRWKV-5\u002F6 Eagle\u002FFinch paper: https:\u002F\u002Farxiv.org\u002Fabs\u002F2404.05892\n\nChat demo code: https:\u002F\u002Fgithub.com\u002FBlinkDL\u002FChatRWKV\u002Fblob\u002Fmain\u002FAPI_DEMO_CHAT.py\n\n**RWKV-7 demo code**: https:\u002F\u002Fgithub.com\u002FBlinkDL\u002FRWKV-LM\u002Ftree\u002Fmain\u002FRWKV-v7\n\nhttps:\u002F\u002Fgithub.com\u002FBlinkDL\u002FRWKV-LM\u002Fblob\u002Fmain\u002FRWKV-v7\u002Frwkv_v7_demo.py (GPT-like mode)\n\nhttps:\u002F\u002Fgithub.com\u002FBlinkDL\u002FRWKV-LM\u002Fblob\u002Fmain\u002FRWKV-v7\u002Frwkv_v7_demo_rnn.py (RNN mode)\n\nhttps:\u002F\u002Fgithub.com\u002FBlinkDL\u002FRWKV-LM\u002Fblob\u002Fmain\u002FRWKV-v7\u002Frwkv_v7_demo_fast.py (Both mode, fastest)\n\nRWKV-6 demo code: https:\u002F\u002Fgithub.com\u002FBlinkDL\u002FRWKV-LM\u002Fblob\u002Fmain\u002FRWKV-v5\u002Frwkv_v6_demo.py\n\nRWKV-6 demo code: https:\u002F\u002Fgithub.com\u002FBlinkDL\u002FChatRWKV\u002Fblob\u002Fmain\u002FRWKV_v6_demo.py\n\n## HOW TO TRAIN RWKV-7\u002F6\u002F5 on MiniPile (1.5G tokens) ##\n\nFor reference, use python 3.10+, torch 2.5+, cuda 12.4+, latest deepspeed, but **keep pytorch-lightning==1.9.5**\n\n**Train RWKV-7:**\n```\n# you can use latest torch + latest cuda (not limited to cu121)\npip install torch --upgrade --extra-index-url https:\u002F\u002Fdownload.pytorch.org\u002Fwhl\u002Fcu121\npip install pytorch-lightning==1.9.5 deepspeed wandb ninja --upgrade\n\n# train RWKV-7\ncd RWKV-v7\u002Ftrain_temp\u002F \n\n# download minipile .bin .idx to train_temp\u002Fdata first (check demo-training-prepare.sh)\n# this will generate the initial weight rwkv-init.pth in out\u002F......\u002F\nsh .\u002Fdemo-training-prepare.sh\n\n# this will load rwkv-init.pth and train the model. you may want to log in to wandb first\nsh .\u002Fdemo-training-run.sh\n\nyour out\u002F......\u002Ftrain_log.txt should have losses similar to:\n0 4.875856 131.0863 0.00059975 2025-04-24 02:23:42.481256 0\n1 4.028621 56.1834 0.00059899 2025-04-24 02:28:16.674463 1\n2 3.801625 44.7739 0.00059773 2025-04-24 02:32:51.059568 2\n3 3.663070 38.9808 0.00059597 2025-04-24 02:37:25.409892 3\n4 3.578974 35.8368 0.00059371 2025-04-24 02:41:59.711315 4\n5 3.510906 33.4786 0.00059096 2025-04-24 02:46:33.990839 5\n6 3.462345 31.8917 0.00058771 2025-04-24 02:51:08.378331 6\n7 3.412196 30.3318 0.00058399 2025-04-24 02:55:42.927474 7\n8 3.376724 29.2747 0.00057978 2025-04-24 03:00:17.504665 8\n9 3.336911 28.1321 0.00057511 2025-04-24 03:04:52.006063 9\n10 3.313411 27.4787 0.00056999 2025-04-24 03:09:27.563336 10\n11 3.295895 27.0016 0.00056441 2025-04-24 03:14:01.786079 11\n```\n\nRWKV-7 weight example for 1.5B (L24-D2048, vocab 65536):\n\n**Make sure you only apply wd to large tensors (with \"wdecay\" in comment) here**, or the performance will be much worse.\n\n| name                | shape         | comment      | initialization  |\n|---------------------|---------------|--------------|-----------------|\n| emb.weight          | [65536, 2048] | wdecay       | see code        |\n| blocks.0.ln0.weight | [2048]        | for layer 0  | 1               |\n| blocks.0.ln0.bias   | [2048]        | for layer 0  | 0               |\n|                     |               |              |                 |\n| blocks.*.ln1.weight | [2048]        |              | 1               |\n| blocks.*.ln1.bias   | [2048]        |              | 0               |\n| blocks.*.att.x_r    | [1, 1, 2048]  |              | see code        |\n| blocks.*.att.x_w    | [1, 1, 2048]  |              | see code        |\n| blocks.*.att.x_k    | [1, 1, 2048]  |              | see code        |\n| blocks.*.att.x_v    | [1, 1, 2048]  |              | see code        |\n| blocks.*.att.x_a    | [1, 1, 2048]  |              | see code        |\n| blocks.*.att.x_g    | [1, 1, 2048]  |              | see code        |\n| blocks.*.att.w0     | [1, 1, 2048]  | lr 2x        | see code        |\n| blocks.*.att.w1     | [2048, 96]    |              | 0               |\n| blocks.*.att.w2     | [96, 2048]    |              | see code        |\n| blocks.*.att.a0     | [1, 1, 2048]  |              | 0               |\n| blocks.*.att.a1     | [2048, 96]    |              | 0               |\n| blocks.*.att.a2     | [96, 2048]    |              | see code        |\n| blocks.*.att.v0     | [1, 1, 2048]  | for layer 1+ | 1               |\n| blocks.*.att.v1                | [2048, 64]   | for layer 1+ | 0         |\n| blocks.*.att.v2                | [64, 2048]   | for layer 1+ | see code  |\n| blocks.*.att.g1                | [2048, 256]  |              | 0         |\n| blocks.*.att.g2                | [256, 2048]  |              | see code  |\n| blocks.*.att.k_k               | [1, 1, 2048] |              | 1         |\n| blocks.*.att.k_a               | [1, 1, 2048] |              | 1         |\n| blocks.*.att.r_k               | [32, 64]     |              | 0         |\n| blocks.*.att.receptance.weight | [2048, 2048] | wdecay       | see code  |\n| blocks.*.att.key.weight        | [2048, 2048] | wdecay       | see code  |\n| blocks.*.att.value.weight      | [2048, 2048] | wdecay       | see code  |\n| blocks.*.att.output.weight     | [2048, 2048] | wdecay       | 0         |\n| blocks.*.att.ln_x.weight       | [2048]       |              | see code  |\n| blocks.*.att.ln_x.bias         | [2048]       |              | 0         |\n|                                |              |              |           |\n| blocks.*.ln2.weight            | [2048]       |              | 1         |\n| blocks.*.ln2.bias              | [2048]       |              | 0         |\n| blocks.*.ffn.x_k               | [1, 1, 2048] |              | see code  |\n| blocks.*.ffn.key.weight        | [8192, 2048] | wdecay       | see code  |\n| blocks.*.ffn.value.weight      | [2048, 8192] | wdecay       | 0         |\n|                                |              |              |           |\n| ln_out.weight | [2048]        |        | 1         |\n| ln_out.bias   | [2048]        |        | 0         |\n| head.weight   | [65536, 2048] | wdecay | see code  |\n\nTrain RWKV-6: use \u002FRWKV-v5\u002F and use --my_testing \"x060\" in demo-training-prepare.sh and demo-training-run.sh\n\nYour loss curve should look almost exactly the same as this, with the same ups and downs (if you use the same bsz & config):\n\n\u003Cimg src=\"RWKV-v5-minipile.png\" width=\"500\">\n\nYou can run your model using https:\u002F\u002Fpypi.org\u002Fproject\u002Frwkv\u002F (use \"rwkv_vocab_v20230424\" instead of \"20B_tokenizer.json\")\n\nUse https:\u002F\u002Fgithub.com\u002FBlinkDL\u002FRWKV-LM\u002Fblob\u002Fmain\u002FRWKV-v5\u002Fmake_data.py to prepare binidx data from jsonl, and compute \"--my_exit_tokens\" and \"--magic_prime\".\n\nUse https:\u002F\u002Fgithub.com\u002FBlinkDL\u002FRWKV-LM\u002Fblob\u002Fmain\u002FRWKV-v5\u002Fcompute_magic_prime.py to compute \"--my_exit_tokens\" and \"--magic_prime\" for existing binidx.\n\nMuch faster tokenizer of large data: https:\u002F\u002Fgithub.com\u002Fcahya-wirawan\u002Fjson2bin https:\u002F\u002Fgithub.com\u002Fcahya-wirawan\u002Frwkv-tokenizer https:\u002F\u002Fgithub.com\u002Fm8than\u002FRWKV-World-Tokenizer-CPP\n\nThe \"epoch\" in train.py is \"mini-epoch\" (not real epoch. only for convenience), and 1 mini-epoch = 40320 * ctx_len tokens.\n\nFor example, if your binidx has 1498226207 tokens and ctxlen=4096, set \"--my_exit_tokens 1498226207\" (this will override epoch_count), and it will be 1498226207\u002F(40320 * 4096) = 9.07 miniepochs. The trainer will auto-exit after \"--my_exit_tokens\" tokens. Set \"--magic_prime\" to the largest 3n+2 prime smaller than datalen\u002Fctxlen-1 (= 1498226207\u002F4096-1 = 365776), which is \"--magic_prime 365759\" in this case.\n\nsimple: prepare SFT jsonl => repeat your SFT data 3 or 4 times in make_data.py. more repetition leads to overfitting.\n\nadvanced: repeat your SFT data 3 or 4 times in your jsonl (note make_data.py will shuffle all jsonl items) => add some base data (such as slimpajama) to your jsonl => and only repeat 1 times in make_data.py.\n\n**Fix training spikes**: see the \"Fixing RWKV-6 Spikes\" part on this page. \n\nOr use RWKV-7 (much better). RWKV-7 is very stable and spike-free (verified for 0.1\u002F0.4\u002F1.5\u002F2.9b):\n\u003Cimg src=\"RWKV-v7-loss.png\" width=\"500\">\n\n**Simple inference for RWKV-6**: https:\u002F\u002Fgithub.com\u002FBlinkDL\u002FChatRWKV\u002Fblob\u002Fmain\u002FRWKV_v6_demo.py\n\n**Simple inference for RWKV-5**: https:\u002F\u002Fgithub.com\u002FBlinkDL\u002FChatRWKV\u002Fblob\u002Fmain\u002FRWKV_v5_demo.py\n\n**Note: In [state = kv + w * state] everything must be in fp32 because w can be very close to 1. So we can keep state and w in fp32, and convert kv to fp32.**\n\nlm_eval: https:\u002F\u002Fgithub.com\u002FBlinkDL\u002FChatRWKV\u002Fblob\u002Fmain\u002Frun_lm_eval.py\n\n**Tips for small model \u002F small data**: When I train RWKV music models, I use deep & narrow (such as L29-D512) dimensions, and apply wd and dropout (such as wd=2 dropout=0.02). Note RWKV-LM dropout is very effective - use 1\u002F4 of your usual value.\n\n## HOW TO TRAIN RWKV-7 on Pile (332G tokens) ##\n\nSee https:\u002F\u002Fgithub.com\u002FBlinkDL\u002FRWKV-LM\u002Fblob\u002Fmain\u002FRWKV-v5\u002Fdemo-training-prepare-v7-pile.sh and https:\u002F\u002Fgithub.com\u002FBlinkDL\u002FRWKV-LM\u002Fblob\u002Fmain\u002FRWKV-v5\u002Fdemo-training-run-v7-pile.sh\n\nGet these files first:\n\npile_20B_tokenizer_text_document.bin (664230651068 bytes)\n\npile_20B_tokenizer_text_document.idx (4212099722 bytes)\n\n### HOW TO FINETUNE RWKV-5 MODELS ###\n\nUse .jsonl format for your data (see https:\u002F\u002Fhuggingface.co\u002FBlinkDL\u002Frwkv-5-world for formats).\n\nUse https:\u002F\u002Fgithub.com\u002FBlinkDL\u002FRWKV-LM\u002Fblob\u002Fmain\u002FRWKV-v5\u002Fmake_data.py to tokenizer it using World tokenizer into binidx, suitable for finetuning World models.\n\nRename the base checkpoint in your model folder to rwkv-init.pth, and change the training commands to use --n_layer 32 --n_embd 4096 --vocab_size 65536 --lr_init 1e-5 --lr_final 1e-5 for 7B.\n\n0.1B = --n_layer 12 --n_embd 768 \u002F\u002F 0.4B = --n_layer 24 --n_embd 1024 \u002F\u002F 1.5B = --n_layer 24 --n_embd 2048 \u002F\u002F 3B = --n_layer 32 --n_embd 2560 \u002F\u002F 7B = --n_layer 32 --n_embd 4096\n\n### State-tuning (tuning the initial state. zero inference overhead)\n\nCurrently unoptimized implementation, takes same vram as full SFT\n\n```--train_type \"states\" --load_partial 1 --lr_init 1 --lr_final 0.01 --warmup_steps 10 (yes, use very high LR)```\n\nuse rwkv 0.8.26+ to auto-load the trained \"time_state\" \n\n### Initializing RWKV 5\u002F6 Models ###\n\nWhen you train RWKV from scratch, try my initialization for best performance. Check generate_init_weight() of src\u002Fmodel.py:\n```\nemb.weight => nn.init.uniform_(a=-1e-4, b=1e-4)\n(Note ln0 of block0 is the layernorm for emb.weight)\nhead.weight => nn.init.orthogonal_(gain=0.5*sqrt(n_vocab \u002F n_embd))\n\natt.receptance.weight => nn.init.orthogonal_(gain=1)\natt.key.weight => nn.init.orthogonal_(gain=0.1)\natt.value.weight => nn.init.orthogonal_(gain=1)\natt.gate.weight => nn.init.orthogonal_(gain=0.1)\natt.output.weight => zero\n\natt.ln_x.weight (groupnorm) => ((1 + layer_id) \u002F total_layers) ** 0.7\n\nffn.key.weight => nn.init.orthogonal_(gain=1)\nffn.value.weight => zero\nffn.receptance.weight => zero\n```\n!!! If you are using positional embedding, maybe it's better to remove block.0.ln0 and use default initialization for emb.weight instead of my uniform_(a=-1e-4, b=1e-4) !!!\n\n### Fixing RWKV-6 Spikes ###\n\n0. upgrade to RWKV-7. It's very stable.\n\n1. when training from scratch, add \"k = k * torch.clamp(w, max=0).exp()\" before \"RUN_CUDA_RWKV6(r, k, v, w, u)\", and remember to change your inference code too. you will see faster convergence.\n\n2. use \"--adam_eps 1e-18\"\n\n3. \"--beta2 0.95\" if you see spikes\n\n4. in trainer.py do \"lr = lr * (0.01 + 0.99 * trainer.global_step \u002F w_step)\" (originally 0.2 + 0.8), and \"--warmup_steps 20\"\n\n5. \"--weight_decay 0.1\" leads to better final loss if you are training lots of data. set lr_final to 1\u002F100 of lr_init when doing this.\n\n### Misc\n\nRWKV-7 can do math. See https:\u002F\u002Fgithub.com\u002FBlinkDL\u002FRWKV-LM\u002Fblob\u002Fmain\u002FResearch\u002Frwkv7-g0-7.2b.md for details.\n\n\u003Cimg width=\"555\" height=\"784\" alt=\"image\" src=\"https:\u002F\u002Fgithub.com\u002Fuser-attachments\u002Fassets\u002F095b4576-962f-4274-ae1a-855406ec76c1\" \u002F>\n\n\u003Cimg src=\"RWKV-v7-niah.png\">\n\n## Introducing RWKV\n\nRWKV is an RNN with Transformer-level LLM performance, which can also be directly trained like a GPT transformer (parallelizable). And it's 100% attention-free. You only need the hidden state at position t to compute the state at position t+1. You can use the \"GPT\" mode to quickly compute the hidden state for the \"RNN\" mode.\n\nSo it's combining the best of RNN and transformer - **great performance, fast inference, saves VRAM, fast training, \"infinite\" ctx_len, and free sentence embedding** (using the final hidden state).\n\n**All latest RWKV weights:** https:\u002F\u002Fhuggingface.co\u002FBlinkDL\n\n**HF-compatible RWKV weights:** https:\u002F\u002Fhuggingface.co\u002FRWKV\n\n```python\nos.environ[\"RWKV_JIT_ON\"] = '1'\nos.environ[\"RWKV_CUDA_ON\"] = '0' # if '1' then use CUDA kernel for seq mode (much faster)\nfrom rwkv.model import RWKV                         # pip install rwkv\nmodel = RWKV(model='\u002Ffsx\u002FBlinkDL\u002FHF-MODEL\u002Frwkv-4-pile-1b5\u002FRWKV-4-Pile-1B5-20220903-8040', strategy='cuda fp16')\n\nout, state = model.forward([187, 510, 1563, 310, 247], None)   # use 20B_tokenizer.json\nprint(out.detach().cpu().numpy())                   # get logits\nout, state = model.forward([187, 510], None)\nout, state = model.forward([1563], state)           # RNN has state (use deepcopy if you want to clone it)\nout, state = model.forward([310, 247], state)\nprint(out.detach().cpu().numpy())                   # same result as above\n```\n\nnanoRWKV: https:\u002F\u002Fgithub.com\u002FBlinkDL\u002FnanoRWKV (does not require custom CUDA kernel to train, works for any GPU\u002FCPU)\n\n**Cool Community RWKV Projects**:\n\nAll (400+) RWKV projects: https:\u002F\u002Fgithub.com\u002Fsearch?o=desc&q=rwkv&s=updated&type=Repositories\n\nhttps:\u002F\u002Fgithub.com\u002FOpenGVLab\u002FVision-RWKV Vision RWKV\n\nhttps:\u002F\u002Fgithub.com\u002Ffeizc\u002FDiffusion-RWKV Diffusion RWKV\n\nhttps:\u002F\u002Fgithub.com\u002Fcgisky1980\u002Fai00_rwkv_server Fastest WebGPU inference (nVidia\u002FAMD\u002FIntel)\n\nhttps:\u002F\u002Fgithub.com\u002Fcryscan\u002Fweb-rwkv backend for ai00_rwkv_server\n\nhttps:\u002F\u002Fgithub.com\u002FsaharNooby\u002Frwkv.cpp Fast CPU\u002FcuBLAS\u002FCLBlast inference: int4\u002Fint8\u002Ffp16\u002Ffp32\n\nhttps:\u002F\u002Fgithub.com\u002FJL-er\u002FRWKV-PEFT lora\u002Fpissa\u002FQlora\u002FQpissa\u002Fstate tuning\n\nhttps:\u002F\u002Fgithub.com\u002FRWKV\u002FRWKV-infctx-trainer Infctx trainer\n\nhttps:\u002F\u002Fgithub.com\u002Fdaquexian\u002Ffaster-rwkv\n\nhttps:\u002F\u002Fgithub.com\u002Fmlc-ai\u002Fmlc-llm\u002Fpull\u002F1275\n\nhttps:\u002F\u002Fgithub.com\u002FTheRamU\u002FFay\u002Fblob\u002Fmain\u002FREADME_EN.md Digital Assistant with RWKV\n\nhttps:\u002F\u002Fgithub.com\u002Fharrisonvanderbyl\u002Frwkv-cpp-cuda Fast GPU inference with cuda\u002Famd\u002Fvulkan\n\n**RWKV v6 in 250 lines** (with tokenizer too): https:\u002F\u002Fgithub.com\u002FBlinkDL\u002FChatRWKV\u002Fblob\u002Fmain\u002FRWKV_v6_demo.py\n\n**RWKV v5 in 250 lines** (with tokenizer too): https:\u002F\u002Fgithub.com\u002FBlinkDL\u002FChatRWKV\u002Fblob\u002Fmain\u002FRWKV_v5_demo.py\n\n**RWKV v4 in 150 lines** (model, inference, text generation): https:\u002F\u002Fgithub.com\u002FBlinkDL\u002FChatRWKV\u002Fblob\u002Fmain\u002FRWKV_in_150_lines.py\n\n**RWKV v4 preprint** https:\u002F\u002Farxiv.org\u002Fabs\u002F2305.13048\n\n**RWKV v4 introduction, and in 100 lines of numpy**: https:\u002F\u002Fjohanwind.github.io\u002F2023\u002F03\u002F23\u002Frwkv_overview.html https:\u002F\u002Fjohanwind.github.io\u002F2023\u002F03\u002F23\u002Frwkv_details.html\n\n![RWKV-7](RWKV-v7.png)\n\n![MQAR](Research\u002FRWKV-6-MQAR.png)\n\n![RWKV-paper](RWKV-paper.png)\n\nRWKV v6 illustrated:\n\n![RWKV-v6](rwkv-x060.png)\n\n![RWKV-v5-benchmark-1](RWKV-v5-benchmark-1.png)\n\nA cool paper (Spiking Neural Network) using RWKV: https:\u002F\u002Fgithub.com\u002Fridgerchu\u002FSpikeGPT\n\nYou are welcome to join the RWKV discord https:\u002F\u002Fdiscord.gg\u002FbDSBUMeFpc to build upon it. We have plenty of potential compute (A100 40Gs) now (thanks to Stability and EleutherAI), so if you have interesting ideas I can run them.\n\n![RWKV-eval2](RWKV-eval2.png)\n\nRWKV [loss vs token position] for 10000 ctx4k+ documents in Pile. RWKV 1B5-4k is mostly flat after ctx1500, but 3B-4k and 7B-4k and 14B-4k have some slopes, and they are getting better. This debunks the old view that RNNs cannot model long ctxlens. We can predict that RWKV 100B will be great, and RWKV 1T is probably all you need :)\n\n![RWKV-ctxlen](RWKV-ctxlen.png)\n\nChatRWKV with RWKV 14B ctx8192:\n\n![RWKV-chat](RWKV-chat.png)\n\nI believe RNN is a better candidate for fundamental models, because: (1) It's more friendly for ASICs (no kv cache). (2) It's more friendly for RL. (3) When we write, our brain is more similar to RNN. (4) The universe is like an RNN too (because of locality). Transformers are non-local models.\n\nRWKV-3 1.5B on A40 (tf32) = always 0.015 sec\u002Ftoken, tested using simple pytorch code (no CUDA), GPU utilization 45%, VRAM 7823M\n\nGPT2-XL 1.3B on A40 (tf32) = 0.032 sec\u002Ftoken (for ctxlen 1000), tested using HF, GPU utilization 45% too (interesting), VRAM 9655M\n\nTraining speed: (new training code) RWKV-4 14B BF16 ctxlen4096 = 114K tokens\u002Fs on 8x8 A100 80G (ZERO2+CP). (old training code) RWKV-4 1.5B BF16 ctxlen1024 = 106K tokens\u002Fs on 8xA100 40G.\n\nI am doing image experiments too (For example: https:\u002F\u002Fhuggingface.co\u002FBlinkDL\u002Fclip-guided-binary-autoencoder) and RWKV will be able to do txt2img diffusion :) My idea: 256x256 rgb image -> 32x32x13bit latents -> apply RWKV to compute transition probability for each of the 32x32 grid -> pretend the grids are independent and \"diffuse\" using these probabilities.\n\nSmooth training - no loss spikes! (lr & bsz change around 15G tokens)\n![RWKV-loss](RWKV-loss.png)\n\n![RWKV-eval](RWKV-eval.png)\n\nAll of the trained models will be open-source. Inference is very fast (only matrix-vector multiplications, no matrix-matrix multiplications) even on CPUs, so you can even run a LLM on your phone.\n\nHow it works: RWKV gathers information to a number of channels, which are also decaying with different speeds as you move to the next token. It's very simple once you understand it.\n\n**RWKV is parallelizable because the time-decay of each channel is data-independent (and trainable)**. For example, in usual RNN you can adjust the time-decay of a channel from say 0.8 to 0.5 (these are called \"gates\"), while in RWKV you simply move the information from a W-0.8-channel to a W-0.5-channel to achieve the same effect. Moreover, you can fine-tune RWKV into a non-parallelizable RNN (then you can use outputs of later layers of the previous token) if you want extra performance.\n\n![RWKV-formula](RWKV-formula.png)\n\nHere are some of my TODOs. Let's work together :)\n\n* HuggingFace integration (check https:\u002F\u002Fgithub.com\u002Fhuggingface\u002Ftransformers\u002Fissues\u002F17230\n), and optimized CPU & iOS & Android & WASM & WebGL inference. RWKV is a RNN and very friendly for edge devices. Let's make it possible to run a LLM on your phone. \n\n* Test it on bidirectional & MLM tasks, and image & audio & video tokens. I think RWKV can support Encoder-Decoder via this: for each decoder token, use a learned mixture of [decoder previous hidden state] & [encoder final hidden state]. Hence all decoder tokens will have access to the encoder output.\n\n* Now training RWKV-4a with one single tiny extra attention (just a few extra lines comparing with RWKV-4) to further improve some difficult zeroshot tasks (such as LAMBADA) for smaller models. See https:\u002F\u002Fgithub.com\u002FBlinkDL\u002FRWKV-LM\u002Fcommit\u002Fa268cd2e40351ee31c30c5f8a5d1266d35b41829\n\nUser feedback:\n> *I've so far toyed around the character-based model on our relatively small pre-training dataset (around 10GB of text), and the results are extremely good - similar ppl to models taking much, much longer to train.*\n\n> *dear god rwkv is fast. i switched to another tab after starting training it from scratch & when i returned it was emitting plausible english & maori words, i left to go microwave some coffee & when i came back it was producing fully grammatically correct sentences.*\n\nTweet from Sepp Hochreiter (thank you!): https:\u002F\u002Ftwitter.com\u002FHochreiterSepp\u002Fstatus\u002F1524270961314484227\n\nYou can find me (BlinkDL) in the EleutherAI Discord too: https:\u002F\u002Fwww.eleuther.ai\u002Fget-involved\u002F\n\n![RWKV-demo](RWKV-demo.png)\n\n## Quick start\n\n**IMPORTANT: Use deepspeed==0.7.0 pytorch-lightning==1.9.5 torch==1.13.1+cu117 and cuda 11.7.1 or 11.7 (note torch2 + deepspeed has weird bugs and hurts model performance)**\n\nUse https:\u002F\u002Fgithub.com\u002FBlinkDL\u002FRWKV-LM\u002Ftree\u002Fmain\u002FRWKV-v4neo (latest code, compatible with v4).\n\nHere is a great prompt for testing Q&A of LLMs. Works for any model: (found by minimizing ChatGPT ppls for RWKV 1.5B)\n```python\nprompt = f'\\nQ & A\\n\\nQuestion:\\n{qq}\\n\\nDetailed Expert Answer:\\n' # let the model generate after this\n```\n\n### Inference\n\n**Run RWKV-4 Pile models:** Download models from https:\u002F\u002Fhuggingface.co\u002FBlinkDL. Set TOKEN_MODE = 'pile' in run.py and run it. It's fast even on CPU (the default mode).\n\n**Colab for RWKV-4 Pile 1.5B**: https:\u002F\u002Fcolab.research.google.com\u002Fdrive\u002F1F7tZoPZaWJf1fsCmZ5tjw6sYHiFOYVWM\n\nRun RWKV-4 Pile models in your browser (and onnx version): see this issue https:\u002F\u002Fgithub.com\u002FBlinkDL\u002FRWKV-LM\u002Fissues\u002F7\n\nRWKV-4 Web Demo: https:\u002F\u002Fjosephrocca.github.io\u002Frwkv-v4-web\u002Fdemo\u002F (note: only greedy sampling for now)\n\nFor the old RWKV-2: see the release here for a 27M params model on enwik8 with 0.72 BPC(dev). Run run.py in https:\u002F\u002Fgithub.com\u002FBlinkDL\u002FRWKV-LM\u002Ftree\u002Fmain\u002FRWKV-v2-RNN. You can even run it in your browser: https:\u002F\u002Fgithub.com\u002FBlinkDL\u002FAI-Writer\u002Ftree\u002Fmain\u002Fdocs\u002Feng https:\u002F\u002Fblinkdl.github.io\u002FAI-Writer\u002Feng\u002F (this is using tf.js WASM single-thread mode).\n\n### Training \u002F Fine-tuning\n\npip install deepspeed==0.7.0 \u002F\u002F pip install pytorch-lightning==1.9.5 \u002F\u002F torch 1.13.1+cu117\n\nNOTE: add weight decay (0.1 or 0.01) and dropout (0.1 or 0.01) when training on small amt of data. try x=x+dropout(att(x)) x=x+dropout(ffn(x)) x=dropout(x+att(x)) x=dropout(x+ffn(x)) etc.\n\n**Training RWKV-4 from scratch:** run train.py, which by default is using the enwik8 dataset (unzip https:\u002F\u002Fdata.deepai.org\u002Fenwik8.zip).\n\nYou will be training the \"GPT\" version because it's paralleziable and faster to train. RWKV-4 can extrapolate, so training with ctxLen 1024 can work for ctxLen of 2500+. You can fine-tune the model with longer ctxLen and it can quickly adapt to longer ctxLens.\n\n**Fine-tuning RWKV-4 Pile models:** use 'prepare-data.py' in https:\u002F\u002Fgithub.com\u002FBlinkDL\u002FRWKV-v2-RNN-Pile\u002Ftree\u002Fmain\u002FRWKV-v3 to tokenize .txt into train.npy data. Then use https:\u002F\u002Fgithub.com\u002FBlinkDL\u002FRWKV-LM\u002Fblob\u002Fmain\u002FRWKV-v4neo\u002Ftrain.py to train it.\n\nRead the inference code in src\u002Fmodel.py and try using the final hidden state（.xx .aa .bb) as a faithful sentence embedding for other tasks. Probably you should begin with .xx and .aa\u002F.bb (.aa divided by .bb).\n\nColab for fine-tuning RWKV-4 Pile models: https:\u002F\u002Fcolab.research.google.com\u002Fgithub\u002Fresloved\u002FRWKV-notebooks\u002Fblob\u002Fmaster\u002FRWKV_v4_RNN_Pile_Fine_Tuning.ipynb\n\n**Large corpus:** Use https:\u002F\u002Fgithub.com\u002FAbel2076\u002Fjson2binidx_tool to convert .jsonl into .bin and .idx\n\nThe jsonl format sample (one line for each document):\n```\n{\"text\": \"This is the first document.\"}\n{\"text\": \"Hello\\nWorld\"}\n{\"text\": \"1+1=2\\n1+2=3\\n2+2=4\"}\n```\ngenerated by code like this:\n```\nss = json.dumps({\"text\": text}, ensure_ascii=False)\nout.write(ss + \"\\n\")\n```\n\n**Infinite ctxlen training (WIP):** https:\u002F\u002Fgithub.com\u002FBlealtan\u002FRWKV-LM-LoRA\u002Ftree\u002Fdev-infctx\n\n### How to use RWKV hidden state as text embedding\n\nConsider RWKV 14B. The state has 200 vectors, that is, 5 vectors for each block: fp16 (xx), fp32 (aa), fp32 (bb), fp32 (pp), fp16 (xx).\n\nDo not avg pool because different vectors (xx aa bb pp xx) in the state have very different meanings and ranges. You can probably remove pp.\n\nI suggest firstly collect the mean+stdev statistics of each channel of each vector, and normalize all of them (note: the normalization should be data-indepedent and collected from various texts). Then train a linear classifer.\n\n## Towards RWKV-5 (just to record some new ideas)\n\n### Lastest Design\n\nRWKV-5 is multi-head and here shows one head. There is also a LayerNorm for each head (hence actually GroupNorm).\n\n$`\n\\begin{array}{|l|l|l|}\n\\hline & \\text { RWKV-4 with real-valued } k \\,\\&\\, v \\,\\&\\, u \\,\\&\\, w & \\text { RWKV-5 with matrix-valued } \\mathrm{k}^{\\dagger} \\mathrm{v} \\,\\&\\, \\mathrm{u} \\,\\&\\, \\mathrm{w} \\\\\n\\hline \\mathrm{y}_0 & \\mathrm{r}_0 \\frac{\\mathrm{uk}_0 \\mathrm{v}_0}{\\mathrm{uk}_0} & \\mathrm{r}_0\\left(\\mathrm{uk}_0^{\\dagger} \\mathrm{v}_0\\right) \\\\\n\\hline \\mathrm{y}_1 & \\mathrm{r}_1 \\frac{\\mathrm{uk}_1 \\mathrm{v}_1+\\mathrm{k}_0 \\mathrm{v}_0}{\\mathrm{uk}_1+\\mathrm{k}_0} & \\mathrm{r}_1\\left(\\mathrm{uk}_1^{\\dagger} \\mathrm{v}_1+\\mathrm{k}_0^{\\dagger} \\mathrm{v}_0\\right) \\\\\n\\hline \\mathrm{y}_2 & \\mathrm{r}_2 \\frac{\\mathrm{uk}_2 \\mathrm{v}_2+\\mathrm{k}_1 \\mathrm{v}_1+\\mathrm{wk}_0 \\mathrm{v}_0}{\\mathrm{uk}_2+\\mathrm{k}_1+\\mathrm{wk}_0} & \\mathrm{r}_2\\left(\\mathrm{uk}_2^{\\dagger} \\mathrm{v}_2+\\mathrm{k}_1^{\\dagger} \\mathrm{v}_1+\\mathrm{wk}_0^{\\dagger} \\mathrm{v}_0\\right) \\\\\n\\hline \\mathrm{y}_3 & \\mathrm{r}_3 \\frac{\\mathrm{uk}_3 \\mathrm{v}_3+\\mathrm{k}_2 \\mathrm{v}_2+\\mathrm{wk}_1 \\mathrm{v}_1+\\mathrm{w}^2 \\mathrm{k}_0 \\mathrm{v}_0}{\\mathrm{uk}_3+\\mathrm{k}_2+\\mathrm{wk}_1+\\mathrm{w}^2 \\mathrm{k}_0} & \\mathrm{r}_3\\left(\\mathrm{uk}_3^{\\dagger} \\mathrm{v}_3+\\mathrm{k}_2^{\\dagger} \\mathrm{v}_2+\\mathrm{wk}_1^{\\dagger} \\mathrm{v}_1+\\mathrm{w}^2 \\mathrm{k}_0^{\\dagger} \\mathrm{v}_0\\right) \\\\\n\\hline\n\\end{array}`$\n\n$`\\left[\\begin{array}{ll}\n\\mathrm{y}_{20} & \\cdots \\mathrm{y}_{2 \\mathrm{c}}\n\\end{array}\\right]=\\left[\\begin{array}{lll}\n\\mathrm{r}_{20} & \\cdots & \\mathrm{r}_{2 \\mathrm{c}}\n\\end{array}\\right]`$\n$`\\left(\\left[\\begin{array}{ccc}\n\\mathrm{u}_{00} & \\cdots & \\mathrm{u}_{0 \\mathrm{c}} \\\\\n\\vdots & \\ddots & \\vdots \\\\\n\\mathrm{u}_{\\mathrm{c} 0} & \\cdots & \\mathrm{u}_{\\mathrm{cc}}\n\\end{array}\\right]\\left[\\begin{array}{ccc}\n\\mathrm{k}_{20} \\mathrm{v}_{20} & \\cdots & \\mathrm{k}_{20} \\mathrm{v}_{2 \\mathrm{c}} \\\\\n\\vdots & \\ddots & \\vdots \\\\\n\\mathrm{k}_{2 \\mathrm{c}} \\mathrm{v}_{20} & \\cdots & \\mathrm{k}_{2 \\mathrm{c}} \\mathrm{v}_{2 \\mathrm{c}}\n\\end{array}\\right]+\\left[\\begin{array}{ccc}\n\\mathrm{k}_{10} \\mathrm{v}_{10} & \\cdots & \\mathrm{k}_{10} \\mathrm{v}_{1 \\mathrm{c}} \\\\\n\\vdots & \\ddots & \\vdots \\\\\n\\mathrm{k}_{1 \\mathrm{c}} \\mathrm{v}_{10} & \\cdots & \\mathrm{k}_{1 \\mathrm{c}} \\mathrm{v}_{1 \\mathrm{c}}\n\\end{array}\\right]+\\left[\\begin{array}{ccc}\n\\mathrm{w}_{00} & \\cdots & \\mathrm{w}_{0 \\mathrm{c}} \\\\\n\\vdots & \\ddots & \\vdots \\\\\n\\mathrm{w}_{\\mathrm{c} 0} & \\cdots & \\mathrm{w}_{\\mathrm{cc}}\n\\end{array}\\right]\\left[\\begin{array}{ccc}\n\\mathrm{k}_{00} \\mathrm{v}_{00} & \\cdots & \\mathrm{k}_{00} \\mathrm{v}_{0 c} \\\\\n\\vdots & \\ddots & \\vdots \\\\\n\\mathrm{k}_{0 \\mathrm{c}} \\mathrm{v}_{00} & \\cdots & \\mathrm{k}_{0 \\mathrm{c}} \\mathrm{v}_{0 c}\n\\end{array}\\right]\n\\right)`$\n\n### RWKV-6\n\nDynamic Mix & Dynamic Decay. Example (do this for both TimeMix & ChannelMix):\n```\nTIME_MIX_EXTRA_DIM = 32\nself.time_mix_k_w1 = nn.Parameter(torch.empty(args.n_embd, TIME_MIX_EXTRA_DIM).uniform_(-0.01, 0.01))\nself.time_mix_k_w2 = nn.Parameter(torch.zeros(TIME_MIX_EXTRA_DIM, args.n_embd))\nself.time_mix_v_w1 = nn.Parameter(torch.empty(args.n_embd, TIME_MIX_EXTRA_DIM).uniform_(-0.01, 0.01))\nself.time_mix_v_w2 = nn.Parameter(torch.zeros(TIME_MIX_EXTRA_DIM, args.n_embd))\nself.time_mix_r_w1 = nn.Parameter(torch.empty(args.n_embd, TIME_MIX_EXTRA_DIM).uniform_(-0.01, 0.01))\nself.time_mix_r_w2 = nn.Parameter(torch.zeros(TIME_MIX_EXTRA_DIM, args.n_embd))\nself.time_mix_g_w1 = nn.Parameter(torch.empty(args.n_embd, TIME_MIX_EXTRA_DIM).uniform_(-0.01, 0.01))\nself.time_mix_g_w2 = nn.Parameter(torch.zeros(TIME_MIX_EXTRA_DIM, args.n_embd))\n...\ntime_mix_k = self.time_mix_k.view(1,1,-1) + (x @ self.time_mix_k_w1) @ self.time_mix_k_w2\ntime_mix_v = self.time_mix_v.view(1,1,-1) + (x @ self.time_mix_v_w1) @ self.time_mix_v_w2\ntime_mix_r = self.time_mix_r.view(1,1,-1) + (x @ self.time_mix_r_w1) @ self.time_mix_r_w2\ntime_mix_g = self.time_mix_g.view(1,1,-1) + (x @ self.time_mix_g_w1) @ self.time_mix_g_w2\n\nxx = self.time_shift(x)\nxk = x * time_mix_k + xx * (1 - time_mix_k)\nxv = x * time_mix_v + xx * (1 - time_mix_v)\nxr = x * time_mix_r + xx * (1 - time_mix_r)\nxg = x * time_mix_g + xx * (1 - time_mix_g)\n```\n\n![RWKV-v6](RWKV-v6.png)\n\n### RWKV-7\n\nUse parallelized mode to quickly generate the state, then use a finetuned full RNN (the layers of token n can use outputs of all layer of token n-1) for sequential generation.\n\n### Some old ideas\n\n1. Now time decay is like 0.999^T (0.999 is learnable). Change it to something like (0.999^T + 0.1) where 0.1 is learnable too. The 0.1 part will be kept forever. Or, A^T + B^T + C = fast-decay + slow-decay + constant. Can even use different formulas (for example, K^2 instead of e^K for a decay component, or, without normalization).\n\n2. Use complex-valued decay (so, rotation instead of decay) in some channels.\n\n3. Inject some trainable and extrapolatable positional encoding?\n\n4. Aside from 2d rotation, we can try other Lie groups such as 3d rotation ( SO(3) ). Non-abelian RWKV lol.\n\n5. RWKV might be great on analog devices (search for Analog Matrix-vector multiplication & Photonic Matrix-vector multiplication). The RNN mode is very hardware-friendly (processing-in-memory). Can be a SNN too (https:\u002F\u002Fgithub.com\u002Fridgerchu\u002FSpikeGPT). I wonder if it can be optimized for quantum computation.\n\n6. Trainable initial hidden state (xx aa bb pp xx).\n\n7. Layerwise (or even row\u002Fcolumn-wise, elementwise) LR, and test Lion optimizer.\n\n### Vision Tasks\n\n1. I find it's good to add a 2d pos encoding:\n```\nself.pos_emb_x = nn.Parameter(torch.zeros((1,args.my_pos_emb,args.n_embd)))\nself.pos_emb_y = nn.Parameter(torch.zeros((args.my_pos_emb,1,args.n_embd)))\n...\nx = x + pos_emb_x + pos_emb_y\n```\n\n2. In a BPE langauge model, it's the best to use [tokenShift of 1 token] (you can mix more tokens in a char-level English model). However you can try [tokenShift of N (or N-1) (or N+1) tokens] if the image size is N x N, because that will be like mixing [the token above the current positon (or the token above the to-be-predicted positon)] with [current token]. You can use try different tokenShift styles for \"ATT\" & \"FFN\", or mixing different tokenShift styles - such as mixing [token A] with [token A-1] and [token A-(N-1)] etc.\n\n### Misc\n\nMaybe we can improve memorization by simply repeating the context (I guess 2 times is enough). Example:  Reference -> Reference(again) -> Question -> Answer\n\n#### Idea: Bytes-aware Embedding\n\nThe idea is to make sure each token in vocab understand its length and raw UTF-8 bytes.\n\nLet a = max(len(token)) for all token in vocab. Define AA : float[a][d_emb]\n\nLet b = max(len_in_utf8_bytes(token)) for all token in vocab. Define BB : float[b][256][d_emb]\n\nFor each token X in vocab, let [x0, x1, ..., xn] be its raw UTF-8 bytes. We will add some extra values to its embedding EMB(X):\n\nEMB(X) += AA[len(X)] + BB[0][x0] + BB[1][x1] + ... + BB[n][xn] (note: AA BB are learnable weights)\n\n* We can do this for the final Linear(d_emb, n_vocab) projection too.\n* We can use some small networks to generate AA and BB, for some extra regularization (for example, BB[m][xi] and BB[n][xi] should be related).\n\n#### Old Idea\n\nI have an idea to improve tokenization. We can hardcode some channels to have meanings. Example:\n\nChannel 0 = \"space\"\n\nChannel 1 = \"capitalize first letter\"\n\nChannel 2 = \"capitalize all letters\"\n\nTherefore:\n\nEmbedding of \"abc\":  [0, 0, 0, x0, x1, x2 , ..]\n\nEmbedding of \" abc\":  [1, 0, 0, x0, x1, x2, ..]\n\nEmbedding of \" Abc\":  [1, 1, 0, x0, x1, x2, ..]\n\nEmbedding of \"ABC\": [0, 0, 1, x0, x1, x2, ...]\n\n......\n\nso they will share most of the embedding. And we can rapidly compute the output probability of all variations of \"abc\".\n\nNote: the above method is assuming that p(\" xyz\") \u002F p(\"xyz\") is the same for any \"xyz\", which can be wrong.\n\nBetter: define emb_space emb_capitalize_first emb_capitalize_all to be a function of emb.\n\nMaybe the Best: let 'abc' ' abc' etc. to share the last 90% of their embeddings.\n\nAt this moment, all our tokenizers spend too many items to represent all variations of 'abc' ' abc' ' Abc' etc. Moreover the model cannot discover that these are actually similar if some of these variations are rare in the dataset. The method here can improve this. I plan to test this in a new version of RWKV.\n\n#### Idea: Better Initial States\n\nExample (single-round Q & A):\n\n1. Generate the final state of all wiki documents.\n\n2. For any user Q, find the best wiki document, and use its final state as the initial state.\n\n3. Train a model to directly generate the optimal initial state for any user Q.\n\nHowever this can be a bit more tricky for multi-round Q & A :)\n\n## How it works\n\nRWKV is inspired by Apple's AFT (https:\u002F\u002Farxiv.org\u002Fabs\u002F2105.14103).\n\nMoreover it's using a number of my tricks, such as:\n\n* SmallInitEmb: https:\u002F\u002Fgithub.com\u002FBlinkDL\u002FSmallInitEmb (applicable to all transformers) which helps the embedding quality, and stabilizes Post-LN (which is what I am using).\n\n* Token-shift: https:\u002F\u002Fgithub.com\u002FBlinkDL\u002FRWKV-LM#token-shift-time-shift-mixing (applicable to all transformers), especially helpful for char-level models.\n\n* Head-QK: https:\u002F\u002Fgithub.com\u002FBlinkDL\u002FRWKV-LM#the-head-qk-trick-learning-to-copy-and-avoid-tokens (applicable to all transformers). Note: it's helpful, but I disabled it in the Pile model to keep it 100% RNN.\n\n* Extra R-gate in the FFN (applicable to all transformers). I am also using reluSquared from Primer.\n\n* Better initilization: I init most of the matrices to ZERO (see RWKV_Init in https:\u002F\u002Fgithub.com\u002FBlinkDL\u002FRWKV-LM\u002Fblob\u002Fmain\u002FRWKV-v2-RNN\u002Fsrc\u002Fmodel.py).\n\n* You can transfer some parameters from a small model to a large model (note: I sort & smooth them too), for faster and better convergence (see https:\u002F\u002Fwww.reddit.com\u002Fr\u002FMachineLearning\u002Fcomments\u002Fumq908\u002Fr_rwkvv2rnn_a_parallelizable_rnn_with\u002F).\n\n* My CUDA kernel: https:\u002F\u002Fgithub.com\u002FBlinkDL\u002FRWKV-CUDA to speedup training.\n\n## The pseudocode (execution from top to bottom):\n\n![RWKV-v2-RNN](RWKV-v2-RNN.png)\n\nThe a b c d factors work together to build a time-decay curve: [X, 1, W, W^2, W^3, ...].\n\nWrite out the formulas for \"token at pos 2\" and \"token at pos 3\" and you will get the idea:\n* a and b: EMAs of kv and k.\n* c and d: these are a and b combined with \"self-attention\".\n\nkv \u002F k is the memory mechanism. The token with high k can be remembered for a long duration, if W is close to 1 in the channel.\n\nThe R-gate is important for performance. k = info strength of this token (to be passed to future tokens). r = whether to apply the info to this token.\n\n## RWKV-3 improvements\n\nUse different trainable TimeMix factors for R \u002F K \u002F V in SA and FF layers. Example:\n```python\nxx = self.time_shift(x)\nxk = x * self.time_mix_k + xx * (1 - self.time_mix_k)\nxv = x * self.time_mix_v + xx * (1 - self.time_mix_v)\nxr = x * self.time_mix_r + xx * (1 - self.time_mix_r)\n```\n\nUse preLN instead of postLN (more stable & faster convergence):\n```python\nif self.layer_id == 0:\n\tx = self.ln0(x)\nx = x + self.att(self.ln1(x))\nx = x + self.ffn(self.ln2(x))\n```\n\n## Explaining the code for RWKV-3 GPT mode\n\n### The GPT mode - overview\n\nThe building blocks of RWKV-3 GPT mode are similar to that of a usual preLN GPT.\n\nThe only difference is an extra LN after embedding. Note you can absorb this LN into the embedding after finishing the training.\n```python\nx = self.emb(idx)  # input: idx = token indices\nx = self.ln_emb(x) # extra LN after embedding\nx = x + self.att_0(self.ln_att_0(x)) # preLN\nx = x + self.ffn_0(self.ln_ffn_0(x))\n...\nx = x + self.att_n(self.ln_att_n(x))\nx = x + self.ffn_n(self.ln_ffn_n(x))\nx = self.ln_head(x) # final LN before projection\nx = self.head(x)    # output: x = logits\n```\nIt is important to initialize emb to tiny values, such as nn.init.uniform_(a=-1e-4, b=1e-4), to utilize my trick https:\u002F\u002Fgithub.com\u002FBlinkDL\u002FSmallInitEmb.\n\nFor the 1.5B RWKV-3, I use Adam (no wd, no dropout) optimizer on 8 * A100 40G.\n\nbatchSz = 32 * 896, ctxLen = 896. I am using tf32 so the batchSz is a bit small. \n\nFor the first 15B tokens, LR is fixed at 3e-4, and beta=(0.9, 0.99).\n\nThen I set beta=(0.9, 0.999), and do an exponential decay of LR, reaching 1e-5 at 332B tokens.\n\n### The GPT mode - ATT block\n\nThe RWKV-3 does not have any attention in the usual sense, but we will call this block ATT anyway.\n```python\nB, T, C = x.size() # x = (Batch,Time,Channel)\n\n# Mix x with the previous timestep to produce xk, xv, xr\nxx = self.time_shift(x) # self.time_shift = nn.ZeroPad2d((0,0,1,-1))\nxk = x * self.time_mix_k + xx * (1 - self.time_mix_k)\nxv = x * self.time_mix_v + xx * (1 - self.time_mix_v)\nxr = x * self.time_mix_r + xx * (1 - self.time_mix_r)\n\n# Use xk, xv, xr to produce k, v, r\nk = self.key(xk).transpose(-1, -2)\nv = self.value(xv).transpose(-1, -2)\nr = self.receptance(xr)\nk = torch.clamp(k, max=60) # clamp k to avoid overflow\nk = torch.exp(k)\nkv = k * v\n\n# Compute the W-curve = [e^(-n * e^time_decay), e^(-(n-1) * e^time_decay), ..., 1, e^(time_first)]\nself.time_w = torch.cat([torch.exp(self.time_decay) * self.time_curve.to(x.device), self.time_first], dim=-1)\nw = torch.exp(self.time_w)\n\n# Use W to mix kv and k respectively. Add K_EPS to wk to avoid divide-by-zero\nif RUN_DEVICE == 'cuda':\n    wkv = TimeX.apply(w, kv, B,C,T, 0)\n    wk = TimeX.apply(w, k, B,C,T, K_EPS)\nelse:\n    w = w[:,-T:].unsqueeze(1)\n    wkv = F.conv1d(nn.ZeroPad2d((T-1, 0, 0, 0))(kv), w, groups=C)\n    wk = F.conv1d(nn.ZeroPad2d((T-1, 0, 0, 0))(k), w, groups=C) + K_EPS\n\n# The RWKV formula\nrwkv = torch.sigmoid(r) * (wkv \u002F wk).transpose(-1, -2)\nrwkv = self.output(rwkv) # final output projection\n```\n\nThe self.key, self.receptance, self.output matrices are all initialized to zero.\n\nThe time_mix, time_decay, time_first vectors are transferred from a smaller trained model (note: I sort & smooth them too).\n\n### The GPT mode - FFN block\n\nThe FFN block has three tricks comparing with the usual GPT:\n\n1. My time_mix trick.\n\n2. The sqReLU from the Primer paper.\n\n3. An extra receptance-gate (similar to the receptance-gate in ATT block).\n```python\n# Mix x with the previous timestep to produce xk, xr\nxx = self.time_shift(x)\nxk = x * self.time_mix_k + xx * (1 - self.time_mix_k)\nxr = x * self.time_mix_r + xx * (1 - self.time_mix_r)\n\n# The usual FFN operation\nk = self.key(xk)\nk = torch.square(torch.relu(k)) # from the Primer paper\nkv = self.value(k)\n\n# Apply an extra receptance-gate to kv\nrkv = torch.sigmoid(self.receptance(xr)) * kv\nreturn rkv\n```\nThe self.value, self.receptance matrices are all initialized to zero.\n\n## RWKV-4 improvements\n\n![RWKV-v3-plan](RWKV-v3-plan.png)\n\n## From GPT to RWKV (the formulas)\n\nLet F[t] be the system state at t.\n\nLet x[t] be the new external input at t.\n\nIn GPT, predicting F[t+1] requires considering F[0], F[1], .. F[t]. So it takes O(T^2) to generate a length T sequence.\n\nThe **simplified formula** for GPT:\n\n![F[\\mathrm{t}+1]=\\frac{\\sum_{\\mathrm{i}=0}^{\\mathrm{t}} \\exp (\\mathbf{Q}x[\\mathrm{t}] * \\mathbf{K}F[\\mathrm{i}]) \\cdot(\\mathbf{V}F[\\mathrm{i}])}{\\sum_{\\mathrm{i}=0}^{\\mathrm{t}} \\exp (\\mathbf{Q}x[\\mathrm{t}] * \\mathbf{K}F[\\mathrm{i}])}](https:\u002F\u002Frender.githubusercontent.com\u002Frender\u002Fmath?math=%5Ccolor%7Bblack%7D%5Cdisplaystyle+F%5B%5Cmathrm%7Bt%7D%2B1%5D%3D%5Cfrac%7B%5Csum_%7B%5Cmathrm%7Bi%7D%3D0%7D%5E%7B%5Cmathrm%7Bt%7D%7D+%5Cexp+%28%5Cmathbf%7BQ%7Dx%5B%5Cmathrm%7Bt%7D%5D+%2A+%5Cmathbf%7BK%7DF%5B%5Cmathrm%7Bi%7D%5D%29+%5Ccdot%28%5Cmathbf%7BV%7DF%5B%5Cmathrm%7Bi%7D%5D%29%7D%7B%5Csum_%7B%5Cmathrm%7Bi%7D%3D0%7D%5E%7B%5Cmathrm%7Bt%7D%7D+%5Cexp+%28%5Cmathbf%7BQ%7Dx%5B%5Cmathrm%7Bt%7D%5D+%2A+%5Cmathbf%7BK%7DF%5B%5Cmathrm%7Bi%7D%5D%29%7D)\n\nIt's very capable in theory, however that **does not mean we can fully utilize its capability with usual optimizers**. I suspect the loss landscape is too difficult for our current methods.\n\nCompare with the **simplified formula** for RWKV (the parallel mode, looks similar to Apple's AFT):\n\n![F[\\mathrm{t}+1]=\\sigma(\\mathbf{R}x[\\mathrm{t}]) \\cdot \\frac{\\sum_{\\mathrm{i}=0}^{\\mathrm{t}} \\exp (\\mathbf{W} \\cdot(\\mathrm{t}-\\mathrm{i})) \\cdot \\exp (\\mathbf{K}F[\\mathrm{i}]) \\cdot(\\mathbf{V}F[\\mathrm{i}])}{\\sum_{\\mathrm{i}=0}^{\\mathrm{t}} \\exp (\\mathbf{W} \\cdot(\\mathrm{t}-\\mathrm{i})) \\cdot \\exp (\\mathbf{K }F[\\mathrm{i}])}](https:\u002F\u002Frender.githubusercontent.com\u002Frender\u002Fmath?math=%5Ccolor%7Bblack%7D%5Cdisplaystyle+F%5B%5Cmathrm%7Bt%7D%2B1%5D%3D%5Csigma%28%5Cmathbf%7BR%7Dx%5B%5Cmathrm%7Bt%7D%5D%29+%5Ccdot+%5Cfrac%7B%5Csum_%7B%5Cmathrm%7Bi%7D%3D0%7D%5E%7B%5Cmathrm%7Bt%7D%7D+%5Cexp+%28%5Cmathbf%7BW%7D+%5Ccdot%28%5Cmathrm%7Bt%7D-%5Cmathrm%7Bi%7D%29%29+%5Ccdot+%5Cexp+%28%5Cmathbf%7BK%7DF%5B%5Cmathrm%7Bi%7D%5D%29+%5Ccdot%28%5Cmathbf%7BV%7DF%5B%5Cmathrm%7Bi%7D%5D%29%7D%7B%5Csum_%7B%5Cmathrm%7Bi%7D%3D0%7D%5E%7B%5Cmathrm%7Bt%7D%7D+%5Cexp+%28%5Cmathbf%7BW%7D+%5Ccdot%28%5Cmathrm%7Bt%7D-%5Cmathrm%7Bi%7D%29%29+%5Ccdot+%5Cexp+%28%5Cmathbf%7BK+%7DF%5B%5Cmathrm%7Bi%7D%5D%29%7D)\n\nThe R, K, V are trainable matrices, and W is a trainable vector (time-decay factor for each channel).\n\nIn GPT, the contribution of F[i] to F[t+1] is weighted by ![ \\exp (\\mathbf{Q}x[\\mathrm{t}] * \\mathbf{K}F[\\mathrm{i}]) ](https:\u002F\u002Frender.githubusercontent.com\u002Frender\u002Fmath?math=%5Ccolor%7Bblack%7D%5Cdisplaystyle++%5Cexp+%28%5Cmathbf%7BQ%7Dx%5B%5Cmathrm%7Bt%7D%5D+%2A+%5Cmathbf%7BK%7DF%5B%5Cmathrm%7Bi%7D%5D%29+).\n\nIn RWKV-2, the contribution of F[i] to F[t+1] is weighted by ![\\sigma(\\mathbf{R}x[\\mathrm{t}]) \\cdot \\exp (\\mathbf{W} \\cdot(\\mathrm{t}-\\mathrm{i})) \\cdot \\exp (\\mathbf{K}F[\\mathrm{i}]) ](https:\u002F\u002Frender.githubusercontent.com\u002Frender\u002Fmath?math=%5Ccolor%7Bblack%7D%5Cdisplaystyle+%5Csigma%28%5Cmathbf%7BR%7Dx%5B%5Cmathrm%7Bt%7D%5D%29+%5Ccdot+%5Cexp+%28%5Cmathbf%7BW%7D+%5Ccdot%28%5Cmathrm%7Bt%7D-%5Cmathrm%7Bi%7D%29%29+%5Ccdot+%5Cexp+%28%5Cmathbf%7BK%7DF%5B%5Cmathrm%7Bi%7D%5D%29+).\n* The ![\\sigma](https:\u002F\u002Frender.githubusercontent.com\u002Frender\u002Fmath?math=%5Ccolor%7Bblack%7D%5Cdisplaystyle+%5Csigma) is a non-linearity and we can use sigmoid. \n* Note ![\\sigma(\\mathbf{R}x[\\mathrm{t}])](https:\u002F\u002Frender.githubusercontent.com\u002Frender\u002Fmath?math=%5Ccolor%7Bblack%7D%5Cdisplaystyle+%5Csigma%28%5Cmathbf%7BR%7Dx%5B%5Cmathrm%7Bt%7D%5D%29) is not in the denominator, and I call R the \"receptance\".\n* The ![\\exp (\\mathbf{W} \\cdot(\\mathrm{t}-\\mathrm{i}))](https:\u002F\u002Frender.githubusercontent.com\u002Frender\u002Fmath?math=%5Ccolor%7Bblack%7D%5Cdisplaystyle+%5Cexp+%28%5Cmathbf%7BW%7D+%5Ccdot%28%5Cmathrm%7Bt%7D-%5Cmathrm%7Bi%7D%29%29) is the time-decay factor. I proposed the same idea (scaling the attention by distance) in Aug 2020 and called it the \"time-weighting\" (check the commit history of https:\u002F\u002Fgithub.com\u002FBlinkDL\u002FminGPT-tuned).\n\nHere comes the punchline: we can rewrite it into a RNN (recursive formula). Note:\n\n![F[1]=\\sigma(\\mathbf{R }x[0]) \\cdot \\frac{ \\exp (\\mathbf{K }F[0]) \\cdot(\\mathbf{V }F[0])}{\\exp (\\mathbf{K }F[0])}](https:\u002F\u002Frender.githubusercontent.com\u002Frender\u002Fmath?math=%5Ccolor%7Bblack%7D%5Cdisplaystyle+F%5B1%5D%3D%5Csigma%28%5Cmathbf%7BR+%7Dx%5B0%5D%29+%5Ccdot+%5Cfrac%7B+%5Cexp+%28%5Cmathbf%7BK+%7DF%5B0%5D%29+%5Ccdot%28%5Cmathbf%7BV+%7DF%5B0%5D%29%7D%7B%5Cexp+%28%5Cmathbf%7BK+%7DF%5B0%5D%29%7D)\n\n![F[2]=\\sigma(\\mathbf{R }x[1]) \\cdot \\frac{ \\exp (\\mathbf{K }F[1]) \\cdot(\\mathbf{V }F[1])+\\exp (\\mathbf{W} ) \\cdot \\exp (\\mathbf{K }F[0]) \\cdot(\\mathbf{V }F[0])}{ \\exp (\\mathbf{K }F[1])+\\exp (\\mathbf{W} ) \\cdot \\exp (\\mathbf{K }F[0])}](https:\u002F\u002Frender.githubusercontent.com\u002Frender\u002Fmath?math=%5Ccolor%7Bblack%7D%5Cdisplaystyle+F%5B2%5D%3D%5Csigma%28%5Cmathbf%7BR+%7Dx%5B1%5D%29+%5Ccdot+%5Cfrac%7B+%5Cexp+%28%5Cmathbf%7BK+%7DF%5B1%5D%29+%5Ccdot%28%5Cmathbf%7BV+%7DF%5B1%5D%29%2B%5Cexp+%28%5Cmathbf%7BW%7D+%29+%5Ccdot+%5Cexp+%28%5Cmathbf%7BK+%7DF%5B0%5D%29+%5Ccdot%28%5Cmathbf%7BV+%7DF%5B0%5D%29%7D%7B+%5Cexp+%28%5Cmathbf%7BK+%7DF%5B1%5D%29%2B%5Cexp+%28%5Cmathbf%7BW%7D+%29+%5Ccdot+%5Cexp+%28%5Cmathbf%7BK+%7DF%5B0%5D%29%7D)\n\nTherefore it's straightforward to verify:\n\n![F[t+1]=\\sigma(\\mathbf{R }x[t]) \\cdot \\frac{\\exp (\\mathbf{K}F[\\mathrm{t}]) \\cdot(\\mathbf{V}F[\\mathrm{t}])+\\exp (\\mathbf{W}) \\cdot A[\\mathrm{t}]}{ \\exp (\\mathbf{K}F[\\mathrm{t}])+\\exp (\\mathbf{W}) \\cdot B[\\mathrm{t}]}](https:\u002F\u002Frender.githubusercontent.com\u002Frender\u002Fmath?math=%5Ccolor%7Bblack%7D%5Cdisplaystyle+F%5Bt%2B1%5D%3D%5Csigma%28%5Cmathbf%7BR+%7Dx%5Bt%5D%29+%5Ccdot+%5Cfrac%7B%5Cexp+%28%5Cmathbf%7BK%7DF%5B%5Cmathrm%7Bt%7D%5D%29+%5Ccdot%28%5Cmathbf%7BV%7DF%5B%5Cmathrm%7Bt%7D%5D%29%2B%5Cexp+%28%5Cmathbf%7BW%7D%29+%5Ccdot+A%5B%5Cmathrm%7Bt%7D%5D%7D%7B+%5Cexp+%28%5Cmathbf%7BK%7DF%5B%5Cmathrm%7Bt%7D%5D%29%2B%5Cexp+%28%5Cmathbf%7BW%7D%29+%5Ccdot+B%5B%5Cmathrm%7Bt%7D%5D%7D)\n\nwhere A[t] and B[t] are the numerator and denominator of the previous step, respectively.\n\nI believe RWKV is performant because W is like repeatedly applying a diagonal matrix. Note (P^{-1} D P)^n = P^{-1} D^n P, so it is similar to repeatedly applying a general diagonalizable matrix.\n\nMoreover it's possible to turn it into a continuous ODE (a bit similar to State Space Models). I will write about it later.\n\n## Star History\n\n[![Star History Chart](https:\u002F\u002Fapi.star-history.com\u002Fsvg?repos=BlinkDL\u002FRWKV-LM&type=Date)](https:\u002F\u002Fstar-history.com\u002F#BlinkDL\u002FRWKV-LM&Date)\n\n## Multimodal ideas\n\nI have an idea for [text --> 32x32 RGB image] using a LM (transformer, RWKV, etc.). Will test it soon.\n\nFirstly, LM loss (instead of L2 loss), so the image will not be blurry.\n\nSecondly, color quantization. For example, only allowing 8 levels for R\u002FG\u002FB. Then the image vocab size is 8x8x8 = 512 (for each pixel), instead of 2^24.\nTherefore, a 32x32 RGB image = a len1024 sequence of vocab512 (image tokens), which is a typical input for usual LMs.\n(Later we can use diffusion models to upsample and generate RGB888 images. We might be able to use a LM for this too.)\n\nThirdly, 2D positional embeddings that are easy for the model to understand.\nFor example, add one-hot X & Y coords to the first 64(=32+32) channels. Say if the pixel is at x=8, y=20, then we will add 1 to channel 8 and channel 52 (=32+20).\nMoreover probably we can add the float X & Y coords (normalized to 0~1 range) to another 2 channels. And other periodic pos. encoding might help too (will test). \n\nFinally, RandRound when doing the color quantization in the DataLoader.\nFor example, if the float level is 4.578, then there is a 57.8% chance to use 5, and (1-57.8%) chance to use 4.\nAnd we can allow both 4 and 5 in the prediction, but the loss will be higher if the prediction is 4.\n\nMulti-task training might help too. I will try this dataset format:\n[TxtFirst] [Desc of Img (txt tokens)] [Img] [img tokens]\nand sometimes\n[ImgFirst] [img tokens] [Txt] [Desc of Img (txt tokens)]\n... the order of the imgs should be randomized in the DataLoader, and [TxtFirst] [ImgFirst] [Img] [Txt] are special tokens\nand do random sampling of the full dataset. So sometimes the model will see the img tokens first and then the corresponding txt tokens, which is a [img -> txt] task. And the model will see some partial imgs and partial txts. I think a char-level LM might help the model to write correct text on images.\n\n## How to sample a large dataset (for training)\n\nI am using a trick to sample the Pile deterministically yet randomly enough.\n\nLet's say the pile has x chunks (a chunk = ctx_len tokens).\n\npick a prime number p just less than x, and make sure p = 2 (mod 3).\n\nUse (step * step * step) mod p to sample it. Add some bias to step for extra randomness.\n\n## The top-p-x sampling method (for inference)\n\nWe propose a new sampling method called top-p-x:\n\nit's like top-p, and the only difference is you also keep all tokens whose prob > x.\n\nTry x = 0.01 first.\n\n## Better Learning Rate Schedule via Variantional Method of Loss Curve\n\nI propose a simple new method to find better LR schedules. The method is cost-efficient and practical for large LMs. The takeaway is we can model the loss curve dynamics (phenomenology) w.r.t. the LR, and a nice closed-form LR curve can be directly computed from it using variantional method. Moreover we can predict the final loss with reasonable accuracy.\n\nUPDATE: In \"Conclusion 1.\", use the best-fitting regime (ignore the initial steps where our approximations break down) to fit the parameters.\n\nTry this: fixed lr for 1 hr, then exponential decay to 0.2 * lr in 12 hrs, and choose the t=[1hr, 13hr] segment.\n\nIn the last three plots, black = predicted loss curve of the new LR schedule, blue = original (unoptimized) real loss curve, orange = new LR schedule.\n\n![better_lr_schedule](Research\u002Fbetter_lr_schedule.png)\n\n# RWKV v1\n\nWe propose the RWKV language model, with alternating time-mix and channel-mix layers:\n\n\u003Cimg src=\n\"https:\u002F\u002Frender.githubusercontent.com\u002Frender\u002Fmath?math=%5Cdisplaystyle+%5Cbegin%7Balign%2A%7D%0A%5Ctext%7BTime-mix+%3A%7D+%26%26+%5Ctext%7BTM%7D_%7Bt%2Cc%7D+%26%26%3D%26%26%5Ctext%7Bsigmoid%7D%28%5Ctext%7BR%7D_%7Bt%2Cc%7D%29+%26%26%5Ccdot%26%26+%26%26%5Ctextstyle%5Csum_%7Bu%7D+%26%26%5Ctextbf%7BW%7D_%7Bt%2Cu%2Cc%7D+%26%26%5Ccdot%26%26+%5Ctext%7Bsoftmax%7D_t%28%5Ctext%7BK%7D_%7Bu%2Cc%7D%29+%26%26%5Ccdot%26%26+%5Ctext%7BV%7D_%7Bu%2Cc%7D%5C%5C%0A%5Ctext%7BChannel-mix+%3A%7D+%26%26+%5Ctext%7BCM%7D_%7Bt%2Cc%7D+%26%26%3D%26%26%5Ctext%7Bsigmoid%7D%28%5Ctext%7BR%7D_%7Bt%2Cc%7D%29+%26%26%5Ccdot%26%26+%26%26%5Ctextstyle%5Csum_d+%26%26%5Ctextbf%7BW%7D_%7Bc%2Cd%7D+%26%26%5Ccdot%26%26+%5Ctext%7Bgelu%7D%28%5Ctext%7BK%7D_%7Bt%2Cd%7D%29+%26%26%5Ccdot%26%26+%5Ctext%7BV%7D_%7Bt%2Cd%7D%0A%5Cend%7Balign%2A%7D%0A\" \nalt=\"\\begin{align*}\n\\text{Time-mix :} && \\text{TM}_{t,c} &&=&&\\text{sigmoid}(\\text{R}_{t,c}) &&\\cdot&& &&\\textstyle\\sum_{u} &&\\textbf{W}_{t,u,c} &&\\cdot&& \\text{softmax}_t(\\text{K}_{u,c}) &&\\cdot&& \\text{V}_{u,c}\\\\\n\\text{Channel-mix :} && \\text{CM}_{t,c} &&=&&\\text{sigmoid}(\\text{R}_{t,c}) &&\\cdot&& &&\\textstyle\\sum_d &&\\textbf{W}_{c,d} &&\\cdot&& \\text{gelu}(\\text{K}_{t,d}) &&\\cdot&& \\text{V}_{t,d}\n\\end{align*}\n\">\n\n* The R, K, V are generated by linear transforms of input, and W is parameter. The idea of RWKV is to decompose attention into R(target) * W(src, target) * K(src). So we can call R \"receptance\", and sigmoid means it's in 0~1 range.\n\n* The Time-mix is similar to AFT (https:\u002F\u002Farxiv.org\u002Fabs\u002F2105.14103). There are two differences.\n\n(1) We changed the normalization (denominator). For masked language models, we define:\n\n\u003Cimg src=\n\"https:\u002F\u002Frender.githubusercontent.com\u002Frender\u002Fmath?math=%5Cdisplaystyle+%5Ctext%7Bsoftmax%7D_t%28%5Ctext%7BK%7D_%7Bu%2Cc%7D%29+%3D+%5Cfrac%7B%5Cexp%28%5Ctext%7BK%7D_%7Bu%2Cc%7D%29%7D%7B%5Csum_%7Bv+%5Cleq+t%7D%5Cexp%28%5Ctext%7BK%7D_%7Bv%2Cc%7D%29%7D\" \nalt=\"\\text{softmax}_t(\\text{K}_{u,c}) = \\frac{\\exp(\\text{K}_{u,c})}{\\sum_{v \\leq t}\\exp(\\text{K}_{v,c})}\">\n\n**(UPDATE: We are using the original AFT normalization in v2)**\n \nInitialize K and R matrices (and the output projection matrix) to ZERO for fast & stable convergence.\n \n(2) We decompose W_{t,u,c} and introduce multi-head W (here h is the corresponding head of c):\n\n\u003Cimg src=\n\"https:\u002F\u002Frender.githubusercontent.com\u002Frender\u002Fmath?math=%5Cdisplaystyle+W_%7Bt%2Cu%2Cc%7D%3Df_h%28t-u%29%5Ccdot+%5Calpha_h%28u%29+%5Ccdot+%5Cbeta_h%28t%29\" \nalt=\"W_{t,u,c}=f_h(t-u)\\cdot \\alpha_h(u) \\cdot \\beta_h(t)\">\n\nMoreover we multiply the final output of Time-mix layer by γ(t). The reason for the α β γ factors, is because the context size is smaller when t is small, and this can be compensated using the α β γ factors.\n\n**(UPDATE: We remove α β γ factors in v2-RNN and restrict W to be of a simple form and hence able to rewrite it as RNN)**\n\n* The Channel-mix is similar to GeGLU (https:\u002F\u002Farxiv.org\u002Fabs\u002F2002.05202) with an extra R factor. Initialize R and W matrices to ZERO for fast & stable convergence.\n\n* Finally, we add extra token-shift (time-shift mixing) as in (https:\u002F\u002Fgithub.com\u002FBlinkDL\u002FminGPT-tuned).\n\n# Token-shift (time-shift mixing)\n\nThe token-shift explicitly uses (half the channels of this token) & (half the channels of prev token) to generate all vectors (QKV, RWKV, ...).\n\n```\nself.time_shift = nn.ZeroPad2d((0,0,1,-1))\n\nx = torch.cat([self.time_shift(x[:, :, :C\u002F\u002F2]), x[:, :, C\u002F\u002F2:]], dim = -1)\n```\n\nDividing channels by 2 and shift-1 works great for char-level English and char-level Chinese LM.\n\nHowever for BPE-level English LM, it's only effective if your embedding is large enough (at least 1024 - so the usual small L12-D768 model is not enough).\n\nMy theory on the effectiveness of token-shift:\n\nWhen we train a GPT, the hidden representation of a token has to accomplish two different objects:\n\n1. Predict the next token. Sometimes this is easy (obvious next token).\n\n2. Collect all previous context info, so later tokens can use it. This is always hard.\n\nThe shifted channels can focus on (2), so we have good propagation of info. It's like some kind of residual connection, or a small RNN inside the transformer.\n\nYou can use token-shift in usual QKV self-attention too. I looked at the weights, and found V really likes the shifted channels, less so for Q. Makes sense if you think about it. I also found you may want to use less mixing in higher layers.\n\np.s. There is a MHA_pro model in this repo with strong performance. Give it a try :)\n\n# The Head-QK Trick: learning to copy and avoid tokens\n\nIn usual transformer, a small model has difficulty copying tokens (such as person names) in the context. We add extra Q & K to the final output such that the model can directly copy (or avoid) tokens in the context. Afterwards the model will teach itself NER (named entity recognition) if you look at the learned weights.\n```\nq = self.head_q(x)[:,:T,:] # projecting to 256-d\nk = self.head_k(x)[:,:T,:] # projecting to 256-d\nc = (q @ k.transpose(-2, -1)) * (1.0 \u002F 256)\nc = c.masked_fill(self.copy_mask[:T,:T] == 0, 0)\nc = c @ F.one_hot(idx, num_classes = self.config.vocab_size).float()       \nx = self.head(x) + c\n```\nNote: when a token occurs multiple times in the context, it might be better to use max(prob) instead of sum(prob).\n\n# The top-a sampling method\n\nWe also propose a new sampling method called top-a (as in src\u002Futils.py):\n\n(1) Find the max probability p_max after softmax.\n\n(2) Remove all entries whose probability is lower than 0.2 * pow(p_max, 2). So it's adaptive, hence \"top-a\".\n\n(3) Feel free to tune the 0.2 and 2 factor. Tune 0.2 first.\n\nThe idea of top-a:\n1. If max_prob=0.9, then remove all tokens with prob \u003C 0.162 (so, removing all alternatives)\n2. If max_prob=0.5, then remove all tokens with prob \u003C 0.05  (so, allowing more choices)\n3. If max_prob=0.1, then remove all tokens with prob \u003C 0.002 (so, allowing lots of possibilities)\n\n```\nprobs = F.softmax(logits, dim=-1)\n\nlimit = torch.pow(torch.max(probs), 2) * 0.02\nlogits[probs \u003C limit] = -float('Inf')\n```\n\n# Performance\n\nCharacter-level loss on simplebooks-92 dataset https:\u002F\u002Fdldata-public.s3.us-east-2.amazonaws.com\u002Fsimplebooks.zip\n\n![RWKV-vs-MHA](RWKV-vs-MHA.png)\n\nGray: usual MHA+Rotary+GeGLU - performance not as good. 17.2M params.\n\nRed: RWKV (\"linear\" attention) - VRAM friendly - quite faster when ctx window is long - good performance. 16.6M params.\n\nGreen: MHA+Rotary+GeGLU+Token_shift. 17.2M params.\n\nBlue: MHA_pro (MHA with various tweaks & RWKV-type-FFN) - slow - needs more VRAM - good performance. 16.6M params.\n\n```\n@software{peng_bo_2021_5196578,\n  author       = {PENG Bo},\n  title        = {BlinkDL\u002FRWKV-LM: 0.01},\n  month        = aug,\n  year         = 2021,\n  publisher    = {Zenodo},\n  version      = {0.01},\n  doi          = {10.5281\u002Fzenodo.5196577},\n  url          = {https:\u002F\u002Fdoi.org\u002F10.5281\u002Fzenodo.5196577}\n}\n```\n\n# Initialization\n\nWe use careful initialization for RWKV to get fast convergence - orthogonal matrices with proper scaling, and special time_w curves. Check model.py for details.\n\nSome learned time_w examples:\n\n![RWKV-time-w](RWKV-time-w.png)\n","RWKV是一个结合了RNN和Transformer优点的语言模型，能够实现与大型语言模型相当的性能，并且可以直接像GPT那样进行并行训练。其核心功能包括线性时间复杂度、常数空间占用（无需kv-cache）、快速训练以及无限上下文长度等特性，使得RWKV在保持高效的同时还具有良好的可扩展性。该模型特别适合需要高性能语言处理能力的应用场景，如大规模文本生成、多模态应用等。基于Python开发，RWKV利用PyTorch框架实现了上述功能，并通过Apache License 2.0开源发布，鼓励社区贡献与使用。","2026-06-11 03:24:11","top_topic"]