[{"data":1,"prerenderedAt":-1},["ShallowReactive",2],{"project-72032":3},{"id":4,"name":5,"fullName":6,"owner":7,"repo":5,"description":8,"homepage":9,"htmlUrl":10,"language":11,"languages":10,"totalLinesOfCode":10,"stars":12,"forks":13,"watchers":14,"openIssues":15,"contributorsCount":16,"subscribersCount":16,"size":16,"stars1d":17,"stars7d":18,"stars30d":19,"stars90d":16,"forks30d":16,"starsTrendScore":20,"compositeScore":21,"rankGlobal":10,"rankLanguage":10,"license":22,"archived":23,"fork":23,"defaultBranch":24,"hasWiki":25,"hasPages":23,"topics":26,"createdAt":10,"pushedAt":10,"updatedAt":31,"readmeContent":32,"aiSummary":33,"trendingCount":16,"starSnapshotCount":16,"syncStatus":34,"lastSyncTime":35,"discoverSource":36},72032,"LLM4Decompile","albertan017\u002FLLM4Decompile","albertan017","Reverse Engineering: Decompiling Binary Code with Large Language Models","https:\u002F\u002Faclanthology.org\u002F2024.emnlp-main.203",null,"Python",6702,532,74,43,0,5,20,64,15,90.58,"MIT License",false,"main",true,[27,28,29,30],"binary","decompile","large-language-models","reverse-engineering","2026-06-12 04:01:03","\n\u003Cp align=\"center\">\n  \u003Cpicture>\n    \u003Csource media=\"(prefers-color-scheme: dark)\" srcset=\"https:\u002F\u002Fgithub.com\u002Falbertan017\u002FLLM4Decompile\u002Fblob\u002Fmain\u002Fsamples\u002Flogo-dark.png\">\n    \u003Cimg alt=\"LLM4Decompile\" src=\"https:\u002F\u002Fgithub.com\u002Falbertan017\u002FLLM4Decompile\u002Fblob\u002Fmain\u002Fsamples\u002Flogo-light.png\" width=55%>\n  \u003C\u002Fpicture>\n\u003C\u002Fp>\n\n\u003Cp align=\"left\">\n    📊&nbsp;\u003Ca href=\"#evaluation\">Results\u003C\u002Fa>\n    | 🤗&nbsp;\u003Ca href=\"#models\">Models\u003C\u002Fa>\n    | 🚀&nbsp;\u003Ca href=\"#quick-start\">Quick Start\u003C\u002Fa>\n    | 📚&nbsp;\u003Ca href=\"#humaneval-decompile\">HumanEval-Decompile\u003C\u002Fa>\n    | 📎&nbsp;\u003Ca href=\"#citation\">Citation\u003C\u002Fa>\n    | 📝&nbsp;\u003Ca href=\"https:\u002F\u002Farxiv.org\u002Fabs\u002F2403.05286\">Paper\u003C\u002Fa>\n    | 🖥️&nbsp;\u003Ca href=\"https:\u002F\u002Fcolab.research.google.com\u002Fdrive\u002F1X5TuUKuNuksGJZz6Cc83KKI0ATBP9q7r?usp=sharing\">Colab\u003C\u002Fa>\n    | ▶️&nbsp;\u003Ca href=\"https:\u002F\u002Fwww.youtube.com\u002Fwatch?v=x7knF3Z1yLk\">YouTube\u003C\u002Fa>\n\u003C\u002Fp>\n\nReverse Engineering: Decompiling Binary Code with Large Language Models\n\n[![GitHub Tread](https:\u002F\u002Ftrendshift.io\u002Fapi\u002Fbadge\u002Frepositories\u002F8664)](https:\u002F\u002Ftrendshift.io\u002Frepositories\u002F8664)\n\n## Updates\n* [2025-10-04]: Release SK²Decompile: LLM-based Two-Phase Binary Decompilation from Skeleton to Skin. Phase 1 Structure Recovery (Skeleton): Transform binary\u002Fpseudo-code into obfuscated intermediate representations 🤗 [HF Link](https:\u002F\u002Fhuggingface.co\u002FLLM4Binary\u002Fsk2decompile-struct-6.7b). Phase 2 Identifier Naming (Skin): Generate human-readable source code with meaningful identifiers 🤗 [HF Link](https:\u002F\u002Fhuggingface.co\u002FLLM4Binary\u002Fsk2decompile-ident-6.7).\n* [2025-05-20]: Release [decompile-bench](https:\u002F\u002Fhuggingface.co\u002Fcollections\u002FLLM4Binary\u002Fdecompile-bench-68259091c8d49d0ebd5efda9), contains two million binary-source function pairs for training, and 70K function pairs for evaluation. Please refer to the [decompile-bench](https:\u002F\u002Fgithub.com\u002Falbertan017\u002FLLM4Decompile\u002Ftree\u002Fmain\u002Fdecompile-bench) folder for details.\n* [2024-10-17]: Release [decompile-ghidra-100k](https:\u002F\u002Fhuggingface.co\u002Fdatasets\u002FLLM4Binary\u002Fdecompile-ghidra-100k), a subset of 100k training samples (25k per optimization level). We provide a [training script](https:\u002F\u002Fgithub.com\u002Falbertan017\u002FLLM4Decompile\u002Fblob\u002Fmain\u002Ftrain\u002FREADME.md) that runs in ~3.5 hours on a single A100 40G GPU. It achieves a 0.26 re-executability rate, with a total cost of under $20 for quick replication of LLM4Decompile.\n* [2024-09-26]: Update a [Colab notebook](https:\u002F\u002Fcolab.research.google.com\u002Fdrive\u002F1X5TuUKuNuksGJZz6Cc83KKI0ATBP9q7r?usp=sharing) to demonstrate the usage of the LLM4Decompile model, including examples for the LLM4Decompile-End and LLM4Decompile-Ref models.\n* [2024-09-23]: Release [LLM4Decompile-9B-v2](https:\u002F\u002Fhuggingface.co\u002FLLM4Binary\u002Fllm4decompile-9b-v2), fine-tuned based on [Yi-Coder-9B](https:\u002F\u002Fhuggingface.co\u002F01-ai\u002FYi-Coder-9B), achieved a re-executability rate of **0.6494** on the Decompile benchmark.\n* [2024-06-19]: Release [V2](https:\u002F\u002Fhuggingface.co\u002FLLM4Binary\u002Fllm4decompile-6.7b-v2) series (LLM4Decompile-Ref). V2 (1.3B-22B), building upon **Ghidra**, are trained on 2 billion tokens to **refine** the decompiled pseudo-code from Ghidra. The 22B-V2 version outperforms the 6.7B-V1.5 by an additional 40.1%. Please check the [ghidra folder](https:\u002F\u002Fgithub.com\u002Falbertan017\u002FLLM4Decompile\u002Ftree\u002Fmain\u002Fghidra) for details.\n* [2024-05-13]: Release [V1.5](https:\u002F\u002Fhuggingface.co\u002FLLM4Binary\u002Fllm4decompile-6.7b-v1.5) series (LLM4Decompile-End, directly decompile binary using LLM). V1.5 are trained with a larger dataset (15B tokens) and a maximum token **length of 4,096**, with remarkable  performance (over **100% improvement**) compared to the previous model.\n* [2024-03-16]: Add [llm4decompile-6.7b-uo](https:\u002F\u002Fhuggingface.co\u002Farise-sustech\u002Fllm4decompile-6.7b-uo) model which is trained without prior knowledge of the optimization levels (O0~O3), the average re-executability is around 0.219, performs the best in our models.\n\n## About\n* **LLM4Decompile** is the pioneering open-source large language model dedicated to decompilation. Its current version supports decompiling Linux x86_64 binaries, ranging from GCC's O0 to O3 optimization levels, into human-readable C source code. Our team is committed to expanding this tool's capabilities, with ongoing efforts to incorporate a broader range of architectures and configurations.\n* **LLM4Decompile-End** focuses on decompiling the binary directly. **LLM4Decompile-Ref** refines the pseudo-code decompiled by Ghidra.\n\n## Evaluation\n\n### Framework\n\u003Cp align=\"center\">\n\u003Cimg src=\"https:\u002F\u002Fgithub.com\u002Falbertan017\u002FLLM4Decompile\u002Fblob\u002Fmain\u002Fsamples\u002Fcompile-decompile.png\" alt=\"image\" width=\"400\" height=\"auto\">\n\u003C\u002Fp>\n\nDuring compilation, the Preprocessor processes the source code (SRC) to eliminate comments and expand macros or includes. The cleaned code is then forwarded to the Compiler, which converts it into assembly code (ASM). This ASM is transformed into binary code (0s and 1s) by the Assembler. The Linker finalizes the process by linking function calls to create an executable file. Decompilation, on the other hand, involves converting binary code back into a source file. LLMs, being trained on text, lack the ability to process binary data directly. Therefore, binaries must be disassembled by ```Objdump``` into assembly language (ASM) first. It should be noted that binary and disassembled ASM are equivalent, they can be interconverted, and thus we refer to them interchangeably. Finally, the loss is computed between the decompiled code and source code to guide the training. To assess the quality of the decompiled code (SRC'), it is tested for its functionality through test assertions (re-executability).\n\n### Metrics\n* **Re-executability** evaluates whether the decompiled code can execute properly and pass all the predefined test cases.\n\n### Benchmarks\n* **HumanEval-Decompile** A collection of 164 C functions that exclusively rely on **standard** C libraries.\n* **ExeBench** A collection of 2,621 functions drawn from **real** projects, each utilizing user-defined functions, structures, and macros.\n\n\n### Results\n\n\u003Cp align=\"center\">\n\u003Cimg src=\"https:\u002F\u002Fgithub.com\u002Falbertan017\u002FLLM4Decompile\u002Fblob\u002Fmain\u002Fsamples\u002Fresults_end_final.png\" alt=\"results\" width=\"800\" height=\"auto\">\n\u003C\u002Fp>\n\n\u003Cp align=\"center\">\n\u003Cimg src=\"https:\u002F\u002Fgithub.com\u002Falbertan017\u002FLLM4Decompile\u002Fblob\u002Fmain\u002Fsamples\u002Fresults_refine_final.png\" alt=\"image\" width=\"800\" height=\"auto\">\n\u003C\u002Fp>\n\n## Models\nOur LLM4Decompile includes models with sizes between 1.3 billion and 33 billion parameters, and we have made these models available on Hugging Face.\n\n| Model                 | Checkpoint                                                        | Size | Re-executability       | Note |\n|-----------------------|-------------------------------------------------------------------|------|---------------------|----------------------|\n| **llm4decompile-1.3b-v1.5**| 🤗 [HF Link](https:\u002F\u002Fhuggingface.co\u002FLLM4Binary\u002Fllm4decompile-1.3b-v1.5)   | 1.3B | **27.3%**   | Note 3 |\n| **llm4decompile-6.7b-v1.5**| 🤗 [HF Link](https:\u002F\u002Fhuggingface.co\u002FLLM4Binary\u002Fllm4decompile-6.7b-v1.5)   | 6.7B | **45.4%**   | Note 3 |\n| **llm4decompile-1.3b-v2**| 🤗 [HF Link](https:\u002F\u002Fhuggingface.co\u002FLLM4Binary\u002Fllm4decompile-1.3b-v2)   | 1.3B | **46.0%**   | Note 4 |\n| **llm4decompile-6.7b-v2**| 🤗 [HF Link](https:\u002F\u002Fhuggingface.co\u002FLLM4Binary\u002Fllm4decompile-6.7b-v2)   | 6.7B | **52.7%**   | Note 4 |\n| **llm4decompile-9b-v2**| 🤗 [HF Link](https:\u002F\u002Fhuggingface.co\u002FLLM4Binary\u002Fllm4decompile-9b-v2)   | 9B | **64.9%**  | Note 4 |\n| **llm4decompile-22b-v2**| 🤗 [HF Link](https:\u002F\u002Fhuggingface.co\u002FLLM4Binary\u002Fllm4decompile-22b-v2)   | 22B | **63.6%**   | Note 4 |\n\nNote 3: V1.5 series are trained with a larger dataset (15B tokens) and a maximum token size of 4,096, with remarkable performance (over 100% improvement) compared to the previous model.\n\nNote 4: V2 series are built upon **Ghidra** and trained on 2 billion tokens to **refine** the decompiled pseudo-code from Ghidra. Check [ghidra folder](https:\u002F\u002Fgithub.com\u002Falbertan017\u002FLLM4Decompile\u002Ftree\u002Fmain\u002Fghidra) for details.\n\n## Quick Start\n\n[![Open In Colab](https:\u002F\u002Fcolab.research.google.com\u002Fassets\u002Fcolab-badge.svg)](https:\u002F\u002Fcolab.research.google.com\u002Fdrive\u002F1X5TuUKuNuksGJZz6Cc83KKI0ATBP9q7r?usp=sharing)\n\n**Setup:** Please use the script below to install the necessary environment.\n```\ngit clone https:\u002F\u002Fgithub.com\u002Falbertan017\u002FLLM4Decompile.git\ncd LLM4Decompile\nconda create -n 'llm4decompile' python=3.9 -y\nconda activate llm4decompile\npip install -r requirements.txt\n```\n\nHere is an example of how to use our model (Revised for V1.5. For previous models, please check the corresponding model page at HF).\nNote: **Replace the \"func0\" with the function name you want to decompile**.\n\n**Preprocessing:** Compile the C code into binary, and disassemble the binary into assembly instructions.\n```python\nimport subprocess\nimport os\nfunc_name = 'func0'\nOPT = [\"O0\", \"O1\", \"O2\", \"O3\"]\nfileName = 'samples\u002Fsample' #'path\u002Fto\u002Ffile'\nfor opt_state in OPT:\n    output_file = fileName +'_' + opt_state\n    input_file = fileName+'.c'\n    compile_command = f'gcc -o {output_file}.o {input_file} -{opt_state} -lm'#compile the code with GCC on Linux\n    subprocess.run(compile_command, shell=True, check=True)\n    compile_command = f'objdump -d {output_file}.o > {output_file}.s'#disassemble the binary file into assembly instructions\n    subprocess.run(compile_command, shell=True, check=True)\n    \n    input_asm = ''\n    with open(output_file+'.s') as f:#asm file\n        asm= f.read()\n        if '\u003C'+func_name+'>:' not in asm: #IMPORTANT replace func0 with the function name\n            raise ValueError(\"compile fails\")\n        asm = '\u003C'+func_name+'>:' + asm.split('\u003C'+func_name+'>:')[-1].split('\\n\\n')[0] #IMPORTANT replace func0 with the function name\n        asm_clean = \"\"\n        asm_sp = asm.split(\"\\n\")\n        for tmp in asm_sp:\n            if len(tmp.split(\"\\t\"))\u003C3 and '00' in tmp:\n                continue\n            idx = min(\n                len(tmp.split(\"\\t\")) - 1, 2\n            )\n            tmp_asm = \"\\t\".join(tmp.split(\"\\t\")[idx:])  # remove the binary code\n            tmp_asm = tmp_asm.split(\"#\")[0].strip()  # remove the comments\n            asm_clean += tmp_asm + \"\\n\"\n    input_asm = asm_clean.strip()\n    before = f\"# This is the assembly code:\\n\"#prompt\n    after = \"\\n# What is the source code?\\n\"#prompt\n    input_asm_prompt = before+input_asm.strip()+after\n    with open(fileName +'_' + opt_state +'.asm','w',encoding='utf-8') as f:\n        f.write(input_asm_prompt)\n```\n\nAssembly instructions should be in the format:\n\n\u003CFUNCTION_NAME>:\\nOPERATIONS\\nOPERATIONS\\n\n\nTypical assembly instructions may look like this:\n```\n\u003Cfunc0>:\nendbr64\nlea    (%rdi,%rsi,1),%eax\nretq\n```\n\n\n**Decompilation:** Use LLM4Decompile to translate the assembly instructions into C:\n```python\nfrom transformers import AutoTokenizer, AutoModelForCausalLM\nimport torch\n\nmodel_path = 'LLM4Binary\u002Fllm4decompile-6.7b-v1.5' # V1.5 Model\ntokenizer = AutoTokenizer.from_pretrained(model_path)\nmodel = AutoModelForCausalLM.from_pretrained(model_path,torch_dtype=torch.bfloat16).cuda()\n\nwith open(fileName +'_' + OPT[0] +'.asm','r') as f:#optimization level O0\n    asm_func = f.read()\ninputs = tokenizer(asm_func, return_tensors=\"pt\").to(model.device)\nwith torch.no_grad():\n    outputs = model.generate(**inputs, max_new_tokens=2048)### max length to 4096, max new tokens should be below the range\nc_func_decompile = tokenizer.decode(outputs[0][len(inputs[0]):-1])\n\nwith open(fileName +'.c','r') as f:#original file\n    func = f.read()\n\nprint(f'original function:\\n{func}')# Note we only decompile one function, where the original file may contain multiple functions\nprint(f'decompiled function:\\n{c_func_decompile}')\n```\n\n### Docker setup\n\n```\n# build docker\ndocker build -t llm4decompile .\n\n# run docker with GPU\ndocker run --gpus all -it --name llm4decompile llm4decompile \u002Fbin\u002Fbash\n\n# run demo.py (choose a model suitable for your resources before running)\ncd ghidra\npython demo.py\n```\n\n## HumanEval-Decompile\nData are stored in ``llm4decompile\u002Fdecompile-eval\u002Fdecompile-eval-executable-gcc-obj.json``, using JSON list format. There are 164*4 (O0, O1, O2, O3) samples, each with five keys:\n\n*   ``task_id``: indicates the ID of the problem.\n*   ``type``: the optimization stage, is one of [O0, O1, O2, O3].\n*   ``c_func``: C solution for HumanEval problem. \n*   ``c_test``: C test assertions.\n*   ``input_asm_prompt``: assembly instructions with prompts, can be derived as in our [preprocessing example](https:\u002F\u002Fgithub.com\u002Falbertan017\u002FLLM4Decompile?tab=readme-ov-file#quick-start).\n\nPlease check the [evaluation scripts](https:\u002F\u002Fgithub.com\u002Falbertan017\u002FLLM4Decompile\u002Ftree\u002Fmain\u002Fevaluation).\n\n## On Going\n* Larger training dataset with the cleaning process. (done:2024.05.13)\n* Support for popular languages\u002Fplatforms and settings.\n* Support for executable binaries. (done:2024.05.13)\n* Integration with decompilation tools (e.g., Ghidra, Rizin)\n\n## License\nThis code repository is licensed under the MIT and DeepSeek License.\n\n## Citation\n```\n@misc{tan2024llm4decompile,\n      title={LLM4Decompile: Decompiling Binary Code with Large Language Models}, \n      author={Hanzhuo Tan and Qi Luo and Jing Li and Yuqun Zhang},\n      year={2024},\n      eprint={2403.05286},\n      archivePrefix={arXiv},\n      primaryClass={cs.PL}\n}\n```\n\n## Star History\n\n[![Star History Chart](https:\u002F\u002Fapi.star-history.com\u002Fsvg?repos=albertan017\u002FLLM4Decompile&type=Timeline)](https:\u002F\u002Fstar-history.com\u002F#albertan017\u002FLLM4Decompile&Timeline)\n","LLM4Decompile 是一个利用大型语言模型对二进制代码进行反编译的工具。其核心功能包括通过两阶段方法（从骨架到皮肤）将二进制或伪代码转换为可读性强的源代码，第一阶段恢复结构，第二阶段生成有意义的标识符。项目使用 Python 编写，并提供了预训练模型和大规模数据集供研究者训练与评估。该工具适用于需要逆向工程分析二进制文件以获取源代码的场景，如软件安全分析、代码审计等。",2,"2026-06-11 03:40:01","high_star"]