[{"data":1,"prerenderedAt":-1},["ShallowReactive",2],{"project-72296":3},{"id":4,"name":5,"fullName":6,"owner":7,"repo":5,"description":8,"homepage":9,"htmlUrl":10,"language":11,"languages":10,"totalLinesOfCode":10,"stars":12,"forks":13,"watchers":14,"openIssues":15,"contributorsCount":16,"subscribersCount":16,"size":16,"stars1d":17,"stars7d":18,"stars30d":19,"stars90d":16,"forks30d":16,"starsTrendScore":20,"compositeScore":21,"rankGlobal":10,"rankLanguage":10,"license":22,"archived":23,"fork":23,"defaultBranch":24,"hasWiki":25,"hasPages":23,"topics":26,"createdAt":10,"pushedAt":10,"updatedAt":31,"readmeContent":32,"aiSummary":33,"trendingCount":16,"starSnapshotCount":16,"syncStatus":34,"lastSyncTime":35,"discoverSource":36},72296,"LimiX","limix-ldm-ai\u002FLimiX","limix-ldm-ai","LimiX: Unleashing Structured-Data Modeling Capability for Generalist Intelligence https:\u002F\u002Farxiv.org\u002Fabs\u002F2509.03505","https:\u002F\u002Fwww.limix.ai",null,"Python",3438,300,75,10,0,26,46,55,78,29.44,"Apache License 2.0",false,"main",true,[27,28,29,30],"foundation-models","limix","machine-learning","structured-data","2026-06-12 02:03:01","\u003Cdiv align=\"center\">\n  \u003Cimg src=\".\u002Fdoc\u002FLimiX-Logo.png\" alt=\"LimiX summary\" width=\"89%\">\n\u003C\u002Fdiv>\n\n#  :boom: News\n - 2025-11-10: LimiX-2M is officially released! Compared to LimiX-16M, this smaller variant offers significantly lower GPU memory usage and faster inference speed. The retrieval mechanism has also been enhanced, further improving model performance while reducing both inference time and memory consumption.\n - 2025-08-29: LimiX V1.0 Released.\n\n#  ⚡ Latest Results Compared with SOTA Models\n\u003Cdiv align=\"center\">\n  \u003Cimg src=\".\u002Fdoc\u002FBCCO-CLS.png\"  width=\"30%\">\n  \u003Cimg src=\".\u002Fdoc\u002FTabArena-CLS.png\"  width=\"30%\">\n  \u003Cimg src=\".\u002Fdoc\u002FTabZilla-CLS.png\" width=\"30%\">  \n\u003C\u002Fdiv>\n\u003Cdiv align=\"center\">\n  \u003Cimg src=\".\u002Fdoc\u002FBCCO-REG.png\"  width=\"30%\">\n  \u003Cimg src=\".\u002Fdoc\u002FTabArena-REG.png\" width=\"30%\">\n  \u003Cimg src=\".\u002Fdoc\u002FCTR23-REG.png\" width=\"30%\">\n\u003C\u002Fdiv>\n\n\n# ➤ Overview\n\u003Cdiv align=\"center\">\n  \u003Cimg src=\".\u002Fdoc\u002FLimiX_Summary.png\" alt=\"LimiX summary\" width=\"89%\">\n\u003C\u002Fdiv>\nWe introduce LimiX, the first installment of our LDM series. LimiX aims to push generality further: a single model that handles classification, regression, missing-value imputation, feature selection, sample selection, and causal inference under one training and inference recipe, advancing the shift from bespoke pipelines to unified, foundation-style tabular learning.\n\nLimiX adopts a transformer architecture optimized for structured data modeling and task generalization. The model first embeds features X and targets Y from the prior knowledge base into token representations. Within the core modules, attention mechanisms are applied across both sample and feature dimensions to identify salient patterns in key samples and features. The resulting high-dimensional representations are then passed to regression and classification heads, enabling the model to support diverse predictive tasks. \n\nFor details, please refer to the technical report at the link: [LimiX:Unleashing Structured-Data Modeling Capability for Generalist Intelligence](https:\u002F\u002Farxiv.org\u002Fabs\u002F2509.03505) or [LimiX_Technical_Report.pdf](https:\u002F\u002Fgithub.com\u002Flimix-ldm\u002FLimiX\u002Fblob\u002Fmain\u002FLimiX_Technical_Report.pdf).\n\n# ➤ Superior Performance \nThe LimiX model achieved SOTA performance across multiple tasks.\n\n## ➩ Classification \n\u003Cdiv align=\"center\">\n  \u003Cimg src=\"doc\u002Fclassification_tabarena_lite.png\" width=\"60%\">\n\u003C\u002Fdiv>\n\n\u003Cdiv align=\"center\">\n  \u003Cimg src=\"doc\u002FClassifier.png\" width=\"45%\" style=\"margin-right:2%;\">\n  \u003Cimg src=\"doc\u002FTabArena_lite_CLS.png\" width=\"42.5%\">\n\u003C\u002Fdiv>\n\n## ➩ Regression \n\u003Cdiv align=\"center\">\n  \u003Cimg src=\"doc\u002Fregression_tabarena_lite.png\" width=\"60%\">\n\u003C\u002Fdiv>\n\u003Cdiv align=\"center\">\n  \u003Cimg src=\"doc\u002FRegression.png\" width=\"45%\" style=\"margin-right:2%;\">\n  \u003Cimg src=\"doc\u002FTabArena_REG.png\" width=\"40.3%\">\n\u003C\u002Fdiv>\n\n## ➩ Missing Values Imputation \n\u003Cdiv align=\"center\">\n  \u003Cimg src=\"doc\u002FMissingValueImputation.png\" alt=\"Missing value imputation\" width=\"60%\">\n\u003C\u002Fdiv>\n\n# ➤ Tutorials \n## ➩ Installation\n### Option 1 (recommended): Use the Dockerfile\nDownload [Dockerfile](https:\u002F\u002Fgithub.com\u002Flimix-ldm\u002FLimiX\u002Fblob\u002Fmain\u002FDockerfile)\n```bash\ndocker build --network=host -t limix\u002Finfe:v1 --build-arg FROM_IMAGES=nvidia\u002Fcuda:12.2.0-base-ubuntu22.04 -f Dockerfile .\n```\n\n### Option 2: Build manually\nDownload the prebuilt flash_attn files\n```bash\nwget -O flash_attn-2.8.0.post2+cu12torch2.7cxx11abiTRUE-cp312-cp312-linux_x86_64.whl https:\u002F\u002Fgithub.com\u002FDao-AILab\u002Fflash-attention\u002Freleases\u002Fdownload\u002Fv2.8.0.post2\u002Fflash_attn-2.8.0.post2+cu12torch2.7cxx11abiTRUE-cp312-cp312-linux_x86_64.whl\n```\nInstall Python dependencies\n```bash\npip install python==3.12.7 torch==2.7.1 torchvision==0.22.1 torchaudio==2.7.1\npip install flash_attn-2.8.0.post2+cu12torch2.7cxx11abiTRUE-cp312-cp312-linux_x86_64.whl\npip install scikit-learn  einops  huggingface-hub matplotlib networkx numpy pandas  scipy tqdm typing_extensions xgboost kditransform hyperopt\n```\n\n### Download source code\n```bash\ngit clone https:\u002F\u002Fgithub.com\u002Flimix-ldm\u002FLimiX.git\ncd LimiX\n```\n\n# ➤ Inference\nLimiX supports tasks such as classification, regression, and missing value imputation\n## ➩ Model download\n| Model size | Download link | Tasks supported |\n| --- | --- | --- |\n| LimiX-16M | [LimiX-16M.ckpt](https:\u002F\u002Fhuggingface.co\u002Fstableai-org\u002FLimiX-16M\u002Ftree\u002Fmain) |  ✅ classification  ✅regression   ✅missing value imputation |\n| LimiX-2M | [LimiX-2M.ckpt](https:\u002F\u002Fhuggingface.co\u002Fstableai-org\u002FLimiX-2M\u002Ftree\u002Fmain) |  ✅ classification  ✅regression |\n\n## ➩ Interface description\n\n### Model Creation\n```python\nclass LimiXPredictor:\n    def __init__(self,\n                 device:torch.device,\n                 model_path:str,\n                 mix_precision:bool=True,\n                 inference_config: list|str,\n                 categorical_features_indices:List[int]|None=None,\n                 outlier_remove_std: float=12,\n                 softmax_temperature:float=0.9,\n                 task_type: Literal['Classification', 'Regression']='Classification',\n                 mask_prediction:bool=False,\n                 inference_with_DDP: bool = False,\n                 seed:int=0)\n```\n| Parameter | Data Type | Description |\n|--------|----------|----------|\n| device | torch.device | The hardware that loads the model |\n| model_path | str | The path to the model that needs to be loaded |\n| mix_precision | bool | Whether to enable the mixed precision inference |\n| inference_config | list\u002Fstr | Configuration file used for inference |\n| categorical_features_indices | list | The indices of categorical columns in the tabular data |\n| outlier_remove_std | float | The threshold is employed to remove outliers, defined as values that are multiples of the standard deviation |\n| softmax_temperature | float | The temperature used to control the behavior of softmax operator |\n| task_type | str | The task type which can be either \"Classification\" or \"Regression\" |\n| mask_prediction | bool | Whether to enable missing value imputation |\n| inference_with_DDP | bool | Whether to enable DDP during inference |\n| seed | int | The seed to control random states |\n### Predict\n```python\ndef predict(self, x_train:np.ndarray, y_train:np.ndarray, x_test:np.ndarray) -> np.ndarray:\n```\n| Parameter   | Data Type    | Description           |\n| ------- | ---------- | ----------------- |\n| x_train  | np.ndarray  | The input features of the training set   |\n| y_train  | np.ndarray  | The target variable of the training set   |\n| x_test   | np.ndarray  | The input features of the test set   |\n\n## Inference Configuration File Description\n| Configuration File Name | Description | Difference |\n| ------- | ---------- | ----- |\n| cls_default_retrieval.json | Default **classification task** inference configuration file **with retrieval** | Better classification performance |\n| cls_default_noretrieval.json | Default **classification task** inference configuration file **without retrieval** | Faster speed, lower memory requirements |\n| reg_default_retrieval.json | Default **regression task** inference configuration file **with retrieval** | Better regression performance |\n| reg_default_noretrieval.json | Default **regression task** inference configuration file **without retrieval** | Faster speed, lower memory requirements |\n| reg_default_noretrieval_MVI.json | Default inference configuration file for **missing value imputation task** |  |\n\n## ➩ Ensemble Inference Based on Sample Retrieval\n\nFor a detailed technical introduction to Ensemble Inference Based on Sample Retrieval, please refer to the [technical report](https:\u002F\u002Fgithub.com\u002Flimix-ldm\u002FLimiX\u002Fblob\u002Fmain\u002FLimiX_Technical_Report.pdf).\n\nConsidering inference speed and memory requirements, ensemble inference based on sample retrieval currently only supports hardware with specifications higher than the NVIDIA RTX 4090 GPU.\n\n### Classification Task\n\n```\npython inference_classifier.py --save_name your_save_name --inference_config_path path_to_retrieval_config --data_dir path_to_data\n```\n\n### Regression Task\n\n```\npython inference_regression.py --save_name your_save_name --inference_config_path path_to_retrieval_config --data_dir path_to_data\n```\n\n### Customizing Data Preprocessing for Inference Tasks\n#### First, Generate the Inference Configuration File\n\n```python\ngenerate_inference_config()\n```\n\n### Classification Task\n#### Single GPU or CPU\n\n```\npython  inference_classifier.py --save_name your_save_name --inference_config_path path_to_retrieval_config --data_dir path_to_data\n```\n\n#### Multi-GPU Distributed Inference\n\n```\ntorchrun --nproc_per_node=8  inference_classifier.py --save_name your_save_name --inference_config_path path_to_retrieval_config --data_dir path_to_data --inference_with_DDP\n```\n\n### Regression Task\n#### Single GPU or CPU\n\n```\npython  inference_regression.py --save_name your_save_name --inference_config_path path_to_retrieval_config --data_dir path_to_data\n```\n\n#### Multi-GPU Distributed Inference\n\n```\ntorchrun --nproc_per_node=8  inference_regression.py --save_name your_save_name --inference_config_path path_to_retrieval_config --data_dir path_to_data --inference_with_DDP\n```\n\n### Retrieval Optimization Project\nThis project implements an optimized retrieval system. To achieve the best performance, we utilize Optuna for hyperparameter tuning of retrieval parameters.\n#### Installation\nEnsure you have the required dependencies installed:\n```\npip install optuna\n```\n#### Usage\nFor standard inference using pre-optimized parameters, refer to the code below:\n```\nsearchInference = RetrievalSearchHyperparameters(\n           dict(device_id=0,model_path=model_path), X_train, y_train, X_test, y_test,\n)\nconfig, result = searchInference.search(n_trials=10, metric=\"AUC\",\n              inference_config='config\u002Fcls_default_retrieval.json',task_type=\"cls\")\n```\nThis will launch an Optuna study to find the best combination of retrieval parameters for your specific dataset and use case.\n\n## ➩ Classification\n```python\nfrom sklearn.datasets import load_breast_cancer\nfrom sklearn.metrics import accuracy_score, roc_auc_score\nfrom sklearn.model_selection import train_test_split\nfrom huggingface_hub import hf_hub_download\nimport numpy as np\nimport os, sys\n\nos.environ[\"RANK\"] = \"0\"\nos.environ[\"WORLD_SIZE\"] = \"1\"\nos.environ[\"MASTER_ADDR\"] = \"127.0.0.1\"\nos.environ[\"MASTER_PORT\"] = \"29500\"\n\nROOT_DIR = os.path.abspath(os.path.join(os.path.dirname(__file__), \"..\"))\nif ROOT_DIR not in sys.path:\n    sys.path.insert(0, ROOT_DIR)\nfrom inference.predictor import LimiXPredictor\n\nX, y = load_breast_cancer(return_X_y=True)\nX_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.5, random_state=42)\n\nmodel_file = hf_hub_download(repo_id=\"stableai-org\u002FLimiX-16M\", filename=\"LimiX-16M.ckpt\", local_dir=\".\u002Fcache\")\n\nclf = LimiXPredictor(device=torch.device('cuda'), model_path=model_file, inference_config='config\u002Fcls_default_retrieval.json')\nprediction = clf.predict(X_train, y_train, X_test)\n\nprint(\"roc_auc_score:\", roc_auc_score(y_test, prediction[:, 1]))\nprint(\"accuracy_score:\", accuracy_score(y_test, np.argmax(prediction, axis=1)))\n```\nFor additional examples, refer to [inference_classifier.py](.\u002Finference_classifier.py)\n\n## ➩ Regression\n```python\nfrom functools import partial\n\nfrom sklearn.datasets import fetch_california_housing\nfrom sklearn.model_selection import train_test_split\nfrom sklearn.metrics import r2_score\nfrom huggingface_hub import hf_hub_download\ntry:\n    from sklearn.metrics import root_mean_squared_error as mean_squared_error\nexcept:\n    from sklearn.metrics import mean_squared_error\n    mean_squared_error = partial(mean_squared_error, squared=False)\nimport os, sys\n\nos.environ[\"RANK\"] = \"0\"\nos.environ[\"WORLD_SIZE\"] = \"1\"\nos.environ[\"MASTER_ADDR\"] = \"127.0.0.1\"\nos.environ[\"MASTER_PORT\"] = \"29500\"\n\nROOT_DIR = os.path.abspath(os.path.join(os.path.dirname(__file__), \"..\"))\nif ROOT_DIR not in sys.path:\n    sys.path.insert(0, ROOT_DIR)\nfrom inference.predictor import LimiXPredictor\n\nhouse_data = fetch_california_housing()\nX, y = house_data.data, house_data.target\nX_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.33, random_state=42)\n\ny_mean = y_train.mean()\ny_std = y_train.std()\ny_train_normalized = (y_train - y_mean) \u002F y_std\ny_test_normalized = (y_test - y_mean) \u002F y_std\n\nmodel_path = hf_hub_download(repo_id=\"stableai-org\u002FLimiX-16M\", filename=\"LimiX-16M.ckpt\", local_dir=\".\u002Fcache\")\n\nmodel = LimiXPredictor(device=torch.device('cuda'), model_path=model_path, inference_config='config\u002Freg_default_retrieval.json')\ny_pred = model.predict(X_train, y_train_normalized, X_test)    \n\n# Compute RMSE and R²\ny_pred = y_pred.to('cpu').numpy()\nrmse = mean_squared_error(y_test_normalized, y_pred)\nr2 = r2_score(y_test_normalized, y_pred)\n\nprint(f'RMSE: {rmse}')\nprint(f'R2: {r2}')\n```\nFor additional examples, refer to [inference_regression.py](.\u002Finference_regression.py)\n\n## ➩ Missing value imputation\nFor the demo file, see [examples\u002Fdemo_missing_value_imputation.py](examples\u002Finference_regression.py)\n\n# ➤ Link\n - LimiX:Unleashing Structured-Data Modeling Capability for Generalist Intelligence: [LimiX:Unleashing Structured-Data Modeling Capability for Generalist Intelligence](https:\u002F\u002Farxiv.org\u002Fabs\u002F2509.03505)\n - LimiX Technical Report: [LimiX_Technical_Report.pdf](https:\u002F\u002Fgithub.com\u002Flimix-ldm\u002FLimiX\u002Fblob\u002Fmain\u002FLimiX_Technical_Report.pdf)\n - Detailed instructions for using Limix: [Visit the official Limix documentation](https:\u002F\u002Fwww.limix.ai\u002Fdoc\u002F)\n - Balance Comprehensive Challenging Omni-domain Classification Benchmark: [bcco_cls](https:\u002F\u002Fhuggingface.co\u002Fdatasets\u002Fstableai-org\u002Fbcco_cls)\n - Balance Comprehensive Challenging Omni-domain Regression Benchmark: [bcco_reg](https:\u002F\u002Fhuggingface.co\u002Fdatasets\u002Fstableai-org\u002Fbcco_reg)\n\n# ➤ License\nThe code in this repository is open-sourced under the [Apache-2.0](LICENSE.txt) license, while the usage of the LimiX model weights is subject to the Model License. The LimiX weights are fully available for academic research and may be used commercially upon obtaining proper authorization.\n\n# ➤ Citation\n```\n@article{zhang2025limix,\n  title={Limix: Unleashing structured-data modeling capability for generalist intelligence},\n  author={Zhang, Xingxuan and Ren, Gang and Yu, Han and Yuan, Hao and Wang, Hui and Li, Jiansheng and Wu, Jiayun and Mo, Lang and Mao, Li and Hao, Mingchao and others},\n  journal={arXiv preprint arXiv:2509.03505},\n  year={2025}\n}\n```\n","LimiX是一个专为结构化数据设计的通用智能模型，旨在通过单一训练和推理流程处理分类、回归、缺失值填补、特征选择、样本选择以及因果推断等多种任务。该项目采用优化后的Transformer架构，能够有效地在样本和特征维度上应用注意力机制，识别关键模式，并生成高维表示以支持多样化的预测任务。LimiX特别适用于需要处理复杂表格数据的应用场景，如金融分析、医疗健康数据分析等，其性能在多个基准测试中均达到了当前最佳水平。此外，项目提供了详细的文档和技术报告，便于用户理解和使用。",2,"2026-06-11 03:41:14","high_star"]