【文献阅读】【TabPFN】Accurate predictions on small data with a tabular foundation model

企业开发 2025-04-10 00:54:26 阅读次数: 0

深度学习也能分析表格数据了，时代进步太快了。这篇文章来自Nature，地址如下：
https://www.nature.com/articles/s41586-024-08328-6

Accurate predictions on small data with a tabular foundation model

基于表格基础模型的小数据精确预测

Abstract（摘要）

Tabular data, spreadsheets organized in rows and columns, are ubiquitous across scientific fields, from biomedicine to particle physics to economics and climate science1,2. The fundamental prediction task of filling in missing values of a label column based on the rest of the columns is essential for various applications as diverse as biomedical risk models, drug discovery and materials science. Although deep learning has revolutionized learning from raw data and led to numerous high-profile success stories3–5, gradient-boosted decision trees6–9 have dominated tabular data for the past 20 years. Here we present the Tabular Prior-data Fitted Network (TabPFN), a tabular foundation model that outperforms all previous methods on datasets with up to 10,000 samples by a wide margin, using substantially less training time. In 2.8 s, TabPFN outperforms an ensemble of the strongest baselines tuned for 4 h in a classification setting. As a generative transformer-based foundation model, this model also allows fine-tuning, data generation, density estimation and learning reusable embeddings. TabPFN is a learning algorithm that is itself learned across millions of synthetic datasets, demonstrating the power of this approach for algorithm development. By improving modelling abilities across diverse fields, TabPFN has the potential to accelerate scientific discovery and enhance important decision-making in various domains.

表格数据，即以行和列形式组织的电子表格，在科学领域无处不在，从生物医学到粒子物理学到经济学和气候科学[1,2]。基于其他列填充标签列缺失值的基本预测任务对于生物医学风险模型，药物发现和材料科学等各种应用至关重要。尽管深度学习已经彻底改变了从原始数据中学习，并带来了许多备受瞩目的成功案例[3 - 5]，但在过去的20年里，梯度增强决策树[6 - 9]一直主导着表格数据。在这里，我们提出了表格先验数据拟合网络(TabPFN)，这是一种表格基础模型，在多达10,000个样本的数据集上，使用更少的训练时间，比以前的所有方法都要好得多。在2.8秒内，TabPFN优于在分类设置中调优4小时的最强基线集合。作为一个基于生成式变压器的基础模型，该模型还允许微调、数据生成、密度估计和学习可重用嵌入。TabPFN是一种学习算法，它本身是在数百万个合成数据集上学习的，展示了这种算法开发方法的强大功能。通过提高不同领域的建模能力，TabPFN具有加速科学发现和增强各个领域重要决策的潜力。

没写标题的开头

Throughout the history of artificial intelligence, manually created algorithmic components have been replaced with better-performing end-to-end learned ones. Hand-designed features in computer vision, such as SIFT (Scale Invariant Feature Transform)[10] and HOG (Histogram of Oriented Gradients)[11], have been replaced by learned convolutions; grammar-based approaches in natural language processing have been replaced by learned transformers[12]; and the design of customized open-ing and end-game libraries in game playing has been superseded by end-to-end learned strategies[3,13]. Here we extend this end-to-end learning to the ubiquitous domain of tabular data.

纵观人工智能的历史，人工创建的算法组件已经被性能更好的端到端学习组件所取代。计算机视觉中手工设计的特征，如SIFT (Scale Invariant Feature Transform)10和HOG (Histogram of Oriented Gradients)[11]，已经被学习卷积所取代;自然语言处理中基于语法的方法已经被学习转换所取代。在游戏中定制的开局和结束库的设计已经被端到端学习策略所取代[3,13]。这里我们将这种端到端学习扩展到无处不在的表格数据领域。

The diversity of tabular data sets them apart from unprocessed modalities such as text and images. While in language modelling for example the meaning of a word is consistent across documents, in tabular datasets the same value can mean fundamentally different things. A drug discovery dataset, for example, might record chemical properties, whereas another dataset in materials science might docu- ment thermal and electric properties. This specialization leads to a proliferation of smaller, independent datasets and associated models. To illustrate, on the popular tabular benchmarking website openml.org, 76% of the datasets contain less than 10,000 rows at the time of writing.

表格数据的多样性使它们有别于文本和图像等未经处理的模式。例如，在语言建模中，一个词的含义在各个文档中是一致的，但在表格数据集中，相同的值可能意味着完全不同的东西。例如，一个药物发现数据集可能记录化学性质，而另一个材料科学数据集可能记录热学和电学性质。这种专业化导致了更小的、独立的数据集和相关模型的激增。为了说明这一点，在流行的表格基准测试网站openml.org上，76%的数据集在编写时包含的行数少于10,000行。

Deep learning methods have traditionally struggled with tabular data, because of the heterogeneity between datasets and the heteroge-neity of the raw data itself: Tables contain columns, also called features, with various scales and types (Boolean, categorical, ordinal, integer, floating point), imbalanced or missing data, unimportant features, outliers and so on. This made non-deep-learning methods, such as tree-based models, the strongest contender so far[14,15].
However, these traditional machine learning models have sev-eral drawbacks. Without substantial modifications, they yield poor out-of-distribution predictions and poor transfer of knowledge from one dataset to another[16]. Finally, they are hard to combine with neural networks, as they do not propagate gradients.

由于数据集之间的异质性和原始数据本身的异质性，深度学习方法传统上一直在与表格数据作斗争:表包含列，也称为特征，具有各种规模和类型(布尔型、分类型、序数型、整数型、浮点型)、不平衡或缺失的数据、不重要的特征、异常值等。这使得非深度学习方法，如基于树的模型，成为迄今为止最强的竞争者。
然而，这些传统的机器学习模型有几个缺点。如果没有实质性的修改，它们产生的分布外预测很差，知识从一个数据集转移到另一个数据集也很差。最后，它们很难与神经网络结合，因为它们不传播梯度。

As a remedy, we introduce TabPFN, a foundation model for small-to medium-sized tabular data. This new supervised tabular learning method can be applied to any small- to moderate-sized dataset and yields dominant performance for datasets with up to 10,000 samples and 500 features. In a single forward pass, TabPFN significantly out-performs state-of-the-art baselines on our benchmarks, including gradient-boosted decision trees, even when these are allowed 4 h of tuning, a speedup of 5,140× (classification) and 3,000× (regression). Finally, we demonstrate various foundation model characteristics of TabPFN, including fine-tuning, generative abilities and density estimation.

作为补救措施，我们引入了TabPFN，这是中小型表格数据的基础模型。这种新的监督表格学习方法可以应用于任何小型到中等规模的数据集，并且对于多达10,000个样本和500个特征的数据集产生优势性能。在单次向前传递中，TabPFN在我们的基准测试中显著优于最先进的基线，包括梯度增强决策树，即使允许4小时的调优，加速也达到5,140倍(分类)和3,000倍(回归)。最后，我们展示了TabPFN的各种基础模型特征，包括微调、生成能力和密度估计。

Principled in-context learning（翻译成上下文学习原理？）

TabPFN leverages in-context learning (ICL)[17], the same mechanism that led to the astounding performance of large language models, to generate a powerful tabular prediction algorithm that is fully learned. Although ICL was first observed in large language models, recent work has shown that transformers can learn simple algorithms such as logistic regression through ICL[18–21]. Prior-data Fitted Net-works (PFNs) have shown that even complex algorithms, such as Gaussian Processes and Bayesian Neural Networks, can be approxi-mated with ICL[22]. ICL enables us to learn a wider space of possible algorithms, including cases for which a closed-form solution does not exist.

TabPFN利用上下文学习(ICL)[17]，这种机制导致了大型语言模型的惊人性能，生成了一个强大的完全可学习的表格预测算法。虽然ICL最初是在大型语言模型中观察到的，但最近的研究表明，变压器可以通过ICL[18-21]学习简单的算法，如逻辑回归。先验数据拟合网络(pfn)已经表明，即使是复杂的算法，如高斯过程和贝叶斯神经网络，也可以用ICL[22]近似。ICL使我们能够学习更广泛的可能算法空间，包括不存在封闭形式解的情况。

We build on a preliminary version of TabPFN[23], which demonstrated the applicability of in-context-learning17 for tabular data in principle but had many limitations that rendered it inapplicable in most cases. Based on a series of improvements, the new TabPFN scales to 50× larger datasets; supports regression tasks, categorical data and missing values; and is robust to unimportant features and outliers.

我们建立在TabPFN[23]的初步版本上，该版本原则上证明了上下文学习[17]对表格数据的适用性，但有许多限制，使得它在大多数情况下不适用。基于一系列改进，新的TabPFN可扩展到50倍大的数据集;支持回归任务，分类数据和缺失值;并且对不重要的特征和异常值具有鲁棒的。

The key idea behind TabPFN is to generate a large corpus of synthetic tabular datasets and then train a transformer-based[12] neural network to learn to solve these synthetic prediction tasks. Although traditional approaches require hand-engineered solutions for data challenges such as missing values, our method autonomously learns effective strategies by solving synthetic tasks that include these challenges. This approach leverages ICL as a framework for exemplar-based declarative programming of algorithms. We design desired algorithmic behaviour by generating diverse synthetic datasets that demonstrate the desired behaviour and then train a model to encode an algorithm that satisfies it. This shifts the algorithm design process from writing explicit instructions to defining input–output examples, opening up possibilities for creating algorithms in various domains. Here, we apply this approach to the high-impact field of tabular learning, generating a powerful tabular prediction algorithm.

TabPFN背后的关键思想是生成一个大型合成表格数据集的语料库，然后训练一个基于Transformer的神经网络来学习解决这些合成预测任务。虽然传统的方法需要手工设计解决方案来解决数据挑战，如缺失值，但我们的方法通过解决包括这些挑战的合成任务来自主学习有效的策略。这种方法利用ICL作为基于范例的声明性算法编程的框架。我们通过生成不同的合成数据集来设计期望的算法行为，这些数据集展示了期望的行为，然后训练一个模型来编码满足它的算法。这将算法设计过程从编写明确的指令转变为定义输入输出示例，为在各种领域创建算法提供了可能性。在这里，我们将这种方法应用于高影响力的表格学习领域，生成了一个强大的表格预测算法。

Our ICL approach differs fundamentally from standard supervised deep learning. Usually, models are trained per dataset, updating model parameters