【译】统计建模:两种文化(第六部分)

谢绝任何不通知本人的转载,尤其是抄袭。

Abstract 

1. Introduction 

2. ROAD MAP

3. Projects in consulting

4. Return to the university

5. The use of data models

6. The limitations of data models

7. Algorithmic modeling

8. Rashomon and the multiplicity of good models

9. Occam and simplicity vs. accuracy

10. Bellman and the curse of dimensionality

11. Information from a black box

12. Final remarks


Statistical Modeling: The Two Cultures 

统计建模:两种文化

Leo Breiman

Professor, Department of Statistics, University of California, Berkeley, California

6. THE LIMITATIONS OF DATA MODELS


With the insistence on data models, multivariate analysis tools in statistics are frozen at discriminant analysis and logistic regression in classification and multiple linear regression in regression. Nobody really believes that multivariate data is multivariate normal, but that data model occupies a large number of pages in every graduate textbook on multivariate statistical analysis.

6. 数据模型的局限性

和数据模型一样,统计学中的多元分析工具在判别式分析和分类逻辑回归以及多重线性回归中的地位也很尴尬。没有人会诊相信多元数据是真的符合多元正态分布的,但是这些数据模型在每一本高校多元统计分析教科书中却占据了大量篇幅。

With data gathered from uncontrolled observations on complex systems involving unknown physical, chemical, or biological mechanisms, the a priori assumption that nature would generate the data through a parametric model selected by the statistician can result in questionable conclusions that cannot be substantiated by appeal to goodness-of-fit tests and residual analysis. Usually, simple parametric models imposed on data generated by complex systems, for example, medical data, financial data, result in a loss of accuracy and information as compared to algorithmic models (see Section 11).

如果数据是由未知的物理、化学或生物机制中的复杂系统经过未加控制的观察所得,那么使用一个先验假设说问题本质产生的数据是由一个统计学家精心挑选的含参模型产生的可能会导致goodness-of-fit和残差检验无法支持的备受质疑的结论。通常来讲,由复杂系统产生的数据会产生的简单的参数模型。举个栗子,相比算法模型,医疗数据,金融数据产生的模型会损失一定的精确度和信息(详情见第十一部分)。

There is an old saying “If all a man has is a hammer, then every problem looks like a nail.” The trouble for statisticians is that recently some of the problems have stopped looking like nails. I conjecture that the result of hitting this wall is that more complicated data models are appearing in current published applications. Bayesian methods combined with Markov Chain Monte Carlo are cropping up all over. This may signify that as data becomes more complex, the data models become more cumbersome and are losing the advantage of presenting a simple and clear picture of nature’s mechanism.

古人云:“如果一个人仅有一把锤子,那么每个问题都看起来像是一个钉子”。统计学家所面临的问题是近来有一些问题看起来不再像是钉子了。我推测打破这面墙壁的结果就是越来越复杂的数据模型被应用到发布的实践中。结合了Markov Chain Monte Carlo的贝叶斯方法到处都是。这可能意味着当数据变得越来越复杂,数据模型也只会变得更加冗余并且失去了能阐述一个简单和清晰本质的优势。

Approaching problems by looking for a data model imposes an a priori straight jacket that restricts the ability of statisticians to deal with a wide range of statistical problems. The best available solution to a data problem might be a data model; then again it might be an algorithmic model. The data and the problem guide the solution. To solve a wider range of data problems, a larger set of tools is needed.

直接使用一个约定俗成的方法(数据模型)来解决问题会限制统计学家解决更多领域的问题。对于一个数据问题,最好的可行方法很可能不是数据模型,而是算法模型。数据和问题可以指导这个方法的实施。如果我们需要解决更广范围的数据问题,那么我们需要更多的手段。

Perhaps the damaging consequence of the insistence on data models is that statisticians have ruled themselves out of some of the most interesting and challenging statistical problems that have arisen out of the rapidly increasing ability of computers to store and manipulate data. These problems are increasingly present in many fields, both scientific and commercial, and solutions are being found by nonstatisticians.

可能坚持数据模型产生的毁灭性后果就是统计学家们不得不让自己无法从事一些新兴的有趣且具有挑战性的统计问题,而这些问题是能够帮他们快速提升计算机和操控数据能力的。这些问题在包含科学和商业的许多领域都开始显现,并且很多都是由非统计学家解决的。

猜你喜欢

转载自blog.csdn.net/weixin_39965890/article/details/83008609