【译】统计建模:两种文化(第四、五部分)

谢绝任何不通知本人的转载,尤其是抄袭。

Abstract 

1. Introduction 

2. ROAD MAP

3. Projects in consulting

4. Return to the university

5. The use of data models

6. The limitations of data models

7. Algorithmic modeling

8. Rashomon and the multiplicity of good models

9. Occam and simplicity vs. accuracy

10. Bellman and the curse of dimensionality

11. Information from a black box

12. Final remarks


Statistical Modeling: The Two Cultures 

统计建模:两种文化

Leo Breiman

Professor, Department of Statistics, University of California, Berkeley, California

4. RETURN TO THE UNIVERSITY

I had one tip about what research in the university was like. A friend of mine, a prominent statistician from the Berkeley Statistics Department, visited me in Los Angeles in the late 1970s. After I described the decision tree method to him, his first question was, “What’s the model for the data?”

4. 重返高校

我对这所学校(伯克利)是如何进行科研的有所了解。我的一个朋友,一位伯克利统计学院的杰出统计学家在20世纪70年代后期在洛杉矶访问了我。在我向他描述了决策树之后,他的第一个问题是:“数据的模型是什么?”

4.1 Statistical Research


Upon my return, I started reading the Annals of Statistics, the flagship journal of theoretical statistics, and was bemused. Every article started with Assume that the data are generated by the following model: ...

followed by mathematics exploring inference, hypothesis testing and asymptotics. There is a wide spectrum of opinion regarding the usefulness of the theory published in the Annals of Statistics to the field of statistics as a science that deals with data. I am at the very low end of the spectrum. Still, there have been some gems that have combined nice theory and significant applications. An example is wavelet theory. Even in applications, data models are universal. For instance, in the Journal of the American Statistical Association (JASA), virtually every article contains a statement of the form: Assume that the data are generated by the following model: ...

I am deeply troubled by the current and past use of data models in applications, where quantitative conclusions are drawn and perhaps policy decisions made.
 

在我返回学校之后,我开始阅读理论统计学的杰出期刊《统计年鉴》,我困惑了。每一篇文章都是从假设数据来源于某种模型开始,随之而来的是数学的推断探索,假设检验和渐进。从数据科学领域出发,对于这一套理论的有效性,在《统计年鉴》中,统计学家有着大范围的讨论。我在这个范围中处于边缘位置。当然,其中也有一些结合了很棒的理论和重要应用的文章。其中一个例子是微波理论。即使在应用层面,数据模型的应用也很广泛。举个栗子,在《美国统计联盟期刊》(JASA)中,实际上每一篇文章都有如下套路:假设数据来源于某一模型……

对于这种在需要定量结论和决策的应用中大量使用数据模型的现状和过往我感到很困惑。

【我之前也有看到一篇论文将p值被滥用了,大概就是这个意思吧。其实毋庸置疑好的模型准确率高大家当然喜欢用,但是机器学习的很多算法没有统计学这种一套完整的理论,所以不知道如何保证稳定性,这才是最大的问题吧】

5. THE USE OF DATA MODELS


Statisticians in applied research consider data modeling as the template for statistical analysis: Faced with an applied problem, think of a data model. This enterprise has at its heart the belief that a statistician, by imagination and by looking at the data, can invent a reasonably good parametric class of models for a complex mechanism devised by nature. Then parameters are estimated and conclusions are drawn. But when a model is fit to data to draw quantitative conclusions:

• The conclusions are about the model’s mechanism, and not about nature’s mechanism. It follows that:
• If the model is a poor emulation of nature, the conclusions may be wrong.

5. 数据模型的使用

应用研究领域的统计学家通常会把数据建模当做一个统计分析的模板:在面对一个应用问题时,思考出一个数据模型。客户就不得允许统计学家通过想象和查看数据来做分析,根据问题本质为复杂的机制发明一个贴合模型的合理参数集。然后估计这些参数并且得出结论。但是当一个模型是用来拟合数据得出定量结论时:

  • 结论是关于模型的机制的,而不是关于机制本身。这会导致:
  • 如果模型是对事物本质的低效估计,那么结论可能是错误的。

These truisms have often been ignored in the enthusiasm for fitting data models. A few decades ago, the commitment to data models was such that even simple precautions such as residual analysis or goodness-of-fit tests were not used. The belief in the infallibility of data models was almost religious. It is a strange phenomenon—once a model is made, then it becomes truth and the conclusions from it are infallible.

这些老生常谈的东西通常会被建模的热情所掩盖,从而被大家忽略。几十年之前,对于数据模型效用的保证只是一些简单的措施,例如残差估计,goodness-of-fit检测还没有使用。当时人们对于数据模型的正确性是谨慎的。现在有一个奇怪的现象:一旦模型建立,那么我们就默认它是真的并且其结论是可用的。

5.1 An Example

I illustrate with a famous (also infamous) example: assume the data is generated by independent  draws from the model

where the coefficients {bm} are to be estimated, ε is N~(0, σ2) and σ2 is to be estimated. Given that the data is generated this way, elegant tests of hypotheses, confidence intervals, distributions of the residual sum-of-squares and asymptotics can be derived. This made the model attractive in terms of the mathematics involved. This theory was used both by academic statisticians and others to derive significance levels for coefficients on the basis of model (R), with little consideration as to whether the data on hand could have been generated by a linear model. Hundreds, perhaps thousands of articles were published claiming proof of something or other because the coefficient was significant at the 5% level.

我会用一个著名的(也臭名昭著)的例子来说明:假设数据是独立地由以下模型生成:

系数{bm}是要被估计的值,ε服从均值为0方差为σ的正态分布,其中方差σ要被估计。因为数据是由这个模型产生,我们可以使用假设检验,置信区间,残差分布,残差平方和和渐进。这些方法会让这个模型看起来很诱人,因为我们使用了数学。学术统计学家和其他领域的人都在使用这个理论从而得到基于模型R的参数的置信区间,但是人们甚少考虑为什么手上的数据可以由一个线性模型生成。上百的,甚至上千的文章只是在使用95%置信区间说明这个证明过程而且不去探讨本质。【难道机器学习探寻本质了?】

Goodness-of-fit was demonstrated mostly by giving the value of the multiple correlation coefficient R2 which was often closer to zero than one and which could be overinflated by the use of too many parameters. Besides computing R2, nothing else was done to see if the observational data could have been generated by model (R). For instance, a study was done several decades ago by a well-known member of a university statistics department to assess whether there was gender discrimination in the salaries of the faculty. All personnel files were examined and a data base set up which consisted of salaryas the response variable and 25 other variables which characterized academic performance; that is, papers published, quality of journals published in, teaching record, evaluations, etc. Gender appears as a binary predictor variable.

在给出R方值时,goodness-of-fit是最常被提起的,但是通常我们得到的值都是趋近于0的,而不是理论上的1,尤其在我们使用了过多参数时【过拟合?】。如果观测到的数据可能来源于模型R,除去计算R方,我们没有别的评估方法了。举一个在几十年前一位某大学统计学院知名成员做的研究的例子。该研究员要评估是否该部门在薪水待遇上存在性别歧视。所有人事部门的资料都被检测了然后形成了一个数据库。该数据库包含作为响应变量的薪水,25个由学术表现数值化的其它变量,例如发表的论文,发表论文的期刊质量,教学记录,教学评估等等。性别被当做一个二进制变量考虑。

A linear regression was carried out on the data and the gender coefficient was significant at the 5% level. That this was strong evidence of sex discrimination was accepted as gospel. The design of the study raises issues that enter before the consideration of a model—Can the data gathered answer the question posed? Is inference justified when your sample is the entire population? Should a data model be used? The deficiencies in analysis occurred because the focus was on the model and not on the problem.

这里使用了线性回归,性别系数在95%置信区间上显著。有强烈证据表面性别歧视是真实存在的。实际上在进入建模阶段之前,这个研究的设计就出现了问题——获取的数据能回答提出的问题吗?我们可以确定你的样本能代表总体吗?我们应该使用数据模型吗?之所以会出现分析的漏洞就是应为大家把注意力都放在了模型上而不是问题上。

The linear regression model led to many erroneous conclusions that appeared in journal articles waving the 5% significance level without knowing whether the model fit the data. Nowadays, I think most statisticians will agree that this is a suspect way to arrive at conclusions. At the time, there were few objections from the statistical profession about the fairy-tale aspect of the procedure, But, hidden in an elementary textbook, Mosteller and Tukey(1977) discuss many of the fallacies possible in regression and write “The whole area of guided regression is fraught with intellectual, statistical, computational, and subject matter difficulties.”

在期刊杂志文章中拿着5%显著性招摇而不去深入探讨为什么模型贴合数据的线性回归模型会导致许多错误的结论。当今,我想大多数统计学家都可能对通过这一方法得到结论抱有怀疑。同时,几乎没有专业的统计学者反对这个近乎童话一样的过程,但是,Mosteller 和 Tukey在1977的教学书记中含蓄地讨论了许多回归的谬论并且写下了这句话:“整个指导型回归领域都充满了对智能、统计、计算机和研究领域的忧虑。”

Even currently, there are only rare published critiques of the uncritical use of data models. One of the few is David Freedman, who examines the use of regression models (1994); the use of path models (1987) and data modeling (1991, 1995). The analysis in these papers is incisive.

即使现在,只有很少的出版物批判了数据模型的盲目使用。其中一个就是David Freedman, 他检测了回归模型的使用、路径模型的使用和数据建模。其论文中的分析很深刻。

5.2 Problems in Current Data Modeling


Current applied practice is to check the data model fit using goodness-of-fit tests and residual analysis. At one point, some years ago, I set up a simulated regression problem in seven dimensions with a controlled amount of nonlinearity. Standard tests of goodness-of-fit did not reject linearity until the nonlinearity was extreme. Recent theory supports this conclusion. Work by Bickel, Ritov and Stoker (2001) shows that goodness-of-fit tests have very little power unless the direction of the alternative is precisely specified. The implication is that omnibus goodness-of-fit tests, which test in many directions simultaneously, have little power, and will not reject until the lack of fit is extreme.

5.2 当今数据模型中存在的问题

如今的应用实例中都是使用goodness-of-fit和残差估计来监测数据模型。有一个问题是,在许多年之前,我建立过一个七个维度的非线性可控的模拟回归问题。直到极值之前,标准的goodness-of-fit测试并不会拒绝非线性。最近的理论支持了这一结论。Bickel, Ritov 和 Stoker的工作指出除非H1被精确定义,否则goodness-of-fit的效力很弱。这暗示着被应用在很多领域的总体性的goodness-of-fit检测只有很小的效用,并且直到拟合效果很差之前都不会拒绝假设。

Furthermore, if the model is tinkered with on the basis of the data, that is, if variables are deleted or nonlinear combinations of the variables added, then goodness-of-fit tests are not applicable. Residual analysis is similarly unreliable. In a discussion after a presentation of residual analysis in a seminar at Berkeley in 1993, William Cleveland, one of the fathers of residual analysis, admitted that it could not uncover lack of fit in more than four to five dimensions. The papers I have read on using residual analysis to check lack of fit are confined to data sets with two or three variables.

并且,如果模型在数据基础上进行过修正,也就是说,如果变量被删除或者非线性的变量组合被添加,goodness-of-fit就不适用了。同理,残差分析也可能会不稳定。在1993年伯克利的一个关于残差分析演讲的研讨会上,William Cleveland, 残差分析之父之一,承认残差分析可能会无法覆盖超过四维或者五维的模型拟合。我所读过的关于使用残差分析来检测拟合性的文章通常都受限在两个或者三个变量。【what? 我所学的都是十几个变量也在用残差分析呀】

With higher dimensions, the interactions between the variables can produce passable residual plots for a variety of models. A residual plot is a goodness-of fit test, and lacks power in more than a few dimensions. An acceptable residual plot does not imply that the model is a good fit to the data.

在更高维度上,变量之间的交叉项可能会产生对一定范围模型都可行的残差图。一个残差图就是一个goodness-of-fit检测,在一定维度之后就变得缺少效力。一个可行的残差图并不代表着一个适合数据的模型。

There are a variety of ways of analyzing residuals. For instance, Landwher, Preibon and Shoemaker (1984, with discussion) gives a detailed analysis of fitting a logistic model to a three-variable data set using various residual plots. But each of the four discussants present other methods for the analysis. One is left with an unsettled sense about the arbitrariness of residual analysis.

残差分析有很多方法。例如,Landwher, Preibon和Shoemaker给出了详细的拟合一个三变量数据集的逻辑回归的变量残差图的详细分析。但是四位讨论者呈现了其它分析方法,其中一个包含着关于残差分析恣意性的激进观点。

Misleading conclusions may follow from data models that pass goodness-of-fit tests and residual checks. But published applications to data often show little care in checking model fit using these methods or any other . For instance, many of the current application articles in JASA that fit data models have very little discussion of how well their model fits the data. The question of how well the model fits the data is of secondary importance compared to the construction of an ingenious stochastic model.

如果数据模型通过了goodness-of-fit和残差检验,误导性的结论可能会产生。但是已发布的数据应用中通常很少关心模型的检测。举个栗子,许多当今的JASA中的应用都很少谈到他们的模型如何贴合数据。模型拟合效果多好这类的问题优先级一般都在建立一个精致随机模型之后。

5.3 The Multiplicity of Data Models


One goal of statistics is to extract information from the data about the underlying mechanism producing the data. The greatest plus of data modeling is that it produces a simple and understandable picture of the relationship between the input variables and responses. For instance, logistic regression in classification is frequently used because it produces a linear combination of the variables with weights that give an indication of the variable importance. The end result is a simple picture of how the prediction variables affect the response variable plus confidence intervals for the weights. Suppose two statisticians, each one with a different approach to data modeling, fit a model to the same data set. Assume also that each one applies standard goodness-of-fit tests, looks at residuals, etc., and is convinced that their model fits the data. Yet the two models give different pictures of nature’s mechanism and lead to different conclusions.

5.3 数据模型的多样性

统计学的一个目标就是从数据入手,从潜在的产生数据的机制中提取信息。数据模型最大的加分点就是它能产生一个简单并且易懂的关于解释变量和响应变量的关系图【这点是上课也强调过的】。举个栗子,分类问题中的逻辑回归常常被使用的原因是它可以产生一个有权重的变量的线性组合,让人们知道不同的重要性。结果就是一个简单的带置信区间的解释变量如何影响响应变量的图。假设两个统计学家每个都用不同的方法来建立数据模型,拟合同一个数据集。假设每个人都用标准的goodness-of-fit和残差分析等来检查模型,并且认为他们的模型贴合数据。那么这两个模型就会给出不同的对于问题本质的解释并且得出不同的结论。

McCullah and Nelder (1989) write “Data will often point with almost equal emphasis on several possible models, and it is important that the statistician recognize and accept this.” Well said, but different models, all of them equally good, may give different pictures of the relation between the predictor and response variables. The question of which one most accurately reflects the data is difficult to resolve. One reason for this multiplicity is that goodness-of-fit tests and other methods for checking fit give a yes–no answer. With the lack of power of these tests with data having more than a small number of dimensions, there will be a large number of models whose fit is acceptable. There is no way, among the yes–no methods for gauging fit, of determining which is the better model. A few statisticians know this. Mountain and Hsiao (1989) write, “It is difficult to formulate a comprehensive model capable of encompassing all rival models. Furthermore, with the use of finite samples, there are dubious implications with regard to the validity and power of various encompassing tests that rely on asymptotic theory.”

McCullah 和 Nelder在1989年写道:“数据通常在许多可能的模型中都能得到解释,并且统计学家能够意识到并接受这点非常重要。” 说的挺好,但是同样好的不同模型可能会给出不同的响应变量和解释变量之间的关系。那么哪一个最精确呢?——通常这个问题是很难衡量的。能产生这样的多样性一个原因是因为goodness-of-fit检测和其他方法智能给出yes-no的回答。在多维度数据中,当我们无从得知这些检测的效力如何时,就会产生大量的可以拟合数据的模型。通过yes-no来判定模型并且说哪个更好是不现实的。一些统计学家意识到了这一点。Mountain和Hsiao写道:“我们很难形成一个能够涵盖所有效力相同的模型的复杂模型。并且,在有限的样本下,许多相关的依赖于渐进法的检测的可用性和效用是值得怀疑的。”

Data models in current use may have more damaging results than the publications in the social sciences based on a linear regression analysis. Just as the 5% level of significance became a de facto standard for publication, the Cox model for the analysis of survival times and logistic regression for survive–nonsurvive data have become the de facto standard for publication in medical journals. That different survival models, equally well fitting, could give different conclusions is not an issue.

相比于基于线性回归分析的社会学科出版物,如今的数据模型可能有着更毁灭性的结果。例如95%的的显著性变成了出版物的必备元素,对于survival times的Cox模型和suvive-non-survive数据的逻辑回归已经变成了医疗期刊的必备配置。不同的存活模型可以得到相同的拟合效果,即时给出不同的结论也不是一个大问题。

5.4 Predictive Accuracy


The most obvious way to see how well the model box emulates nature’s box is this: put a case x down nature’s box getting an output y. Similarly, put the same case x down the model box getting an output y . The closeness of y and y is a measure of how good the emulation is. For a data model, this translates as: fit the parameters in your model by using the data, then, using the model, predict the data and see how good the prediction is.

5.4 预测精确率

去查看一个模型是否能很好模拟本质的最显而易见的方法就是:把一个例子x放进本质模型中得到一个输出y。同样地,把同样的x放进模型中得到输出y。 y和y越接近说明拟合的越好。对于一个数据模型,我们可以把这个过程翻译为:通过数据拟合模型中的参数,然后使用模型来预测数据,看看预测结果如何。

Prediction is rarely perfect. There are usually many unmeasured variables whose effect is referred to as “noise.” But the extent to which the model box emulates nature’s box is a measure of how well our model can reproduce the natural phenomenon producing the data.

预测几乎是不可能完美的。通常会有很多不可衡量的变量产生所谓“噪声”的影响。但是在使用模型模拟本质的这个层面上,我们在乎的是通过这个模型再生产出来的y能够有多贴合产生这些数据的本质现象。【这里我突然有一点懂了,其实数据科学的本质还是在玩数据,我们期望通过建模等手段模拟数据产生的原因,而不是建立酷炫好看的模型,毕竟解决问题才是实际的】

McCullagh and Nelder (1989) in their book on generalized linear models also think the answer is obvious. They write, “At first sight it might seem as though a good model is one that fits the data very well; that is, one that makes ˆμ (the model predicted value) very close to y (the response value).” Then they go on to note that the extent of the agreement is biased by the number of parameters used in the model and so is not a satisfactory measure. They are, of course, right. If the model has too many parameters, then it may overfit the data and give a biased estimate of accuracy. But there are ways to remove the bias. To get a more unbiased estimate of predictive accuracy, cross-validation can be used, as advocated in an important early work by Stone
(1974). If the data set is larger, put aside a test set.

McCullagh和Nelder在他们关于广义线性模型的书中也提到了这个问题。他们写道:“尽管第一眼看上去某一个模型应该是能够很好地拟合数据,但其实只是一个人用μ去接近y(即用模型预测值去接近响应变量)”。然后他们提到这种和谐会随着参数数量的变化而有偏,所以这不是一个很好地方法。他们当然是对的。如果一个模型有太多的参数,那么它可能过拟合并且给出有偏的估计值。但是我们有很多方法让它无偏。为了得到一个更加无偏的精确预测估计,Stone提出应该把交叉验证作为一个早期的重要工作去使用。如果数据集很大,把验证集放在一边(不是很确定这里翻译对了)。

Mosteller and Tukey(1977) were early advocates of cross-validation. They write, “Cross-validation is a natural route to the indication of the quality of any data-derived quantity... . We plan to cross-validate carefully wherever we can.”

Judging by the infrequency of estimates of predictive accuracy in JASA, this measure of model fit that seems natural to me (and to Mosteller and Tukey) is not natural to others. More publication of predictive accuracy estimates would establish standards for comparison of models, a practice that is common in machine learning.

Mosteller 和 Tukey很早就倡导使用交叉验证。他们说道:“交叉验证是一个指向任何数据派生的总量的质量的自然的路径……我们计划在任何可以使用的地方小心地使用交叉验证。”

通过JASA中少见的预测精确率的估计判断,这个对于模型拟合的衡量标准对我来说能接受多了,但并不被大众所接受。更多的出版物的预测精确估计应该建立在标准的模型比较上——一个在机器学习领域更加常见的实例。

猜你喜欢

转载自blog.csdn.net/weixin_39965890/article/details/82997898
今日推荐