【译】统计建模:两种文化(第三部分)

谢绝任何不通知本人的转载,尤其是抄袭。

Abstract 

1. Introduction 

2. ROAD MAP

3. Projects in consulting

4. Return to the university

5. The use of data models

6. The limitations of data models

7. Algorithmic modeling

8. Rashomon and the multiplicity of good models

9. Occam and simplicity vs. accuracy

10. Bellman and the curse of dimensionality

11. Information from a black box

12. Final remarks


Statistical Modeling: The Two Cultures 

统计建模:两种文化

Leo Breiman

Professor, Department of Statistics, University of California, Berkeley, California

3. PROJECTS IN CONSULTING

As a consultant I designed and helped supervise surveys for the Environmental Protection Agency (EPA) and the state and federal court systems. Controlled experiments were designed for the EPA, and I analyzed traffic data for the U.S. Department of Transportation and the California Transportation Department. Most of all, I worked on a diverse set of prediction projects. Here are some examples:

Predicting next-day ozone levels.

Using mass spectra to identify halogen-containing compounds.

Predicting the class of a ship from high altitude radar returns.

Using sonar returns to predict the class of a submarine.

Identity of hand-sent Morse Code.

Toxicity of chemicals.

On-line prediction of the cause of a freeway traffic breakdown.

Speech recognition The sources of delay in criminal trials in state court systems.

To understand the nature of these problems and the approaches taken to solve them, I give a fuller description of the first two on the list.

3. 咨询工作

作为一个咨询师,我负责设计和帮助EPA(环境保护署)调查监管,并且将结果汇报给联邦法院系统。我为EPA设计控制变量实验,分析美国交通部和加利福尼亚交通部的交通数据。总体来说,我致力于研究许多不同的项目。以下为一些例子:

预测隔天臭氧等级。

使用质量光谱鉴别卤化物。

通过高海拔无线电探测器甄别船只类型。

通过声呐返回值预测潜艇种类。

鉴别手打摩尔斯电码。

化学毒性。

在线预测高速公路交通设施损坏原因。

使用语音识别识别州法院系统犯罪审判延迟的源头(没看懂啥意思)。

为了了解这些问题的本质和其解决办法,我会详细介绍列表前两个例子。

3.1 The Ozone Project

In the mid-to-late 1960s ozone levels became a serious health problem in the Los Angeles Basin. Three different alert levels were established. At the highest, all government workers were directed not to drive to work, children were kept off playgrounds and outdoor exercise was discouraged.

The major source of ozone at that time was automobile tailpipe emissions. These rose into the low atmosphere and were trapped there by an inversion layer. A complex chemical reaction, aided by sunlight, cooked away and produced ozone two to three hours after the morning commute hours. The alert warnings were issued in the morning, but would be more effective if they could be issued 12 hours in advance. In the mid-1970s, the EPA funded a large effort to see if ozone levels could be accurately predicted 12 hours in advance.

Commuting patterns in the Los Angeles Basin are regular, with the total variation in any given daylight hour varying only a few percent from one weekday to another. With the total amount of emissions about constant, the resulting ozone levels depend on the meteorology of the preceding days. A large data base was assembled consisting of lower and upper air measurements at U.S. weather stations as far away as Oregon and Arizona, together with hourly readings of surface temperature, humidity, and wind speed at the dozens of air pollution stations in the Basin and nearby areas.

3.1 臭氧项目

在二十世纪60年代中后期,臭氧问题严重影响着洛杉矶盆地人们的健康。当局使用了三个不同的预警等级。最高等级:所有政府部门工作人员不允许驾驶上班,孩童必须远离操场等场所,并且不建议进行户外活动。

当时主要的臭氧排放源是汽车尾气。尾气进入较低的大气层并且在一个反转层转化。在日照催化下,经过复杂的化学反应,臭氧会在早高峰的两到三个小时形成。我们可以在早上形成较多臭氧时发出警报,但是如果能提前12小时发布预警,就能更加有效避免一定伤害。在20世纪70年代中期,EPA耗费了大量精力来寻求是否能够提前12小时发布准确的臭氧预警。

洛杉矶盆地的通勤模式是常规的:在给定夏令时中,只有少量百分点会在工作日内变化。因此,总尾气排放量趋近于一个常数,臭氧等级取决于基于先前若干天的气象学。为了完成这个项目,我们安装了一个大型数据库。其中包括从俄勒冈州到亚利桑那州的各种美国气象台的空气评估数据,包含了洛杉矶盆地和附近区域众多空气污染站点每小时地表温度、湿度、风速等的读取。

Altogether, there were daily and hourly readings of over 450 meteorological variables for a period of seven years, with corresponding hourly values of ozone and other pollutants in the Basin. Let x be the predictor vector of meteorological variables on the nth day. There are more than 450 variables in x since information several days back is included. Let y be the ozone level on the (n + 1) st day. Then the problem was to construct a function f(x) such that for any future day and future predictor variables x for that day, f(x) is an accurate predictor of the next day’s ozone level y.

总之,我们拥有对于洛杉矶盆地关于臭氧和其他污染物长达七年的超过450个以天和小时为量级的气象学变量。我们设x为第n天时,这些气象学变量组成的预测向量。在x中,因为许多天前的信息都被包含在其中,它有超过450个变量。假设y为第(n+1)天的臭氧等级。那么问题就转换成了构造一个关于y和x的函数f(x),使得f(x)能够精确预测第二天的臭氧等级y。

To estimate predictive accuracy, the first five years of data were used as the training set. The last two years were set aside as a test set. The algorithmic modeling methods available in the pre-1980s decades seem primitive now. In this project large linear regressions were run, followed by variable selection. Quadratic terms in, and interactions among, the retained variables were added and variable selection used again to prune the equations. In the end, the project was a failure—the false alarm rate of the final predictor was too high. I have regrets that this project can’t be revisited with the tools available today.

为了评估预测准确性,第一个五年的数据被用作训练集,后两年的数据被用作测试集。所使用到的算法模型是20世纪80年代早期的,现在看来十分原始。在这个项目中,我们使用了大型的线性回归和变量选择。除此之外,我们还使用了二次项和交叉项,添加了保留下来的变量并且再次使用变量选择来给等式做修剪。最后,这个项目失败了——错误预警率太高。我很后悔这个项目没办法使用如今的方法来重建。【这很尴尬啊,我学的就是这些东西】

3.2 The Chlorine Project

The EPA samples thousands of compounds a year and tries to determine their potential toxicity. In the mid-1970s, the standard procedure was to measure the mass spectra of the compound and to try to determine its chemical structure from its mass spectra.

3.2 氯气项目

EPA采样了一年中成千的复合物并且试图判断它们的潜在毒性。在20世纪70年代中期,标准的程序应该是衡量复合物的质量光谱然后试图确定其化学结构。

Measuring the mass spectra is fast and cheap. But the determination of chemical structure from the mass spectra requires a painstaking examination by a trained chemist. The cost and availability of enough chemists to analyze all of the mass spectra produced daunted the EPA. Many toxic compounds contain halogens. So the EPA funded a project to determine if the presence of chlorine in a compound could be reliably predicted from its mass spectra.

衡量质谱是快速且廉价的。但是根据质谱确定化学结构需要训练有素的化学家极其小心且辛苦地进行检测。使用足够化学家来分析质谱的成本和可实施性让EPA望而却步。许多有毒复合物都包含卤素。所以EPA筹建了一个项目来判断是否能够通过一个复合物的质谱来推断氯的存在。

Mass spectra are produced by bombarding the compound with ions in the presence of a magnetic field. The molecules of the compound split and the lighter fragments are bent more by the magnetic field than the heavier. Then the fragments hit an absorbing strip, with the position of the fragment on the strip determined by the molecular weight of the fragment. The intensity of the exposure at that position measures the frequency of the fragment. The resultant mass spectra has numbers reflecting frequencies of fragments from molecular weight 1 up to the molecular weight of the original compound. The peaks correspond to frequent fragments and there are many zeroes. The available data base consisted of the known chemical structure and mass spectra of 30,000 compounds.

质谱是通过在磁场中使用离子射击复合物得到的。相比重力场,磁场中复合物的分子会分离并且分裂成更轻的组成。然后这些分裂物会依据碎片的位置击打一个吸收带,碎片的重量和分子重量相同,以此来确定分子质量。根据特定位置的出现密度我们可以确定碎片产生的频率。质谱会有一些数字用来反应从质量为1的分子到复合物总重的碎片频次。峰值和碎片频次是一致的,并且有很多为0的值。可使用的数据库和已知的30,000种复合物的化学结构和质谱是一致的。

The mass spectrum predictor vector x is of variable dimensionality. Molecular weight in the data base varied from 30 to over 10,000. The variable to be predicted is

质谱的预测变量x在变量维度中(其实这一句不知道怎么翻)。数据库中的分子质量从30到10,000不等。预测变量为:

y=1:包含氯;y=2:不包含氯。

The problem is to construct a function f(x) that is an accurate predictor of y where x is the mass spectrum of the compound.

To measure predictive accuracy the data set was randomly divided into a 25,000 member training set and a 5,000 member test set. Linear discriminant analysis was tried, then quadratic discriminant analysis. These were difficult to adapt to the variable dimensionality. By this time I was thinking about decision trees. The hallmarks of chlorine in mass spectra were researched. This domain knowledge was incorporated into the decision tree algorithm by the design of the set of 1,500 yes–no questions that could be applied to a mass spectra of any dimensionality. The result was a decision tree that gave 95% accuracy on both chlorines and nonchlorines (see Breiman, Friedman, Olshen and Stone, 1984).

这个问题目的是构造一个函数f(x)使得我们可以根据x——复合物质谱来精确预测y。

为了保证预测准确率,整个数据集被随机分成了25,000 组训练集和5,000组测试集。线性判别式分析(LDA)被使用了,然后我们使用了二项判别式分析。在调节变量维度上,这两个方法太麻烦了【之前学2D的GAM建模时就发现异常麻烦】。这一次,我开始考虑使用决策树。我们探究了氯在质谱中的特性。然后我们将这一专业领域的知识结合在了决策树算法中,设计了1500个yes-no的问题。这个算法可以被应用在任何维度的质谱中。结果表明在预测含氯和不含氯的复合物中,决策树给出了95%的准确率(可参考Breiman, Friedman, Olshen and Stone, 1984)。

线性判别分析(LDA)

是一种用来实现两个或者多个对象特征分类方法,在数据统计、模式识别、机器学习领域均有应用。

LDA跟PCA非常相似、唯一不同的是LDA的结果是将数据投影到不同分类、PCA的结果是将数据投影到最高相似分组,而且过程无一例外的都基于特征值与特性向量实现降维处理。

PCA变换基于在原数据与调整之后估算降维的数据之间最小均方错误,PCA趋向提取数据最大相同特征、而忽视数据之间微小不同特征、所以如果在OCR识别上使用PCA的方法就很难分辨Q与O个英文字母、而LDA基于最大类间方差与最小类内方差,其目的是减少分类内部之间差异,扩大不同分类之间差异。所以LDA在一些应用场景中有比PCA更好的表现。

——摘自“OpenCV学堂”《LDA(Linear Discriminant Analysis)算法介绍》【1】

Quadratic Discriminant Analysis

类似于LDA,不同的地方是它可以形成非线性的边界,并且不同的类所属的高斯分布具有不同的协方差矩阵。 

——摘自CSDN用户“NirHeavenX”博客《sklearn浅析(五)——Discriminant Analysis》【2】

3.3 Perceptions on Statistical Analysis

As I left consulting to go back to the university, these were the perceptions I had about working with data to find answers to problems:

(a) Focus on finding a good solution—that’s what consultants get paid for.

(b) Live with the data before you plunge into modeling.

(c) Search for a model that gives a good solution, either algorithmic or data.

(d) Predictive accuracy on test sets is the criterion for how good the model is.

(e) Computers are an indispensable partner.

3.3 统计分析的一些观点

在我离开资讯岗位返回高校时,我有了一些关于从事数据相关工作产生的问题及其解决方案的想法:

(a)我们需要关注于解决好的方法——这才是咨询师的获利点;

(b)了解数据比模型更重要;

(c)寻求一个能够产生良好解决办法的模型,无论是算法上的还是数据上的;

(d)验证集的准确率才是衡量一个模型好坏的标准;

(e)计算机是不可或缺的伙伴。

参考文献:

【1】https://www.sohu.com/a/159765142_823210

【2】https://blog.csdn.net/qsczse943062710/article/details/75977118

猜你喜欢

转载自blog.csdn.net/weixin_39965890/article/details/82182002