【论文】Tishby‘s talk about Information Bottleneck 翻译（二）

要想听懂这一段，先准备一点基础知识：

https://www.youtube.com/watch?v=EQTtBRM0sIs&feature=youtu.be&t=42m55s

Tishby另一个视频，介绍的更详细一点。

1.PAC学习：Probably Approximately Correct，PAC框架主要确定数据是否可分，确定训练样本个数，判断时间空间复杂度等。

2. 假设空间：Hypothesis set。机器学习实际上就是学一个映射f，使得f（X）--->Y。这个f会有很多种可能，就和上面视频里说的，如果你学习一个“圆”，那么你的这个f实际上实所有圆方程构成的集合），但是最终我们通过得到的样本进行分析，最后确定一个最好的f作为最佳的映射方式。这些可能的f 构成了一个集合，就是假设空间。而所说的cardinality of hypothesis set 就是这些可能的f的个数。如周志华机器学习里的瓜的例子，我们可以认为‘色泽青绿、根蒂蜷缩、敲声浊响的瓜是好瓜’，也可以认为“色泽乌黑、根蒂蜷缩、敲声浊响的瓜” 是好瓜，一共有65种选择，这个65中选择构成的叫做hypothesis set，65叫做cardinality of hypothesis set。可以参考之前的博客文章。https://blog.csdn.net/qq_20936739/article/details/77982056

3. 这个 $\epsilon$ 实际上是一个是经验误差和泛化误差（测试样本集上的误差和模型真实的误差）的差值。也就是说 $\epsilon$ 越小，经验误差就和泛化误差越接近，我们测试集上的误差越能够反映将来实际应用过程中，未知样本集上的真实误差。 $\epsilon$ 无法准确求解，但是它上限可以求解。使这个上限足够小，就能保证经验误差能够反映泛化误差。

4. $\delta$ 是置信度，有大于等于1- $\delta$ 的可能性使这个不等式成立。可以看出 $\delta$ 越接近于1，也就是说，我们对于这个不等式成立的把握越小，但是此时 $\epsilon$ 会更小。也就是说，我们不能保证这个式子100%成立，但是一旦成立，我们的经验误差就会和真实情况越相似。log1/ $\delta$ 通常忽略不计。

5. 对于 $H\epsilon$ 的理解：这个全部假设的集合实际上很大，难以确定，通常使用一个“球体”来近似，这个球的大小和 $\epsilon$ 和VC维度有关。我自己的理解如下图。

6. m是样本个数，这个不需要解释了吧，用的样本越多越好。

好了，我是正式开始的分割线~~~~~~~~~~

So I hope that some of you know something about PAC learning and learning theory. And then usually it should be familiar with what I call the old type of generalization bound. Essentially that the generation error probability of error outside my trainning data is bounded or some power or its square is bounded by the log of hypothesis class cardinality. Essentially for finite cardinality simply what's called the cardinality bound but for any general cardinality, a general class and I put in this class, we usually use what we call epsilon cover of the class. Actually, sample it on a grid such that all the hypotheses there are absolutely close to each other and then we can settle with ？cut in a cover of the hypotheses class divided by the number of samples, of some constant----I don't care about---the constant plus a small number which actually it was typical large problems is negligible completely which has to do with the confidence. Okay, this is everybody who left took the first course in learning theory knows about this bound. And then usually what we have is some sort like the VC dimension or other dimensionalities of class that tells us that the cardinality of an epsilon cover of the class scales like one over epsilon to some dimension. This is the event I plug this here. I get the d/m as the main factor which is really telling me as long as the number of example is smaller than the dimension of the class. You are not generalizeing once it above it. You start to generalize it like 1/sqrt(n) or like 1 over some other power n. That's classical. The problem is , as I'm sure you all know but maybe don't appreciate is .. maybe you don't know but the deep learning deep neral networks don't use this bound. It's useless for deep learning. It doesn't work. Why they're useless. Actually they get even worse when you show me that the network can express very complicated functions that are complicated and very very sophisticated expressivity bounds, essentially push the dimension higher. So this dimension is now orders of the millions of tens of millions and I have hundreds of thousands of samples and actually generalized very well.So obviously, this doesn't explain anything and they just move by many people in the wrong direction.

希望你们对PAC学习和其他学习理论有所了解。那么你们对我接下来所说的“旧”泛化界（generation bound）应该十分熟悉。训练集之外的其他样本的错误概率，可以表示为使用平方形式或者根号形式，以log|H|为界，其中H 为假设空间。对于有限的假设空间来讲，这个界限就叫做“cardinality bound”。但是对于普通的类别，我们通常用一个episilon-cover的假设空间。实际就是对假设空间进行采样，每个采样都非常接近。然后用这个假设空间的分割除上样本个数，后面是一个常数，一个通常被忽略不计的很小的数，和置信度有关的常数。如果了解相关知识的话，你们就听说过这个bound。通常的话，我们可以通过VC维度或者其他维度了解到， $\epsilon$ -cover的假设空间的元素个数是1除上 $\epsilon$ 或者别的某个维度。在这里如果用下面的这个近似，可以得到维度和样本个数的比值作为影响样本个数的主要因素。只要样本个数比类别的维度小（误差超过1了），就没有什么泛化的意义了。否则的话，泛化能力可以表示为1除以n的二次或几次方根。这些是传统的观念。问题在于，你们也许知道也许不知道，深度学习不适合用这个泛化界。这行不通，为什么它没有作用呢？神经网络可以描述非常复杂的函数，因此界限通常也非常复杂，难以描述，而且深度网络维度极其高，使用传统的就泛化界性能会更差，这个维度可能上亿，并且在上千的样本上可以表现出很好的性能。因此它并没有反映泛化能力的实际情况，甚至带来误导。

So I suggest a different bound. I call it the input compression bound.It's quite new and it's actually surprisingly that it's related to many things that people knew for a long time like nearest neithbors and like many other partitions. So instead of focusing on the hypothesis class which actually think it is completely an irrelevant notion for deep learning. I think about what happens to the input. Now I already told you that the layers induce partition of the input. So I'm going to quantify this partition by how homogeneous they are with respect to the label. So if I'm actually managed somehow to compress my input into cells to cover my input space with cells which are more or less homogeneous with respect to the generalization or the straight direct to the label. And then I'm doing a much finer job in terms of (?).So imaginge that I actually cover the space of input X. In the case of images that say it's all the possible images that care about. It is a very big space and I cover it with spheres which have essentially groups of images here, clusters of images. If you want it can be a soft of hard partition. Essentially what I'm saying is that then I can replace the cardinality of the hypothesis class. Let's say the boolean function just for simplicity. It moves from 2 to the cardinality of X which is all boolean functions from x to 2 to the cardinality of the partition. Because essentially now I need if I manage to Epsilon cover my input then essentially and the iput moves from 2 to T_epsilon in terms of number of labels that I really need and essentially one label per partition. So that looks okay so this is exponential decrease but it's ok.

所以，我推荐使用另一个bound。我把它称为压缩bound。这个概念是新提出的，而且它和许多人们熟知已久的方法相关，如近邻法，或者其他聚类方法。我们不需要在意假设空间的问题，它和深度学习完全无关。我们现在需要思考的是输入层发生了什么。现在我已经向大家说明神经网络层实际上是对输入层进行了分割。因此，我希望可以量化这种分割，量化的标准是他们和标签的统一程度。我从这个角度做了一些有趣的工作。假设我已经覆盖了整个输入空间。在所有可能的图片空间的情况下，这个输入空间是非常大的，我用小的球体，实际上是将图片分组，或者说聚类。因此，假设空间里的元素个数改变了，为了方便起见，我们假设使用布尔函数。也就是说，元素个数从2^N,变成了2^k（N是原来的X假设空间里的元素个数，k是对X进行分割成的小块的个数)。由于我是使用epsilon-cover的方式，输入元素个数变成了2^T_epsilon，从标签的角度来讲，每个标签需要单独的一块分割区域。这看上去不错，因为假设空间的可能性呈现出指数级别的的减少。

Now there are two questions. The first one is how do I make sure that the partition that I get is indeed homogeneous with respect to the labels. So this requires some sort of a distortion function. I mean I compress it just I can read a distortion theory. I want my grade or my code book or my partition to be to have close small distortion. The only thing I'm saying is that distortion we need to use is actually...emmm...if I use this information bottleneck distortion it's actually bound the L1 distortion which means that if this is going to be an epsilon partition respect to this is also going to be an epsilon partition respect to this the same topology. Therefore, since the information bottleneck distortion average is precisely related to the mutual information in the representation, this is equivalent minimizing this is equivalent to maximizing the mutual information on the output. Okay that's very nice, though not surprising to any one.

但是，这里有两个问题。第一个是我得保证我的分割方式确实和标签一致，这需要某个失真函数作为度量。我的意思是，可以通过失真理论来压缩。我希望对等级、码字或者说就是分割方式，能有较小的失真率。我们要用到的失真函数是……嗯……如果用信息瓶颈的失真理论，需要使用L1失真函数，也就是说，如果进行epsilon-分割的话，

【论文】Tishby‘s talk about Information Bottleneck 翻译（二）

猜你喜欢