【论文】Tishby‘s talk about Information Bottleneck 翻译（一）

Prof Naftali Tishby提出用information bottleneck 的思路理解神经网络。主要文章有以下三篇：

《The Information Bottleneck Method》：信息瓶颈理论的提出

《Deep Learning and the Information Bottleneck Principle》：发现深度学习和信息瓶颈理论的存在关联

《Opening the Black Box of Deep Neural Networks via Information》根据信息瓶颈理论进一步探索神经网络是怎么产生作用的。

Tishby教授的演讲：

https://www.youtube.com/watch?v=bLqJHjXihK8&t=262s

So essentially this shouldn't be a surprise to anyone here I look at the input layer is one random variable --- high dimensional random variable, so think of pixels of an image for example. This is a multi-millions of variables, millions of binary variables any if you think about it. So I call it X just the input variable --- the whole layer and then usually this is a high entropy variable in sence that there's a lot of randomness there so I think about X is a high entropy random variable. Usually the label Y is much simpler. I mean listening the classical supervised learning scenario, Y can be one bit okay or a few bits, okay? So when you are given is a sample of the joint distribution of X and Y, which I call data or training data and what the layers or the network is doing is essentially moving it through a cascade of hidden layers. This is the first hidden layer, the second layer and so on.. In a way which is actually a Markov chain when each layer can be calculated only from the previous one. So this is a Markov chain of successive representations and eventually the last layer, which I consider actually to be very important in this theory, is generating another random variable , which is supposed to be as close as possible to the original Y, which I call Y hat. So if my training works well, this is some sort of refinement of X through the layers and should eventually lead to good prediction of Y, but that is not exactly Y, of course, that's what I call Y hat. So and of course what's actually going on in these layers is a mysrety I mean this is the biggest question that how is this particular succession. Of course you all know what these neurons out nonlinearities of those products of weights with the previous layer. But in a sense these variables are just what I call X hat in the bottleneck framework. Essentially these are some sort of representations which actually is made out of the sucession of many layers of several internal representations like this, each one is just calculating from the previous one and this is supposed to be some sort of inrelevant representation of the prediction. So this is to me , what we call it a bottleneck problem. I want to squeeze out of X everything I can except the thing that is relevant for Y. And so in some sense I want to compress X... I am sorry exactly in some sence such that only through this process, through these layers , only the information that X shares with Y is stayed out and eventually floated out in one linear classifier at the end, okay? That is deep learning for me. The question is how it happens.

首先，大家都知道输入层是一个随机变量，而且是一个高维随机变量，比如说一幅图片的像素。输入层有成千上万的变量，无数的二进制变量，在这里，我将输入变量（整个输入层）称为X。通常，X 的熵很大，因为X有很强的随机性，因此X是一个熵很大的随机变量。而在传统监督学习领域，标签Y通常很简单，只有1比特或几个比特。当给定X,Y的联合概率分布（或者说是训练数据）的时候，神经网络会通过级联的隐层将数据传递下去，从第一个隐层到第二个隐层，等等。而每个隐层只能根据前一层神经元的输出进行计算，因此神经网络实际上是一个马尔可夫链。马尔可夫链是一个连续的数据传递方式，最终，最后一层神经元，也是神经网络中最重要的一层，将输出一个随机变量，该随机变量应该和标签Y 尽量相似，在这里，我将输出的这个随机变量称为 $\hat Y$ 。最大的问题是神经网络内部发生了什么一直是一个谜题，究竟数据是怎样传递的。当然了，你们都知道，每一层实际上是对上一层输出进行加权并进行非线性映射。对于这些隐层，在信息瓶颈框架里被称为 $\hat X$ . 总的来说，神经网络是通过神经元的传递寻找一种数据表示的方式，而且这种表示方式和预测值Y并不相关（？？？）。对我来说，信息瓶颈问题实际上是将和Y有关的信息从X中“挤出来”，或者说，通过挤压过程，也就是通过神经网络各层，把X和Y相关的信息保留下来，最终通过线性分类器进行预测。对于我来说，深度学习就是这样一个过程，而问题在于，这个过程是如何发生的。

So I just very quickly to ? few more minutes and 45 minutes would be even nicer . That's two function that I hope all of you know but I just want to remind you one of them is famous scale divergence or cross-entropy or information dimension many many different names. It's essentially the average log likehood ration of two distrubutions, average to one of them so it's not a symmetric quantity. And it's non-negetive. It's zero when they're equal almost everywhere. And the important function I'm going to use it again and again, It has many good properies that I don't have time to review all of them. The second one is the mutual information essentially is the KL divergence in the jiont distribution of variables and the product of the marginals. And it's zero only if they're independent and it can be written in catious ways, one of the important ways for me here is the H(X)- H(X|Y), is essentially how you weigh how much the uncertainty of X is removed when I know Y. And now remember that X is given so I don't have any effect on the entropy of X in most cases but I do have effect on the conditional and we'll talk about it. There are two properties of the mutual information I want you also remember. One of the is what is called the data processing Inequality. As you move along a Markov chain X->Y->Z and so on, the mutual information on decreases, which means that the mutual information here X and Y cannot be smaller than the mutual information X and Z,(and the I(X;Z) )cannot be larger that the mututal information X and Y. so if you move along the layers of a network ,mutual information about input variable can only decrease . So it is about the information (?) about Y also.The other thing which is really important about mutual information is that it is invariant with respect to reparamerization. So if I do some sort of permutation of any invertible function whether it's continuous variable or discrete variable doesn't matter, the mutual information doesn't change. This is some sence it's a special case of this but some mutual information can have the same value again if you think about layers and networks even if I shuffle or permute the variables in the layers completely in a crazy way as long as it's an invertible function.

我简要回顾一下一些概念。我希望大家可以记住这两个公式。其中一个是著名的标量散度，也叫交叉熵或者信息维度，它有许多不同的叫法。它实际上是两个分布的对数似然比的平均值，计算平均时的权重是其中一个分布的概率，因此这个散度是不对称的。同时，KL散度是非负的，当KL散度为0的时候，两个分布几乎处处相等。这一个重要的式子将被反复使用，因为KL散度有很好的性质，但因为时间原因无法逐一回顾。第二个公式是互信息，实际上是变量联合分布和两个边缘分布乘积的KL散度。当两个分布独立的时候，互信息为0。互信息可以被写为各种形式，最重要的一种是H(X)- H(X|Y)。互信息的含义是，当知道Y的时候，我们可以减少多少X的不确定性。现在我们只需要知道，X是给定的，所以大多数情况下X 的熵不会受影响，而变化的是条件熵H（X|Y）。对于互信息有两个重要性质。一个是数据处理不等式。当沿着马尔可夫链X->Y->Z向前，互信息会不断减少，这意味着X,Y的互信息不会比XZ的互信息小。因此，当你沿着神经网络的各层向前进行，和输入变量的互信息会不断减小。因此（？）。另一个重要的关于互信息的公式是，互信息在重构变量后保持不变，无论变量是离散还是连续的。其中一个特殊的例子就是经过神经元的处理，只要函数可逆，即使你任意变换各神经网络层的变量，熵也不会改变。

So I think the information bottleneck was already introduced but just one slide about it. So there is a classical notion in statistics called minimal sufficient statistic you talk about minimality and suffecient. And briefly metioning essentially if using the data processing inequality DPI, you can see that a statistical sufficient is preserved all the information about the parameter in out case about the labels. So S(X) is the sufficient statistics of X respect to Y. It's the mutual information between S(X) and Y is the same as this of X and Y. It's a minimal sufficient statistic if it preserves the minimum information between two X among all the variables that are sufficient statistics. So that's an variation of pratical definition of the minimum sufficient statistic. Then we relax it in 99 and call it actually let's say that I don't have all the information I preserve as much as I can of the mutual information and then you have this variational problem --- minimize the mutual information on X subject to constrain on the mutual information of Y. I noticed by looking at all Sandra's slides that my better is actually the inverse of his better. So this is zero when you have no information about Y and it goes to infinite when you have all the information about Y. So this is just like one of the temperature in statistical physics and it's of course can be formulated with some of rate distortion problem with a very specific distortion function which is precisely the KL divergence between how well you predict Y from the original X versus how well you predict Y from any compression of X or any statistical factor. It is perfect fit sufficient statistic if the distortion is zero. Otherwise, this distortion is directly related to the mutual information between Y and $\hat X$ .

下面用一页PPT介绍一下信息瓶颈理论。在统计学领域，有一个非常经典的概念叫做最小充分统计量，通常我们讨论该统计量是否为最小且充分的。简单提一下，如果利用DPI理论，我们可以看到充分统计量保留了参数的全部信息量，在我们的问题设定下，也就是保留了全部关于标签Y的信息量。因此在估计Y的情况下，S（X）为 X的充分统计量。也就是I(S(X);Y) = I(X;Y). 如果S（X）是最小充分统计量，I(S(X);X)应该是最小的（译注：这里理解为S（X）丢弃了全部X中和Y无关的信息，所以最小）。这是最小充分统计量定义的一种变形。我们放宽限定，也就是说，我们并不要求保留全部信息量，只是尽量多的保留互信息，也就是在I(S(X);Y) = I(X;Y)约束下，对I(S(X);X)最小化。我在看Sandra的PPT时注意到，我的最好结果是他的最好结果的逆。当不知道任何关于Y的信息的时候，信息量为0，而知道全部关于Y的信息的时候，信息量为无穷大。这是统计学性质，当然也可以应用率失真理论，通过一个特定的失真函数来定义这个问题。这个失真函数是一个KL散度，衡量了从原始数据X预测Y和从压缩后的X预测Y的差异。如果KL散度为0，说明这个充分统计量是合适的，否则KL散度值与Y和 $\hat X$ 之间互信息有关。

So if you go back to the neural networks, what we have are two lines of inequalities and (what I call) I'm going to call the information paths of the network. Essentially I would love to calculate the mutual information between each of the layers later I'm going to call the P instead of H which is the same thing. So between the layer in the input and the layer and the output, I know that these two changes of values are monotonically decreasing because of data processing inequality again. So the mutual information only goes down to its maximum（？or minimum?） respect to X(S?) goes down and respect to Y goes down. Now noticed that each one of those layers, let me using a partition of the input to think about the values that the hidden layers can take. There is a probability which can be hard partition or soft partition depending how you interpret this softness of the units. If it's a hard units like signs then it's a hard partition. If it's something like a sigmoid and then it's a soft partition. If you interpret it probabilistically but essentially any layer is really inducing a position of input and which is actually very important. I'm going to talk a lot about those partitions. And in the language of information theory we actually can call it a successive refinement of the relevant information. I'm inducing a code which is getting closer and closer when I move through the layers. Start with very fine code at the input layer and the layers following it actually refine the information in some sense.

回到神经网络，我们可以得到两行不等式，我把它们称为神经网络的信息路径。我将计算每两层神经网络之间的互信息P，其实就是之前提到的H。在输入层和输出层之间，我们知道，两条路径的值是单调递减的，原因也是DPI理论。因此，每一层和X、Y的互信息都是递减值最小值。注意，每层神经元，我们都会对其输出进行分割，以此决定下一个隐藏层神经元的输入值是多少。软分割和硬分割均有可能，这取决于你对于软硬分割的理解。可以使用硬分割，如sign函数，也可以使用软分割，如sigmoid函数。如果……。。。。接下来我会讨论这些分割。用信息论的话来讲，这个过程就是一个连续提炼相关信息的过程。我引入一个编码，沿着各神经网络层传递，最终和理想结果越来越接近。输入层从一个合适的编码开始，后面的神经元将不断的提炼信息。

So another way of thinking about it before this theorem is to think about the input as being encoded by a layer. Let's say this Ti and I call the map from the input to the layer the encoder of the layer and from the layer to the output I call the decoder of the layer. Here comes a recent result which I found quite important. So essentially I argue that for large enough problems which means that when X is large enough, such as we can actually talk about typical Xes（？） which means I'm going to confine myself to what an information called typical sequences or typical a variable. So the variables becomes very large almost all in some very specific sense of the values of X are going to be dominated and equally probable completely by the entropy of the source X. This is true for information sources information theory for those of you who know a little bit about it. And it's also true for most of our big learning problems like image recognition or whatever it is. We now have beening talking for a long time about large-scale problems and that's really the beauty of it because now I can use all the asymptotics of information theory in a way that is completely rigorous and completely ok. So essentially what my argument is that when you talk about such large neural network the sample complexity which is how many samples you need in order to achieve a certain accuracy in generalization is completely determined asymptotically by the value of the mutual information of the encoder, which means how much information there is between X and the hidden layer for the last hidden layer of the network. And the accuracy or the generalization error the precision the probability of making an error on a new sample is completely determined by the values of the mutual information between the layer and the input(译注：应为output,口误), the desired output Y. So these two quantities in some sense are sufficient to charaterize the two main things we care about ---- numbers of samples and precision for large enough data. So that's a very bored statement I know that none of you believe it at this one.

另一种理解方式为，神经网络对输入数据进行解码。也就是说，隐藏层Ti，我把输入到 Ti 这一层称为是编码，而从Ti之后到输出层，我称之为解码。最近，我发现了一个十分重要的结果。我认为，足够大的问题，或者说输入X 足够大，比如说，典型的X，如我将X 限定为一段序列或者一个变量。这些变量通常非常大。在某些特定情况下，X的值基本上完全被X的熵所决定。就像你们了解的，这在信息理论中确实如此。大部分的大规模学习问题，如图像识别或者其他任务，都是这样。我们已经讨论了很多大规模学习的问题了。它的奇妙之处在于可以严格的应用信息理论。因此，我认为，当讨论大型神经网络时，样本复杂度，也就是我们需要使用多少样本才能达到一定的准确度，这一复杂度是由编码的互信息决定的，即X和最后一个隐层（译注：应该是编码部分最后一个隐层，就是Ti）之间的信息量。而准确率或者说误差，即对一个新样本判断错误的概率，由隐层（也是Ti）和输出Y之间的信息量决定。因此这两个量化方式足以确定我们所关心的两个关键问题：样本数和足够多数据时的预测精度。这只是无聊的一些陈述，可能你们觉得难以相信。

So I'm actually going to try to convince you this is true then show you why it's so important. So before I show some mathmatics I want to show you a nice movie. I know it's early in the morning you're sleeping.. Anyway so essentially I'm going to play quite a lot of this kind of movies. I want you to understand them. So this is what I call the information plane. Information plane is ---- This is information in bits, actually it's not in bits here, it's in ln and log. So never mind it. It's in bits of layer. Each layer is the whole random variable the whole layer here. Forget about neurons they are not important. It's layers. So the whole layer, to the input variable this is I(T;X) and this is the whole layer to the output variable. Since this is the binary problem , this can get to one bit which is in this unit zero point seven, and get to twelve bits？？？. This unit it modest eight something. Never mind. This is a small problem. I'll show it exactly what the network is. Then I'll argue this is a very general picture. What I want to show you so there's little spheres essentially this belong to the last hidden layer. This is the one before the hidden layer and so on. Each color is a different layer in the network. And there are 50 different random reprtitions of the same problem with different initial conditions and different order of the examples. So there's some stochasticity here which is due to the initial conditions of the network and the order of discerning examples.

现在我想让你们相信这是真的，并且向你们展示他的重要性。在介绍数学理论之前，我先展示一个有趣的短片。现在是早晨，所以我知道大家都很困...不过没有关系。我会展示许多视频，希望你们能理解我的想法。现在介绍一下我所说的信息平面。信息平面就是……图片上展示的是信息量，单位是bit，实际上，并不是bit，而是ln或log的形式计算出的信息量。这不重要，总之这是每一层的比特数。要注意，每一层指的是这一层所有的随机变量，是整个这一层。忘记他们是神经元，这不重要，他是整个这一层。这一层，对于输入层来讲，是I（T;X），对于整个输出层来讲是I（Y;T）。由于是二元问题，……。这不重要，只是一个小问题。下面我将要展示神经网络究竟是怎么一回事。接着我会说明这个现象是具有普遍意义的。如图所示，这里有许多小圆点，他们是最后一个隐藏层信息量在信息平面坐标下的位置。这个是前一个隐层，以此类推。每一种颜色表示一个神经网络层。这里，我进行了50次实验，分别设置了不同的初始值和训练样本的顺序。因此，由不同的初值和样本训练顺序，使得实验有一定的随机性。

And what you see here are the initial conditions.So you see the hightlight point here. This is the input layer or the first hidden layer. This preserves essentially all the information about the information of the input and all the information about the output suppose I didn't do anything but when you start with random initial conditions, the information drops down very quickly and the last hidden layer essentially has very little information both about the input and the end, the output. So I want to show you how these things evovle when you train the network with stochastic gradient descent.

现在看到的是初始时刻神经网络的信息平面。看这些被重点标记的点，这个是输入层或者说第一个隐层的信息量。总的来说，它保留了几乎所有关于输入和输出数据信息，因为此时此刻我并没有做什么，只是随机设置了初始条件。信息量下降很快，最后一个隐层机会没有任何关于输入和输出的信息。接下来我想给大家演示一下随着梯度下降的训练过程的进行，神经网络会如何变化。

So look at this tour. It's quite nice. So essentially what you see is that the the layers come up as you expect. I mean the information about the label increased and then at this point they are somehow slow down .

看一下这个过程，非常有趣。开始，可以看到各层信息量开始变大（上图中：epoch 213），这一点大家都能想象得到。我的意思是说，各层神经元获取到越来越多的关于标签的信息。到了这一时刻，他们获取信息的速度开始变慢（上图中：epoch 614）。

Then start to do this very funny thing. They move to the up and to the right left which means they compress the representation in some sense. They forget the input . Eventually this is, see the number of epochs, it's running eventually they converge, believe me. They don't really converge here but you see that even the last hidden layer is still stuck already in one place. The previous layer still move to the left.

接着，有趣的事情发生了。他们开始向左上角移动，也就是说，他们在压缩信息量，或者说，他们开始遗忘输入数据。最终，注意迭代的次数，他们最终收敛了。收敛这个词并不恰当，但是可以看到最后一层神经元被卡在信息平面的一个固定位置上。前面的节点还在不断向左移动。

Let me show it again because probably you haven't seen all the nice features of this. So again the first phase is coming up very quickly. this happen whith 350 somes iterations. From this point something very strange happens. They move slowly and together. I mean these are 50 different initial conditions of the network that's where you see all these crowds. And eventually they move to the left.

可能你们还没有看清楚这有趣之处，我再放一边。第一个阶段，他们迅速的上升，大概持续到350次迭代。之后，奇怪的事情发生了。他们开始缓慢移动，不断聚拢。这50个不同初始条件的神经网络，就是你看到的这一团圆点，最终都移动到了左上角。

Just to make this clearer, I'll show you the arverage over these and 50 networks and just tracing it. So essentially ,okay, this is the first phase 300 something. And they very very slow convergence to the left.

为了更清晰的展示，我追踪并展示了这50个神经网络信息量均值的移动路径。如图（上图epoch 325），这是第一个阶段。然后他们缓慢的收敛至左上角（上图epoch3999）。

Okay, this is real neural network terms of flow using no regularization and no specific certification nothing , no dropouts nothing of districs. And you see that most of the epochs of four thousands epochs a ... essentially 350 you get this point more or less. So if you increase rather quickly and then from there start very very slow motions upward and left of all the layers. This is what I call the compression phase of the network. The main part of this talk is to convince you first of all. This is a very general and very real phenomena and then to convince you that it's actually extremely useful phenomena. In my opinion the important part of the SGD algorithm because it's not really fitting the data to forget the about it. Even a very small data you get to this point, very small data. It's this part the compression and keeping the new information high which is so useful for the network so let me go back to the theory.

这个神经网络，没有经过任何的正则化，训练也没有早停，没有任何处理。可以看到，在4000个迭代中，经过第350次左右迭代，信息量迅速增加，从这之后，他们开始缓慢的移动向左上角。这一过程我称之为压缩阶段。这段演讲我主要是想说服你们。这实际上是一个非常普遍、非常真实的现象。接着，我希望让你们相信，这个现象非常有用。依我看来，这是SGD算法最重要的部分，因为对于神经网络来讲，忘记输入数据是匪夷所思的。尽管最后只能得到很少的一部分数据，这一部分压缩后依旧保留了充足的信息量可以被神经网络使用。接下来让我们回到理论研究上。

【论文】Tishby‘s talk about Information Bottleneck 翻译（一）

猜你喜欢