在这篇文章当中，我们将会用根据MNIST的数据集，跟大家介绍神经网络进行分类的基本原理和方法。

1.神经网络的正向计算

如果我们把神经网络当作一个黑盒来看，它的结构大概是这样的：

输入（层）：一张图片

计算过程：神经网络

输出（层）：这张图片属于每种分类的概率（或者说是分数）

现在我们来解释一下

输入是一张图片，也就是这张图片的各个像素的值，向量化了之后的向量，对于MNIST来说，就是一个( 1, 28*28 )的向量。通过训练，我们要让这个神经网络自己提取这些像素值的某种特征，这样当出现一个从来没有见过的图像的时候，也能根据识别的特征进行分类

计算过程：

众所周知，神经单元的基本结构是这样的

表示成向量，就是这样的

W1 = [ w1, w2, w3 ]

假设我们输入的图片向量是I = [ i1, i2, i3 ]

那我们计算出的分数其实是 S = W1 * I = w1 * i1 + w2 * i2 + w3 * i3

但是这样的表现力是不够的，还需要加入偏移量 B1 = [ b1, b2, b3 ]

所以我们计算出的分数其实就是S = W1 * I + B1

虽然这是最最简化的版本，但是基本的东西就是这样，很简单

根据现在的这个结构，我们只能算出一个分数，但是我们要对一个输入的图像算出多个分数，才知道该把它分成那一类，我么你现在要对权值矩阵做一定的调整

原来我们的权值是

W1 = [ w1, w2, w3 ]

现在我们的权值调整为

w 11 w 21 w 31 w 12 w 22 w 32 w 13 w 23 w 33

$\begin{matrix} w_{11} & w_{12} & w_{13} \\ w_{21} & w_{22} & w_{23} \\ w_{31} & w_{32} & w_{33} \\ \end{matrix}$

输入仍然是

I = [i 1, i 2, i 3]

$I = [ i_1, i_2, i_3 ]$

在运算之后我们得到一个分数的向量

S = [s 1 s 2 s 3]

$S = [ s_1 s_2 s_3 ]$
这三个值分别就是我们得到的三个分类初步的分数，但是如果要进一步计算概率的话，就要计算它们的Softmax值，详细的计算请见我的另外一篇文章，这里理解成计算成各自的概率就可以了

在实际上计算的过程中，我们每次只处理一张图片太慢了，所以一般是对一个batch的图片一起进行计算，其实道理也是一样的，只不过是输入的矩阵从[ 1 * N ] 变成了 [ M * N ]而已

接下来我们要引入隐层的概念

我们上面的示例当中，储存数据的节点，就只有一个输入层，一个输出层，这样我们只能有一层的抽象，表现能力其实是不够的，我们需要加入更多的节点，也就是隐层

我们其实可以把隐层理解成输入向量的进一步抽象，在一步一步的抽象后，就可以得到每个分类的概率了

而隐层的结构和输入层也是类似的，只不过在每次计算的过程中进行抽象，所以节点的数量也变少了

神经元以及代表的意义

矩阵运算的意义

非线性单元

这是神经网络中非常重要的一部分，虽然结构简单，实现简单，但是非常的重要，不可或缺

2.Softmax计算

当我们通过神经网络的计算过程计算出当前的特征向量（就是图片）属于每个类别的分数的时候，我们需要将这个分数转化为这个特征向量属于每个类别的概率，这个概率的计算就是通过Softmax来实现的

详见这篇文章
http://blog.csdn.net/superCally/article/details/54291865

3.梯度下降

这是神经网络算法中非常重要的部分，我们通过Softmax计算出了当前向量属于每个类别的概率，而且通过标注的数据我们知道当前向量正确的分类。这样我们就可以通过损失函数来调整各个权重，优化网络

比如我们通过计算得到

\partial L o s s \partial W 1 = d W 1

$\frac{\partial Loss}{\partial{W_1}}=dW_1$
我们就可以对网络的权值进行优化

W 1 + = d W 1 * - s t e p s i z e

$W1 += dW1 * -stepsize$

如果我们把梯度下降比作一个下山的过程的话，计算梯度就是计算在当前位置向哪个方向爬山最陡峭，然后朝着它的反方向朝着山下迈一步。当然显而易见的是这种优化方法非常容易走到局部极值点，所以有很多方法来优化，我们会在其他文章中详细讨论。

4.反向传播

backpropagation,这是整个神经网络最难理解的地方（其实也没那么难），当我们计算出了Loss，进而算出了Loss对Wn的偏导的时候，我们也想知道，Loss对Wn-1的偏导，这样我们才能对前面的权值网络做一些调整

而且还有一个反向传播的好处就是，每个偏导计算一次就行了，虽然数据量也不是特别大，但是这多多少少提高了效率

我们通过正向的计算算出了score，并且根据定义算出了Loss，所以我们可以计算出Loss对score的偏导

d s c o r e = \partial L o s s \partial s c o r e

$dscore = \frac{\partial Loss}{ \partial score}$
而且我们之前也说了，在正向传播的时候我们的score是这样计算的，假设我们只有一层网络

s c o r e = I n p u t * W 1 + b 1

$score = Input * W1 + b1$
所以W1的偏导就是

d W 1 = \partial L o s s \partial W 1 = \partial L o s s \partial s c o r e * \partial s c o r e \partial W 1 = d s c o r e * I n p u t

$dW_1=\frac{\partial Loss}{\partial W1} = \frac{\partial Loss}{\partial score} * \frac{\partial score}{\partial W_1} = dscore * Input$

这很显然就是链式法则的应用，无他

在这里我们需要特别强调ReLU函数的偏导，ReLU函数的表达式我们知道很简单

R (x) = m a x (0, x)

$R(x)=max(0,x)$
这个一个分段函数，对于分段函数求导我们当然就要分段来看

加入ReLU之后，我们的表达式就类似于这样

s c o r e = m a x (0, I n p u t * W 1 + b 1)

$score = max(0,Input * W1 + b1)$
所以当出大于0的时候，表达式就是原来的

s c o r e = I n p u t * W 1 + b 1

$score = Input * W1 + b1$
偏导也按原来的方法进行计算
当输出小于0的时候，表达式就变成了

s c o r e = 0

$score = 0$
所以偏导也自然就是0

具体实现

Talk is cheap, Show me the code

在扯了那么多之后，我们还是要回到代码上来，看看具体是怎么实现的，我们才能对理论有更好的了解

先附上完整的代码

# -*- coding:utf-8
import numpy as np
import struct
import matplotlib.pyplot as plt
import random
import pickle
class Data:
    def __init__(self):

        self.K = 10
        self.N = 60000
        self.M = 10000
        self.BATCHSIZE = 2000
        self.reg_factor = 1e-3
        self.stepsize = 1e-2
        self.train_img_list = np.zeros((self.N, 28 * 28))
        self.train_label_list = np.zeros((self.N, 1))

        self.test_img_list = np.zeros((self.M, 28 * 28))
        self.test_label_list = np.zeros((self.M, 1))

        self.loss_list = []
        self.init_network()

        self.read_train_images( 'train-images-idx3-ubyte')
        self.read_train_labels( 'train-labels-idx1-ubyte')

        self.train_data = np.append( self.train_img_list, self.train_label_list, axis = 1 )


        self.read_test_images('t10k-images-idx3-ubyte')
        self.read_test_labels('t10k-labels-idx1-ubyte')

    def predict(self):
        hidden_layer1 = np.maximum(0, np.matmul(self.test_img_list, self.W1) + self.b1)


        hidden_layer2 = np.maximum(0, np.matmul(hidden_layer1, self.W2) + self.b2)


        scores = np.maximum(0, np.matmul(hidden_layer2, self.W3) + self.b3)

        prediction = np.argmax( scores, axis = 1 )
        prediction = np.reshape( prediction, ( 10000,1 ) )
        print prediction.shape
        print self.test_label_list.shape
        accuracy = np.mean( prediction == self.test_label_list )
        print 'The accuracy is:  ',accuracy
        return

    def train(self):

        for i in range( 10000 ):
            np.random.shuffle( self.train_data )
            img_list= self.train_data[:self.BATCHSIZE,:-1]
            label_list = self.train_data[:self.BATCHSIZE, -1:]
            print "Train Time: ",i
            self.train_network( img_list, label_list )


    def train_network(self, img_batch_list, label_batch_list):

        # calculate softmax
        train_example_num = img_batch_list.shape[0]
        hidden_layer1 = np.maximum( 0, np.matmul( img_batch_list, self.W1 ) + self.b1 )


        hidden_layer2 = np.maximum( 0, np.matmul( hidden_layer1, self.W2 ) + self.b2 )


        scores = np.maximum( 0, np.matmul( hidden_layer2, self.W3 ) + self.b3 )


        scores_e = np.exp( scores )
        scores_e_sum = np.sum( scores_e, axis = 1, keepdims= True )

        probs = scores_e / scores_e_sum

        loss_list_tmp = np.zeros( (train_example_num, 1) )
        for i in range( train_example_num ):
            loss_list_tmp[ i ] = scores_e[ i ][ int(label_batch_list[ i ]) ] / scores_e_sum[ i ]
        loss_list = -np.log( loss_list_tmp )



        loss = np.mean( loss_list, axis=0 )[0] + \
               0.5 * self.reg_factor * np.sum( self.W1 * self.W1 ) + \
               0.5 * self.reg_factor * np.sum( self.W2 * self.W2 ) + \
               0.5 * self.reg_factor * np.sum( self.W3 * self.W3 )

        self.loss_list.append( loss )
        print loss, " ", len(self.loss_list)
        # backpropagation

        dscore = np.zeros( (train_example_num, self.K) )
        for i in range( train_example_num ):
            dscore[ i ][ : ] = probs[ i ][ : ]
            dscore[ i ][ int(label_batch_list[ i ]) ] -= 1

        dscore /= train_example_num


        dW3 = np.dot( hidden_layer2.T, dscore )
        db3 = np.sum( dscore, axis = 0, keepdims= True )

        dh2 = np.dot( dscore, self.W3.T )
        dh2[ hidden_layer2 <= 0 ] = 0

        dW2 = np.dot( hidden_layer1.T, dh2 )
        db2 = np.sum( dh2, axis = 0, keepdims= True )

        dh1 = np.dot( dh2, self.W2.T )
        dh1[ hidden_layer1 <= 0 ] = 0

        dW1 = np.dot( img_batch_list.T, dh1 )
        db1 = np.sum( dh1, axis = 0, keepdims= True )





        dW3 += self.reg_factor * self.W3
        dW2 += self.reg_factor * self.W2
        dW1 += self.reg_factor * self.W1




        self.W3 += -self.stepsize * dW3
        self.W2 += -self.stepsize * dW2
        self.W1 += -self.stepsize * dW1

        self.b3 += -self.stepsize * db3
        self.b2 += -self.stepsize * db2
        self.b1 += -self.stepsize * db1


        return


    def init_network(self):
        self.W1 = 0.01 * np.random.randn( 28 * 28, 100 )
        self.b1 = 0.01 * np.random.randn( 1, 100 )

        self.W2 = 0.01 * np.random.randn( 100, 20 )
        self.b2 = 0.01 * np.random.randn( 1, 20 )

        self.W3 = 0.01 * np.random.randn( 20, self.K )
        self.b3 = 0.01 * np.random.randn( 1, self.K )

    def read_train_images(self,filename):
        binfile = open(filename, 'rb')
        buf = binfile.read()
        index = 0
        magic, self.train_img_num, self.numRows, self.numColums = struct.unpack_from('>IIII', buf, index)
        print magic, ' ', self.train_img_num, ' ', self.numRows, ' ', self.numColums
        index += struct.calcsize('>IIII')
        for i in range(self.train_img_num):
            im = struct.unpack_from('>784B', buf, index)
            index += struct.calcsize('>784B')
            im = np.array(im)
            im = im.reshape(1, 28 * 28)
            self.train_img_list[ i , : ] = im

            # plt.imshow(im, cmap='binary')  # 黑白显示
            # plt.show()

    def read_train_labels(self,filename):
        binfile = open(filename, 'rb')
        index = 0
        buf = binfile.read()
        binfile.close()

        magic, self.train_label_num = struct.unpack_from('>II', buf, index)
        index += struct.calcsize('>II')

        for i in range(self.train_label_num):
            # for x in xrange(2000):
            label_item = int(struct.unpack_from('>B', buf, index)[0])
            self.train_label_list[ i , : ] = label_item
            index += struct.calcsize('>B')

    def read_test_images(self, filename):
        binfile = open(filename, 'rb')
        buf = binfile.read()
        index = 0
        magic, self.test_img_num, self.numRows, self.numColums = struct.unpack_from('>IIII', buf, index)
        print magic, ' ', self.test_img_num, ' ', self.numRows, ' ', self.numColums
        index += struct.calcsize('>IIII')
        for i in range(self.test_img_num):
            im = struct.unpack_from('>784B', buf, index)
            index += struct.calcsize('>784B')
            im = np.array(im)
            im = im.reshape(1, 28 * 28)
            self.test_img_list[i, :] = im
    def read_test_labels(self,filename):
        binfile = open(filename, 'rb')
        index = 0
        buf = binfile.read()
        binfile.close()

        magic, self.test_label_num = struct.unpack_from('>II', buf, index)
        index += struct.calcsize('>II')

        for i in range(self.test_label_num):
            # for x in xrange(2000):
            label_item = int(struct.unpack_from('>B', buf, index)[0])
            self.test_label_list[i, :] = label_item
            index += struct.calcsize('>B')

def main():
    data = Data()
    data.train()
    data.predict()
    pickle.dump( data.loss_list, open( "gradient_data", "w" ), False )

if __name__ == '__main__':
    main()

可以看到，整个代码的流程是，先读取数据，然后训练数据，然后拿测试集来做预测

函数主要有这几个

def init_network(self)

这个函数当中主要来初始化，用numpy.random产生给定shape的矩阵，矩阵的值符合正态分布

self.read_train_images()
self.read_train_labels()

这两个函数是用来读图片的数据和标签的数据，在这篇文章中进行了详细的介绍，有兴趣的同学可以看一下

def train(self):

这个函数中调用了训练神经网络的函数，上面的代码中是调用了10000次训练函数

def train_network(self, img_batch_list,label_batch_list)

这个函数就是训练的主要函数

在训练的过程中也采用了随机采样batch训练的方法，这样可以在损失一定的训练精度的情况下大大提高训练的效率

def __init__(self)

这个函数进行整个系统的初始化，设置hyperparameters，读取文件等等

我们可以看一下这个训练过程中，损失函数的变化曲线
这里写图片描述
可以看到，中间几次迭代的Loss发生了突变，但总体的效果还是很好的，最后的识别率达到了97%

参考资料

http://cs231n.stanford.edu/

识别MNIST数据集之（二）：用Python实现神经网络