小知识，大挑战！本文正在参与“程序员必备小知识”创作活动。

这个文章的例子来源于pytorch的官网，原文链接：Learning PyTorch with Examples — PyTorch Tutorials 1.9.1+cu102 documentation

We will use a problem of fitting y=sin(x) with a third order polynomial as our running example. The network will have four parameters, and will be trained with gradient descent to fit random data by minimizing the Euclidean distance between the network output and the true output.

就是用三阶多项式拟合一个sin(x)图像，需要四个输入参数，之后使用梯度下降找到最优参数。

NumPy

虽然是pytorch入门，但是代码先用numpy实现一下有助于理解：

# 设定随机的x和y
x = np.linspace(-math.pi, math.pi, 2000)
y = np.sin(x)

# 给四个权重进行初始化
a = np.random.randn()
b = np.random.randn()
c = np.random.randn()
d = np.random.randn()

# 设定学习速率
learning_rate = 1e-6


for t in range(2000):
    # 计算预测函数 y = a + b x + c x^2 + d x^3
    y_pred = a + b * x + c * x ** 2 + d * x ** 3

    # 计算loss，这里loss = sum((y_predict - y)^2) 当然我学的loss前面还要除m或者2m
    loss = np.square(y_pred - y).sum()
    
    # 输出loss，这里就是挑了几个随便输出，不要纠结为什么是输出
    if t % 100 == 99:
        print(t, loss)

    # 反向传播计算abcd的gradients，求哪个参数就对哪个参数求偏导，所以得到下面的结论。不懂的话可以看底下推导。y
    grad_y_pred = 2.0 * (y_pred - y)
    grad_a = grad_y_pred.sum()
    grad_b = (grad_y_pred * x).sum()
    grad_c = (grad_y_pred * x ** 2).sum()
    grad_d = (grad_y_pred * x ** 3).sum()

    # 更新权重
    a -= learning_rate * grad_a
    b -= learning_rate * grad_b
    c -= learning_rate * grad_c
    d -= learning_rate * grad_d

print(f'Result: y = {a} + {b} x + {c} x^2 + {d} x^3')
复制代码

对于gardients部分：

$loss = \sum_{i=1}^{2000}(y_{pred}-y)^2=\sum_{i=1}^{2000}(a+bx+cx^2+dx^3-y)^2$

对上边的式子求偏导：

$\frac{\partial loss}{\partial a} = \frac{\partial \sum_{i=1}^{2000}(a+bx+cx^2+dx^3-y)^2}{\partial a} = \frac{\partial \sum_{i=1}^{2000}(y_{pred}-y)^2}{\partial y_{pred}-y} · \frac{\partial a+bx+cx^2+dx^3-y}{\partial a} = \sum_{i=1}^{2000}2(y_{pred}-y)$

$\frac{\partial loss}{\partial b} = \frac{\partial \sum_{i=1}^{2000}(a+bx+cx^2+dx^3-y)^2}{\partial b} = \frac{\partial \sum_{i=1}^{2000}(y_{pred}-y)^2}{\partial y_{pred}-y} · \frac{\partial a+bx+cx^2+dx^3-y}{\partial b} = \sum_{i=1}^{2000}2x(y_{pred}-y)$

同理：

$\frac{\partial loss}{\partial c} = \sum_{i=1}^{2000}2x^2(y_{pred}-y)$

$\frac{\partial loss}{\partial d} = \sum_{i=1}^{2000}2x^3(y_{pred}-y)$

Pytorch

那既然numpy都能写，那为什么要用pytorch呢？

Numpy provides an n-dimensional array object, and many functions for manipulating these arrays. Numpy is a generic framework for scientific computing; it does not know anything about computation graphs, or deep learning, or gradients.

Numpy is a great framework, but it cannot utilize GPUs to accelerate its numerical computations. For modern deep neural networks, GPUs often provide speedups of 50x or greater, so unfortunately numpy won’t be enough for modern deep learning.

numpy提供了能方便计算n维数组的对象，是一个科学计算框架，但是numpy不能计算图、深度学习、梯度并且不能使用GPU加速计算，所以numpy并不适合现在深度学习。因此要使用pytorch。

import torch
import math


dtype = torch.float
device = torch.device("cpu")
# 取消下边这行的注释就可以在GPU上运行
# device = torch.device("cuda:0") 

# 随机设定输入
x = torch.linspace(-math.pi, math.pi, 2000, device=device, dtype=dtype)
y = torch.sin(x)

# 随机初始化参数
a = torch.randn((), device=device, dtype=dtype)
b = torch.randn((), device=device, dtype=dtype)
c = torch.randn((), device=device, dtype=dtype)
d = torch.randn((), device=device, dtype=dtype)

learning_rate = 1e-6
for t in range(2000):
    # 预测函数
    y_pred = a + b * x + c * x ** 2 + d * x ** 3

    # 计算并输出loss
    loss = (y_pred - y).pow(2).sum().item()
    if t % 100 == 99:
        print(t, loss)

    # 反向传播
    grad_y_pred = 2.0 * (y_pred - y)
    grad_a = grad_y_pred.sum()
    grad_b = (grad_y_pred * x).sum()
    grad_c = (grad_y_pred * x ** 2).sum()
    grad_d = (grad_y_pred * x ** 3).sum()

    # 更新权重
    a -= learning_rate * grad_a
    b -= learning_rate * grad_b
    c -= learning_rate * grad_c
    d -= learning_rate * grad_d


print(f'Result: y = {a.item()} + {b.item()} x + {c.item()} x^2 + {d.item()} x^3')
复制代码

autograd

PyTorch: Tensors and autograd (auto-gradient)

In the above examples, we had to manually implement both the forward and backward passes of our neural network. Manually implementing the backward pass is not a big deal for a small two-layer network, but can quickly get very hairy for large complex networks.

Thankfully, we can use automatic differentiation to automate the computation of backward passes in neural networks. The autograd package in PyTorch provides exactly this functionality. When using autograd, the forward pass of your network will define a computational graph; nodes in the graph will be Tensors, and edges will be functions that produce output Tensors from input Tensors. Backpropagating through this graph then allows you to easily compute gradients.

This sounds complicated, it’s pretty simple to use in practice. Each Tensor represents a node in a computational graph. If x is a Tensor that has x.requires_grad=True then x.grad is another Tensor holding the gradient of x with respect to some scalar value.

相对于numpy，除了上边提到的优点，我们还可以不用手写反向传播过程了，因为pytorch中可以使用autograd实现神经网络反向传播过程的自动计算。当我们使用autograd的时候，前向传播会定义一个计算图，图中的节点都是张量，图的边是函数，用于从输入张量产生输出张量。通过这个图的反向传播就可以轻松获得gradient。

虽然听起来很复杂，但是用起来是很简单的，每个张量都代表计算图中的一个节点。如果x是一个张量，并且对其设置x.requires_grad=True，那么x.grad会存储x相对于某个标量梯度的张量。

import torch
import math

dtype = torch.float
device = torch.device("cpu")
# 取消下边这行的注释就可以在GPU上运行
# device = torch.device("cuda:0")  

# Create Tensors to hold input and outputs.
# 默认情况下requires_grad=False, 表示我们不需要计算这个张量在反向传播过程中的梯度。
x = torch.linspace(-math.pi, math.pi, 2000, device=device, dtype=dtype)
y = torch.sin(x)

# 随机初始化参数，设置requires_grad=True表示我们希望将反向传播过程中的梯度保留
a = torch.randn((), device=device, dtype=dtype, requires_grad=True)
b = torch.randn((), device=device, dtype=dtype, requires_grad=True)
c = torch.randn((), device=device, dtype=dtype, requires_grad=True)
d = torch.randn((), device=device, dtype=dtype, requires_grad=True)

learning_rate = 1e-6
for t in range(2000):
    y_pred = a + b * x + c * x ** 2 + d * x ** 3

    # Now loss is a Tensor of shape (1,)
    # loss.item() 获取loss中的标量值
    loss = (y_pred - y).pow(2).sum()
    if t % 100 == 99:
        print(t, loss.item())

    # 使用autogrid计算，这个调用会计算所有的requires_grad=True的张量的gradient。
    # 然后他们的值会分别存储在对应的张量中a.grad, b.grad. c.grad d.grad
    loss.backward()

    # 手动更新权重
    # Wrap in torch.no_grad()因为之前设置了requires_grad=True,但是我们不希望在autograd记录下a-操作的gradient
    with torch.no_grad():
        a -= learning_rate * a.grad
        b -= learning_rate * b.grad
        c -= learning_rate * c.grad
        d -= learning_rate * d.grad

        # 更新权重之后手动清除存储梯度gradient的张量
        # 每次在计算backward时都需要将前一时刻的梯度归零，否则梯度值会一直累加
        a.grad = None
        b.grad = None
        c.grad = None
        d.grad = None

print(f'Result: y = {a.item()} + {b.item()} x + {c.item()} x^2 + {d.item()} x^3')
复制代码

定义新的autograd函数

Under the hood, each primitive autograd operator is really two functions that operate on Tensors. The forward function computes output Tensors from input Tensors. The backward function receives the gradient of the output Tensors with respect to some scalar value, and computes the gradient of the input Tensors with respect to that same scalar value.

In PyTorch we can easily define our own autograd operator by defining a subclass of torch.autograd.Function and implementing the forward and backward functions. We can then use our new autograd operator by constructing an instance and calling it like a function, passing Tensors containing input data.

原始的autograd操作符实际上是提供了两个张量操作：

前向传播：由输入张量计算输出张量
反向传播：接收输出张量关于某个标量的梯度并计算输入张量关于同一标量的梯度

在pytorch中我们可以使用torch.autograd.Function实现我们自己的前向和后向传播，然后我们可以构造一个实例，像调用函数一样调用它，并使用新的autograd操作符。

之前我们的预测模型是 $y=a+bx+cx^2+dx^3$ ，现在我们把预测模型修改为 $y=a+bP_3(c+dx)$ 。 $P_3(x)=\frac{1}{2}\left(5x^3-3x\right)$ 是三次Legendre多项式。现在我们实现我们的自定义autograd函数来实现我们的新模型：

# -*- coding: utf-8 -*-
import torch
import math


class LegendrePolynomial3(torch.autograd.Function):
    @staticmethod
    def forward(ctx, input):
        """
        在前向传递中，我们接收一个输入张量并返回一个输出张量
        ctx是一个伪后向传播隐藏信息的上下文对象
        匿可以使用ctx缓存任意对象，以便在向后传递中使用save_for_backward方法。
        """
        ctx.save_for_backward(input)
        return 0.5 * (5 * input ** 3 - 3 * input)

    @staticmethod
    def backward(ctx, grad_output):
        """
        在后向传播中我们接收一个包含loss相对于输出的梯度的张量
        并且我们需要计算loss关于输入的梯度
        """
        input, = ctx.saved_tensors
        return grad_output * 1.5 * (5 * input ** 2 - 1)


dtype = torch.float
device = torch.device("cpu")

# 声明保存输入和输出的张量
x = torch.linspace(-math.pi, math.pi, 2000, device=device, dtype=dtype)
y = torch.sin(x)


# 随机初始化权重
# y = a + b * P3(c + d * x), 我们需要四个权重abcd
# 这些数的初始化要接近正确答案，以确保其收敛（？我有疑问，你咋知道正确答案）
a = torch.full((), 0.0, device=device, dtype=dtype, requires_grad=True)
b = torch.full((), -1.0, device=device, dtype=dtype, requires_grad=True)
c = torch.full((), 0.0, device=device, dtype=dtype, requires_grad=True)
d = torch.full((), 0.3, device=device, dtype=dtype, requires_grad=True)

learning_rate = 5e-6
for t in range(2000):
    # 调用我们自己的函数P3
    P3 = LegendrePolynomial3.apply

    y_pred = a + b * P3(c + d * x)

    # 计算并输出loss
    loss = (y_pred - y).pow(2).sum()
    if t % 500 == 0:
        print(t, loss.item())

    # 反向传播
    loss.backward()

    # 更新权重
    with torch.no_grad():
        a -= learning_rate * a.grad
        b -= learning_rate * b.grad
        c -= learning_rate * c.grad
        d -= learning_rate * d.grad

        # 手动清零 每次在计算backward时都需要将前一时刻的梯度归零，否则梯度值会一直累加
        a.grad = None
        b.grad = None
        c.grad = None
        d.grad = None

print(f'Result: y = {a.item()} + {b.item()} * P3({c.item()} + {d.item()} x)')
复制代码

pytorch入门实例

NumPy

Pytorch

autograd

定义新的autograd函数

神经网络模块

猜你喜欢