Autoencoders in Deep Learning

Noisy data remains one of the most common machine learning problems that keep us data scientists up at night.
In the face of high-dimensional unstructured data (images, voice, text), deep learning, how to obtain feature information is also the most troublesome problem.
Fortunately, however, we can now utilize various techniques and tricks for dimensionality reduction and reconstruction of data, one of which is the autoencoder.

What is an autoencoder

An autoencoder is a type of artificial neural network used in data compression that compresses input data into smaller codes that can then be decoded back to the original data. It can be viewed as an unsupervised learning algorithm because it does not require labeled data.

insert image description here

The structure of the autoencoder

Let's start with a quick overview of the architecture of an autoencoder. An autoencoder consists of the following three parts:

  1. Encoder: A module that compresses the input data of the training verification test set into an encoded representation, which is usually several orders of magnitude smaller than the input data.
  2. Bottleneck: Contains the compressed knowledge representation and is therefore the most important part of the network.
  3. Decoder: A module that helps the network "decompress" the knowledge representation and reconstruct the data from its encoded form, which is then compared to the ground truth.

The whole architecture looks like this:
insert image description here

Relationship between Encoder, Bottleneck and Decoder

Encoder An encoder is a set of convolutional blocks followed by a pooling block that compresses the model's input into a compact part called the bottleneck.

Behind the bottleneck is the decoder, consisting of a series of upsampling modules to convert the compressed features back into image form.

For simple autoencoders, the output is expected to be the same as the input but denoised.

But for a variational autoencoder, the output is an entirely new image, formed using the input information provided by the model.

Bottleneck The most important part of a neural network, and also the smallest part, is the bottleneck. The bottleneck exists to limit the flow of information from the encoder to the decoder, allowing only the most critical information to pass through. Since the bottleneck is designed to capture as much information as the image has, we can say that the bottleneck helps us form a knowledge representation of the input.

Therefore, the encoder-decoder structure helps us extract the most useful data from images and establish useful relationships between various inputs inside the network. The bottleneck as a compressed representation of the input further prevents the neural network from remembering the input and overfitting on the data. As a rule of thumb, remember this: the smaller the bottleneck, the lower the risk of overfitting. However - very small bottlenecks limit the amount of information that can be stored, thus increasing the chance of important information slipping through the encoder's pooling layers. Decoder Finally, the decoder is a set of upsampling and convolution blocks used to reconstruct the output of the bottleneck. Since the input to the decoder is a compressed knowledge representation, the decoder acts as a "decompressor" and builds an image from its latent properties.

How to train an autoencoder

Before training an autoencoder, you need to set four hyperparameters:

  • Encode Size: Encode size or bottleneck size is the most important hyperparameter for tuning autoencoders. The bottleneck size determines how much the data needs to be compressed, which also acts as a regularization term.
  • Number of layers: Like all neural networks, an important hyperparameter for tuning autoencoders is the depth of the encoder and decoder. Higher depths increase model complexity, lower depths process faster.
  • Nodes per layer: The number of nodes per layer defines the weights we use for each layer. Typically, the number of nodes in each subsequent layer in an autoencoder gradually decreases, as the input to each layer gets smaller and smaller across layers.
  • Reconstruction Loss: The loss function we use to train the autoencoder is highly dependent on the type of input and output we want the autoencoder to adapt to. If we are dealing with image data, the most commonly used loss functions for reconstruction are Mean Squared Error (MSE Loss) and L1 Loss. If the input and output are in the range [0, 1], such as the MNIST dataset, then we can also use binary cross-entropy as the reconstruction loss.

Finally, let's explore the different types of autoencoders.

5 types of autoencoders

The concept of an autoencoder is not new. In fact, the earliest applications can be traced back to the 1980s. Originally used for dimensionality reduction and feature learning, over time the concept of an autoencoder has evolved into a widely used technique for learning generative models on data. Here are five popular autoencoder types:

1. Undercomplete Autoencoders

Undercomplete autoencoders are one of the simplest autoencoders and were first applied to dimensionality reduction and feature learning as early as the 1980s. Its principle is simple, by compressing the input data to generate a kind of latent space, and then decompress it back to the original data. Since it is unsupervised learning, no labels are required. Undercomplete autoencoders can be viewed as a dimensionality reduction technique that projects high-dimensional data into a low-dimensional latent space.

Its calculation formula:
input data xxx passes through encoderfff generates a latent space feature vectorhhh

h = f ( x ) h=f(x) h=f ( x )
Next, the feature vectorhhh goes through the decoderggg generate reconstructed datax ′ x’x
x ′ = g ( h ) x'=g(h) x=g(h)

The loss function of an undercomplete autoencoder is the reconstruction error, which can be expressed using a variety of different error functions, such as L1 loss or mean square error.

By using an undercomplete autoencoder, we can compress high-dimensional data into a low-dimensional latent space and be able to reconstruct the original data. This technique is very useful in practice, such as image processing, speech signal processing, natural language processing and other fields. Compared with other dimensionality reduction techniques, undercomplete autoencoders can learn non-linear relationships and thus perform better dimensionality reduction while preserving data information.

If you compare this model to a person, it's like you can only take one piece of luggage when you travel, but you need to take as many items as possible. So you have to think about how to pack items into the luggage, and you also need to reassemble your items when you arrive at your destination. An undercomplete autoencoder is such a "baggage" that helps you compress data and reconstruct it

2. Sparse Autoencoders

Sparse autoencoder is an extension of undercomplete autoencoder. Compared with undercomplete autoencoder, it is characterized by the addition of sparsity constraints, which can better learn the characteristics of data. The following is the algorithm principle and calculation formula of the sparse autoencoder:
insert image description here

Algorithm principle

Sparse autoencoders limit the average activation of neurons in the hidden layer, forcing only some neurons to be activated, so that the model is more robust. This constraint can be achieved by adding a penalty term to the objective function.

Calculation formula

For sparse autoencoders, the objective function consists of two parts: the reconstruction error and the sparsity penalty.

The reconstruction error part is the same as the undercomplete autoencoder, that is, the network is trained by minimizing the error between the input and the output, and the formula is as follows: J reconstruction
( W , b ; x ( i ) ) = 1 2 ∣ ∣ y ( x ( i ) ) − x ( i ) ∣ ∣ 2 J_{reconstruction}(W,b;x^{(i)}) = \frac{1}{2}||y(x^{(i)} ) - x^{(i)}||^2Jreconstruction(W,b;x(i))=21∣∣y(x(i))x(i)2

Among them, WWW andbbb is the weight and bias term of the network,x ( i ) x^{(i)}x( i ) is the iithin the training data setsample i , y ( x ( i ) ) y(x^{(i)})y(x( i ) )is the output of the network.

The sparsity penalty part can be realized by adding a sparsity constraint, and its formula is as follows:

J sparse ( a ) = ∑ j = 1 s KL ( ρ ∣ ∣ ρ j ^ ) J_{sparse}(a) = \sum_{j=1}^{s}KL(\rho || \hat{\rho_j })Jsparse(a)=j=1sKL(ρ∣∣rj^)

Among them, sss is the number of neurons in the hidden layer,aaa is the output of the hidden layer,ρ \rhoρ is the expected neuron activation,ρ j ^ \hat{\rho_j}rj^is the calculated average activation. KL ( ρ ∣ ∣ ρ j ^ ) KL(\rho || \hat{\rho_j})KL(ρ∣∣rj^) represents the KL divergence, which can be calculated by the following formula:
KL ( ρ ∣ ∣ ρ j ^ ) = ρ log ρ ρ j ^ + ( 1 − ρ ) log 1 − ρ 1 − ρ j ^ KL(\rho || \ hat{\rho_j}) = \rho log \frac{\rho}{\hat{\rho_j}} + (1 - \rho)log\frac{1-\rho}{1-\hat{\rho_j}}KL(ρ∣∣rj^)=ρlogrj^r+(1r ) l o g1rj^1r

In the objective function, the weight of the sparsity penalty term can be tuned by hyperparameters.

Sparse autoencoders can be compared to the learning process of the human brain. When the human brain is learning new things, it will try to find some of these features to deepen its understanding of things. Similarly, sparse autoencoders make the network more robust and generalizable by forcing the network to learn only part of the features, so that it can better learn and understand the input data.

3. Shrinkage Autoencoder

A contractive autoencoder is an unsupervised learning algorithm that can learn low-dimensional representations of data. Compared with undercomplete autoencoders and sparse autoencoders, shrinkage autoencoders pay more attention to the local structure of the data.

The emergence of this algorithm can be traced back to a paper "Contractive Auto-Encoders: Explicit Invariance During Feature Extraction" by Rifai et al. in 2011. The shrinkage auto-encoder proposed in the paper is mainly based on the less-complete auto-encoder. In order to constrain the local structure of the data, the encoder is more robust to small changes, thus learning a more stable and interpretable low-dimensional representation.

Algorithm principle

Shrinkage autoencoders build on undercomplete autoencoders by adding an extra term to penalize the network for small changes to the input data. This extra term is obtained by computing the Frobenius norm of the encoding layer's Jacobian matrix on the input data. The Frobenius norm measures the square root of the sum of the squares of each element of the matrix, so this additional item measures the small change in the encoder's encoding when the input data changes slightly, thereby achieving constraints on the local structure of the data.
insert image description here

Calculation formula

WW _W is the weight parameter in the encoder,h ( x ) h(x)h ( x ) is the output of the encoder,JJJ ish ( x ) h(x)h ( x ) vsxxThe Jacobian matrix of x , the calculation formula of the Frobenius norm is:

∣ ∣ J ∣ ∣ F 2 = ∑ i , j ( ∂ h j ( x ) ∂ x i ) 2 ||J||_F^2=\sum_{i,j}(\frac{\partial h_j(x)}{\partial x_i})^2 ∣∣JF2=i,j(xihj(x))2

The loss function of the shrinkage autoencoder generally includes two parts, one part is the reconstruction error (reconstruction error), and the other part is the constraint on the encoding layer (contractive penalty). The usual expression is:

L = 1 N ∑ i = 1 N ∣ ∣ x i − g ( f ( x i ) ) ∣ ∣ 2 2 + λ ∣ ∣ J ∣ ∣ F 2 \mathcal{L}=\frac{1}{N}\sum_{i=1}^N||x_i-g(f(x_i))||_2^2+\lambda||J||_F^2 L=N1i=1N∣∣xig(f(xi))22+λ∣∣JF2

where ggg is the decoder,fff is the encoder,xi x_ixiis the input data, NNN is the number of samples,λ \lambdaλ is the constraint coefficient, which is used to balance the reconstruction error and the constraint term.

Imagine you're learning how to draw, but you feel like your style is lacking certain elements. You decide to join a bootcamp to learn how to get better at drawing people's faces. Before you can start training, you need to understand how to describe these features and how to put them together to paint a complete face.

It's like a shrinking autoencoder that learns the most important features from an image and then keeps only those features when compressing the data. In this way, we can preserve important information while compressing the data

4. Denoising Autoencoder

insert image description here

Denoising Autoencoder (Denoising Autoencoder) is an autoencoder that can extract data features by removing noise. Its emergence is to solve the problem that data is often disturbed by noise in real environments.

Algorithm principle

Denoising autoencoders differ from standard autoencoders in that their training data is data to which random noise has been added. Denoising autoencoders add random noise to the input data and then take the denoised data as input to reconstruct clean data. In this process, it not only learns the characteristics of the data, but also learns the skills of removing noise, so it is more robust.

Calculation formula

The calculation formula of the denoising autoencoder is similar to the formula of the standard autoencoder, except that there is an additional denoising term. Its formula is as follows:

L θ = ∑ i = 1 n ∣ ∣ g θ ( xi ~ ) − xi ∣ ∣ 2 \mathcal{L}_{\theta}=\sum_{i=1}^n||g_\theta(\widetilde{ \mathbf{x_i}})-\mathbf{x_i}||^2Li=i=1n∣∣gi(xi )xi2

where xi ~ \widetilde{\mathbf{x_i}}xi is the data with random noise added, g θ g_{\boldsymbol{\theta}}giis a function of the encoder and decoder, LDAE \mathcal{L}_{DAE}LDAEis the training loss function.

Anthropomorphic explanation: The denoising autoencoder is like learning a thing in a noisy environment. We will remove the noise as much as possible and keep some key information. For example, at a loud concert, although we can't hear every note, we can still hear the main theme of a song. The principle of the denoising autoencoder is the same. Through training, it can extract useful information from the data disturbed by noise and remove the noise.

5. Variational Autoencoder VAE (for generative models)

Variational Autoencoder (VAE) is a neural network-based generative model whose main purpose is to learn the probability distribution of data and be able to generate new data samples. VAE was originally proposed by Diederik P. Kingma and Max Welling in 2013, which is an improvement on the traditional autoencoder (Autoencoder, AE).

In traditional autoencoders, both the encoder and decoder are deterministic functions, and the model is trained by minimizing the reconstruction error. VAE, on the other hand, introduces a probability-based generative model to describe the underlying distribution of input data. VAEs encode data into a latent variable vector and use a decoder to map this vector back into the original data space. Unlike traditional autoencoders that use deterministic encoders and decoders, VAEs use random encoders to encode input data into a distribution of latent variables and generate new data samples by sampling.

The goal of VAE is to minimize the reconstruction error while constraining the distribution of latent variables to obey the standard normal distribution. This is achieved by minimizing the reconstruction error and the KL divergence, which measures the difference between the latent variable distribution and the standard normal distribution. This method enables VAE to generate data samples with diversity and continuity, and can control the degree of diversity of data generation.

Algorithm principle

VAE is a generative model that can be used to learn low-dimensional representations of data distributions for data generation and reconstruction. VAE is based on the basic structure of Autoencoder (Autoencoder, AE), but uses a different training strategy that allows it to learn a continuous distribution in the latent space. This training strategy is based on Variational Inference (VI).

VAE uses an encoder and a decoder. Encoder will input data xxx maps to latent variablezzz , the decoder takes the latent variablezzz maps back to the reconstructed datax ′ x’x' . In this way, the VAE can be viewed as an input dataxxx is transformed into the latent variablezzz , and then the hidden variablezzz converts back to the reconstructed datax ′ x’x function, namely:

x ′ = f θ ( z ) , z = g ϕ ( x ) x'=f_\theta(z),z=g_\phi(x) x=fi(z),z=gϕ(x)

Form, f θ f_{\theta}fiDenotes the decoder, g ϕ g_{\phi}gϕRepresents the encoder, θ \thetaθ andϕ \phiϕ denotes the parameters of the decoder and encoder.

In order to train the VAE, we need to maximize the log likelihood function log ⁡ p θ ( x ) \log p_{\theta}(x)logpi( x ) . However, it is infeasible to compute the log-likelihood function directly, since itinvolvesIntegral over z -values. Therefore, VAE uses variational inference, which transforms the problem into maximizing the lower bound:

L V A E = E q ϕ ( z ∣ x ) [ l o g p θ ( x ∣ z ) ] − K L ( q ϕ ( z ∣ x ) ∣ ∣ p ( z ) ) \mathcal{L}_{VAE}=E_{q_\phi(z|x)}[log p_\theta(x|z)]-KL(q_\phi(z|x)||p(z)) LV A E=Eqϕ(zx)[logpi(xz)]KL(qϕ(zx)∣∣p(z))

where q ϕ ( z ∣ x ) q_{\phi}(z|x)qϕ( z x ) means givenxxIn the case of x ,zzThe posterior probability distribution of z , p ( z ) p(z)p ( z ) represents the prior distribution,KL ( q ϕ ( z ∣ x ) ∣ ∣ p ( z ) ) \text{KL}(q_{\phi}(z|x)||p(z))KL(qϕ( z x ) ∣∣ p ( z )) represents the posterior distributionq ϕ ( z ∣ x ) q_{\phi}(z|x)qϕ( z x ) and prior distributionp ( z ) p(z)KL divergence between p ( z ) .

We can decompose this lower bound into two parts: the reconstruction error and the regularization term. The reconstruction error measures the decoder reconstructed data x ′ x’x and raw dataxxThe difference between x , while the regularization term encourages the posterior distributionq ϕ ( z ∣ x ) q_{\phi}(z|x)qϕ( z x ) is close to the prior distributionp ( z ) p(z)p(z)

Algorithm process

The following is the calculation formula of the VAE training process:

  1. data preprocessing

Image data is usually input into the VAE model in the form of a pixel matrix, so the image data needs to be preprocessed, such as normalizing the pixels to a real number between [0,1], or subtracting the mean from the pixels and dividing them by the standard deviation. standardization etc.

  1. forward propagation

Use the encoder to input data xxx maps to latent variablezzz , where the output of the encoder is the mean vectorμ \muμ and the variance vectorσ \sigmap

μ , log σ 2 = g ϕ ( x ) \mu,log\sigma^2=g_\phi(x)m ,logσ2=gϕ(x)

where log ⁡ σ 2 \log \sigma^2logp2 is to ensure that the variance is positive. Here uselog ⁡ σ 2 \log \sigma^2logp2 is because we need a trainable parameter, and this avoids negative cases.

  1. sampling

From the posterior distribution q ϕ ( z ∣ x ) q_{\phi}(z|x)qϕSamplezz in ( z x )z

z = μ + ϵ ⊙ σ , ϵ N ( 0 , 1 ) z=\mu+\epsilon \odot \sigma,\epsilon ~N(0,1)z=m+ϵs ,ϵ N ( 0 , 1)

where, ϵ \epsilonϵ is a noise vector sampled from a standard normal distribution,⊙ \odot means multiplication between elements

  1. decoding

Use the decoder to convert the latent variable zzz maps back to the reconstructed datax ′ x’x

x ′ = f θ ( z ) x'=f_\theta(z) x=fi(z)

  1. Calculate reconstruction error

Use the reconstruction error to measure the decoder reconstructed data x ′ x’x and raw dataxxThe difference between x , here assuming the data is binary, using cross-entropy as the reconstruction error:

C E = − ∑ i = 1 N x i l o g x i ′ + ( 1 − x i ) l o g ( 1 − x i ′ ) CE=-\sum_{i=1}^{N}x_ilogx'_i+(1-x_i)log(1-x'_i) CE=i=1Nxilogxi+(1xi)log(1xi)
among them,NNN is the dimensionality of the data.

  1. Calculate KL divergence

Calculate the posterior distribution q ϕ ( z ∣ x ) q_{\phi}(z|x)qϕ( z x ) and prior distributionp ( z ) p(z)KL divergence between p ( z ) :

K L ( q ϕ ( z ∣ x ) ∣ ∣ p ( z ) ) = − 1 2 ∑ j = 1 J ( 1 + l o g ( σ j 2 ) − μ j 2 − σ j 2 ) KL(q_\phi(z|x)||p(z))=-\frac{1}{2}\sum_{j=1}^J(1+log(\sigma_j^2)-\mu_j^2-\sigma_j^2) KL(qϕ(zx)∣∣p(z))=21j=1J(1+l o g ( pj2)mj2pj2)

Among them, JJJ is the hidden variablezzThe dimension of z ,μ j \mu_jmjand σ j 2 \sigma_j^2pj2is the jjth output of the encoderThe mean and variance of the j latent variables.

  1. Calculate the loss function

Combining the reconstruction error and the KL divergence, the loss function of the VAE is obtained:

L = CE + β KL ( q ϕ ( z ∣ x ) ∣ ∣ p ( z ) ) L=CE+\beta KL(q_\phi(z∣x)∣∣p(z))L=CE+βKL(qϕ(zx)∣∣p(z))

Among them, β \betaβ is a hyperparameter to balance the weight of reconstruction error and KL divergence.

  1. backpropagation

According to the loss function LLL to the parameters of the modelϕ \phiϕθ \thetaθ performs backpropagation and updates parameters:

θ ← θ − α ∂ L ∂ θ \theta \leftarrow \theta-\alpha \frac{\partial L}{\partial \theta} iiaθL

ϕ ← ϕ − α ∂ L ∂ ϕ \phi \leftarrow\phi-\alpha \frac{\partial L}{\partial \phi}ϕϕaϕL

Among them, α \alphaα is the learning rate. Repeat the above steps until the loss function converges or reaches the preset number of iterations.

Autoencoder code implementation

insert image description here

Linear self-editor implementation

import torch
import torchvision
from torch import nn
from torch import optim
import torch.nn.functional as F
from torch.autograd import Variable
from torch.utils.data import DataLoader
from torchvision import transforms
from torchvision.utils import save_image
from torchvision.datasets import MNIST
import os

if not os.path.exists('./vae_img'):
    os.mkdir('./vae_img')


def to_img(x):
    x = x.clamp(0, 1)
    x = x.view(x.size(0), 1, 28, 28)
    return x


num_epochs = 100
batch_size = 128
learning_rate = 1e-3

img_transform = transforms.Compose([
    transforms.ToTensor()
    # transforms.Normalize((0.5, 0.5, 0.5), (0.5, 0.5, 0.5))
])

dataset = MNIST('../data', transform=img_transform, download=True)
dataloader = DataLoader(dataset, batch_size=batch_size, shuffle=True)


class VAE(nn.Module):
    def __init__(self):
        super(VAE, self).__init__()

        self.fc1 = nn.Linear(784, 400)
        self.fc21 = nn.Linear(400, 20)
        self.fc22 = nn.Linear(400, 20)
        self.fc3 = nn.Linear(20, 400)
        self.fc4 = nn.Linear(400, 784)

    def encode(self, x):
        h1 = F.relu(self.fc1(x))
        return self.fc21(h1), self.fc22(h1)

    def reparametrize(self, mu, logvar):
        std = logvar.mul(0.5).exp_()
        if torch.cuda.is_available():
            eps = torch.cuda.FloatTensor(std.size()).normal_()
        else:
            eps = torch.FloatTensor(std.size()).normal_()
        eps = Variable(eps)
        return eps.mul(std).add_(mu)

    def decode(self, z):
        h3 = F.relu(self.fc3(z))
        # return F.sigmoid(self.fc4(h3))
        return torch.sigmoid(self.fc4(h3))

    def forward(self, x):
        mu, logvar = self.encode(x)
        z = self.reparametrize(mu, logvar)
        return self.decode(z), mu, logvar


model = VAE()
if torch.cuda.is_available():
    # model.cuda()
    print('cuda is OK!')
    model = model.to('cuda')
else:
    print('cuda is NO!')

reconstruction_function = nn.MSELoss(size_average=False)
# reconstruction_function = nn.MSELoss(reduction=sum)


def loss_function(recon_x, x, mu, logvar):
    """
    recon_x: generating images
    x: origin images
    mu: latent mean
    logvar: latent log variance
    """
    BCE = reconstruction_function(recon_x, x)  # mse loss
    # loss = 0.5 * sum(1 + log(sigma^2) - mu^2 - sigma^2)
    KLD_element = mu.pow(2).add_(logvar.exp()).mul_(-1).add_(1).add_(logvar)
    KLD = torch.sum(KLD_element).mul_(-0.5)
    # KL divergence
    return BCE + KLD


optimizer = optim.Adam(model.parameters(), lr=1e-3)

for epoch in range(num_epochs):
    model.train()
    train_loss = 0
    for batch_idx, data in enumerate(dataloader):
        img, _ = data
        img = img.view(img.size(0), -1)
        img = Variable(img)
        if torch.cuda.is_available():
            img = img.cuda()
        optimizer.zero_grad()
        recon_batch, mu, logvar = model(img)
        loss = loss_function(recon_batch, img, mu, logvar)
        loss.backward()
        # train_loss += loss.data[0]
        train_loss += loss.item()
        optimizer.step()
        if batch_idx % 100 == 0:
            print('Train Epoch: {} [{}/{} ({:.0f}%)]\tLoss: {:.6f}'.format(
                epoch,
                batch_idx * len(img),
                len(dataloader.dataset), 100. * batch_idx / len(dataloader),
                # loss.data[0] / len(img)))
                loss.item() / len(img)))

    print('====> Epoch: {} Average loss: {:.4f}'.format(
        epoch, train_loss / len(dataloader.dataset)))
    if epoch % 10 == 0:
        save = to_img(recon_batch.cpu().data)
        save_image(save, './vae_img/image_{}.png'.format(epoch))

torch.save(model.state_dict(), './vae.pth')

Convolution autocompiler implementation

import os
import datetime

import torch
import torchvision
from torch import nn
from torch.autograd import Variable
from torch.utils.data import DataLoader
from torchvision import transforms
from torchvision.utils import save_image
from torchvision.datasets import MNIST


if not os.path.exists('./dc_img'):
    os.mkdir('./dc_img')


def to_img(x):
    x = 0.5 * (x + 1)
    x = x.clamp(0, 1)
    x = x.view(x.size(0), 1, 28, 28)
    return x


num_epochs = 100
batch_size = 128
learning_rate = 1e-3

img_transform = transforms.Compose([
    transforms.ToTensor(),
    # transforms.Normalize((0.5, 0.5, 0.5), (0.5, 0.5, 0.5))
    transforms.Normalize([0.5], [0.5])
])

dataset = MNIST('./data', transform=img_transform)
dataloader = DataLoader(dataset, batch_size=batch_size, shuffle=True)


class autoencoder(nn.Module):
    def __init__(self):
        super(autoencoder, self).__init__()
        self.encoder = nn.Sequential(
            nn.Conv2d(1, 16, 3, stride=3, padding=1),  # b, 16, 10, 10
            nn.ReLU(True),
            nn.MaxPool2d(2, stride=2),  # b, 16, 5, 5
            nn.Conv2d(16, 8, 3, stride=2, padding=1),  # b, 8, 3, 3
            nn.ReLU(True),
            nn.MaxPool2d(2, stride=1)  # b, 8, 2, 2
        )
        self.decoder = nn.Sequential(
            nn.ConvTranspose2d(8, 16, 3, stride=2),  # b, 16, 5, 5
            nn.ReLU(True),
            nn.ConvTranspose2d(16, 8, 5, stride=3, padding=1),  # b, 8, 15, 15
            nn.ReLU(True),
            nn.ConvTranspose2d(8, 1, 2, stride=2, padding=1),  # b, 1, 28, 28
            nn.Tanh()
        )

    def forward(self, x):
        x = self.encoder(x)
        x = self.decoder(x)
        return x


model = autoencoder().cuda()
criterion = nn.MSELoss()
optimizer = torch.optim.Adam(model.parameters(), lr=learning_rate,weight_decay=1e-5)
starttime = datetime.datetime.now()

for epoch in range(num_epochs):
    for data in dataloader:
        img, label = data
        img = Variable(img).cuda()
        # ===================forward=====================
        output = model(img)
        loss = criterion(output, img)
        # ===================backward====================
        optimizer.zero_grad()
        loss.backward()
        optimizer.step()
    # ===================log========================
    endtime = datetime.datetime.now()
    print('epoch [{}/{}], loss:{:.4f}, time:{:.2f}s'.format(epoch+1, num_epochs, loss.item(), (endtime-starttime).seconds))
    
    # if epoch % 10 == 0:
    pic = to_img(output.cpu().data)
    save_image(pic, './dc_img/image_{}.png'.format(epoch))

torch.save(model.state_dict(), './conv_autoencoder.pth')

Variational autocompiler implementation

insert image description here

import torch
import torchvision
from torch import nn
from torch import optim
import torch.nn.functional as F
from torch.autograd import Variable
from torch.utils.data import DataLoader
from torchvision import transforms
from torchvision.utils import save_image
from torchvision.datasets import MNIST
import os
import datetime

if not os.path.exists('./vae_img'):
    os.mkdir('./vae_img')


def to_img(x):
    x = x.clamp(0, 1)
    x = x.view(x.size(0), 1, 28, 28)
    return x


num_epochs = 100
batch_size = 128
learning_rate = 1e-3

img_transform = transforms.Compose([
    transforms.ToTensor()
    # transforms.Normalize((0.5, 0.5, 0.5), (0.5, 0.5, 0.5))
])

dataset = MNIST('./data', transform=img_transform, download=True)
dataloader = DataLoader(dataset, batch_size=batch_size, shuffle=True)

class VAE(nn.Module):
    def __init__(self):
        super(VAE, self).__init__()
        self.fc1 = nn.Linear(784, 400)
        self.fc21 = nn.Linear(400, 20)
        self.fc22 = nn.Linear(400, 20)
        self.fc3 = nn.Linear(20, 400)
        self.fc4 = nn.Linear(400, 784)

    def encode(self, x):
        h1 = F.relu(self.fc1(x))
        return self.fc21(h1), self.fc22(h1)

    def reparametrize(self, mu, logvar):
        std = logvar.mul(0.5).exp_()
        if torch.cuda.is_available():
            eps = torch.cuda.FloatTensor(std.size()).normal_()
        else:
            eps = torch.FloatTensor(std.size()).normal_()
        eps = Variable(eps)
        return eps.mul(std).add_(mu)

    def decode(self, z):
        h3 = F.relu(self.fc3(z))
        # return F.sigmoid(self.fc4(h3))
        return torch.sigmoid(self.fc4(h3))

    def forward(self, x):
        mu, logvar = self.encode(x)
        z = self.reparametrize(mu, logvar)
        return self.decode(z), mu, logvar


strattime = datetime.datetime.now()
model = VAE()
if torch.cuda.is_available():
    # model.cuda()
    print('cuda is OK!')
    model = model.to('cuda')
else:
    print('cuda is NO!')

reconstruction_function = nn.MSELoss(size_average=False)
# reconstruction_function = nn.MSELoss(reduction=sum)


def loss_function(recon_x, x, mu, logvar):
    """
    recon_x: generating images
    x: origin images
    mu: latent mean
    logvar: latent log variance
    """
    BCE = reconstruction_function(recon_x, x)  # mse loss
    # loss = 0.5 * sum(1 + log(sigma^2) - mu^2 - sigma^2)
    KLD_element = mu.pow(2).add_(logvar.exp()).mul_(-1).add_(1).add_(logvar)
    KLD = torch.sum(KLD_element).mul_(-0.5)
    # KL divergence
    return BCE + KLD


optimizer = optim.Adam(model.parameters(), lr=1e-3)

for epoch in range(num_epochs):
    model.train()
    train_loss = 0
    for batch_idx, data in enumerate(dataloader):
        img, _ = data
        img = img.view(img.size(0), -1)
        img = Variable(img)
        img = (img.cuda() if torch.cuda.is_available() else img)
        optimizer.zero_grad()
        recon_batch, mu, logvar = model(img)
        loss = loss_function(recon_batch, img, mu, logvar)
        loss.backward()
        # train_loss += loss.data[0]
        train_loss += loss.item()
        optimizer.step()
        if batch_idx % 100 == 0:
            endtime = datetime.datetime.now()
            print('Train Epoch: {} [{}/{} ({:.0f}%)]\tLoss: {:.6f} time:{:.2f}s'.format(
                epoch,
                batch_idx * len(img),
                len(dataloader.dataset), 
                100. * batch_idx / len(dataloader),
                loss.item() / len(img), 
                (endtime-strattime).seconds))
    print('====> Epoch: {} Average loss: {:.4f}'.format(
        epoch, train_loss / len(dataloader.dataset)))
    if epoch % 10 == 0:
        save = to_img(recon_batch.cpu().data)
        save_image(save, './vae_img/image_{}.png'.format(epoch))

torch.save(model.state_dict(), './vae.pth')

Guess you like

Origin blog.csdn.net/weixin_42010722/article/details/129969178