Representation Learning (Representation Learning) Part1--Pretext Text

"Artificial Intelligence and Machine Learning" from Prof. Manolis Kellis (Director of MIT Computational Biology)

The main content is part1 of Representation Learning—Pretext Text (agent task/pre-task/auxiliary task, etc.), which can be understood as an indirect task designed to achieve a specific training task.

Including these parts: inferred structure, conversion prediction, reconstruction, time utilization, multimodality, instance classification (see catalog for English)

Below is the link to the tubing:

Generative Models, Adversarial Networks GANs, Variational Autoencoders VAEs, Representation Learning

Representation Learning

Representation Learning: Pretext tasks, embedding spaces, knowledge representation, next word prediction, image placement prediction, Variational AutoEncoders.

Representation Learning : This is a machine learning approach that aims to automatically identify better ways to represent input data to a learning algorithm. The idea is that downstream tasks such as classification or regression should become easier given the correct data representation.

Hypothesis tasks : In self-supervised learning, hypothesis tasks are designed as auxiliary tasks, where the model learns rich feature representations from unlabeled data, which can then be used for the main task . Examples of hypothetical tasks are predicting the next word in a sentence, image completion, or colorization of black and white images.

Embedding spaces : These are high-dimensional vector spaces where similar objects are close together and dissimilar objects are far apart. They are often used to represent categorical variables or discrete objects such as words (in Word2Vec or GloVe), sentences (in Sentence-BERT), or even graphs (in graph neural networks).

Knowledge representation : This is the part of the field of artificial intelligence that focuses on representing information about the world in a form that computer systems can use to solve complex tasks, such as diagnosing a medical condition or conducting a conversation using natural language. It includes representations of action, time, causality, and belief, among others.

Next Word Prediction : This is a task of language modeling where the model predicts the next word in a sentence given the previous words. It is commonly used to train deep learning models, such as Transformers (such as GPT-3 or GPT-4), with the goal of predicting the next token in a sequence.

Image Position Prediction : This could be a task where the goal is to predict the correct position or arrangement of an image based on some context. For example, given a comic series that is missing one panel, the task would be to predict the correct location of the missing panel. This kind of task requires a good understanding of visual narrative and context.

Variational Autoencoders (VAEs) : These are generative models that use neural networks for efficient Bayesian inference of complex, intractable probabilistic models. VAEs have a specific architecture that allows them to generate new data, which is similar to the training data. They are especially useful for tasks such as anomaly detection, denoising, and generating new samples.

Deep learning is a powerful framework for understanding and learning data representations. The key idea is representation learning, which translates raw data into a more meaningful form that is easier to use for tasks such as classification.

  • A commonly used deep learning model is Convolutional Neural Networks (CNNs), which are often used for image classification tasks. A typical CNN includes multiple layers such as convolutional layers, ReLU (rectified linear unit) activation layers, pooling layers, and fully connected layers.
    • In the convolution layer, the model extracts features from the input image through convolution operations, that is, the interaction between image pixels in the local receptive field.
    • The ReLU activation layer sets all negative values ​​to 0 to introduce non-linearity.
    • The pooling layer (such as maximum pooling) is used to reduce the spatial size of the model and increase the robustness and computational efficiency of the model.
    • The fully connected layer is at the end of the network, and it is used as a classifier to analyze the previously extracted features and output prediction results.

The main difference between deep learning and traditional neural networks is that it combines the two tasks of feature extraction and classification : the classification task drives the extraction of features . This is an extremely powerful and general model, but we still need to keep innovating because the field is still in its infancy .

New application domains (such as those beyond images) may have structures that cannot be captured or exploited by current architectures . For example, genomics, biology, and neuroscience may help drive the development of new architectures.

When we say that new application domains may have structures that cannot be captured or exploited by current architectures, it's as if you have a super advanced toolbox with all kinds of tools like hammers, screwdrivers, wrenches, etc. that are useful in repairing furniture or cars can be very effective. But if your challenge right now is cooking a good dinner, you might need some brand new tools like pots, knives, and ovens that might not already be in your toolbox.

Likewise, our current deep learning architectures, such as convolutional neural networks or recurrent neural networks, perform extremely well on image, audio, and text data. However, when we try to apply deep learning to new fields, such as genomics (DNA sequence analysis), biology (such as protein structure prediction), or neuroscience (such as brain wave analysis), we may find that our existing tools It doesn't quite fit. We may need to develop new deep learning architectures that can better understand and exploit data structures in these domains.

As an example, a DNA sequence can be viewed as a long string consisting of four genes (A, T, C, G). While we can process this kind of data in a similar way to text, DNA sequences have special structures and properties (such as three genes encoding one amino acid) that may not be fully exploited by current deep learning architectures. Therefore, we may need to develop new architectures, especially to capture and exploit these properties.

  • Representation Learning in Unsupervised Learning
  • How to learn useful representations from data without labeled data .
    • Predicting the future: such as using a recurrent neural network (RNN) or predicting the next frame in a video sequence. In this setting, future predictions become a way of learning data representations.
    • Compression: An autoencoder is a type of neural network that attempts to reconstruct its input from a low-dimensional representation (called a latent space ), and thus can also be seen as a type of compression.
    • Pretext task: This is a construction task that aims to drive the learning of useful representations, rather than directly solving the task we care about. For example, predicting missing parts in an image, predicting the rotation angle of an image, colorizing a black and white image, upsampling a low resolution image, etc.
    • Capturing parameter distributions (mutation): Variational autoencoders (VAEs) attempt to learn a latent probability distribution of input data such that latent representations sampled through this distribution can generate new data similar to the input data.
    • **Making Latent Space Parameters Meaningful:** The latent space can be designed such that each dimension has specific meaning, which can be orthogonal, explicit, or tunable.
    • **Use a second network for training: **Generate confrontation networks (GANs) include a generation network and a discriminative network. Through the confrontation training of the two, the generation network can generate better and better fake data.
  • **Infinite Possibilities:** The above are just some existing methods, and there are infinite possibilities in this field waiting for us to explore. Your innovative ideas may open up new avenues.

In general, the topic of this passage is representation learning without labeled data via unsupervised learning or self-supervised learning. This learning approach provides us with a powerful tool to learn useful knowledge from large amounts of unlabeled data.

This article is about the Pretext task

In the field of deep learning, self-supervised learning is a form of unsupervised learning in which the training signals (aka labels) are generated from the input data itself rather than provided by humans. The goal of this approach is to learn good data representations without caring about the outcome of the proxy task.

Proxy tasks are an implementation of self-supervised learning by constructing a task that can derive supervisory signals from the input data itself. In fact, we don't care about the result of the agent task itself, we only care about whether it can promote the model to learn useful data representation (such as really recognizing a cat in a photo).

Agent tasks can be roughly divided into the following categories:

  1. **Inferring structure:** This type of task requires the model to infer some structure or pattern from the input data.
  2. **Transformation prediction:** This type of task requires the model to predict some transformation of the data, such as rotation, translation, or scaling.
  3. **Reconstruction:** This type of task requires the model to reconstruct its input, usually after some kind of transformation (such as adding noise). Autoencoders are examples of such tasks.
  4. **Leveraging time:** In this type of task, the model needs to understand the chronological order of the data or predict future events. For example, in natural language processing, a model may need to predict the next word.
  5. **Multimodal tasks:** These tasks involve multiple types of data, such as images and text, and the goal is to learn representations across modalities.
  6. **Instance Classification:** This is a special type of task where each data instance is treated as its own class.

It should be noted that this is only a rough classification of agent tasks, and some tasks may fit into more than one category.

Inferring structure



  1. Contextual Prediction: This is a self-supervised learning method in which the model learns to predict one part of the input from other parts of the input. This is a useful way to understand parts or features of objects in an image. However, this approach assumes that the training images are taken in a canonical (standard) orientation, which may not always be true.
  2. Disadvantages of contextual prediction: Contextual prediction has some problems.
    • First, it assumes that all images are taken from one canonical orientation, which may not match reality. For example, when shooting clouds, there is no standard direction.
    • Use patch (small part) to train, but want to learn the overall representation.
    • Second, these models often "cheat" by using hints that were not available at test time, so special care needs to be taken when designing these models. There is a gap between training and evaluation due to differences in data distribution, and possibly a lack of fine-grained distinction in training.
    • There may be limitations if the classifier is not trained with negative examples (examples that are not trying to recognize) from other images. For example, it might not be able to distinguish cat eyes from dog eyes without negative examples.
    • Small output space: This seems to refer to problems where the model predicts a limited number of classes or cases. In this example, the model may only be able to distinguish 8 different locations, which may limit the effectiveness of the model.

The following are because my personal listening is a bit abstract, and some of the questions raised and the answers to gpt4 can be skipped. In fact, it means that representation learning automatically learns a lot of representations (features), but this feature is more advanced, and it has much more potential than people creating features one by one in machine learning.

Unsupervised visual representation learning by context prediction

Basic concepts of unsupervised visual representation learning: Unsupervised visual representation learning through context prediction is a self-supervised learning method whose goal is to enable machine learning models to learn useful representations from input unlabeled images. This learning method works by predicting the context in an image, usually by predicting one part of the image to understand other parts of the image.

Let's take a simple example to illustrate the concept:

Say you have a photo of a cat playing with a ball. We cut this photo into several parts, and then randomly remove one of the parts, and our task is to predict what the removed part should be.

In this task, the model needs to understand various contextual cues, such as the shape of the cat, the color and shape of the ball, and their relative positions in the image. In this way, the model can learn visual representations of cats and balls as well as their spatial relationships. That way, even if it encounters new pictures of cats or balls at test time, the model can understand and classify them based on the representations it has learned.

This is the basic concept of unsupervised visual representation learning through context prediction. Note that this is a self-supervised learning method, that is, it learns without human labels, and thus can handle large amounts of unlabeled data.

  • The second question represents space

"Representation space" in machine learning generally refers to a model's internal representation of input data. In CNN, convolutional layers and pooling layers (if present) process the input image and generate a complex feature map. This feature map is the representation of the input image in the model. These representations encode a lot of information about the image, such as color, texture, shape, etc.

In unsupervised visual representation learning, the goal of the model is to find a representation that is most helpful for predictive tasks such as context prediction. In this case, we hope that the representation space of the model can capture important features and structures in the image.

Taking CNN as an example, let's say we use the context prediction task to train a CNN. During training, the CNN's convolutional layers learn how to extract useful features from images, which are encoded in feature maps (i.e., representation spaces). After training, a CNN should be able to generate representations that are useful for contextual prediction tasks.

This is the relationship between unsupervised visual representation learning and representation space. In unsupervised learning, our goal is to find a representation that is useful for prediction tasks.

  • The third question how to quantify the representation space

Representation space is quantifiable. In a convolutional neural network (CNN), the representation space usually refers to the output of one or more layers in the network, which are a kind of encoding or representation of the original input (such as an image). These representations are usually numerical vectors with many dimensions and are therefore quantizable.

Specific to a new image, the representation space works like this:

  1. First, new images are fed into the trained CNN.
  2. The convolutional layer of CNN will start to process the image, and each layer will extract some features of the image. Early layers may extract some basic features such as edges and colors, while deeper layers may extract more complex features such as parts or wholes of objects.
  3. These extracted features constitute the representation of the new image in the model, that is, points in the "representation space". These representations are usually fed into fully connected layers or other types of layers for further processing to accomplish the final prediction task (such as classification).

As an example, let's say you have a CNN for distinguishing between cats and dogs. You feed it a new picture of a cat. The convolutional layer of CNN will extract various features in the picture and generate a representation. This representation captures a lot of information about the cat in the picture, such as its shape, color, texture, etc. Then, this representation is fed into the fully connected layer, which predicts the label "cat" from this representation. This is what the representation space does when dealing with new images.

  • Jigsaw Problems: Even More Complicated

Transformation prediction

Rotation prediction

  • Predict how many degrees the image will be rotated, the model is trained to predict how many degrees the image is rotated

First, the image is randomly rotated by an angle (such as 90°, 180°, 270°), and then input into a neural network (such as ConvNet), and the task of the model is to predict this rotation angle. This process can be optimized by Classification Loss, because this is actually a 4-class classification problem (assuming we only consider the four rotations of 0°, 90°, 180°, 270°).

advantage:

  • It's easy to do, and works pretty well.

shortcoming:

  • Assume that the training images are all taken with a standard orientation (and that there is a standard orientation).
  • Training-Evaluation Gap: There are no rotated images in the evaluation phase.
  • Since there are no negative examples from other images, the accuracy is not fine-grained. For example, it is not necessary to distinguish between cats and dogs.
  • Small output space - only 4 cases (rotations) need to be distinguished, adding categories is not easy.
  • In some domains, this approach may be too simplistic, such as in Street View imagery, where only the sky needs to be identified.

It should be noted that although this method has its limitations, it provides a simple and effective way to learn visual representation without labels, which is very useful for large-scale unlabeled datasets.

  • Q: What is self-supervised learning
    • A: This method of generating labels from the input data itself is called self-supervised learning. Self-supervised learning is a form of unsupervised learning by designing a task (such as predicting the rotation angle of an image) so that the model can receive supervision signals from the input data itself. This approach does not require manually labeled labels, which is its main advantage, since manually labeling large amounts of data is usually time- and resource-intensive.
    • In the "rotation prediction" example, the goal of the model is to predict the angle by which the image will be rotated. This angle is the label obtained from the input data itself. In this way, the model can learn useful visual representations without human labels. This is the basic concept of self-supervised learning.
  • Where did the previous context prediction label come from?
    • In the self-supervised learning task of context prediction, labels are likewise generated from the input data itself. How to generate labels depends on the specific context prediction task you use.
    • Take a common contextual prediction task as an example - predicting missing parts in an image. In this task, you first choose a part of the image as input, and then choose another part of the image as the object you want to predict. This goal is your label . For example, you might randomly pick the left half of an image as input, and use the right half as target. In this case, the image on the right half is your label.
    • It should be noted that this is only one possible form of the context prediction task. There are many different ways to design such a task, depending on your specific goals and your data. But in all these cases the labels are generated directly from the input data itself. That's why this learning method is called self-supervised learning.

Relative transformation prediction

  • Relative Transformation Prediction, a self-supervised learning strategy for estimating the transformation between two images. This approach usually requires good feature extraction.

In this approach, the goal of the model is to estimate the transformation from one image to another. Suppose we have two images x and t(x), where t(x) is the image of x after some transformation (such as rotation, scaling, etc.). The model will first extract the features of these two images, denoted as E(x) and E(t(x)), and then predict the transformation from E(x) to E(t(x)).

  • advantage:

    • Dovetailing with traditional computer vision methods, for example, SIFT (Scale Invariant Feature Transform) was developed for image matching.
  • shortcoming:

    • Train-Evaluate Gap: There are no transformed images in the evaluation phase.

    • Since there are no negative examples from other images, the accuracy is not fine-grained. For example, it is not necessary to distinguish between cats and dogs.

    • There are questions about semantics and the importance of low-level features (assuming we are concerned with semantics).

      • Features may not be invariant to transformations.

Reconstruction

Refactor, destroy the original part, and re-learn the prediction

Denoising Autoencoders

A self-supervised learning method based on reconstruction, that is, using denoising autoencoders (Denoising Autoencoders). A denoising autoencoder is a special kind of autoencoder that takes a noisy input signal and tries to reconstruct the original, uncontaminated signal.

The denoising self-encoder consists of two parts: the encoder (Encoder) and the decoder (Decoder). The encoder encodes the input signal into an intermediate representation, and the decoder decodes this representation back to the original signal space. Denoising autoencoders can be trained to extract useful features from noisy inputs by minimizing the Reconstruction Loss—the difference between the decoder output and the original signal not contaminated by noise ( representation) .

For example, denoising autoencoders can be used to extract useful features in images of handwritten digits. Even if the image is polluted by noise, such as some random pixels are added to the image, the denoising autoencoder can still learn how to extract useful information about handwritten digits from the noisy image.

  • advantage:

    • Denoising autoencoders are a simple, classic approach.

    • Besides being able to learn useful representations , we also get a denoiser for free.

  • shortcoming:

    • Train-Evaluate Gap: Training on Noisy Data.

    • This task may be oversimplified and semantic understanding may not be required - low-level cues may suffice.

Denoising autoencoders are an effective approach in self-supervised learning, although it may be oversimplified for some complex tasks that require deep semantic understanding. However, this method is still very useful, especially when denoising or restoring a signal contaminated by noise is required.

Context encoders

Another version of the refactorer

The most effective way to predict is that you can understand what it is

"Context Encoders". Context encoders try to predict occluded or missing parts of an image. This method is also common in the field of natural language processing, such as word2vec and the masked language model task in the BERT model.

In this method, the input of the model is a part of the occluded or missing image, and the task of the model is to predict the occluded or missing part. This usually requires the model to understand the context information of the image, because only by understanding the context of the image can the model predict what the occluded or missing part may be.

For example, if an image shows an elephant but part of the elephant is occluded, if the model understands that it is an elephant, it might be able to accurately predict what the occluded part is.

  • advantage:

    • Fine-grained information needs to be preserved.

    • Reconstruction + perceptual loss: Can be used to train the model to understand images better.

  • shortcoming:

    • Training-Evaluation Gap: No occlusions during the evaluation phase.

    • Refactoring tasks can be too difficult and ambiguous.

    • A lot of effort was spent on "useless" details like accurate colors, nice borders, etc.

While contextual encoders can be a complex and ambiguous task to deal with, they provide a powerful way to learn to understand representations of image context, which is very valuable for many computer vision tasks.

Colorization

A summary of image color reconstruction tasks. In this task, the model takes a grayscale image as input and tries to predict the original color image

In this process, the encoder (Encoder) first encodes the input grayscale image into an intermediate representation (Representation), and then the decoder (Decoder) tries to reconstruct a color image from this representation. The quality of reconstruction is measured by Reconstruction Loss, which is the difference between the predicted color image and the original color image.

  • advantage:
    • Fine-grained information needs to be preserved because the model needs to extract enough information from grayscale images to predict color images.
  • shortcoming:
    • Reconstruction tasks can be too difficult and ambiguous because reconstructing color images from grayscale images requires models to understand complex color relationships, which is difficult in many cases.
    • A lot of work needs to be put into "useless" details like accurate colors and nice borders etc.
    • Evaluation needs to be performed on grayscale images, which may lose some information because grayscale images do not contain color information.

Split-brain encoders

This section describes a special form of "context encoders", known as "split-brain encoders". In this type of model, the input image is divided into two parts, each part is processed by part of the model, and then the model tries to predict information about the other part.

For example, a color image can be decomposed into grayscale and color channels. Then, one part of the model works on the grayscale channel, trying to predict the color channel, and the other part works on the color channel, trying to predict the grayscale channel. Thus, the model needs to learn how to infer information from one part of the image to other parts.

The two prediction results are fused to obtain the final prediction result.

advantage:

  • Fine-grained information needs to be preserved because the model needs to infer information from one part of the image to other parts.

shortcoming:

  • Reconstruction tasks can be too difficult and ambiguous because inferring information from one part of an image to others requires the model to understand complex color and brightness relationships.
  • A lot of work needs to be put into "useless" details like accurate colors and nice borders etc.
  • Different parts of the input need to be processed, which can make the model harder to train and evaluate.

Instance classification

**Instance Classification:** This is a special type of task where each data instance is treated as its own class.

Exemplar ConvNets

Example Convolutional Neural Networks, an Unsupervised Feature Learning Method

The example convolutional neural network works by extracting multiple distorted crops from a single image, and then letting the model decide which crops are from the same original image. This task is relatively simple if the model is robust to desired transformations such as geometric and color transformations. The model does this by classifying K "categories" (the categories here are actually the original images).

  • advantage:

    • Representations learned in this way are invariant to desired transformations.

    • Fine-grained information needs to be preserved.

  • shortcoming:

    • Choosing the right data augmentation method is important.

    • As an exemplar approach, images of the same class or instance are negative samples, but there is no mechanism to prevent the model from focusing on the background.

    • The original design was not scalable (since the number of "categories" equals the size of the dataset).

A key idea of ​​this approach is to use multiple distorted cuts from the same image to train the model to be robust to objects in the image, which requires the model to be able to ignore changes in color and geometry and focus on recognizing objects in the image. objects.

Exemplar ConvNets via metric learning

How to implement Exemplar ConvNets with metric learning.

The original paradigm ConvNet has a scalability problem where the number of "categories" is equal to the number of training images. To solve this problem, the task can be reshaped by means of metric learning.

Metric learning is a method whose goal is to learn a distance metric between data points so that the distance between data points of the same category is small, while the distance between data points of different categories is large . In the example convolutional neural network, traditional metric learning loss functions such as Contrastive Loss or Triplet Loss can be used, as well as the more recent InfoNCE loss function.

InfoNCE loss function

The InfoNCE loss function is a particularly popular version used by many recent methods such as CPC, AMDIM, SimCLR, MoCo, etc. It works like a ranking loss: for combinations of queries and positives, it should be close, and for combinations of queries and negatives, it should be far away. In terms of implementation, it can be regarded as a classification loss, but the labels and weights are replaced.

On the right side of the figure, the above is the traditional classification, which may be one-hot encoding, a very lengthy vector. Second, the following is metric learning, which is to learn the similarity between different samples and map it in the latent space.

A key advantage of this approach is that it reformulates the problem of exemplified convolutional neural networks as a more scalable problem where learned representations preserve similarity measures between data points. While this may introduce some new challenges, such as how to select or generate negative samples, it also opens up new possibilities for self-supervised learning.

A bit abstract, for example

Let's say we have some images, which are pictures of different breeds of dogs. Our goal is for the machine to learn to distinguish between different types of dogs, even if it has not seen pictures of such dogs during training.

In the original example convolutional neural network, we would treat each picture of a dog as a separate "category". We then randomly crop multiple clips from each dog photo, and let the network determine whether the clips are from the same dog photo. The problem with this approach is that if we have a very large number of pictures of dogs, then we have a very large number of "categories", which makes it very difficult to train the network.

So, we turn to metric learning. In metric learning, we no longer care whether each picture of a dog constitutes a separate "category". Instead, we only care about "similarity" between photos of different dogs. For each dog photo, we randomly crop a segment from it as a "query", and then we randomly crop other segments from other dog photos, some of which are from the same dog photo (these are " positive samples"), while other segments come from photos of different dogs (these are "negative samples"). Then, we train the network such that the distance between the query and the positive samples is small, and the distance between the query and the negative samples is large.

In this way, we can teach the network how to distinguish between different kinds of dogs, even if it has not seen pictures of such dogs during training. Because the network learns how to judge the "similarity" between photos of dogs, rather than remembering every photo of dogs. This is the application of metric learning to the example convolutional neural network.

Contrastive predictive coding (CPC)

Contrastive Predictive Coding (CPC) is a self-supervised learning method that is primarily used to learn useful representations for unsupervised data.

The basic idea of ​​CPC is to predict future parts of the data, and then use a contrastive loss (such as InfoNCE loss) to train the predictions. In the context of image processing, CPC can predict from one block of an image the representations of other blocks below. It then compares the predicted representation to the actual representation and compares this result to other negative samples (i.e. other images or other blocks of the same image). The goal of this is to enable the network to better understand the intrinsic structure and contextual information of the data.

Imagine we have a picture, which is a natural scene, with a blue sky above the picture, lush trees in the middle, and a lake below.

In CPC, we divide this image into three regions (or patches), which are sky, trees and lakes. Then, we'll pick a region, say the sky, and try to predict the representation of the underlying region (trees). This process is called "context prediction".

Next, we use the neural network to generate a representation of the predicted tree region, and then compare it with the actual representation of the tree region to see if the predicted representation is close to the actual representation. At the same time, we will also take some negative samples from other images (such as images of cityscapes, which may also be the sky) to see if the predicted representation is far enough from these negative samples.

Through such training, the neural network will learn how to predict the following area (such as trees) based on one area (such as the sky). In this way, even in the testing phase, when the neural network only sees part of the sky, it may accurately predict that there may be trees below, rather than other objects, such as buildings or the ocean.

Advantages of CPC include:

  1. It is a general framework that can be applied to many fields such as image, video, audio, natural language processing and so on.
  2. It needs to preserve fine-grained information, which helps to better understand the characteristics of the data.
  3. It can help the network learn various parts of objects through context prediction.

However, CPC also has some disadvantages:

  1. It is exemplar-based, that is, images of the same category or the same instance are considered as negative samples. This may affect the performance of the model.
  2. Training-Evaluation Gap: CPC uses small patches of images for training and the entire image for evaluation, which may lead to a certain gap between training and evaluation.
  3. CPC assumes that the training images are taken at normalized angles (and such normalized angles exist), which may limit its applicability.
  4. The training process of CPC can be slow due to the need to divide the image into many small patches.

Exploiting time

Watching objects move

"Watch objects move" is a self-supervised learning method whose main goal is to predict which pixels will move. This process tends to become relatively easy once we can segment objects out

Specifically, the network will extract features from the image and try to predict which pixels will move in the next frame of the image. This prediction is done in units of pixels , so this method requires pixel-level labels. These labels are usually generated by an external motion segmentation algorithm.

The advantages of "watching objects move" include:

  1. Spontaneous behavior: The network may spontaneously learn the ability of the object to segment segments (separate the object, understand the concept of the object ), because knowing which pixels will move is very helpful for understanding the boundaries of the object.
  2. No training-evaluation gap: During the training and evaluation phases, the network is making pixel-level predictions and is always training and evaluating at the same time, so there is no training-evaluation gap.

However, this approach also has some disadvantages:

  1. "Blind spots": For stationary objects, this method may not handle correctly, because it mainly focuses on pixels that will move.
  2. May over-focus on large conspicuous objects: Large, conspicuous objects tend to generate more moving pixels, so the network may over-focus on these objects while ignoring small or less conspicuous objects.
  3. Reliance on an external motion segmentation algorithm: Generating pixel-level labels requires a motion segmentation algorithm, which leads to the fact that the performance of this method is largely dependent on the performance of the motion segmentation algorithm.
  4. Cannot be extended to temporal networks: When processing video data, the network needs to predict the moving pixels of each frame of image, but if the next frame of image is predicted, then this task becomes very simple, because most of the next frame of image It is the same as the current frame image.

Tracking by colorization

"Color Pursuit" is a self-supervised learning method whose main goal is to colorize new frames using color information from earlier frames. This task becomes relatively easy if all objects can be tracked.

Specifically, the network needs to extract color information from reference frames (frames that have been colored), and then use this color information to color the input frames (frames that have not been colored). This is equivalent to tracking the movement of color information in the video.

Advantages of Color Tracking include:

  1. Spontaneous behavior: The network may spontaneously learn techniques such as tracking, matching, optical flow, and segmentation, as these techniques are very helpful for correctly extracting color information from reference frames and applying it to input frames.

However, this approach also has some disadvantages:

  1. Low-level cues are effective: color information is a very direct, low-level cue, so the network may rely on these cues for learning while ignoring higher-level, more semantic information.
  2. Evaluate on grayscale frames: Since the input frame is not colored, the network must be evaluated on grayscale frames, which results in the loss of some color information.

Temporal ordering

"Is this set of frame sequences in the correct order" is a self-supervised learning method whose main goal is to judge whether a set of video frames is in the correct time order. This task becomes relatively easy if we can recognize actions and human poses in videos.

Specifically, the network needs to extract features for each frame and analyze these features to determine whether the sequence of frames is in the correct order. This is equivalent to tracking changes in movement and human posture over time.

The advantages of "arranging this sequence of frames in the correct order" include:

  1. No training-evaluation gap: In both training and evaluation phases, the network is making sequence judgments, so there is no training-evaluation gap.
  2. Learning the ability to recognize human poses: Since the successful solution of this task requires the recognition of actions and human poses in the video, the network may learn the ability to recognize human poses in the process of solving this task.

However, this approach also has some disadvantages:

  1. Mainly focus on human pose: This method mainly focuses on human pose, but sometimes it is impossible to determine the correct sequence of frame sequences based on human pose alone, because different actions may have the same human pose.
  2. Scalability is questionable: Although this method works well when processing frame sequences, it is questionable whether it can be extended to deal with time-series networks (such as RNN, etc.), because when dealing with time-series networks, the task may become too large. Simple.

In addition, this method has some extension directions:

  1. Randomly place a frame among N frames and find this frame. This requires the network not only to judge the correctness of the frame sequence, but also to find frames that do not belong to this sequence.
  2. Use a ranking loss: the network should generate similar embeddings for frames that are close in time, and different embeddings for frames that are far away in time. This requires the network to be able to recognize the temporal distance between frames.

Multimodal

Bag-of-Words (BoW)

"Bag-of-Words (BoW)" is a technique commonly used in natural language processing and computer vision. The basic idea is to decompose an input (such as a piece of text or an image) into a set of "words" and then construct a "Bag of words" to represent this input.

In natural language processing, a "word" is a word in a text. In computer vision, a "word" can be a local feature or a certain pattern in an image.

We first perform feature extraction on images using a pretrained self-supervised convolutional neural network . Then, assign the extracted features to the visual vocabulary to form a "visual bag of words". Next, we can perform some random perturbations on the image (such as rotation, cropping, etc.) and try to predict the "bag of words" of the original image from the perturbed image.

Try to explain this concept with a more common example. Suppose we have an image containing multiple animals, such as cats, dogs, and rabbits .

In the method of using Bag-of-Words (BoW for short), first, we need a pre-trained neural network model that can recognize and extract features in images . For example, in our example, a neural network model might recognize cat features (such as tail, ears, and eyes), dog features (such as nose, legs, and tail), and rabbit features (such as ears and feet).

These features are regarded as **"sight words"**, and we put them all into a "bag of words", just like when we do text analysis, we put all the words in a text into a bag of words Same. Therefore, no matter the specific position of these animals in the picture, or how their posture changes, as long as these features are in the picture, we can find the corresponding "sight words" in the bag of words.

Then, we perform some random perturbations on the image, such as rotating, zooming in, zooming out, cropping, etc. Next, we try to predict the bag of visual words of the original image from this perturbed image . This requires the neural network model to have strong learning and reasoning capabilities, and to be able to correctly identify visual words belonging to the original image from the disturbed image.

One of the great advantages of this method is that it can understand and describe images from different angles and scales, which is very important for many computer vision tasks, such as object recognition, scene understanding, etc. However, this method also has some limitations, such as it cannot capture the fine features in the image, and the relative position information between visual words. Therefore, although the bag of visual words is a powerful tool, in practical applications, we usually combine other methods, such as Convolutional Neural Networks (CNNs for short), to further improve the performance of our model.

Advantages of this approach include:

  1. The generated representation is invariant to the required transformation : that is, no matter how the image is rotated or cropped, as long as it contains the same "word", the same "bag of words" will be generated.
  2. Learn contextual reasoning skills: Since the perturbed image needs to predict the "bag of words" of the original image, the network needs to learn how to infer some parts of the image from other parts of the image.
  3. Inferring words in missing image regions: If a part of the image is missing or covered, we can also use the "bag of words" to predict which "words" this part may contain.

However, this approach also has some disadvantages:

  1. Need to start from another network: this network cannot be learned from scratch, it must be started from another pre-trained network.
  2. Limited ability to learn fine-grained features: While bag-of-words methods can identify general features in images, they may have limited learning ability for finer-grained features such as color, texture, etc.

In addition, although "Bag of Visual Words" is an effective feature extraction method, it loses spatial information, such as relative position information between features, which is very important in many applications. So there is an improved method called "spatial word bag", that is, while retaining the characteristics of the word bag, it also retains part of the spatial information.

Audio-visual correspondence

The self-supervised learning task of "audio-visual correspondence" is performed by combining audio with images. The goal is to tell based on the image and sound whether they are a match.

Let's illustrate with a simple example: Suppose you have a video clip where a football is kicked, and you hear the sound of a kick. In this example, the image and sound match because what you see visually as the football being kicked matches what you hear audibly.

However, if we replace the audio in this video with a cat meowing, the image and sound no longer match, because what you see is a football being kicked, but what you hear is a cat cat meowing.

In the task of "audio-visual correspondence", the goal of the neural network is to learn this correspondence. During training, the network needs to judge whether the input image and sound match. If there is a match, the network should output "yes"; if not, the network should output "no". [External link image transfer...(img-pjRJW98E-1686298179749)]

In this way, the network can learn how to extract useful features from visual and auditory signals and understand the correlation between the two signals.

The advantage of this approach is that we can get representations of both modalities simultaneously without the need for additional data augmentation methods .

The disadvantage of this method is that not all images have corresponding sounds, that is, there are some "blind spots" that the network cannot learn. In addition, the instance-based nature of this method makes videos of the same category or instance negative samples, which may affect the results.

Leveraging narration

Guess you like

Origin blog.csdn.net/weixin_57345774/article/details/131118438