【Imitation Learning】: Offline and Online Imitation

1. Description

         Imitation Learning is a type of machine learning in which agents learn by observing and imitating the behavior of experts. In this approach, the agent is given a set of demonstrations or examples of the desired behavior and learns the mapping between input observations and output actions by attempting to replicate the behavior of an expert.

        Imitation learning is often used in scenarios where it is difficult to define an objective function for agent optimization, such as complex tasks such as playing a game or driving a car. By learning from expert demonstrations, agents can achieve high levels of performance without complex or hand-designed reward functions.

        One of the main challenges of imitation learning is handling distribution changes when an agent is exposed to a new environment or a different set of inputs. This can lead to brittle learned behaviors or unexpected failures. Techniques such as domain adaptation and inverse reinforcement learning can be used to address this problem.

source

2. What is imitation learning?

        As the same itself suggests, nearly all species, including humans, learn through imitation and improvisation. In a word, it is evolution. Likewise, we can have machines imitate us and learn from human experts. Autonomous driving is a good example: we can have agents learn from millions of driver demonstrations and imitate expert drivers.

        This learning from demonstrations, also known as imitation learning (IL), is an emerging field of reinforcement learning and artificial intelligence in general. The application of IL in robots is ubiquitous, and robots can learn policies by analyzing policy demonstrations performed by human supervisors.

        Expert Absence vs. Presence:  Imitation learning takes 2 directions, conditional on whether the expert is absent during training or whether the expert is present to correct the behavior of the agent. Let's talk about the first case when the specialist is absent.

3. Absence of experts during training

        The absence of experts basically means that agents only have access to expert demos and nothing more. In these "expert-absent" tasks, the agent tries to use a fixed training set (state-action pairs) demonstrated by the expert in order to learn a policy and achieve actions as similar as possible to the expert. These "expert-absent" tasks can also be called offline imitation learning tasks.

        This problem can be framed taxonomically as supervised learning. The expert demonstration consists of many training trajectories, each of which consists of a sequence of observations and a sequence of actions performed by the expert. These training trajectories are fixed and not affected by the agent's policy, and this "expert-absent" task can also be called an offline imitation learning task.

        This learning problem can be formulated as a supervised learning problem, where the policy can be obtained by solving a simple supervised learning problem: we can simply train a supervised learning model that directly maps states to actions, by his/her demonstration Come imitate the experts. We call this approach "behavior cloning".

        Now we need an alternative loss function to quantify the difference between the demonstrated behavior and the learned policy. We formulate the loss using the maximum expected log-likelihood function.

L2 error ~ maximizes the log-likelihood

        If we are solving a classification problem, we choose cross entropy, and if we are solving a regression problem, we choose L2 loss. It is easy to see that minimizing the l2 loss function is equivalent to maximizing the expected log-likelihood under a Gaussian distribution.

4. Challenges

        So far, everything looks good, but an important disadvantage of behavior cloning is generalization. The expert collects only a subset of the infinite possible states that the agent can experience. A simple example is that a professional car driver would not go off course to collect unsafe and dangerous states, but an agent might encounter such dangerous states and it might not learn corrective actions because there is no data. This is because " covariate shift " is a known challenge where the states encountered during training differ from those encountered during testing, reducing robustness and generalization.

        One way to solve this "covariance shift" problem is to collect more demonstrations of risky states, which can be very expensive. Expert presence during training can help us address this issue and bridge the gap between demonstrated and surrogate strategies.

5. Expert Presentation: Online Learning

        In this section, we introduce the most famous online imitation learning algorithm known as data aggregation method: DAGGER. This approach is very effective at bridging the gap between the states encountered during training and the states encountered during testing, aka " covariate shift ".

        What if experts evaluate learner policies during the learning process? Experts provide examples of correct actions to take from learners' own behavior. This is exactly what DAgger is trying to achieve. The main advantage of DAgger is that experts teach learners how to recover from past mistakes.

        The steps are simple and similar to behavioral cloning, except that we collect more trajectories based on what the agent has learned so far.

        1. The policy is initialized by a behavioral clone of the expert demonstration D, resulting in policy
        π1 2. The agent uses π1 and interacts with the environment to generate a
        new dataset D3 containing trajectory 1. D = DU D1: We add the newly generated dataset D1 to the expert demo D.1
        . The new demo D is used to train the policy π2.....

        To exploit the presence of experts, a combination of experts and learners are used to query the environment and collect datasets. Thus, DAGGER learns policies from expert demonstrations under the learned policy-induced state distribution. If we set β = 0, which in this case means that all trajectories during the period are generated from the learning agent.

6. Algorithm:

DAgger alleviates the problem of "covariance shift" (the state distribution induced by the learner's policy is different from the state distribution in the initial demonstration data). This approach significantly reduces the size of the training dataset needed to achieve satisfactory performance.

7. Conclusion

        DAgger has achieved extraordinary success in robotic control and has been applied to control drones. Online learning methods such as DAGGER are essential in these applications because learners encounter states where the expert has not demonstrated how to act.

        In the next blog in this series, we will look at the disadvantages of the DAagger algorithm and importantly we will highlight the security aspects of the DAagger algorithm.

Guess you like

Origin blog.csdn.net/gongdiwudu/article/details/132014527