Improving generalization of reinforcement learning-based trading by using generative adversarial market models

Improving Generalization in Reinforcement Learning–Based Trading by Using a Generative Adversarial Market Model
first part.

introduce

Portfolio management is a long-standing component of quantitative trading, where the goal is to satisfy a predefined utility function by continuously reallocating funds among certain financial products. Approaches to portfolio management fall into one of three types: 1) traditional approaches (such as momentum [1] and contrarian strategies [2] ), 2) machine learning methods (such as pattern matching [3] ), and 3) based on Reinforcement Learning (RL) methods [4] , [5] . With the vigorous development of deep neural networks, many researchers have combined deep learning with RL and achieved remarkable results in many financial fields, such as foreign exchange trading [6], portfolio management [4], [5] ,  [7]  ,  [8] and market making [9] .

Most successful RL research uses realistic physics engines or dynamically interacting entities to build training environments. For example, AlphaZero  [10] trains an agent to play a board game through self-play. Here, self-play means that the environment that the agent faces, i.e. the environment that the player-agent plays against, is generated by the best player (agent) trained by the neural network from all previous iterations. The training agent gets continuous feedback in response to its own behavior, resulting in a robust and plausible interrelationship between the training environment and the agent. However, research on RL-based portfolio management has been less successful. In such studies, historical price data are still directly used to construct the training environment [4], [5] , [7] , [8]. From the agent’s point of view, feedback from such training behavior does not respond. Thus, the agent faces several problems in optimizing its actions for such an unresponsive training environment. First, the state obtained from the environment is independent of the agent's behavior. An agent's interaction with such an unresponsive environment may violate the definition of a Markov decision process (MDP)—where the MDP theorem explicitly defines state transitions as those that depend on the current state and actions. Since the MDP theorem is a fundamental theorem of RL, violating the definition of MDP can lead to an unreasonable optimization process for RL-based portfolio agents. Second, this unresponsiveness means that the environment fails to respond appropriately to the agent's behavior in the market. In other words, an environment built on historical price data cannot model the agent's influence on the market. As a result, agents optimized using historical price data may produce poor generalization: trading knowledge built from in-sample (training) data cannot be applied out-of-sample (testing). Regardless of how well the model fits the training data, a model that generalizes poorly is useless for solving practical decision problems. Therefore, generalization can be considered as the biggest hurdle that must be overcome to build an RL-based portfolio management model. [9], [11] improve the generalization ability of RL-based transactional agents by injecting randomization into the environment. However, most of these studies use historical price data to construct the environment; the injection of random noise does not directly address the above issues.

In our opinion, two solutions can be used to address the above problems. The first is to interact an RL-based portfolio agent with real stock exchange data to optimize portfolios. The second is to use another AI model to build a real virtual market for RL agents to interact with. The first solution is based on rewards for trading results in real financial markets. However, due to the high cost of this solution and the relatively long data collection time required for the agent to converge, it cannot be practically applied to RL-based portfolio optimization. The second approach is where our main contribution lies. In our study, a variant of generative adversarial networks (GANs) is proposed to simulate market order behavior by modeling the distribution of historical limit orders . The generative model is then used to construct a synthetic stock exchange as a training environment for the agent. The proposed learning framework enables agents to obtain simulated market responses to their trading decisions. By doing so, the causal relationship between state and action is strengthened. Furthermore, simulating a stock exchange prevents the agent from violating the definition of the MDP by allowing the agent to participate in the state transition process ; this justifies the use of RL in portfolio optimization by ensuring that the fundamental theorems underpinning the RL framework hold. By interacting with a simulated stock exchange, the agent is able to explore a larger range of previously unforeseen market situations; the training dataset is also more diverse. To the best of our knowledge, this is the first study to use generative models to reconstruct financial markets in RL-based portfolio management simulations with the goal of improving agent generalization. The main contributions of this study are as follows:

  • A generative model known as limit order book (LOB)-GAN models the distribution under historical limit orders . LOB-GAN is used to simulate the order behavior of investors as a whole in the market.

  • Introduce a limit order conversion module to let LOB-GAN synthesize relative order quantities instead of directly predicting order prices and corresponding quantities.

  • By having the generator in LOB-GAN cooperate with a secure matching system, a comprehensive stock exchange called a virtual market is constructed. A virtual market can present simulated market reactions based on an agent's trading decisions.

  • A novel RL-based learning framework for portfolio optimization utilizing virtual markets is proposed. The framework ensures that the definition of an MDP is never violated by establishing a tighter interrelationship between actions and transition states.

The rest of the paper is organized as follows: Section II reviews the literature; Section III states the assumptions and defines the problem; Section IV presents the proposed market behavior simulator, construction of virtual markets, and other generalization strategies; Section V presents the proposed RL-based portfolio optimization framework; Section VI presents the experimental results; Section VII concludes the paper and discusses future research directions .

the second part.

literature review

This section reviews three bodies of literature: on exploiting RL in finance, RL generalization techniques, and artificial market simulations.

A. Financial Reinforcement Learning

RL has been widely used in several areas of finance, such as market making and foreign exchange trading, and is especially important in portfolio management. In this section, we focus on reviewing the literature on RL-based portfolio management. As a rule of thumb, portfolio management can be broken down into three main steps: portfolio selection, weighting, and rebalancing. In portfolio selection, the focus is on selecting portfolio assets; in portfolio weighting, the process determines capital allocation; and, in portfolio rebalancing, decides whether and when to change portfolio weights. Sbruzzi et al.  [12] focus on portfolio selection and use an RL framework where the asset pool selection agent optimizes the selection strategy. Wang et al.  [4] bridged the process of portfolio selection and weighting by using their proposed AlphaStock method. Specifically, the authors formulate a specialized Cross-Asset Attention Network (CAAN) mechanism in AlphaStock to capture the interrelationships among portfolio assets. ginger and so on.  [7] focus on portfolio weights and propose their Equal Independent Evaluator Ensemble (EIIE) topology. Their portfolio selection strategy is directly based on trading volume, taking into account transaction costs (a key issue in algorithmic trading strategy execution) in their learning framework. The authors examine several time series feature extraction models using their EIIE topology. Shi et al.  [5] extend the EIIE topology in their ensemble of the same independent initial (EIII) topology, which exploits the initial network to simultaneously consider price movements of different sizes. Their experimental results show that the EIII topology yields better portfolio performance than the original EIIE. Ye et al.  [8] also extended the EIIE topology in their state-augmented RL (SARL) topology, where cooperation is introduced into heterogeneous datasets to help agents make better predictions. Tangwait.  [13] also emphasizes combining multiple sources, where traditional metrics and modules of pretrained GANs each constitute different data streams. Lee et al.  [14] applied a novel RL algorithm that utilizes stacked denoising autoencoders (SDAE) to build agents with the goal of obtaining robust state representations. Despite these advances, most studies on RL-based portfolio optimization use historical data to optimize agents, which may lead to agents with poor generalization ability.

B. Generalization in Reinforcement Learning

The problem of generalization in RL has been studied in various fields. Whitson et al.  [15] split the generalization problem into on-task and off-task variants. In the within-task variant, generalization is satisfactory if the agent optimized on the training trajectory performs well on the test trajectory in the same environment. In the off-task variant, generalization is satisfactory when the agent performs well in an environment different from the training environment. The methods used to solve the generalization problem in RL can be divided into five categories.

  • Regularization methods: Various techniques such as dropout and L2 regularization are applied to prevent the agent from overfitting in the limited state space [16] . Igl et al.  [17] proposed Selective Noise Injection (SNI), which preserves the regularization effect but alleviates the side effects on gradients, to improve the adaptability to RL.

  • 对抗训练: Different settings of the perturbation generation strategy are introduced in RL-based trading [9][11]. The injected noise can 1) help the agent to learn how to furnish a robust representation and 2) diversify the training environment.

  • Data Augmentation: To make the data more diverse, transformations are applied to the state [18] , [19] .

  • Transfer Learning: It is widely used for domain adaptation [20] by focusing on helping agents generalize to new tasks . Gamrian and Goldberg  [21] further utilized GANs to map visual observations from the target domain to the source domain.

  • Meta-learning: The agent learns meta-policies that help it quickly adapt to other domains [22] . Wang et al.  [23] also focus on the problem of allowing agents to quickly adapt to new tasks; they do this by extending recurrent networks to support meta-learning in RL.

In this study, we focus on the within-task generalization ability of agents whose goal is to learn a general trading strategy that yields comparable portfolio performance during testing and training. This objective is similar to those in [9], [11]. However, similar to RL research in finance, research on improving generalization in finance has been based on historical price-based training environments. Therefore, the above-mentioned problem of using historical data remains unresolved in the literature.

C. Artificial Market Simulation

Researchers have long attempted to model investor behavior. Groundbreaking research has focused on the potential of the Efficient Market Hypothesis (EMH) [24], which holds that people are always rational enough to make optimal decisions. However, other researchers have found that people do make irrational decisions, for example under herding [25] . So behavioral economics was proposed to model this irrationality. Recent research has focused on behavior prediction. According to Lovric et al.  [26] , the investment decision can be modeled as the result of the interaction between the investor and the environment. Research also suggests several interdependent variables that affect the investment process, such as time preference, risk attitude, and personality.  Furthermore, in the framework proposed by Shantha et al . [27] , investors learn from their trading experience (individual learning) or by imitating others (social learning).

Artificial market simulations allow researchers to construct situations that cannot be captured in historical data. Consequently, such simulations are widely used to analyze various issues in finance, such as short selling regulations [28] , transaction taxes [29] , and the speed of order matching systems [30] . Agent-based simulation combines multiple agents to reproduce stylized facts in real markets and is the most common technique in artificial market simulations. The simulation process consists of several parts. First, the intelligence level, utility function and learning ability of the relevant subjects are defined [31] . Second, asset price determination [32] . Third, the type and quantity of trading assets involved in the declaration of artificial market construction [33] . Fourth, determine the learning process that is highly correlated with the agent's intelligence level [34] , [35] . Fifth, and finally, the simulated market is calibrated and validated. Specifically, calibration is the selection of parameters that make the simulated market behave the closest to the real market, while verification involves whether the simulated market behaves the same as the real market. In addition to constructing simulated markets using an agent-based model, Li et al.  [36] proposed Stock-GAN to generate limit order data with high fidelity to support market design and analysis in continuous trading systems. In this study, we utilize generative models to construct financial markets. We not only reconstruct a financial market with a realistic pricing mechanism, but also combine the simulated market with RL trading agents. By combining market simulation with an RL-based portfolio optimization framework, we overcome the aforementioned shortcomings of using historical price data for proxy optimization.

the third part.

preliminaries

This section states the hypotheses, discusses the limitations of this study, and addresses issues in applying RL to portfolio management.

A. Hypothesis

We propose a generative model to simulate market responses to agent actions. Therefore the following assumptions must be made:

  • Since the simulated financial market is responsible for generating plausible responses to the actions of the agent, it is assumed that the agent has the ability to influence the behavior of other investors in the market.

  • Investors' ordering behavior fully reflects the impact of exogenous variables on financial markets. Therefore, we only model market ordering behavior when synthesizing plausible market responses.

In addition to these assumptions, the study has another limitation. As we still lack a systematic way to verify the authenticity of generated limit orders, assessing portfolio performance in simulated financial markets may expose agents to the risk of unrealistic estimates. Therefore, we use historical price data to evaluate the generalization ability.

B. Problem Definition

Portfolio management is a decision-making process in which funds are constantly reallocated to different assets. The process of portfolio strategy formulation can be expressed as MDP. MDP is represented as a tuple <S,A,P,,,,p0,c>, where S is the state space, A the action space, P the state transition function, R the reward function, p0 the probability distribution of the initial state, and C∈ [ 0 , 1 ) Reward discount coefficient. In the case of portfolio management, the agent aims to find the optimal policy π(A | s ), where the action A ∈ A is optimal with respect to the state S ∈ S. In this optimal policy, the expected return is maximized :

π*=parameter maximum E [∑t = 0∞γtR(st, At)],(1)
view source code

where s0~p0, At∼π(···_st) and _ _ _st+1~P(···_st,At) . The RL-based portfolio management framework mainly includes environments and agents. The mapping from MDP to learning framework is described as follows.

1) environment

The design of the environment includes the following elements: (1) state St ∈ S, which contains the transaction state of the agent or the period of the price sequence provided by the environment; (2) state transition P(···_St,At), Present the next state St + 1 given the previous state and action; (3) the reward function R(St,At), which is the utility function that defines the agent’s portfolio performance and serves as the objective function for the agent to maximize .

Guess you like

Origin blog.csdn.net/sinat_37574187/article/details/130301857