Article Directory

Application of Machine Learning to Quantitative Models
- Machine Learning Quantification Application Scenarios
- Thinking about the Effectiveness of Quantitative Models
Application of Machine Learning Models in Quantitative Timing
- Training and Prediction Process
- Training Data Feature Construction
SVM model and calculation

Recently, ChatGPT has been popular, and NLP students will definitely feel more deeply. It is a good thing that the application of NLP is known and actively deployed, but every application scenario at the application level is a task that has been continuously overcome by the SOTA model in the past domain. But unfortunately, in recent years, breakthroughs in solving a single task at the algorithm level have slowed down significantly, but the application level has accelerated.

ps: At present, the word "Skynet" has not been mentioned in the information, hhhhhhh, back then when VR and AR didn't have anything, mentions of "Skynet" were coming all over the mountains and plains. what is

Here we use a relatively simple and commonly used machine learning model SVM to help timing and obtain excess returns

insert image description here

Application of Machine Learning to Quantitative Models

Machine Learning Quantification Application Scenarios

The machine learning application and quantitative strategy summarized by the blogger have the following three scenarios:

Construct a quantitative strategy with a winning rate greater than 50. No matter whether the model is explainable or not, by increasing the number of transactions, the comprehensive income will be shifted towards the moving average to obtain the expected excess return
On a logical framework that may obtain excess returns, use the machine learning model to optimize the details, so that the average expected return will shift to a higher return under the blessing of the model
Based on the pricing model, earn excess returns from the revised market

And each scenario corresponds to different quantification ideas, and also corresponds to the knowledge system of different researchers:

The first type is suitable for engineering backgrounds with sufficient professionalism. The difficulty lies in the premise that "history will not repeat itself". The demonstration model can obtain excess returns, and obtaining excess returns is also a high-probability event. Mainly high-frequency trading
The second type is suitable for financial personnel with programming ability. The difficulty lies in demonstrating the logical chain that can obtain excess returns
The third type is suitable for financial personnel with programming ability and experience. The difficulty lies in identifying and eliminating noise information in the market, or correcting and optimizing the pricing model

Thinking about the Effectiveness of Quantitative Models

The current consensus is that the complexity of investment tasks is far beyond the scope of machine learning, so it is usually necessary to use machine learning models to optimize within a logical framework artificially framed.

After studying so far, I have read a lot of quantitative books and strategies. The blogger has some thoughts and wants to share with you:

In fact, many students, like bloggers, have transferred from computer science to finance, so "quantification" is a good entry point for us. The more we lean towards data analysis, the more comfortable it is for us. But humans versus algorithms:
- The advantages of human beings are: stripping away noise, summarizing, and being able to read books less and less
- The advantages of machines are: statistics, reasoning, and the ability to read books thicker and thicker

The econometric model, which has been developed for more than half a century, has shown that the "result data" of finance and pricing are chaotic and random in their information composition. Therefore, it is best not to let machines "replace Think for yourself", the results of the algorithm can only give some inspiration at most, far from assisting thinking. At the same time, don't "have more features, the better". Garbage features are the source of noise, and machines cannot filter them by themselves. Therefore, "humans" must first understand finance and have logic, and then "humans" construct algorithms.

In addition to tuning parameters, there are generally two effects of improving machine learning models:
- Artificially constructed feature sequences that can withstand logical scrutiny
- Do not pre-eliminate features according to the inherent rules of data analysis

Experience, such as the random forest model commonly used by bloggers, when you want to improve the effect only by adjusting features and data without tuning parameters, first of all, don’t remove this feature based on a biased distribution or something. Because each feature is a perspective, some perspectives are more accurate, but some perspectives are clear and strange. But every perspective is valuable. At this time, we need human participation to construct some suitable perspectives to match these features and reprocess the features. The less important the feature, the more the source of inspiration, the greater the room for improvement! It would be a big loss to eliminate it in advance.

The difference in professional knowledge will make us look at the world from a different perspective. As the saying goes, "everyone who learns becomes a character." Students majoring in finance will put "risk management" in the first place, and at the same time have an almost instinctive recognition of "survivor bias" incidents, which is very powerful! However, according to my observation, in order to pursue the "theoretical mean", many quantitative strategies will rely on data theory and indulge the model, which requires special attention.

This blog only uses the SVM model for calculations. For more machine learning models, please refer to: https://blog.csdn.net/weixin_35757704/article/details/89280669

Application of Machine Learning Models in Quantitative Timing

Training and Prediction Process

Using machine learning usually has the following steps:

data cleaning
Split training set and test set
Using the training set, cross-validate the stability of the model
The test set judges the effectiveness of the model
Application model calculation and backtesting

Therefore, we divide the time into the following two parts:

Training and test data time: 2015-01-01 to 2020-01-01
Application model calculation and backtest time: 2020-01-01 to 2023-01-01

Training Data Feature Construction

Here we construct a simpler feature for your convenience:

Average turnover rate in the past 5 days
Average turnover rate in the past 10 days
Change in the past 5 days
Change in the past 10 days
MACD indicator DIF value
MACD indicator DEA value
MACD value
Aroon indicator (a momentum indicator) DOWN value
Aroon index UP value

SVM model and calculation

SVM training and prediction

Usually, after getting the data, the model with the final income as the goal mainly has the following training objectives:

Directly predict the rate of return for a period of time in the future
Forecast the range of earnings for a period of time in the future

Due to the limited performance of the machine learning model, when the ultimate goal is usually the rate of return, it will choose to "predict the range of earnings for a period of time in the future"

Therefore, we train and predict according to the following rules:

70% of the data is used as the training set, and 30% of the data is used as the test set
Take the [rise and fall in the next 5 days] as the forecast target, and at the same time divide the data into bins and divide them into:
- Yield range: [minus infinity, -1]
- Yield interval: [-1, 1]
- Yield range: [1, positive infinity]
In the training set, do 10 times of cross-validation
The test set calculates the confusion matrix and visualizes it

The above "cross-validation" is to judge the problem of over-fitting and under-fitting. Many articles tend to blame "over-fitting" for poor results, but there is obviously a problem. For overfitting and underfitting, please refer to: https://blog.csdn.net/weixin_35757704/article/details/123931046

Effect measurement

The calculation process is as follows:

Collect every non-ST stock from 2015-01-01 to 2020-01-01
Then according to the stock price trend of individual stocks, construct the above 9 characteristics
According to 70% of the data as the training set, 30% of the data as the test set
Do 10 cross-validations on the training set

According to the above rules for training and prediction, the following model results are obtained:

Calculated according to the above calculation process, the accuracy rate on the test set is 0.4751
The normalized confusion matrix is as follows:
The results using 10-fold cross-validation are as follows:

Accuracy effect	0.492502	0.488092	0.478529	0.473529	0.485882	0.477647	0.477059	0.484118	0.480882	0.486176

In actual use, we will judge according to the logical effect of the model: if the model predicts a positive return, we will buy; if the model predicts a negative return, we will sell;

Effectiveness analysis

The effect of cross-validation is similar to the prediction effect of the test set, indicating that the performance of the SVM model is relatively stable
SVM predicts almost no difference in categories such as 0, 1, and 2 as category 1, and the accuracy rate of calculating 0 and 2 is only 10%, regardless of the category itself.

This effect is quite satisfactory, because there is no optimization, adjustment, or subjective structural features, the effect of the naked model is almost the same effect...

Quantitative Timing - SVM Machine Learning Quantitative Timing (Part 1 - Factor Measurement)