[Data analysis] AB Test learning

Reprinted Source: https://www.cnblogs.com/zichun-zeng/p/9042779.html

AB Test description:

https://vwo.com/ab-testing/

AB Test of significance:

Data analysis tells us not to do something, ab test feedback tells us that we do well is not good, where there are problems, as well as a measure of the uncertainty of how much growth can bring.

First, the theoretical basis

 

1 , the central limit theorem:

 

A large number of independent random variables with mean (or and) the limit distribution to a normal distribution (meaning when certain conditions are met, such as Sample Size is relatively large, the number of sampling time infinity region, closer normality distributed). And this theorem amazing place is that no matter what the distribution of the random variable, satisfy this theorem.

 

2 , law of large numbers

 

    Can be described as simple, if there is a random variable X, you constantly observing and sampling the random variable n samples obtained values, which then determined the average of n samples worth, when n tends to positive infinity of time, the average value converges to the random variable X expectations.

 

3, confidence intervals and statistical significance

reference:

https://zhuanlan.zhihu.com/p/24399612

 

 The concept: samples, the overall

 

The confidence interval is (for the overall parameters of a probability sample interval estimation of) the sample mean range, it shows the probability that the mean range includes general parameters, this probability is called the confidence level;

 

Confidence level represents the reliability of estimates, in general, we use the 95% confidence level interval estimation.

 

Significance of the confidence interval in ABtest :( confidence interval of the mean difference between the two overall)

 

Obtain Z value is calculated by t-test large sample test formula (calculated from the mean, sample size, the statistical variance values, combined with statistics by the distribution formula, it can also be calculated p value in order to make a decision whether to reject the null hypothesis) , then according to the two population mean, standard deviation and the like
of this size, using the following equation to determine the difference between two population means 95% confidence interval:

 

 

 Zhidezhuyi that the upper and lower limits of the confidence interval with positive or negative, can only show trial was statistically significant (that is, the test version and version control are different), but this difference may be very small, in the practical application Insignificant. Therefore, only two features both statistically significant results and significant effect, in order to explain this version is available, it is worth to publish.

 

Two, AB testing of experimental Precautions

 

1, time consistency;

 

2, data distribution consistency;

 

3, statistically significant results can guide decision-making;

 

4, experimental block design (flow to be evenly distributed):

 

Deviation algorithm to the user is not reflected to the experimental points tub, it will enlarge the gap between the effect of the algorithm, thereby generating Simpson's Paradox;

 

5, confidence

 

To obtain a trustworthy test results require a certain flow (sample) time and, if the flow (sample) is too small or uneven points, test results are sporadic, reliable results may not be obtained; the test run time is too short the same words;

 

6, time

 

Experimental period should avoid the influence of external factors, as far as possible stable in time, to reduce the interference of external factors;

 

Sometimes, in order to ensure confidence in the experimental results, the conclusions to prevent low flow is not uniform, during the test, and gradually increase the flow rate distribution, while monitoring the trend of the data key indicators, thereby obtaining a confidence;

 

Third, the shunt and sub-bucket works

The need to ensure:

(1) The same experiment different points between the tub is random;

(2) different scenes, experimental, kit of parts will be broken up again;

(3) experimental design, we need to consider which factor authentication, you can be divided barrel according to the factor;

 

The relationship between the barrel and the sub shunt:

  Shunt means, from a few percent of a random sample of the population to do the experiment;

  Refers to sub-barrel, which in the experiment were randomly divided flow according to a tub require authentication factor;

Fourth, the sub-barrel unbalanced certification scheme

 

1, AA testing

 

A / A A test will be appreciated that the pair of two versions of the same / B test. In general, the purpose of this test is to verify the tool is being used to run a fair statistically. In A / A test, if the test correctly, the control and experimental groups did not make any difference.

 

If the A / B testing to test the comparative merits of several programs, then the A / A test is an effective way to A / B testing and verification tool for confidence.

 

Should consider running A / A test case is unique:

 

(1) you just installed a new test tools or change the test tool settings;

 

(2) you find the A / B test data and analysis results are different between the tool;

 

Generally performed prior to testing AA AB test, or by A / A / B test while ab test to see if there is statistically significant difference between the two groups A identical, thereby determining whether the sub-rules tub fly. Some analysts suggest that in this way does not directly control, such as barrels (a barrel) is an experimental barrels (b barrels) twice as large (so-called pooling)

 

2, a number of statistical tests;

 

 

Five minutes barrels unbalanced solution

 

1, the experimental evolution from AA A: B = 2: 1 distribution of the size of the flow;

 

2, the comparison made by the flow rate gradually enlarge manner;

 

 

Six, AB test applications

 

1, a preferred embodiment;

2, the test system;

3, causal inference;

 

AB test and evaluate the pros and cons and offline usage scenarios:

1, AB test system setup and maintenance requires a certain cost, there are certain technical requirements, if the system does not do well, with but harmful; ab system is more convenient for a measure algorithm / product optimization effect brought about, for shortage of start-up companies employing not really necessary;

2, off-line assessment is the most important online virtual reality scenario, if the simulation is not good, the results of off-line testing is also not credible;

  However, when the company's ab test system has not built a good time, offline or necessary to assess, at least there are some obvious algorithm problem can be seen by offline testing, model selection and tuning also requires off-line testing and offline test does not will affect online, ab experiment will;

3, when the intense product in the market, the competitive environment, the project line need to win favorable opportunity, they often rely strategic decision to decide on the line or not, rather than ab experiments, ab test for some of the observation period and require external conditions are relatively stable in order to obtain objective conclusions for the development of the product in a relatively stable when the decision to prevent data errors fall;

  Therefore, real-time data analysis is necessary, the need for real-time ab test is not very strong;

4, most ab test system does not have the ability to push the decision-making continues to observe the whole, some relating to realize long-term strategic objectives of the company or product functions / algorithms, may lead to short-term indicators decreased or not increased noticeably, but must also to on-line;

5, AB test to help you get more revenue in existing traffic, or upgrade existing ROI in traffic, or to enhance the activity on the existing user base, but the measure of subscriber growth or get new traffic is the helpful, ab test the role or not.

6, AB test there is a disadvantage that the effect can only do small-scale compared to the effect of such action using different algorithms compared to the same scene; for example, it does not tell us whether the recommendation algorithm A business than business B recommendation algorithms Well done; that is, it can not measure migration and generalization of a model;

 

 

Algorithm test and data analysis associated with ab:

 1, the model using algorithms do deep, ab test method to measure the end-to-end effect; interpretation model is then made by statistical analysis or a method ml, or modeling feature before analysis.

 

 

other:

 

1 , Darwin verification system flow maldistribution problem solution: Set AA testing

 

2 , and a control group to optimize the baseline agreement, to experimental design and we need to verify the conclusions consistent with the job;

 

3 , in the end whether the model needs to  Fine the Tuning (select models weigh between tune with the new algorithm attempts:?? Look in the end the goal is to accurately recall or to actually balance the high precision and recall rates aside business needs, a good model is the basis for the precision and recall rates tend to close on, but also to achieve both a higher value; that if there is a business need, then use the appropriate characteristics of the model based on business needs)

 

4, online training model which in the end is a sampling or sample the full amount of training is better? (Optimized sampling aspects of the training sample) (requires experimental verification)

 

5, comprehensive feature optimization to improve diversity and improve the processing characteristics of the embodiment includes the features;

 

 Data Analysis -> base feature processing -> Model Design -> Tool Platform (wherein Engineering, model training, and prediction) -> Experimental Design and Verification -> (feedback to any of the previous step was execution order)

 

Published 44 original articles · won praise 16 · views 10000 +

Guess you like

Origin blog.csdn.net/YYIverson/article/details/103845019