Sample size can affect the results of hypothesis testing (if significant) do? A / B Test Example

Today lectures hear this conclusion: If we assume that a large amount of sample tested, then the significance level α should be set a little too small.

 

why? I did not pass, then go online to try to find answers. It was found that many people online are still entangled: If we assume that a large amount of sample tested, then the result will test the hypothesis very easy to produce significant. This is not true? The sample size is not too bad?

 

I:? ? ? Before long I knew this argument does not make sense, but I never had to study this issue carefully. This time in the stackexchange collecting and know almost everyone's answer a bit and found that many teachers, including many versions of the book are wrong, here it is necessary to clarify and record it.

 

First, the reason why some people would think that a large sample hypothesis test results easier to produce significant for the following reasons:

 

Cast copper, cast more the number, the less likely amount of the value of a statistical test that appears. (Image taken from: https://www.zhihu.com/question/53199900?sort=created )

 

 

Opponents say: This is indicative of the benefits of a large sample of it. If the sample size is small, it is likely the result of hypothesis testing is due to accidental caused. The larger the sample size, the more we can be sure the results of hypothesis testing is accurate.

 

Another reason is such an example t test, t values according to the formula: If n larger sample size, the smaller the standard deviation, so that the greater the value t, the smaller the p-value will be introduced , that does not explain the larger the sample size, the more likely the result of significant it?

 

Opponents say: If the effect of the amount of change, then this statement is correct. However, in the case where other parts (α, 1-β) constant, n is the greater, the smaller the amount of the effect, and therefore the value of t will not become large.

 

反对者承认,在大样本的情况下,我们会检测出那些细小但有时不具有实际意义的差别。也就是说,即使假设检验的结果具有统计显著性,但是由于该结果的效应量太小,因而该结果没有什么意义。比如《A/B测试实例》这个例子,转化率从30%到33%,这个需要提升的部分就是我们希望假设检验能检测到的最小差别,以此可以计算出效应量。从样本量的计算中可以看出来,在其他部分(α,1-β)不变的情况下,效应量越小,我们需要的样本量就越大。因此,也就是说样本量越大,假设检验也就越敏感,越容易检测出细小的差别。但这并不是说我们不应该使用大样本,而是说我们对假设检验结果的解释依赖于效应量和敏感度。如果效应量很小,敏感度又很高,那么很可能结果具有统计显著性但并没有什么实际意义。

 

那么为什么大家都在争论这个问题呢?谁也说服不了谁。我觉得是因为他们都没有说清楚前提条件,以至于大家说话没在一个频道上。

 

如果我们保持效应量不变,也就是说把我们想要检测出的最小差别确定好,此外把想要达到的power也确定好,那么如果样本量大的话,统计检验量的值确实更容易被检测出显著。在这种情况下,我们应该把α调小一点,这样可以同时很好地控制第一类错误和第二类错误出现的概率。

 

回到开头说的这个结论,我问了教授,说是假设检验本身设计是没错的,但是人们经常会错误地使用它。不是说样本量大了就不好,而是样本量大了,我们应该把显著性水平α调小一点,而不是生搬硬套,一直使用α=0.05。

 

Guess you like

Origin www.cnblogs.com/HuZihu/p/12228418.html