软工APR方向论文——A correlation study between automated program repair and test-suite metrics [来自EMSE 2018]

版权声明:如需转载或引用,请注明出处。 https://blog.csdn.net/weixin_39278265/article/details/82495153

前言

本文旨在介绍软工领域自动修复方向文章 A correlation study between automated program repair and test-suite metrics [来自EMSE 2018]

1 作者信息

Jooyong Yi1 · Shin Hwei Tan2 · Sergey Mechtaev2 · Marcel Böhme2 · Abhik Roychoudhury2

Yi J, Tan S H, Mechtaev S, et al. A correlation study between automated program repair and test-suite metrics[J]. Empirical Software Engineering, 2018, 23(5): 2948-2979.

2 摘要说啥了?

1)这些厉害的作者是真的会写,每次都能换各种说法表达APR的火,或者debugging的麻烦。

Automated program repair is increasingly gaining traction, due to its potential to
reduce debugging cost greatly.

traction
英 [ˈtrækʃn] 美 [ˈtrækʃən]
n. 牵引力;拖拉;附着摩擦力

2)当前的趋势:
The feasibility of automated program repair has been shown
in a number of works, and the research focus is gradually shifting toward the quality of
generated patches.

确实是shifting towards the quality of generated patches

3)指出了方向

One promising direction is to control the quality of generated patches
by controlling the quality of test-suites used for automated program repair.

**通过控制test suites的质量,来提高补丁质量??

这个我有点不懂,因为测试用例不是本身就是不完整的吗,不过看后面我应该就能明白了**

2 震惊,作者的idea竟然是这个,之前竟然没有人做(感觉是比较难想到?但是实现起来不难,然而没人做过,估计肯定有改进,不用说了,这个idea一出来我就感觉应该是有improvements的)


In this paper, we ask the following research question: “Can traditional test-suite metrics proposed for the purpose of software testing also be used for the purpose of automated program repair?”

传统的测试用例metrics可以放到APR中来,这个idea是真的厉害。

后面有详细说明:
This problem of automated program repair is akin to the problem of
software testing; even if all available tests pass the software under test, there is generally no
guarantee that no other new tests will fail the software under test. Despite this limitation, it
is possible to improve software quality by improving the quality of a test-suite. Likewise, is
it possible to control the quality of automatically generated repair by controlling the quality
of a test-suite? This is our key high-level research question we aim to answer in this paper.
Apart from this main research question, we also investigate how test-suite metrics affect
repairability (repair success rate) and repair time.

3 largest-scale experiments引起了我的注意,我觉得不上千,应该是称不上largest的。

We conduct the largest-scale experiments of
this kind to date with real-world software, and for the first time perform a correlation study
between various test-suite metrics and the reliability of generated repairs

4 还是关于这个idea

我看到这一句:Our results show
that in general, with the increase of traditional test suite metrics, the reliability of repairs
tend to increase.


我觉得这个工作就是很有意义的,虽然我觉得这个算是基础研究,不一定有技术创新,但是确确实实解决了APR领域大家都感到疑惑、或者说感兴趣的问题。
所以,佩服佩服。

5 introduction

1) Researchers have experimentally shown that automated program repair is possible for real-world large-scale software such as the PHP interpreter and Heartbleed-containing OpenSSL (Le Goues et al, 2012a; Nguyen et al, 2013; Le Goues et al, 2012b; Weimer et al, 2013; Nguyen et al, 2013; Kim et al, 2013; White et al, 2011; Dallmeier et al, 2009; Xuan et al, 2017; Assiri and Bieman, 2014; Pei et al, 2014; Debroy and Wong, 2010; Samimi et al, 2012; Qi et al, 2014, 2013; Mechtaev et al, 2016).

修复工具也太多了吧,不过这句话写的超好,通过介绍heartbleed-containing OpenSSL,PHP解释器来说明现在APR的进展。这样读者一看就觉得:APR还是有很大进展的。

2)方向的转变
Currently, the research focus is gradually shifting from the feasibility of automated
program repair to the quality of generated patches (Assiri and Bieman, 2014; Smith et al,
2015; Qi et al, 2015; Long and Rinard, 2016a)

Assiri FY, Bieman JM (2014) An assessment of the quality of automated program operator
repair. In: Proceedings of the 2014 IEEE Seventh International Conference on Software
Testing, Verification and Validation, ICST ’14, pp 273–282

这篇文章没看过,看来自己还是看的太少了,值得一看。


以后ICST的文章也应该多关注一下,感觉和软件自动修复还挺相关的(这类文章还挺多)

6 对correct patch的定义

In particular, these latest research results
raise a question about how to generate a “correct” patch—a patch that not only passes all
tests available to a repair system, but also indeed fixes the bug.

既通过所有测试,也确实修复了bug

indeed
ADV 确实;的确
You use indeed to confirm or agree with something that has just been said.
ADV (表示语气的递进)实际上,其实,确切说来
You use indeed to introduce a further comment or statement which strengthens the point you have already made.
ADV (位于句末修饰very或强调某词)确实,实在
You use indeed at the end of a clause to give extra force to the word ‘very’, or to emphasize a particular word.

7 测试用例在自动修复中充当的地位 & 面临的问题、局限

Most of the automated program repair approaches use a test-suite as a proxy of software specification, since formal specification is hardly used in the industry.
软件specification的代理。

While the fact that software tests are widely
available is advantageous, the fact that a test-suite is an incomplete specification can make
a generated repair incomplete; there is generally no guarantee that no other new tests will
fail a generated repair.
但是是一个incomplete的specification。

akin
英 [əˈkɪn] 美 [əˈkɪn]
adj. 相似的;同族的;同源的;关系密切的
is akin to 和…相似的

8 patch reliability,关于patch reliability和regression的解释

With regard to the quality of automatically generated repairs, we focus on the reliability
of a generated repair, that is, whether regressions occur in a repair. Judging whether a repair
is correct is often subjective and difficult to be automated in the absence of formal specifications. Previous studies investigate the reliability of repairs instead, because whether a
generated repair causes regressions can be checked in an automated way (Assiri and Bieman,
2014; Ke et al, 2015; Kong et al, 2015; Smith et al, 2015). That is, once a repair is generated, this repair can be tested with a test-universe (held-out test-suite) that contains tests that
were not available at the time of generating the repair. If a failing test is found in the testuniverse, it is considered that the repair causes regressions. As in previous studies, we also
similarly investigate how often regressions occur to measure the quality of a repair.


原来作者研究的不是correctness。而是reliability

9 天秀的作者,写的超级好——关于工具的使用和解释。

we obtain automatically generated repairs by running GENPROG (Le Goues et al,
2012b; Weimer et al, 2013). In total, we collected 3818 repairs from 142 buggy versions of
10 different programs of various sizes (173–1046K LoC), using 14600 randomly sampled
test suites. We sample test suites from the whole test cases available in our subjects. While
we retrieve the main results from GENPROG-generated repairs, we also conduct smaller
scale experiments with another repair tool SEMFIX (Nguyen et al, 2013) to see whether our
main results extend beyond GENPROG. GENPROG and SEMFIX are first search-based and
constraint-based repair tools, respectively. Search-based repair tools navigate a set of repair
candidates through a search algorithm until a repair is found, while constraint-based repair
tools first construct repair constraints that should be satisfied by a repair and symbolically
search for a repair satisfying the repair constraint using a theorem prover. While our experiments may not generalize to all other repair tools, GENPROG, the repair system we mainly
use in our study, has been used in many previous studies on automated program repair (Smith
et al, 2015; Kong et al, 2015; Qi et al, 2015; Le Goues et al, 2012a,b; Weimer et al, 2013;
Le Goues et al, 2013). Our experimental results obtained from GENPROG complement the
results from earlier studies.

这里是解释的真的好,值得学习。

10 实验结果

1) Our results show that in general, the traditional metrics of test-suites, that is, statement
coverage, branch coverage, test-suite size, and mutation score, are negatively correlated with
the likelihood that a generated repair causes a regression. In other words, as the traditional
metrics of a test-suite increase, generated repair tend to cause regressions less often.

负相关。

2)Our result implies that the traditional test suite metrics proposed for software testing can also be used for automated program repair.

3)Among the test-suite metrics we investigate, statement coverage is shown to be most strongly correlated with regression ratio. A practical implication is that to reduce regression ratio, increasing statement coverage is likely to be more effective than improving the other test-suite metrics such as branch coverage.

4)However, it should be noted that the highest correlation of statement coverage does not necessarily imply that a statement coverage-adequate test-suite is better than a branch coverage-adequate test-suite.

11 少见的contributions只说了两条(一般是3+条contributions,神奇)

– We for the first time conduct a correlation study of automated program repair with various test-suite metrics such as statement coverage, branch coverage, test-suite size, and
mutation score. According to our study, traditional test-suite metrics proposed for software testing are negatively correlated with the likelihood that a generated repair causes
regressions. Therefore, improving a test-suite based on traditional test-suite metrics is
beneficial both for software testing and automated program repair. Among test-suite
metrics we investigate, statement coverage is shown to be most strongly correlated.

– We conduct the largest experiments to date about the correlation between test-suite quality and the performance of automated program repair (in particular, the reliability of
repairs). Our subject programs contain four large-scale real-world programs. Our experimental results provide strong empirical evidences that repair quality problem is indeed quite severe (the average regression ratio of 3818 repairs repairs we obtained from
GENPROG is 40%), and traditional test suite metrics can be used to control the quality
of automatically generated repairs.

12 仓促小结

后面的内容有点多,而且我觉得目前了解这些就行了,再到后面的话,目前这个状态这个心情的我继续不下去。

SO,暂且一放,来日再读。

猜你喜欢

转载自blog.csdn.net/weixin_39278265/article/details/82495153