软件工程自动修复论文——A novel co-evolutionary approach to automatic software bug fixing [来自CEC 2008]

前言

一开始没想看这篇文章，因为是2008年的（不是近几年的），但是想到一来这篇文章和遗传算法（GenProg使用的算法）有关，二来相遇即是缘分，所以那就读一下吧。

基本信息

作者：Andrea Arcuri and Xin Yao

引用：Arcuri, A., & Yao, X. (2008, June). A novel co-evolutionary approach to automatic software bug fixing. In Evolutionary Computation, 2008. CEC 2008.(IEEE World Congress on Computational Intelligence). IEEE Congress on (pp. 162-168). IEEE.

Andrea Arcuri，我经常能看到他发的论文，感觉是一位特别厉害的大神。

1 摘要说了啥

1）背景
Many tasks in Software Engineering are very expensive, and that has led the investigation to how to automate them.

In particular, Software Testing can take up to half of the resources of the development of new software.

意思就是软工很烧钱，软件测试（我觉得应该包括debugging activity吧）格外烧钱。

2）Although
there has been a lot of work on automating the testing phase,
fixing a bug after its presence has been discovered is still
a duty of the programmers.

这个写的挺好的， after its presence has been discovered 也可以看出，自动缺陷定位技术比自动程序修复技术确实要成熟一点。

2 没想到以前是这样的写法（APR的效果、修复能力），看来此论文还是值得一看

The user needs only to provide a buggy program and a formal specification of it.

No other information is required.

Hence, the approach may work for any implementable software.

We show some preliminary experiments in which bugs in an implementation of a sorting algorithm are automatically fixed.

3 作者要做的工作（main work）

In this paper we propose an
evolutionary approach to automate the task of fixing bugs. This
novel evolutionary approach is based on Co-evolution, in which
programs and test cases co-evolve, influencing each other with
the aim of fixing the bugs of the programs. This competitive
co-evolution is similar to what happens in nature for predators
and prey.

感觉以前从来没看到过通过程序和测试用例的共同进化进行修复。 influencing each other with the aim of fixing the bugs of the programs.

很有趣，可以继续关注。

4 introduction说了啥

1）还是讲软件测试很花钱

It is estimated that testing requires around 50% of
the total cost of software development [2]. This cost is paid
because software testing is very important. Releasing bugridden and non-functional software is indeed an easy way to
lose customers.

ridden
英 [ˈrɪdn] 美 [ˈrɪdn:]
adj.（通常构成复合词）充满（某种不良事物）的;满是的;受…困扰的;ride 的过去分词
v. 乘，骑，驾( ride的过去分词 );（骑马、自行车等）穿越;搭乘;飘浮

adj.
(usually in compounds 通常构成复合词) 充满（某种不良事物）的；满是…的
full of a particular unpleasant thing
a disease-ridden slum
疾病流行的贫民窟

2）之前的工作（竟然在introduction里面了了，这个可能也算一个limitation吧。）

Hence, there has been effort in developing Automated Debugging techniques (e.g., [4], [5] and [6]) to help the programmers to locate the bugs.
However, to our best knowledge, only little work has been done on the the actual automation of repairing software (e.g., [7], [8] and [9]),
and it is able to address only specific types of bugs.

For example, in [8] only expressions and left-hand side of assignments might be repaired

5 作者的idea来源

We name this task with the expression Automatic Bug Fixing (ABF). We gave a first idea of ABF in [10], and in this paper we show preliminarily experiments to confirm its validity and to see how its performance can be increased.

[10] A. Arcuri, “On the automation of fixing software bugs,” to appear in the Doctoral Symposium of the IEEE International Conference on Software Engineering (ICSE), 2008.

厉害，也是自己写的，等于是extended work。

还有一个来源：

The system presented in this paper is based on our previous work on Automatic Programming (AP) [11]. The idea of that paper is to use Genetic Programming (GP) to evolve programs that satisfy a formal specification.

[11] A. Arcuri and X. Yao, “Coevolving programs and unit tests from their specification,” in IEEE International Conference on Automated Software Engineering (ASE), 2007, pp. 397–400.

太神奇了。值得学习。

使用的一些技术（暂时没看懂）：
The training set is composed by Unit Tests [12]. Because we use the formal specification for generating an oracle, we can yield as many unit tests as we want. However, we need to carefully choose a relatively small set of unit tests, because using all of them is infeasible (i.e., the computational cost of evaluating an evolutionary program would be too high).

Hence, we use GP for evolving programs for passing the current set of unit tests, but at the same time we use Search Based Software Testing techniques [13] to yield new unit tests for finding bugs in the evolutionary programs.

That generates a co-evolution [14], that hopefully will lead to an arms race that will bring to the evolution of a program that satisfies the given formal specification.

现在看懂了，我的问题是：这个formal specification是什么，为什么能够产生一个oracle，as many unit tests as we want.

[14] W. D. Hillis, “Co-evolving parasites improve simulated evolution as an optimization procedure,” Physica D, vol. 42, no. 1-3, pp. 228–234, 1990.

作者的idea来源还是很值得学习的。
多研究多思考一下。

6 究极重要——作者对APR的展望

Because the programs implemented by software developers are usually close to being correct [15], we expect that
ABF would be a much easier task than AP. That legitimates
us to speculate that industrial applications might be possible
in a not far future. However, we claim that ABF is more
difficult than software testing, because for example one of
its components is software testing itself. Therefore, if we
consider the fact that, although the automation of software
testing has been heavily investigated since the 1970s (e.g.,
[16]), it has not been solved yet, hence we expect that ABF
will require at least the same amount of research effort.

我觉得这个是真的值得记住，作者说：软件测试已经搞了快30年了，还没有解决，我们认为ABF（automated bug fixing）也需要至少30年。
如今看来，十年过去了，APR还是在不断推进中，已经有一大波高手在研究了。
希望自己也能出一份力。

7 论文的价值所在（作者的写作功底太强，专家）

This paper gives to the Software Engineering community the important contribution of showing how the highly
expensive task of bug fixing might be automated. At the
same time, it brings to the Natural Computation community
a novel context in which well known research fields such
as co-evolution and GP are combined together. This novel
combination yields many research questions that require to
be investigated.

文章给了SE community 重要贡献，因为文章展示了how the highly expensive task of bug fixing might be automated.

8 文章组织结构 the paper is organised as follows.

Section II: the framework that we developed for automatically fixing bugs.
Section III: a case study to validate our novel approach
Section IV: conclusion.

9 Section II 讲了什么

1）使用的技术：

Genetic Programming (GP).
Distance functions derived from formal specifications.
Search Based Software Testing.
Co-evolution.

However, we are sceptical about this latter option,
because being structurally near to a global optimum does
not necessarily mean being in its basin of attraction. In other
words, a program tree that is close to a correct one might
have a very bad fitness value (this is particularly true for
bugs that are in areas of the code that are always executed
regardless of the input, like for example all the statements
before the first conditional branch in the execution flow).

离正确程序很近的variant不一定fitness score很好（可能很差）

10 原来2008年就有overfitting这个词汇在自动修复中出现了，很酷

However, there are two main differences from the normal use of GP:
• the training set does not contain any noise.
• we are not looking for a program that on average performs well, but we want a program that always gives the expected results. Hence, a program does not need to worry about over-fitting the training set, it has to overfit it. In fact, even if only one test in the training set is failed, that means that the specification is not satisfied

11 那时候作者也开始考虑symbolic execution了，厉害

although we use search based software testing
techniques for sampling the unit tests, any other automated
software testing technique could be employed (e.g., symbolic
execution [16]).

[16] J. C. King, “Symbolic execution and program testing,” Communications of the ACM, pp. 385–394, 1976.

可以看到，这些都是软件测试的技术。

12 作者对co-evolution的解释，简单易懂

This type of co-evolution is similar to what in nature
happens between predators and prey. For example, faster
prey escape predators more easily, and hence they have
higher probability of generating offspring. This influences
the predators, because they need to evolve as well to get
faster if they want to feed and survive. In our context, the
evolutionary programs can be considered as prey, whereas
the unit tests are predators. Hopefully, this co-evolution will
lead to an arms race that will lead to the evolution of a
program that satisfies the given formal specification.

Unfortunately, producing an arms race in co-evolutionary
algorithms is more difficult than it looks [23]. In fact,
problems such as mediocre stable states and loss of gradient
might arise
只不过实现起来似乎不容易。

[23] S. G. Ficici and J. B. Pollack, “Challenges in coevolutionary learning:
Arms-race dynamics, open-endedness, and mediocre stable states,” in
Artificial Life VI, 1998, pp. 238–247.

总觉得这个GP是和人工智能相关，毕竟这篇文章是发在Computational Intelligence相关的会议上了。

13 我觉得Section II-D把GP修复bug讲的很透彻，确实学习到了

Because software developers do not create programs at
random [15], the buggy program would be structurally near
to the optimal solution. However, its fitness value (based on
eq.1) might be extremely poor. This happens because a single
bug might completely change the output of a program. The
problem is that GP is driven by the fitness values and not
by the structural distance from the optimal solution (that is
unknown). Therefore, in the case of that type of bug, it is very
likely that during the first generations all the genetic material
of the buggy program would be lost because replaced by
smaller and simpler programs. These smaller programs will
be far away from the optimum, but because they are smaller
and likely having a higher fitness, they will be preferred by
the GP evolution. Eventually, these smaller programs might
evolve up to the optimal solution [11], but they would do it
without exploiting any useful information given by the buggy
program.

14 future work

Evolving correct programs from scratch is currently a very
hard task [11]. However, we think that our framework for
bug fixing might be a very useful tool for the software
developers. The reason is that a bug that is difficult to be
fixed by a human might be, on the other hand, very easy
for our framework. That would be the case of bugs that
consist of only small differences from the correct code,
but which are located in parts of the source code that are
very complex for a human to analyse. However, to really
scale up to real-world software, it will be compulsory to
design techniques for narrowing down the search space of
the evolutionary operators. One possibility could be the
exploitation of automated debugging techniques.

15 机器配置

In our experiments on a 3.0 GHz machine, a typical run of
the framework for fixing a bug takes between two and four
minutes.

16 想了解遗传算法GP如何修复bug的具体过程，请细看Section III，首先有图，而且描述特别详细，值得一看。

17 简单小结吧

谢谢这篇文章，让我进一步了解了GP以及一些APR的未来发展展望。

至于进一步阅读，未来可期。