Learning Combinatorial Interaction Test Generation Strategies using Hyperheuristic Search

Overview

The surge of search based software engineering research has been hampered by the need to develop customized search algorithms for different classes of the same problem. For instance, two decades of bespoke Combinatorial Interaction Testing (CIT) algorithm development, our exemplar problem, has left software engineers with a bewildering choice of CIT techniques, each specialized for a particular task. This paper proposes the use of a single hyperheuristic algorithm that learns search strategies across a broad range of problem instances, providing a single generalist approach. We have developed a Hyperheuristic algorithm for CIT, and report experiments that show that our algorithm competes with known best solutions across constrained and unconstrained problems: For all 26 real world subjects, it equals or outperforms the best result previously reported in the literature. We also present evidence that our algorithm’s strong generic performance results from its unsupervised learning. Hyperheuristic search is thus a promising way to relocate CIT design intelligence from human to machine.

Benchmarks

Syn-2: contains 14 different pairwise (2-way) synthetic models without constraints. These models are benchmarksthat have been used both to compare mathematical constructions as well as search based techniques. (download)

Subjects	Models	Constraints
S2-1	3⁴	N/A
S2-2	5¹3⁸2²	N/A
S2-3	3¹³	N/A
S2-4	4¹3³⁹2²⁵	N/A
S2-5	5¹4⁴3¹¹2⁵	N/A
S2-6	4¹⁵3¹⁷2²⁹	N/A
S2-7	6¹5¹4⁶3⁸2³	N/A
S2-8	7¹6¹5¹4⁵3⁸2³	N/A
S2-9	4¹⁰⁰	N/A
S2-10	6¹⁶	N/A
S2-11	7¹⁶	N/A
S2-12	8¹⁶	N/A
S2-13	8¹⁷	N/A
S2-14	10²⁰	N/A

Syn-3: contains 15 different 3-way synthetic models without constraints. These beenchmarks that have been used for mathematical constructions and search. (download)

Subjects	Models	Constraints
S3-1	3⁶	N/A
S3-2	4⁶	N/A
S3-3	3²4²5²	N/A
S3-4	5⁶	N/A
S3-5	5⁷	N/A
S3-6	6⁶	N/A
S3-7	6⁶4²2²	N/A
S3-8	10¹6²4³3¹	N/A
S3-9	8⁸	N/A
S3-10	7⁷	N/A
S3-11	9⁹	N/A
S3-12	10⁶	N/A
S3-13	10¹⁰	N/A
S3-14	12¹²	N/A
S3-15	14¹⁴	N/A

Syn-C2: contains 30 different 2-way synthetic models with constraints These models were designed to simulate configurations with constraints in real world programs, generated by Cohen et al. and adopted in follow-up research by Garvin et al. (download)

Subjects	Models	Constraints
C2-S1	2⁸⁶3³4¹5⁵6²	2²⁰3³4¹
C2-S2	2_{863³4³5¹6¹	2¹⁹3³
C2-S3	2²⁷4²	2⁹3¹
C2-S4	2⁵¹3⁴4²5¹	2¹⁵3²
C2-S5	2¹⁵⁵3⁷4³5⁵6⁴	2³²3⁶4¹
C2-S6	2⁷³4³6¹	2²⁶3⁴
C2-S7	2²⁹3¹	2¹³3²
C2-S8	2¹⁰⁹3²4²5³6³	2³²3⁴4¹
C2-S9	2⁵⁷3¹4¹5¹6¹	2³⁰3⁷
C2-S10	2¹³⁰3⁶4⁵5²6⁴	2⁴⁰3⁷
C2-S11	2⁸⁴3⁴4²5²6⁴	2²⁸3⁴
C2-S12	2¹³⁶3⁴4³5¹6³	2²³3⁴
C2-S13	2¹²⁴3⁴4¹5²6²	2²²3⁴
C2-S14	2⁸¹3⁵4³6³	2¹³3²
C2-S15	2⁵⁰3⁴4¹5²6¹	2²⁰3²
C2-S16	2⁸¹3³4²6¹	2³⁰3⁴
C2-S17	2¹²⁸3³4²5¹6³	22²⁵3⁴
C2-S18	2¹²⁷3²4²5¹6³	2²³3⁴4¹
C2-S19	2¹⁷²3⁹4⁹5³6⁴	2³⁸3⁵
C2-S20	2¹³⁸3⁴4⁵5⁴6⁷	2⁴²3⁶
C2-S21	2⁷⁶3³4²5¹6³	2⁴⁰3⁶
C2-S22	2⁷³3³4³	2³¹3⁴
C2-S23	2²⁵3¹6¹	2¹³3²
C2-S24	2¹¹⁰3²5³6⁴	2²⁵3⁴
C2-S25	2¹¹⁸3⁶4²5²6⁶	2²³3³4¹
C2-S26	2⁸⁷3¹4³5⁴	2²⁸3⁴
C2-S27	2⁵⁵3²4²5¹6²	2¹⁷3³
C2-S28	2¹⁶⁷3¹⁶4²5³6⁶	2³¹3⁶
C2-S29	2¹³⁴3⁷5³	2¹⁹3³
C2-S30	2⁷²3⁴4¹6²	2²⁰3²

Real-1: contains real world models from a recent benchmark created by Segall et al. There are 20 CIT problems in this subject set, generated by or for IBM customers. The 20 problems cover a wide range of applications, including telecommunications, healthcare, storage and banking systems. (download)

Subjects	Models	Constraints
Concurrency	2⁵	2⁴3¹5²
Storage1	2¹3¹4¹5¹	4⁹⁵
Banking1	3⁴4¹	5¹¹²
Storage2	3⁴6¹	-
CommProtocol	2¹⁰7¹	2¹⁰3¹⁰4¹²5⁹⁶
SystemMgmt	2⁵3⁴5¹	2¹³3⁴
Healthcare1	2⁶3²5¹6¹	2³3¹⁸
Telecom	2⁵3¹4²5¹6¹	2¹¹3¹4⁹
Banking2	2¹⁴4¹	2³
Healthcare2	2⁵3⁶4¹	2¹3⁶5¹⁸
NetworkMgmt	2²4¹5³10²11¹	2²⁰
Storage3	2⁹3¹5³6¹8¹	2³⁸3¹⁰
Proc.Comm1	2³3⁶4⁶	2¹³
Services	2³3⁴5²8²10²	3³⁸⁶4²
Insurance	2⁶3¹5¹6²11¹13¹17¹31¹	-
Storage4	2⁵3⁷4¹5²6²7¹9¹13¹	2²⁴
Healthcare3	2¹⁶3⁶4⁵5¹6¹	2³¹
Proc.Comm2	2³3¹²4⁸5²	1⁴2¹²¹
Storage5	2⁵3⁸5³6²8¹9¹10²11¹	2¹⁵¹
Healthcare4	2¹³3¹²4⁶5²6¹7¹	2²²

Real-2 contains 6 real world constrained subjects which have been widely studied in the literature The TCAS model was first presented by Kuhn et al. . TCAS is a traffic collision avoidance system from the `Siemens' suite. The rest of the models in this subject set were introduced by Cohen et al. SPIN-S and SPIN-V are two components for model simulation and model verification. GCC is a well known compiler system from the GNU Project. Apache is a web server application and Bugzilla is a web-based bug tracking system. (download)

Subjects	Models	Constraints
TCAS	2⁷3²4¹10²	2³
Spin-S	2¹³4⁵	2¹³
Spin-V	2⁴²3²4¹¹	2⁴⁷3²
GCC	2¹⁸⁹3¹⁰	2³⁷3³
Apache	2¹⁵⁸3⁸4⁴5¹6¹	2³3¹4²5¹
Bugzilla	2⁴⁹3¹4²	2⁴3¹

Experiment Settings

All experiments but one are carried out on a desktop computer with a 6 core 3.2GHz Intel CPU and 8GB memory. To understand the trade off between the quality of the results and the cost of the hpyerheuristics approach, we used the Amazon EC2 Cloud. All experiments are repeated five times and we report the best and the average results over these five runs.

Results

This section contains only the research questions and results of tables and figures, a much detailed explanation and discussion of these results can be found in our paper.

RQ1: What is the quality of the test suites generated using the hyperheuristic approach?

One of the primary goals of CIT is to find the smallest test suite (defined by the covering array) that achieves the desired strength coverage. It is trivial to generate an arbitrarily large covering test suite - simply include one test case for each interaction to be covered. However, such a na ̈ıve approach to test generation would yield exponentially many test cases. All CIT approaches therefore work around problem of finding a minimal size covering array for testing. The goal of CIT is to try to find the smallest test suite that achieves 100% t-way interaction coverage for some chosen strength of interaction t.

In following tables, we include the best reported solution from the literature followed by the smallest CIT sample and its running time for each of the three settings of the HHSA.

Summary of RQ1

We conclude that the quality of results obtained by using HHSA is high. While we do not produce the best results on every model, we are quite competitive and for all of the real subjects we are as good as, or improve upon the best known results.

RQ2: How efficient is the hyperheuristic approach and what is the trade off between the quality of the results and the running time?

Another important issue in CIT is the time to find a test suite that is as close to the minimal one as possible given time budgeted for the search. Depending on the application, one might want to sacrifice minimality for efficiency (and vice-versa). This research question therefore investigates whether the hyperheuristic approach can generate small test suites in reasonable time.

The following Tables summarize the average execution time in seconds per subject within each group of benchmarks, using the three configurations of HHSA.

Summary of RQ2

We conclude that the HHSA algorithm is efficient when run at the lowest level (HHSA-low). When run at the higher levels we see a cost-quality tradeoff. In practice, however, the monetary cost of running these algorithms is very small.

RQ3: How efficient and effective is each search navigation operator in isolation ?

In order to collect baseline results for each of the operators that the hyperheuristic approach can choose, we study the effects of each operator in isolation. That is, we ask how well each operator can perform on its own.

Should it turn out that there is a single operator that performs very well, then there would be no need for further study; we could simply use the high performing operator in isolation. Similarly, should one operator prove to perform poorly and to be expensive then we might consider removing it from further study.

The following figures are additional results which show the rank of the solution by operator at every 100 generations. Each algorithm variant was run (using the single operator) and at 100 generations we rank the quality of the solution.

Summary of RQ3

We conclude that there is a difference between effectiveness of each of the operators and that combining them contributes to a better quality solution

RQ4: Do we see evidence that the hyperheuristic approach is learning?

Should it turn out that the hyperheuristic approach performs well, finding competitively sized covering arrays in reasonable time, then we have evidence to suggest that the adaptive learning used by the hyperheuristic approach is able to learn which operator to deploy. However, is it really learning? This RQ investigates, in more detail, the nature of the learning taking place as the algorithm searches for interaction test suites. We explore how the problem difficulty varies over time for each of the CIT problems we study, and then ask which operators are chosen at each stage of difficulty; is there evidence that the algorithm is selecting different operators for different types of problems?

The following figures contain two examples showning the learning algoirhtm's reward scores for each operator over time.

We examine this further by breaking down data from each benchmark set into stages. The results are shown in the Table below.

Summary of RQ4

We see evidence that the Hyperheuristic algorithm is learning both at different stages of search and across different types of subjects.

Downloads

Models

This download includes all the subjects we used in our study. Please note that the format of the constraint files used by HHSA is different from the format used by CASA. Download

Acknowledgments

The FITTEST project FP7/ICT/257574 supports Yue Jia. Mark Harman is supported by the EPSRC, EP/J017515/1 (DAASE), EP/I033688/1 (GISMO), EP/I010165/1 (RE-COST) and EP/G060525/2 (Platform Grant). RE-COST and DAASE also completely support Justyna Petke. Myra Cohen is supported in part by NSF award CCF-1161767.