Call for Participation — Procedure

The competition compares state-of-the-art tools for automatic software testing with respect to effectiveness and efficiency. The competition consists of two phases: a training phase, in which benchmark programs are given to the tool developers, and an evaluation phase, in which all participating test-generation tools will be executed on benchmark test tasks, and the number of found bugs, the coverage, as well as the runtime is measured. The competition is performed and presented by the Test-Comp competition chair.

This page is the authoritative reference for the rules of the competition. For the most recent changes, please refer to the changelog. If in doubt, please also read the FAQ with more descriptions.

Publication and Presentation of the Competition Candidates

The jury selects qualified competition candidates to publish (in the STTT special issue of TOOLympics 2019) a contribution paper that gives an overview of the competition candidate (the selection criteria is not novelty but text quality in terms of describing the tool and the qualification of the tool). An overview paper by the competition organizers will describe the competition procedure and present the results. The paper-submission deadline is specified on the dates page; submission requirements are explained on the submission page.

In addition, every qualified competition candidate is granted a demonstration slot in the ETAPS/TOOLympics program to present the competition candidate to the ETAPS audience.

Definition of Test Task

A test task consists of a C program and a test specification. A test run is a non-interactive execution of a competition candidate on a single test task, in order to generate a result value and test suite according to the test specification.

The result of a test run is a triple (ANSWER, TEST-SUITE, TIME). ANSWER has the value DONE if the tool finished to compute a TEST-SUITE that tries to achieve the coverage goal as defined in the test specification, and other values (such as UNKNOWN) if the tool terminates prematurely by a tool crash, time-out, or out-of-memory.

TEST-SUITE is a directory of files according to the format for exchangeable test-suites (test-suite format). Irrespectively of the ANSWER, the (partial) TEST-SUITE is validated.

TIME is the consumed CPU time until the tester terminates or is terminated. It includes the consumed CPU time of all processes that the tester starts. If TIME is equal to or larger than the time limit, then the tester is terminated.

The C programs are partitioned into categories, which are defined in category-set files. The categories and the contained programs are explained on the benchmark page. The requirements for the C programs are described below.

Test Specifications

The specifications for testing a program are given in files properties/coverage-error-call.prp and properties/coverage-branches.prp.

The definition init(main()) gives the initial states of the program by a call of function main (with no parameters). The definition FQL(f) specifies that coverage definition f should be achieved. The FQL (FShell query language) coverage definition COVER EDGES(@DECISIONEDGE) means that all branches should be covered, COVER EDGES(@BASICBLOCKENTRY) means that all statements should be covered, and COVER EDGES(@CALL(__VERIFIER_error)) means that function __VERIFIER_error should be called [lang def].

Bug Finding:

COVER( init(main()), FQL(COVER EDGES(@CALL(__VERIFIER_error))) )    

FormulaDefinition
COVER EDGES(@CALL(__VERIFIER_error))
The test suite contains at least one test that executes function __VERIFIER_error.
Coverage:

COVER( init(main()), FQL(COVER EDGES(@DECISIONEDGE)) )    

FormulaDefinition
COVER EDGES(@DECISIONEDGE)
The test suite contains tests such that all branches of the program are executed.

Evaluation by Scores and Runtime

Finding Bugs

The first category is to show the abilities to discover bugs. The programs in the benchmark set contain programs that contain a bug.

Evaluation by scores and runtime

Every run will be started by a batch script, which produces for every tool and every test task (a C program) one of the following score and result:

+1 point for a generated TEST-SUITE that contains a test witnessing the specification violation and TIME is less than the time limit
0 pointsotherwise

The participating test tools are ranked according to the sum of points. Tools with the same sum of points are ranked according to success-runtime. The success-runtime for a tool is the total CPU time over all test tasks for which the tool produced a test suite that revealed the bug.

Code Coverage

The second category is to cover as many program branches as possible. The coverage criterion was chosen because many test-generation tools support this standard criterion by default. Other coverage criteria can be reduced to branch coverage by transformation.

Evaluation by scores and runtime

Every run will be started by a batch script, which produces for every tool and every test task (a C program) the coverage (as reported by gcov; value between 0 and 1) of branches of the program that are covered by the generated test cases.

+cov points for a generated TEST-SUITE that yields coverage cov and TIME is less than the time limit
0 pointsotherwise

The participating test tools are ranked according to the accumulated coverage. Tools with the same coverage are ranked according to success-runtime. The success-runtime for a tool is the total CPU time over all test tasks for which the tool reported a test suite.

Runtime

Runtime is the consumed CPU time until the tester terminates. It includes the consumed CPU time of all processes that the tester starts. If TIME is equal to or larger than the time limit, then the tester is terminated.

Benchmark Test Tasks

The training test tasks will be provided by a specified date on this web page. A subset of the training test tasks will be used for the actual competition experiments. The jury agrees on a procedure to define the subset (if not all test tasks are taken). We avoid using test tasks without training, because the semantics of the language C is underspecified; the participants, jury, and community ensure that all involved parties agree on the intended meaning of the test tasks. (This is in contrast to competitions over completely specified problems, such as SAT or SMT.)

The programs are assumed to be written in GNU C (some of them adhere to ANSI C). Each program contains all code that is needed for the testing, i.e., no includes (cpp -P). Some programs are provided in CIL (C Intermediate Language) [1].

Potential competition participants are invited to submit benchmark test tasks until the specified date. Test tasks have to fulfill two requirements, to be eligible for the competition: (1) the program has to be written in GNU C or ANSI C, and (2) the program has to come with a specification given by one of the properties stated above. Other specifications are possible, but need to be proposed and discussed.

New proposed categories will be included if at least three different tools or teams participate in the category (i.e., not the same tool twice with a different configuration).

For each category, we specify whether the programs are written for an ILP32 (32-bit) or an LP64 (64-bit) architecture (cf. http://www.unix.org/whitepapers/64bit.html).

In the following, we list a few conventions that are used in some of the test tasks, in order to express special information that is difficult to capture with the C language.

__VERIFIER_error(): For checking reachability we use the function __VERIFIER_error(). The verification tool can assume the following implementation:
void __VERIFIER_error() { abort(); }
Hence, a function call __VERIFIER_error() never returns and in the function __VERIFIER_error() the program terminates.

__VERIFIER_assume(expression): A verification tool can assume that a function call __VERIFIER_assume(expression) has the following meaning: If 'expression' is evaluated to '0', then the function loops forever, otherwise the function returns (no side effects). The verification tool can assume the following implementation:
void __VERIFIER_assume(int expression) { if (!expression) { LOOP: goto LOOP; }; return; }

__VERIFIER_nondet_X(): In order to model nondeterministic values, the following functions can be assumed to return an arbitrary value of the indicated type: __VERIFIER_nondet_X() (and nondet_X(), deprecated) with X in {bool, char, int, float, double, loff_t, long, pchar, pointer, pthread_t, sector_t, short, size_t, u32, uchar, uint, ulong, unsigned, ushort} (no side effects, pointer for void *, etc.). The test tool can assume that the functions are implemented according to the following template:
X __VERIFIER_nondet_X() { X val; return val; }

__VERIFIER_atomic_begin(): For modeling an atomic execution of a sequence of statements in a multi-threaded run-time environment, those statements can be placed between two function calls __VERIFIER_atomic_begin() and __VERIFIER_atomic_end() (deprecated but still valid: those statements can be placed in a function whose name matches __VERIFIER_atomic_). The testers are instructed to assume that the execution between those calls is not interrupted. The two calls need to occur within the same control-flow block; nesting or interleaving of those function calls is not allowed.

malloc(), free(): We assume that the functions malloc and alloca always return a valid pointer, i.e., the memory allocation never fails, and function free always deallocates the memory and makes the pointer invalid for further dereferences.

Competition Environment and Requirements

Competition Environment

Each test run will be started on a machine with a GNU/Linux operating system (x86_64-linux, Ubuntu 18.04); there are three resource limits for each test run: a memory limit of 15 GB (14.6 GiB) of RAM, a runtime limit of 15 min of CPU time, and a limit to 8 processing units of a CPU. Further technical parameters of the competition machines are available in the repository that also contain the benchmark definitions.

Benchmark Definition

The competition environment feeds the candidates with parameters when started for a test run. There is one global set of parameters that can be used to tune the tester to the benchmark programs. This set of (command-line) parameters have to be defined in the competition contribution paper and in the benchmark definition. One parameter defines the specification file. There are categories that need to be tested with the information that a 64-bit architecture is used, thus, if the tester has a parameter to handle this aspect, it needs to be defined.

In order to participate at Test-Comp, a benchmark definition in the Test-Comp repository is necessary. Technically, the benchmark definition needs to be integrated into the Test-Comp repository under directory benchmark-defs using a merge request.

The benchmark definition defines the categories that the tester is to be executed on, the parameters to be passed, and the resource limits.

Tool-Info Module

In order to participate at Test-Comp, a tool-info module in the BenchExec repository is necessary. Technically, the tool-info module needs to be integrated into the BenchExec repository under directory benchexec/tools using a pull request.

The task of the tool-info module is (besides other tasks) to translate the output of a tester to uniform results. For running the contributed tester, the organizer follows the installation requirements and executes the tester, relying on the tool-info module for correct execution of the tester and correct interpretation of its results.

The tool-info module must be tested, cf. instructions on tool integration.

Tester

The submitted system has to meet the following requirements:

Qualification

Tester. A test tool is qualified to participate as competition candidate if the tool is (a) publicly available for download and fulfills the above license requirements, (b) works on the GNU/Linux platform (more specifically, it must run on an x86_64 machine), (c) is installable with user privileges (no root access required, except for required standard Ubuntu packages) and without hard-coded absolute paths for access to libraries and non-standard external tools, (d) succeeds for more than 50 % of all training programs to parse the input and start the test process (a tool crash during the test phase does not disqualify), and (e) produces test suites that adhere to the exchange format (syntactically correct).

Person. A person (participant) is qualified as competition contributor for a competition candidate if the person (a) is a contributing designer/developer of the submitted competition candidate (witnessed by occurrence of the person’s name on the tool's project web page, a tool paper, or in the revision logs) or (b) is authorized by the competition organizer (after the designer/developer was contacted about the participation).

Paper. A paper is qualified if the quality of the description of the competition candidate suffices to run the tool in the competition and meets the scientific standards of the venue (conference or journal) as competition-candidate representation in the conference proceedings or journal issue.

Note: A test tool can participate several times as different competition candidates, if a significant difference of the conceptual or technological basis of the implementation is justified in the accompanying description paper. This applies to different versions as well as different configurations, in order to avoid forcing developers to create a new tool for every new concept.

The participants have the opportunity to check the tester results against their own expected results and discuss inconsistencies with the competition chair.

Opting-out from Categories

Every team can submit for every category (including meta categories, i.e., categories that consist of test tasks from other categories) an opt-out statement. In the results table, a dash is entered for that category; no execution results are reported in that category. If a team participates (i.e., does not opt-out), *all* test tasks that belong to that category are executed with the tester. The obtained results are reported in the results table; the scores for meta categories are weighted according to the established procedure. (This means, a tool can participate in a meta category and at the same time opt-out from a sub-category, with having the real results of the tool counted towards the meta category, but not reported for the sub-category.)

Computation of Scores for Meta Categories

A meta category is a category that consists of several sub-categories. The scores in such a meta category is computed from the normalized scores in its sub-categories.

Procedure

The scores for a (meta) category is computed from the scores of all k contained (sub-) categories using a normalization by the number of contained test tasks: The normalized score sn_i of a tester in category i is obtained by dividing the score s_i by the number of tasks n_i in category i (sn_i = s_i / n_i), then the sum st = SUM(sn_1, ..., sn_k) over the normalized scores of the categories is multiplied by the average number of tasks per category.

Motivation

The goal is to reduce the influence of a test task in a large category compared to a test task in a small category, and thus, balance over the categories.

Last changed: 2022-08-19T16:18:06+02:00