Flaky tests - repeated tests on software projects that can both pass and fail despite no changes having been made to the code - can cost developers significant amounts of time and energy through repeated testing. As such, identifying and repairing flaky tests can be costly.
Researchers Owain Parry and Phil McMinn aim to produce techniques to identify, debug and repair flaky tests that could ultimately be implemented as automated tools for developers. These tools could save developers time, money and energy.
During the development of a software project, the code needs to be tested throughout to make sure the application does what it’s designed to do. The software should either work properly every time and pass the test, or not work correctly and fail the test. Generally, if the software fails the test, it means there is a bug somewhere in the program that needs to be fixed. However, in reality, this is not always the case. Seemingly at random, the same test of the same code will produce different results, causing them to be labelled as “flaky” tests, and hence, unreliable.
It is often assumed that failing tests are due to bugs in the program, but in reality this could be down to an issue with the test itself. On face value, there's no way to distinguish between the two.
“This means a great deal of time and energy can be spent looking for a bug that isn’t there.”
Owain Parry
PhD student and member of the Department’s Testing Research Group
Running these tests over and over to detect flaky tests, particularly in larger projects, can take impractical amounts of time and resources, meaning the bug or issue with the test itself often goes undetected. This can lead to problems with software much further down the line.
Using a machine learning model, the team is developing a tool which can predict, with a reasonable degree of accuracy, whether a test is flaky or not. This dramatically reduces the number of times a test needs to be run, saving time, money and energy. Moving forward, the team is hoping to create a tool that will not only detect flaky tests, but identify the root cause.
“There is a danger that if you give a developer a list of flaky tests, they will mark them in some way so they don’t fail the software build, with the intention of fixing them later,” added Owain.
“Then you end up with fewer tests, which means it’s more likely bugs will be missed - and this can have significant impact.
“If we can build something that’s relatively straightforward to set up and use, and it can give developers more information than whether a test is flaky or not, that can ultimately lead to better software.
“What we’re doing may not look very glamorous, but it could have a significant benefit to developers and translate into real world impact.”